Facade Commits
The facade phase extracts commit data from git repositories using git log. This page covers the bare clone design, log parsing, per-file commit rows, and aggregate computation.
Bare clone vs full clone
Aveloxis uses two types of git clones for different purposes:
Type |
Command |
Persistence |
Purpose |
|---|---|---|---|
Bare clone |
|
Permanent |
Facade phase (git log parsing) |
Full clone |
Local checkout from bare clone |
Temporary |
Analysis phase (dependency scanning, scc) |
Bare clones
Bare clones contain only the git object database (no working tree). They are:
Smaller than full clones (no checked-out files)
Permanent – stored in
repo_clone_dirand reused across collection cyclesUpdated via
git fetch --allon subsequent runs
Full clones (temporary)
When the analysis phase needs to read file contents (for dependency scanning and code complexity), a full checkout is created locally from the bare clone:
git clone /path/to/bare.git /path/to/temp-checkout
This is a local operation (no network request). After analysis completes, the temporary checkout is deleted.
Disk usage
Bare clones: Permanent. Plan for 10 MB to 5+ GB per repo depending on history size.
Full clones: Temporary. Roughly double the bare clone size while they exist, then deleted.
For large instances (400K repos), bare clones can consume tens of terabytes.
Git log parsing
The facade phase runs git log with a custom format string to extract commit data.
Format string
The format uses custom field and record separators to reliably parse multi-line output:
git log --all --numstat --pretty=format:'<COMMIT>%H<SEP>%an<SEP>%ae<SEP>%ad<SEP>%cn<SEP>%ce<SEP>%cd<SEP>%P<SEP>%s'
Where:
Placeholder |
Field |
|---|---|
|
Full commit hash |
|
Author name |
|
Author email |
|
Author date |
|
Committer name |
|
Committer email |
|
Committer date |
|
Parent hashes (space-separated) |
|
Subject line (commit message first line) |
The --numstat flag appends per-file statistics after each commit:
12 5 src/main.go
3 1 README.md
- - binary-file.bin
Each line shows lines added, lines removed, and the file path. Binary files show - for both counts.
Parsing logic
The parser:
Splits output on the
<COMMIT>record separatorFor each commit, splits the header on
<SEP>field separatorsReads subsequent lines as numstat entries until the next commit
Handles binary files (lines added/removed = 0 when
-is encountered)Extracts date components for aggregate computation
Per-file commit rows
Following Augur’s data model, the commits table stores one row per file per commit. A commit that touches 10 files produces 10 rows, all sharing the same cmt_commit_hash.
Columns populated from git log
Column |
Source |
Description |
|---|---|---|
|
Context |
The repo being collected |
|
|
Full SHA-1 hash |
|
|
Author name |
|
|
Author email as-is |
|
|
Initially same as raw; updated by commit resolver |
|
|
Author date string |
|
Parsed from |
Parsed timestamp |
|
|
Committer name |
|
|
Committer email as-is |
|
|
Initially same as raw; updated by commit resolver |
|
|
Committer date string |
|
Parsed from |
Parsed timestamp |
|
numstat |
Lines added in this file |
|
numstat |
Lines removed in this file |
|
Computed |
Always 0 (reserved) |
|
numstat |
File path |
Upsert behavior
Commits are upserted with ON CONFLICT (repo_id, cmt_commit_hash, cmt_filename) DO UPDATE. This means:
New commits are inserted
Existing commits are updated (e.g., after commit resolver fills in
cmt_author_platform_username)Re-running facade on the same repo is safe and idempotent
Commit parents
Parent-child relationships are extracted from the %P placeholder (space-separated parent hashes) and inserted into the commit_parents table.
INSERT INTO aveloxis_data.commit_parents (cmt_id, parent_id)
VALUES ($1, $2)
ON CONFLICT DO NOTHING;
This enables:
Reconstructing the commit DAG
Identifying merge commits (commits with 2+ parents)
Analyzing branching and merging patterns
Commit messages
Full commit messages are stored in the commit_messages table, deduplicated per repo and commit hash:
INSERT INTO aveloxis_data.commit_messages (repo_id, cmt_hash, cmt_msg)
VALUES ($1, $2, $3)
ON CONFLICT (repo_id, cmt_hash) DO NOTHING;
The subject line (%s) is used for the message. This is stored separately from the per-file commit rows to avoid duplicating message text across all file rows for the same commit.
Affiliation resolution
During facade processing, commit author and committer emails are matched against the contributor_affiliations table to resolve organizational affiliations.
Resolution logic
The affiliation resolver:
Loads all active rules from
contributor_affiliationson first use (lazy initialization, cached in memory)Extracts the domain from the email address (e.g.,
user@redhat.com->redhat.com)Exact domain match first (e.g.,
redhat.com->Red Hat)Parent domain fallback (e.g.,
mail.google.com->google.com->Google)
Populated columns
Column |
Value |
|---|---|
|
Organization name for the author’s email domain |
|
Organization name for the committer’s email domain |
If no match is found, these columns are left NULL.
Adding affiliations
Affiliations are stored in the contributor_affiliations table:
INSERT INTO aveloxis_data.contributor_affiliations
(ca_domain, ca_affiliation, ca_active)
VALUES ('redhat.com', 'Red Hat', 1);
After adding new affiliations, existing commits can be re-processed by re-running the facade phase for affected repos.
Facade aggregates
After all commits for a repo are inserted, aggregate tables are refreshed by SQL aggregation over the commits table.
Aggregate tables
Table |
Granularity |
Key |
|---|---|---|
|
Year |
(repo_id, email, affiliation, year) |
|
Month |
(repo_id, email, affiliation, year, month) |
|
Week |
(repo_id, email, affiliation, year, week) |
|
Year |
(repo_group_id, email, affiliation, year) |
|
Month |
(repo_group_id, email, affiliation, year, month) |
|
Week |
(repo_group_id, email, affiliation, year, week) |
Aggregate columns
Each aggregate row contains:
Column |
Description |
|---|---|
|
Contributor email |
|
Organizational affiliation |
|
Total lines added |
|
Total lines removed |
|
Total whitespace changes |
|
Distinct files changed |
|
Number of commits/patches |
Refresh SQL
The aggregates are computed by SQL queries that group commits rows by the appropriate time period. For example, the annual aggregate:
DELETE FROM aveloxis_data.dm_repo_annual WHERE repo_id = $1;
INSERT INTO aveloxis_data.dm_repo_annual
(repo_id, email, affiliation, year, added, removed, whitespace, files, patches)
SELECT
repo_id,
cmt_author_email,
cmt_author_affiliation,
EXTRACT(YEAR FROM cmt_author_timestamp)::SMALLINT,
SUM(cmt_added),
SUM(cmt_removed),
SUM(cmt_whitespace),
COUNT(DISTINCT cmt_filename),
COUNT(DISTINCT cmt_commit_hash)
FROM aveloxis_data.commits
WHERE repo_id = $1
AND cmt_author_timestamp IS NOT NULL
GROUP BY repo_id, cmt_author_email, cmt_author_affiliation,
EXTRACT(YEAR FROM cmt_author_timestamp);
Aggregates are refreshed per-repo after each facade run, not globally. This keeps the cost proportional to the repo’s commit count.
Resilience
Fetch failure recovery
If git fetch --all fails on an existing bare clone (e.g., due to corruption):
The existing bare clone is deleted
A fresh
git clone --bareis attemptedIf that also fails, the facade phase is skipped for this repo (logged as an error)
Incremental collection
On subsequent collection cycles, git fetch --all retrieves only new commits since the last fetch. The git log is re-parsed in full, but upserts with ON CONFLICT ensure only truly new data is inserted.
GitLab repo_info.commit_count backfill (v0.16.9+)
GitLab’s GET /projects/:id?statistics=true sometimes reports commit_count = 0 even for non-empty projects:
The
statisticsobject is omitted when the token lacks Reporter+ access on a private project, or on self-managed instances with custom permission rules.statistics.commit_countis populated by a GitLab background worker, so freshly-imported, mirrored, or recently-pushed-to projects can report 0 until the worker runs. This is common for pull-mirror setups.
After a successful facade run for a PlatformGitLab repo, the scheduler calls store.BackfillGitLabCommitCount(repoID), which:
Reads
SELECT COUNT(DISTINCT cmt_commit_hash) FROM aveloxis_data.commits WHERE repo_id = $1— the facade’s ground truth.If that gathered count is non-zero,
UPDATE aveloxis_data.repo_info SET commit_count = <gathered> WHERE repo_id = $1 AND commit_count = 0 AND repo_info_id = <latest>— only the most recent snapshot, only when the API value is explicitly 0.
Safety properties:
Never overwrites a real non-zero API count (
WHERE commit_count = 0guard).Never writes a no-op zero (short-circuit when gathered is 0).
Idempotent: the second call after a successful backfill touches zero rows because
commit_countis no longer 0.Scope is GitLab only — GitHub repos skip the call entirely so that path is byte-for-byte unchanged.
Observability: gitlab.Client.FetchRepoInfo now logs a WARN when statistics is nil (“token may lack Reporter+ access”) and an INFO when commit_count = 0 (“will backfill from facade if non-empty”). The scheduler logs gitlab commit_count backfilled from facade with the repo ID when the UPDATE actually writes a row.
Next steps
Contributor Resolution – how commit authors are resolved to GitHub users
Analysis – dependency scanning and code complexity from full clones
Overview – system architecture overview