Facade Commits

The facade phase extracts commit data from git repositories using git log. This page covers the bare clone design, log parsing, per-file commit rows, and aggregate computation.

Bare clone vs full clone

Aveloxis uses two types of git clones for different purposes:

Type	Command	Persistence	Purpose
Bare clone	`git clone --bare`	Permanent	Facade phase (git log parsing)
Full clone	Local checkout from bare clone	Temporary	Analysis phase (dependency scanning, scc)

Bare clones

Bare clones contain only the git object database (no working tree). They are:

Smaller than full clones (no checked-out files)
Permanent – stored in repo_clone_dir and reused across collection cycles
Updated via git fetch --all on subsequent runs

Full clones (temporary)

When the analysis phase needs to read file contents (for dependency scanning and code complexity), a full checkout is created locally from the bare clone:

git clone /path/to/bare.git /path/to/temp-checkout

This is a local operation (no network request). After analysis completes, the temporary checkout is deleted.

Disk usage

Bare clones: Permanent. Plan for 10 MB to 5+ GB per repo depending on history size.
Full clones: Temporary. Roughly double the bare clone size while they exist, then deleted.

For large instances (400K repos), bare clones can consume tens of terabytes.

Git log parsing

The facade phase runs git log with a custom format string to extract commit data.

Format string

The format uses custom field and record separators to reliably parse multi-line output:

git log --all --numstat --pretty=format:'<COMMIT>%H<SEP>%an<SEP>%ae<SEP>%ad<SEP>%cn<SEP>%ce<SEP>%cd<SEP>%P<SEP>%s'

Where:

Placeholder	Field
`%H`	Full commit hash
`%an`	Author name
`%ae`	Author email
`%ad`	Author date
`%cn`	Committer name
`%ce`	Committer email
`%cd`	Committer date
`%P`	Parent hashes (space-separated)
`%s`	Subject line (commit message first line)

The --numstat flag appends per-file statistics after each commit:

12    5    src/main.go
3     1    README.md
-     -    binary-file.bin

Each line shows lines added, lines removed, and the file path. Binary files show - for both counts.

Parsing logic

The parser:

Splits output on the <COMMIT> record separator
For each commit, splits the header on <SEP> field separators
Reads subsequent lines as numstat entries until the next commit
Handles binary files (lines added/removed = 0 when - is encountered)
Extracts date components for aggregate computation

Per-file commit rows

Following Augur’s data model, the commits table stores one row per file per commit. A commit that touches 10 files produces 10 rows, all sharing the same cmt_commit_hash.

Columns populated from git log

Column	Source	Description
`repo_id`	Context	The repo being collected
`cmt_commit_hash`	`%H`	Full SHA-1 hash
`cmt_author_name`	`%an`	Author name
`cmt_author_raw_email`	`%ae`	Author email as-is
`cmt_author_email`	`%ae`	Initially same as raw; updated by commit resolver
`cmt_author_date`	`%ad`	Author date string
`cmt_author_timestamp`	Parsed from `%ad`	Parsed timestamp
`cmt_committer_name`	`%cn`	Committer name
`cmt_committer_raw_email`	`%ce`	Committer email as-is
`cmt_committer_email`	`%ce`	Initially same as raw; updated by commit resolver
`cmt_committer_date`	`%cd`	Committer date string
`cmt_committer_timestamp`	Parsed from `%cd`	Parsed timestamp
`cmt_added`	numstat	Lines added in this file
`cmt_removed`	numstat	Lines removed in this file
`cmt_whitespace`	Computed	Always 0 (reserved)
`cmt_filename`	numstat	File path

Upsert behavior

Commits are upserted with ON CONFLICT (repo_id, cmt_commit_hash, cmt_filename) DO UPDATE. This means:

New commits are inserted
Existing commits are updated (e.g., after commit resolver fills in cmt_author_platform_username)
Re-running facade on the same repo is safe and idempotent

Commit parents

Parent-child relationships are extracted from the %P placeholder (space-separated parent hashes) and inserted into the commit_parents table.

INSERT INTO aveloxis_data.commit_parents (cmt_id, parent_id)
VALUES ($1, $2)
ON CONFLICT DO NOTHING;

This enables:

Reconstructing the commit DAG
Identifying merge commits (commits with 2+ parents)
Analyzing branching and merging patterns

Commit messages

Full commit messages are stored in the commit_messages table, deduplicated per repo and commit hash:

INSERT INTO aveloxis_data.commit_messages (repo_id, cmt_hash, cmt_msg)
VALUES ($1, $2, $3)
ON CONFLICT (repo_id, cmt_hash) DO NOTHING;

The subject line (%s) is used for the message. This is stored separately from the per-file commit rows to avoid duplicating message text across all file rows for the same commit.

Affiliation resolution

During facade processing, commit author and committer emails are matched against the contributor_affiliations table to resolve organizational affiliations.

Resolution logic

The affiliation resolver:

Loads all active rules from contributor_affiliations on first use (lazy initialization, cached in memory)
Extracts the domain from the email address (e.g., user@redhat.com -> redhat.com)
Exact domain match first (e.g., redhat.com -> Red Hat)
Parent domain fallback (e.g., mail.google.com -> google.com -> Google)

Populated columns

Column	Value
`cmt_author_affiliation`	Organization name for the author’s email domain
`cmt_committer_affiliation`	Organization name for the committer’s email domain

If no match is found, these columns are left NULL.

Adding affiliations

Affiliations are stored in the contributor_affiliations table:

INSERT INTO aveloxis_data.contributor_affiliations
  (ca_domain, ca_affiliation, ca_active)
VALUES ('redhat.com', 'Red Hat', 1);

After adding new affiliations, existing commits can be re-processed by re-running the facade phase for affected repos.

Facade aggregates

After all commits for a repo are inserted, aggregate tables are refreshed by SQL aggregation over the commits table.

Aggregate tables

Table	Granularity	Key
`dm_repo_annual`	Year	(repo_id, email, affiliation, year)
`dm_repo_monthly`	Month	(repo_id, email, affiliation, year, month)
`dm_repo_weekly`	Week	(repo_id, email, affiliation, year, week)
`dm_repo_group_annual`	Year	(repo_group_id, email, affiliation, year)
`dm_repo_group_monthly`	Month	(repo_group_id, email, affiliation, year, month)
`dm_repo_group_weekly`	Week	(repo_group_id, email, affiliation, year, week)

Aggregate columns

Each aggregate row contains:

Column	Description
`email`	Contributor email
`affiliation`	Organizational affiliation
`added`	Total lines added
`removed`	Total lines removed
`whitespace`	Total whitespace changes
`files`	Distinct files changed
`patches`	Number of commits/patches

Refresh SQL

The aggregates are computed by SQL queries that group commits rows by the appropriate time period. For example, the annual aggregate:

DELETE FROM aveloxis_data.dm_repo_annual WHERE repo_id = $1;

INSERT INTO aveloxis_data.dm_repo_annual
  (repo_id, email, affiliation, year, added, removed, whitespace, files, patches)
SELECT
  repo_id,
  cmt_author_email,
  cmt_author_affiliation,
  EXTRACT(YEAR FROM cmt_author_timestamp)::SMALLINT,
  SUM(cmt_added),
  SUM(cmt_removed),
  SUM(cmt_whitespace),
  COUNT(DISTINCT cmt_filename),
  COUNT(DISTINCT cmt_commit_hash)
FROM aveloxis_data.commits
WHERE repo_id = $1
  AND cmt_author_timestamp IS NOT NULL
GROUP BY repo_id, cmt_author_email, cmt_author_affiliation,
         EXTRACT(YEAR FROM cmt_author_timestamp);

Aggregates are refreshed per-repo after each facade run, not globally. This keeps the cost proportional to the repo’s commit count.

Resilience

Fetch failure recovery

If git fetch --all fails on an existing bare clone (e.g., due to corruption):

The existing bare clone is deleted
A fresh git clone --bare is attempted
If that also fails, the facade phase is skipped for this repo (logged as an error)

Incremental collection

On subsequent collection cycles, git fetch --all retrieves only new commits since the last fetch. The git log is re-parsed in full, but upserts with ON CONFLICT ensure only truly new data is inserted.

GitLab `repo_info.commit_count` backfill (v0.16.9+)

GitLab’s GET /projects/:id?statistics=true sometimes reports commit_count = 0 even for non-empty projects:

The statistics object is omitted when the token lacks Reporter+ access on a private project, or on self-managed instances with custom permission rules.
statistics.commit_count is populated by a GitLab background worker, so freshly-imported, mirrored, or recently-pushed-to projects can report 0 until the worker runs. This is common for pull-mirror setups.

After a successful facade run for a PlatformGitLab repo, the scheduler calls store.BackfillGitLabCommitCount(repoID), which:

Reads SELECT COUNT(DISTINCT cmt_commit_hash) FROM aveloxis_data.commits WHERE repo_id = $1 — the facade’s ground truth.
If that gathered count is non-zero, UPDATE aveloxis_data.repo_info SET commit_count = <gathered> WHERE repo_id = $1 AND commit_count = 0 AND repo_info_id = <latest> — only the most recent snapshot, only when the API value is explicitly 0.

Safety properties:

Never overwrites a real non-zero API count (WHERE commit_count = 0 guard).
Never writes a no-op zero (short-circuit when gathered is 0).
Idempotent: the second call after a successful backfill touches zero rows because commit_count is no longer 0.
Scope is GitLab only — GitHub repos skip the call entirely so that path is byte-for-byte unchanged.

Observability: gitlab.Client.FetchRepoInfo now logs a WARN when statistics is nil (“token may lack Reporter+ access”) and an INFO when commit_count = 0 (“will backfill from facade if non-empty”). The scheduler logs gitlab commit_count backfilled from facade with the repo ID when the UPDATE actually writes a row.

Next steps

Contributor Resolution – how commit authors are resolved to GitHub users
Analysis – dependency scanning and code complexity from full clones
Overview – system architecture overview