# Facade Commits

The facade phase extracts commit data from git repositories using `git log`. This page covers the bare clone design, log parsing, per-file commit rows, and aggregate computation.

---

## Bare clone vs full clone

Aveloxis uses two types of git clones for different purposes:

| Type | Command | Persistence | Purpose |
|---|---|---|---|
| **Bare clone** | `git clone --bare` | Permanent | Facade phase (git log parsing) |
| **Full clone** | Local checkout from bare clone | Temporary | Analysis phase (dependency scanning, scc) |

### Bare clones

Bare clones contain only the git object database (no working tree). They are:

- **Smaller** than full clones (no checked-out files)
- **Permanent** -- stored in `repo_clone_dir` and reused across collection cycles
- **Updated** via `git fetch --all` on subsequent runs

### Full clones (temporary)

When the analysis phase needs to read file contents (for dependency scanning and code complexity), a full checkout is created locally from the bare clone:

```bash
git clone /path/to/bare.git /path/to/temp-checkout
```

This is a local operation (no network request). After analysis completes, the temporary checkout is deleted.

### Disk usage

- **Bare clones:** Permanent. Plan for 10 MB to 5+ GB per repo depending on history size.
- **Full clones:** Temporary. Roughly double the bare clone size while they exist, then deleted.

For large instances (400K repos), bare clones can consume tens of terabytes.

---

## Git log parsing

The facade phase runs `git log` with a custom format string to extract commit data.

### Format string

The format uses custom field and record separators to reliably parse multi-line output:

```
git log --all --numstat --pretty=format:'<COMMIT>%H<SEP>%an<SEP>%ae<SEP>%ad<SEP>%cn<SEP>%ce<SEP>%cd<SEP>%P<SEP>%s'
```

Where:

| Placeholder | Field |
|---|---|
| `%H` | Full commit hash |
| `%an` | Author name |
| `%ae` | Author email |
| `%ad` | Author date |
| `%cn` | Committer name |
| `%ce` | Committer email |
| `%cd` | Committer date |
| `%P` | Parent hashes (space-separated) |
| `%s` | Subject line (commit message first line) |

The `--numstat` flag appends per-file statistics after each commit:

```
12    5    src/main.go
3     1    README.md
-     -    binary-file.bin
```

Each line shows lines added, lines removed, and the file path. Binary files show `-` for both counts.

### Parsing logic

The parser:

1. Splits output on the `<COMMIT>` record separator
2. For each commit, splits the header on `<SEP>` field separators
3. Reads subsequent lines as numstat entries until the next commit
4. Handles binary files (lines added/removed = 0 when `-` is encountered)
5. Extracts date components for aggregate computation

---

## Per-file commit rows

Following Augur's data model, the `commits` table stores **one row per file per commit**. A commit that touches 10 files produces 10 rows, all sharing the same `cmt_commit_hash`.

### Columns populated from git log

| Column | Source | Description |
|---|---|---|
| `repo_id` | Context | The repo being collected |
| `cmt_commit_hash` | `%H` | Full SHA-1 hash |
| `cmt_author_name` | `%an` | Author name |
| `cmt_author_raw_email` | `%ae` | Author email as-is |
| `cmt_author_email` | `%ae` | Initially same as raw; updated by commit resolver |
| `cmt_author_date` | `%ad` | Author date string |
| `cmt_author_timestamp` | Parsed from `%ad` | Parsed timestamp |
| `cmt_committer_name` | `%cn` | Committer name |
| `cmt_committer_raw_email` | `%ce` | Committer email as-is |
| `cmt_committer_email` | `%ce` | Initially same as raw; updated by commit resolver |
| `cmt_committer_date` | `%cd` | Committer date string |
| `cmt_committer_timestamp` | Parsed from `%cd` | Parsed timestamp |
| `cmt_added` | numstat | Lines added in this file |
| `cmt_removed` | numstat | Lines removed in this file |
| `cmt_whitespace` | Computed | Always 0 (reserved) |
| `cmt_filename` | numstat | File path |

### Upsert behavior

Commits are upserted with `ON CONFLICT (repo_id, cmt_commit_hash, cmt_filename) DO UPDATE`. This means:

- New commits are inserted
- Existing commits are updated (e.g., after commit resolver fills in `cmt_author_platform_username`)
- Re-running facade on the same repo is safe and idempotent

---

## Commit parents

Parent-child relationships are extracted from the `%P` placeholder (space-separated parent hashes) and inserted into the `commit_parents` table.

```sql
INSERT INTO aveloxis_data.commit_parents (cmt_id, parent_id)
VALUES ($1, $2)
ON CONFLICT DO NOTHING;
```

This enables:

- Reconstructing the commit DAG
- Identifying merge commits (commits with 2+ parents)
- Analyzing branching and merging patterns

---

## Commit messages

Full commit messages are stored in the `commit_messages` table, deduplicated per repo and commit hash:

```sql
INSERT INTO aveloxis_data.commit_messages (repo_id, cmt_hash, cmt_msg)
VALUES ($1, $2, $3)
ON CONFLICT (repo_id, cmt_hash) DO NOTHING;
```

The subject line (`%s`) is used for the message. This is stored separately from the per-file commit rows to avoid duplicating message text across all file rows for the same commit.

---

## Affiliation resolution

During facade processing, commit author and committer emails are matched against the `contributor_affiliations` table to resolve organizational affiliations.

### Resolution logic

The affiliation resolver:

1. **Loads all active rules** from `contributor_affiliations` on first use (lazy initialization, cached in memory)
2. **Extracts the domain** from the email address (e.g., `user@redhat.com` -> `redhat.com`)
3. **Exact domain match first** (e.g., `redhat.com` -> `Red Hat`)
4. **Parent domain fallback** (e.g., `mail.google.com` -> `google.com` -> `Google`)

### Populated columns

| Column | Value |
|---|---|
| `cmt_author_affiliation` | Organization name for the author's email domain |
| `cmt_committer_affiliation` | Organization name for the committer's email domain |

If no match is found, these columns are left `NULL`.

### Adding affiliations

Affiliations are stored in the `contributor_affiliations` table:

```sql
INSERT INTO aveloxis_data.contributor_affiliations
  (ca_domain, ca_affiliation, ca_active)
VALUES ('redhat.com', 'Red Hat', 1);
```

After adding new affiliations, existing commits can be re-processed by re-running the facade phase for affected repos.

---

## Facade aggregates

After all commits for a repo are inserted, aggregate tables are refreshed by SQL aggregation over the `commits` table.

### Aggregate tables

| Table | Granularity | Key |
|---|---|---|
| `dm_repo_annual` | Year | (repo_id, email, affiliation, year) |
| `dm_repo_monthly` | Month | (repo_id, email, affiliation, year, month) |
| `dm_repo_weekly` | Week | (repo_id, email, affiliation, year, week) |
| `dm_repo_group_annual` | Year | (repo_group_id, email, affiliation, year) |
| `dm_repo_group_monthly` | Month | (repo_group_id, email, affiliation, year, month) |
| `dm_repo_group_weekly` | Week | (repo_group_id, email, affiliation, year, week) |

### Aggregate columns

Each aggregate row contains:

| Column | Description |
|---|---|
| `email` | Contributor email |
| `affiliation` | Organizational affiliation |
| `added` | Total lines added |
| `removed` | Total lines removed |
| `whitespace` | Total whitespace changes |
| `files` | Distinct files changed |
| `patches` | Number of commits/patches |

### Refresh SQL

The aggregates are computed by SQL queries that group `commits` rows by the appropriate time period. For example, the annual aggregate:

```sql
DELETE FROM aveloxis_data.dm_repo_annual WHERE repo_id = $1;

INSERT INTO aveloxis_data.dm_repo_annual
  (repo_id, email, affiliation, year, added, removed, whitespace, files, patches)
SELECT
  repo_id,
  cmt_author_email,
  cmt_author_affiliation,
  EXTRACT(YEAR FROM cmt_author_timestamp)::SMALLINT,
  SUM(cmt_added),
  SUM(cmt_removed),
  SUM(cmt_whitespace),
  COUNT(DISTINCT cmt_filename),
  COUNT(DISTINCT cmt_commit_hash)
FROM aveloxis_data.commits
WHERE repo_id = $1
  AND cmt_author_timestamp IS NOT NULL
GROUP BY repo_id, cmt_author_email, cmt_author_affiliation,
         EXTRACT(YEAR FROM cmt_author_timestamp);
```

Aggregates are refreshed per-repo after each facade run, not globally. This keeps the cost proportional to the repo's commit count.

---

## Resilience

### Fetch failure recovery

If `git fetch --all` fails on an existing bare clone (e.g., due to corruption):

1. The existing bare clone is deleted
2. A fresh `git clone --bare` is attempted
3. If that also fails, the facade phase is skipped for this repo (logged as an error)

### Incremental collection

On subsequent collection cycles, `git fetch --all` retrieves only new commits since the last fetch. The git log is re-parsed in full, but upserts with `ON CONFLICT` ensure only truly new data is inserted.

### GitLab `repo_info.commit_count` backfill (v0.16.9+)

GitLab's `GET /projects/:id?statistics=true` sometimes reports `commit_count = 0` even for non-empty projects:

- The `statistics` object is omitted when the token lacks Reporter+ access on a private project, or on self-managed instances with custom permission rules.
- `statistics.commit_count` is populated by a GitLab background worker, so freshly-imported, mirrored, or recently-pushed-to projects can report 0 until the worker runs. This is common for pull-mirror setups.

After a successful facade run for a `PlatformGitLab` repo, the scheduler calls `store.BackfillGitLabCommitCount(repoID)`, which:

1. Reads `SELECT COUNT(DISTINCT cmt_commit_hash) FROM aveloxis_data.commits WHERE repo_id = $1` — the facade's ground truth.
2. If that gathered count is non-zero, `UPDATE aveloxis_data.repo_info SET commit_count = <gathered> WHERE repo_id = $1 AND commit_count = 0 AND repo_info_id = <latest>` — only the most recent snapshot, only when the API value is explicitly 0.

Safety properties:

- Never overwrites a real non-zero API count (`WHERE commit_count = 0` guard).
- Never writes a no-op zero (short-circuit when gathered is 0).
- Idempotent: the second call after a successful backfill touches zero rows because `commit_count` is no longer 0.
- Scope is GitLab only — GitHub repos skip the call entirely so that path is byte-for-byte unchanged.

Observability: `gitlab.Client.FetchRepoInfo` now logs a WARN when `statistics` is nil ("token may lack Reporter+ access") and an INFO when `commit_count = 0` ("will backfill from facade if non-empty"). The scheduler logs `gitlab commit_count backfilled from facade` with the repo ID when the UPDATE actually writes a row.

---

## Next steps

- [Contributor Resolution](contributor-resolution.md) -- how commit authors are resolved to GitHub users
- [Analysis](analysis.md) -- dependency scanning and code complexity from full clones
- [Overview](overview.md) -- system architecture overview