Contributor Resolution
Contributor resolution is the process of mapping platform user references — login names, numeric platform user IDs, commit-author emails, mailing-list sender emails — to canonical contributor records in the database. It is one of the most heavily-exercised paths in aveloxis: every issue reporter, PR author, review author, message author, event actor, commit author, and mailing-list sender triggers it.
The whole subsystem has three layers worth keeping distinct in your head:
Data origins — where a contributor reference enters the system (API actors, git commit authors, mailing-list senders). Each origin hands off a
login, aplatform_user_id, an email, or some subset.Shared processing — the resolution logic the origins funnel through. Login/ID references go through
ContributorResolver.Resolve; bare emails (commit authors, mailing-list senders) go through one shared chain,ResolveEmailToIdentity.Identity — the canonical
contributorsrow (cntrb_id) every origin ultimately maps to, and the guarantees about it.
This document is the public contract for contributor data in aveloxis. The contract has thirteen rules covering identity-key determinism, immutability, deduplication, enrichment, and FK integrity. Operators rely on these guarantees when writing analytics queries against aveloxis_data.contributors and its child tables.
If you are looking for the high-level resolution flow, jump to Resolution flow. If you are diagnosing a specific data-quality concern, jump to Data-quality FAQ or Diagnostic queries.
What a contributor record represents
A row in aveloxis_data.contributors represents one person or bot that interacts with repositories aveloxis tracks. The same person may have a GitHub identity (gh_user_id, gh_login), a GitLab identity (gl_id, gl_username), one or more commit-author emails, or any combination of these. The cntrb_id UUID on the row is stable and is referenced from 16+ child tables (commits, issues, PRs, messages, reviews, events).
The schema separates three concerns:
aveloxis_data.contributors— one row per individual; carries denormalizedgh_*/gl_*columns for backward-compatible Augur queries and the canonical email (cntrb_canonical).aveloxis_data.contributor_identities— one row per(platform, platform_user_id)observation; the relational truth for “this person has this account on this platform.”aveloxis_data.contributors_aliases— one row per distinct commit-author email; maps alternate emails back to the canonical contributor.
A contributor with both a GitHub and a GitLab identity has one contributors row and two contributor_identities rows. A contributor with three different commit emails has one contributors row and three contributors_aliases rows.
Where contributor data comes from
Three independent origins feed contributor data. They differ in what identifier they carry and which resolution path they take, but all converge on a single contributors row.
Origin |
Entry point |
Identifier it carries |
Resolution path |
|---|---|---|---|
Platform API actors (issue/PR/review/comment authors, event actors) |
Staged collector → |
|
Login/ID lookup → deterministic-UUID INSERT (Layer 1) |
Git commit authors |
Facade → |
a bare commit-author email (rarely a noreply login) |
Commit-specific SHA lookup, then the shared email→identity chain (Layer 2) |
Mailing-list senders |
|
a bare sender email (+ display name) |
The key structural point: the two email-bearing origins (commits, mailing-list senders) share one resolution chain. A commit author and a mailing-list sender with the same email resolve to the same contributors row, and an email that resolves to a platform identity later (as the DB fills) is reconciled by the same background machinery for both. This is deliberate — see Shared email→identity resolution.
Contract rules
The thirteen rules below define what the system promises about contributor data. They are referenced by ID throughout the codebase (search for “R1”, “R2”, etc. in code comments) and in the private summary/02-contract.md design document.
R1: Identity-key determinism
When (platform, platform_user_id) is known with platform_user_id > 0, cntrb_id MUST equal PlatformUUID(platform, platform_user_id). This produces an Augur-byte-compatible UUID encoding the platform byte and the user ID. Two independent collections of the same GitHub or GitLab user produce the same UUID — cross-database joins on cntrb_id work without content drift.
When platform_user_id == 0 (commit author with no resolved platform user), cntrb_id is a random gen_random_uuid(). This is the only legal source of nondeterminism, typically affecting <1% of contributors.
R2: Identity-key immutability
Once cntrb_id is written, it MUST NOT change. There is no operation in aveloxis that updates cntrb_id on an existing row. If a row was inserted with a random UUID and the contributor later turns out to have a gh_user_id, the search-resolve background task adds the gh_user_id to the existing row but does NOT migrate the row to the deterministic UUID. The two would differ only in the value of cntrb_id, and changing it would orphan every FK reference (commits, issues, PRs, messages — 16+ columns).
This rule is the reason aveloxis does not ship a 16-table merge migration. The cleanup tools in augur-contributor-resolver exist precisely because Augur did the opposite — letting cntrb_id mutate post-hoc and accumulating orphans. Aveloxis prevents the bad state at the write boundary.
R3: cntrb_login partial uniqueness
The partial unique index idx_contributors_login enforces that no two rows share the same non-empty cntrb_login. Empty logins are unconstrained — multiple email-only contributors can coexist with cntrb_login = ''.
Rename edge case. When a platform user renames between observations, three outcomes are possible:
The new login is unobserved elsewhere —
Resolveupdates the existing row’scntrb_loginto the new value via theON CONFLICT (cntrb_id) DO UPDATEbranch. Clean.The new login is already observed under a different
cntrb_idAND the partial unique index trips at write time — theUpsertContributorFull23505 fallback UPDATEs the row’s other fields without touchingcntrb_login. Two rows representing the same person coexist with different logins, both visible in lookup queries. This was the steady-state behavior pre-v0.20.2.(v0.20.2 onward) When
runSearchResolvelater identifies the duplicate via the search API result and callsLinkContributorToGitHubUser, the function now performs a logical merge (soft-delete pattern). It picks a winner — preferring the row whosecntrb_idmatchesPlatformUUID(1, ghUserID)per R1, falling back to the older row — copies non-empty fields from the loser(s) into the winner, insertscontributors_aliasesrows so the loser’s emails resolve to the winner, and setscntrb_deleted = 1on the loser(s). The loser rows are NOT deleted physically (preserving R2 identity-key immutability and R10 FK integrity); they’re filtered out of every lookup query viaCOALESCE(cntrb_deleted, 0) = 0. Analytics/aggregate queries that JOIN oncntrb_idcontinue to see the loser rows so historical activity stays attributed correctly.
The v0.20.2 logical merge means outcome #2 is now self-correcting: the next search-resolve cycle that processes the duplicate’s email cleans it up automatically.
cntrb_deleted semantics (v0.20.2):
0(default) — active row, returned by lookup queries.1— loser of a rename merge. Filtered out ofResolve’s lookup-by-login,FindLoginByEmail,FindContributorIDByLogin,GetThinContributorLogins,GetContributorsNeedingSearch,GetContributorsMissingCanonical, andPopulateAffiliationscandidate selection. Still visible to analytics queries that JOIN oncntrb_iddirectly (so historical FK references continue to resolve).The column already existed in the schema (legacy Augur compatibility); v0.20.2 repurposes it for the soft-delete merge semantics. Pre-v0.20.2 rows are all
cntrb_deleted = 0or NULL — both treated as active byCOALESCE(cntrb_deleted, 0) = 0.
R4: Identity rows are denormalized truth
Every observation of (platform_id, platform_user_id) produces a row in contributor_identities (UNIQUE on the pair). The denormalized gh_* / gl_* columns on the contributors row are convenience copies; when they disagree with contributor_identities, the latter wins.
Backfill from contributor_identities to contributors.gh_* happens via backfillGHColumns and the inline UPDATE inside UpsertContributorBatch.
R5: Email aliasing
Every distinct commit-author email observed for a resolved contributor lands in contributors_aliases with alias_email = email and cntrb_id pointing at the contributor. The UNIQUE (alias_email) constraint enforces that one email maps to exactly one contributor.
If an email is observed under two different cntrb_ids, that is a duplicate-contributor bug, not a multi-mapping. Use the Diagnostic queries to detect this case.
R6: Enrichment is best-effort, cooldown-bounded
A “thin” contributor has empty cntrb_company AND empty cntrb_location. The periodic enrichment ticker calls client.EnrichContributor(login) to fill these fields. If the platform profile is genuinely empty (the user has not set company / location), the row’s cntrb_last_enriched_at is stamped to NOW() to suppress retry for the cooldown window (30 days).
Two separate cooldown columns track two distinct background tasks:
cntrb_last_enriched_at—runEnrichmentandResolveEmailsToCanonicalcooldown.cntrb_last_search_attempted_at—runSearchResolve(search-by-email) cooldown.
Without these cooldowns, every periodic tick would re-process the same “genuinely empty” rows forever, wasting GitHub Search API and core API quota.
R7: Cached resolution on the hot path
Every UserRef observed during staged collection MUST go through ContributorResolver.Resolve (defined in internal/db/contributors.go), not a direct UpsertContributor call. The resolver caches (platform, user_id) → cntrb_id for the lifetime of the collection job. Two observations of the same user inside one repo’s data hit the cache; across workers (different repos), each worker pays the cache-miss SELECT.
This rule is the input to the planned process-wide cache (Phase C of summary/04-refactoring-plan.md), which lifts the cache to fleet scope.
R8: Bulk operations dedup before flushing
UpsertContributorBatch MUST deduplicate by login in memory (“richest data wins”) before flushing in a single transaction. Per-row UpsertContributor calls in a tight loop are forbidden — that pattern produced 14,000 single-row transactions per enrichment cycle on a 100K-repo fleet (v0.18.28 pre-fix) and induced deadlocks on the contributors hot index.
The batch path is used by:
EnrichThinContributors(after the periodic enrichment ticker)The staged processor’s
processBatch(forEntityContributorpayloads)The legacy
Collector.collectContributors
R9: Idempotent re-collection
Every contributor INSERT carries an ON CONFLICT clause. Re-collecting the same repo produces the same final state — DO UPDATE coalesces non-empty incoming fields into existing fields without overwriting non-empty existing values. This is verified by the TestAllDataInsertTablesHaveOnConflict source-contract test.
R10: Foreign-key integrity
Every cntrb_id referenced from a child table MUST exist in contributors. The resolver creates the parent row before any child row references it. This is what prevents the orphan-FK problem that drove the augur-contributor-resolver post-hoc fix scripts.
R11: Search-resolve never modifies identity
The runSearchResolve background task and its LinkContributorToGitHubUser worker function MUST NOT modify cntrb_id (R2) or cntrb_login (R3). They only backfill gh_user_id, gh_login, the audit column cntrb_last_search_attempted_at, and add a contributor_identities row.
R13: Documentation as a first-class deliverable
This document IS R13’s deliverable. The contract is public. Operator-facing changes to contributor data shape MUST be reflected here in the same release that ships the code change. The source-contract test TestContributorResolutionDocReferencesCanonicalFunctions fails the build if any of ContributorResolver.Resolve, UpsertContributorBatch, or LinkContributorToGitHubUser is renamed without updating this doc.
Resolution flow
Layer 1: API-phase resolution (during collection)
When the staged collector observes a UserRef from issues, PRs, events, or messages:
UserRef (login + platform_user_id + name + email + ...)
|
v
ContributorResolver.Resolve
|
├── 1. In-memory cache by (platform, user_id) — hit returns immediately
|
├── 2. SELECT contributor_identities WHERE platform_id=? AND platform_user_id=?
| — hit caches and returns
|
├── 3. SELECT contributors WHERE cntrb_login=?
| — hit reuses existing row, backfills identities row, caches, returns
|
└── 4. INSERT contributors + contributor_identities in one transaction
- userID > 0: ON CONFLICT (cntrb_id) DO UPDATE — R1, R3
- userID == 0: ON CONFLICT (cntrb_login) DO UPDATE — R3 partial
Step 3 (lookup-by-login before INSERT) was added in v0.19.2 to prevent the partial-unique-index race that produced “duplicate key value violates unique constraint idx_contributors_login” floods in production logs. Without it, two workers observing the same renamed user under the new login both attempt INSERT with different deterministic UUIDs and one trips the constraint.
Layer 2: Git-phase resolution (after facade)
After the facade phase loads commits, CommitResolver.ResolveCommits walks every commit with cmt_author_platform_username IS NULL and tries the seven-step chain in R12. On a hit, it calls UpsertContributorFull to create or update the contributor row, then EnsureContributorAlias to record the commit email.
For each unresolved commit:
|
v
resolveOne(hash, email)
|
├── hash cache → email cache → noreply parse → bot skip → DB lookup → Commits API → shared API tail (Search → global commit-search)
|
v
on hit:
|
├── SetCommitAuthorLogin(repo, hash, login)
├── UpsertContributorFull(deterministic-uuid, login, gh_user_id, email) — R1, R2
└── EnsureContributorAlias(cntrb_id, email) — R5
After all commits resolved:
|
└── BackfillCommitAuthorIDs(repo) — bulk UPDATE commits SET cmt_ght_author_id = ...
UpsertContributorFull carries the most defensive logic in the codebase. The 23505 fallback (v0.19.2) catches the partial-unique-index trip described in R3 and degrades gracefully — the rename edge case becomes a logged Debug line, not a job-killing error.
Layer 3: Background tasks
Three periodic tickers run inside aveloxis serve. Each is rate-limited and cooldown-bounded per R6:
Ticker |
Cadence |
What it does |
Cooldown column |
|---|---|---|---|
|
30 min |
≤14000 thin contributors → |
|
|
1 hour |
≤100 contributors with email + no |
|
|
1 hour |
≤N mailing-list senders (≥ message threshold) with no resolved |
|
|
configurable |
Discovers cross-repo contributor activity via Events API; writes |
n/a (per-contributor priority) |
The two email-resolve tickers (runSearchResolve for contributor rows, runMailingListSenderResolve for mailing-list senders) are the convergence machinery: an email observed before its owner’s platform identity is known gets re-attempted as the DB fills, so email-only rows acquire a gh_user_id over days without re-collection. In addition to the tickers, every per-job collection runs the commit resolver (Layer 2) and a ResolveEmailsToCanonical pass that fills cntrb_canonical for ≤500 contributors per call.
GithubUUID / PlatformUUID
The deterministic UUID encoding is byte-compatible with Augur:
Byte(s) |
Content |
|---|---|
0 |
Platform ID (1 = GitHub, 2 = GitLab) |
1–4 |
Platform user ID (big-endian uint32, when |
1–8 |
Platform user ID (big-endian uint64, when |
Remaining |
Zero-filled |
Example: GitHub user 12345 (platform=1):
Byte 0: 0x01
Bytes 1-4: 0x00003039 (12345 big-endian)
Remaining: 0x00000000000000000000
UUID: 01003039-0000-0000-0000-000000000000
The 8-byte fallback for IDs above 2³² is non-Augur-compatible by necessity (Augur predates GitHub’s overflow concern). Existing rows in either layout are never re-encoded. The function is in internal/db/github_uuid.go.
Rename handling: which columns track renames, which don’t
GitHub and GitLab both allow users to rename their account. The numeric user_id is stable across renames, so the deterministic cntrb_id = PlatformUUID(platform, user_id) keeps every cross-table reference valid. But the login field has multiple representations in the schema, and they behave differently on a rename. This is the contract.
Three tables, three rename behaviors
aveloxis_data.contributors (one row per person)
Column |
Updated on rename? |
Purpose |
|---|---|---|
|
Never |
Stable identity. All FK references (16 columns across 15 tables — see R10) depend on this. |
|
Never (R2 invariant) |
Durable audit trail — the login as first observed for this contributor. Subject to |
|
Yes |
The “current display name” mirror. Updated by |
|
Never (write-once via COALESCE) |
Stable platform integer ID. Drives the deterministic |
A renamed contributor ends up with cntrb_login='oldname' and gh_login='newname' on the same row. Two different login values, each authoritative for a different question.
aveloxis_data.contributor_identities (one row per (platform_id, platform_user_id))
The unique key is (platform_id, platform_user_id), so there is one identity row per platform user — not one per observation. On every observation:
INSERT INTO contributor_identities (cntrb_id, platform_id, platform_user_id, login, ...)
VALUES (...)
ON CONFLICT (platform_id, platform_user_id) DO UPDATE SET
login = EXCLUDED.login, -- overwritten in place
name = EXCLUDED.name,
email = COALESCE(NULLIF(EXCLUDED.email,''), contributor_identities.email),
avatar_url = EXCLUDED.avatar_url,
profile_url = EXCLUDED.profile_url;
The previous login is overwritten on the existing identity row. We do NOT create a new identity row for a rename. The identity row is a “current state per platform user” record, not a history. After a rename, the previous login at the identity layer is lost.
aveloxis_data.commits (one row per file per commit)
cmt_author_platform_username TEXT— a frozen raw string set once at commit-resolution time. No FK constraint; just text. It stays at whatever login was current when the commit was attributed. Renames after the fact do not update it.cmt_ght_author_id UUID— the stable FK tocontributors(cntrb_id). This is what keeps commit attribution correct across renames, because the cntrb_id is deterministic from gh_user_id and the gh_user_id never changes.
BackfillCommitAuthorIDs resolves unresolved commits via LOWER(cmt_author_platform_username) = LOWER(cn.gh_login) (v0.20.12 Fix H, case-insensitive). It joins against the current gh_login, not cntrb_login. The function only updates rows with cmt_ght_author_id IS NULL, so once a commit is attributed it stays attributed regardless of subsequent renames.
v0.25.6 coverage asymmetry closure. Production audit on 2026-05-29 of aveloxis_large (474M commit rows) found cmt_ght_author_id populated on 91.95% of rows but cmt_author_platform_username populated on only 77.58%. The asymmetry comes from API paths that stamp the UUID without updating the login text on every matching commit row (noreply email parsing, Commits API resolution, email-based backfill). v0.25.6 ships a one-shot migration backfill cmt_author_platform_username from cmt_ght_author_id that UPDATEs commits SET platform_username = c.gh_login wherever the UUID is populated but the login text isn’t. After deploy, login-based queries against commits match the UUID-based queries’ coverage. Soft-deleted contributors (R10 / v0.20.2 merge losers) are skipped per COALESCE(c.cntrb_deleted, 0) = 0.
Operational consequences
“Show me commits by user X (where X is the current login)” — JOIN commits ON cmt_ght_author_id, then filter on contributors.gh_login = 'X'. The FK is stable; the gh_login mirror reflects current state. Works.
“What was this commit’s author at write time?” — Read commits.cmt_author_platform_username directly. Frozen string. Useful for audit logs and historical accuracy.
“What was this user’s first observed login on aveloxis?” — Read contributors.cntrb_login. Never mutates.
“What login has this person been known by between then and now?” — Stored as of v0.23.0 in aveloxis_data.contributor_login_history. Every observed (cntrb_id, platform_id, login) triple gets a row, with first_seen preserved across re-observations and last_seen advancing. The source column tags the row with WHY it was created (observation, rename_recovery, rename_breadth, backfill). Pre-v0.23.0 only cntrb_login (first observation) and gh_login (current state) survived; intermediate renames between those two were lost. See the “Login history table (v0.23.0)” section below for query patterns.
Login history table (v0.23.0)
aveloxis_data.contributor_login_history is the append-only audit trail of every login a contributor has been observed under. Schema:
CREATE TABLE aveloxis_data.contributor_login_history (
history_id BIGSERIAL PRIMARY KEY,
cntrb_id UUID NOT NULL REFERENCES contributors(cntrb_id)
ON UPDATE CASCADE ON DELETE RESTRICT DEFERRABLE INITIALLY DEFERRED,
platform_id SMALLINT NOT NULL REFERENCES platforms(platform_id) DEFERRABLE INITIALLY DEFERRED,
login TEXT NOT NULL,
first_seen TIMESTAMPTZ NOT NULL DEFAULT NOW(),
last_seen TIMESTAMPTZ NOT NULL DEFAULT NOW(),
source TEXT NOT NULL DEFAULT 'observation',
tool_source TEXT NOT NULL DEFAULT 'aveloxis',
tool_version TEXT NOT NULL DEFAULT '',
UNIQUE (cntrb_id, platform_id, login)
);
source values:
Value |
Written by |
Trigger |
|---|---|---|
|
|
Steady-state per-cycle write — most rows have this source. |
|
|
Pkey collision recovery: same gh_user_id, new login arrived in a batch. |
|
|
Breadth worker hit 404, looked up by id, found a different current login. |
|
One-shot migration step at v0.23.0 install |
Seeded from existing |
Common queries:
-- Every login this person has ever used, oldest first
SELECT login, first_seen, last_seen, source
FROM aveloxis_data.contributor_login_history
WHERE cntrb_id = $1::uuid
ORDER BY first_seen ASC;
-- Find all rename events (any contributor whose history has more than
-- one login on the same platform)
SELECT cntrb_id, platform_id, count(*) AS login_count,
array_agg(login ORDER BY first_seen ASC) AS chronological
FROM aveloxis_data.contributor_login_history
GROUP BY cntrb_id, platform_id
HAVING count(*) > 1;
-- Resolve a historical login to a current contributor (e.g. when
-- joining against an external dataset that captured a now-stale login)
SELECT c.cntrb_id, c.gh_login AS current_login
FROM aveloxis_data.contributor_login_history h
JOIN aveloxis_data.contributors c USING (cntrb_id)
WHERE h.login = $1 AND h.platform_id = 1
ORDER BY h.last_seen DESC
LIMIT 1;
The Go store method PostgresStore.GetContributorLoginHistory(ctx, cntrbID) returns the rows ordered chronologically — useful for the contributor detail page in the web GUI (display planned for a follow-up release).
Why the two write paths exist (v0.22.12 vs v0.22.13)
Both paths update gh_login on the existing contributor row and never touch cntrb_login. They differ in trigger:
Path |
Trigger |
Code |
|---|---|---|
|
Breadth worker hits HTTP 404 on |
|
|
Enrichment ticker / staged collector calls batch upsert with new login; deterministic-UUID INSERT trips |
|
The v0.22.13 path is inlined inside the batch transaction (rather than calling out to RenameContributorGhLogin) because the batch tx already holds the savepoint scope and a nested transaction would deadlock against itself. The shape of the UPDATE statement is identical between the two paths: SET gh_login = $newLogin, cntrb_email = COALESCE(...) — preserving the contract uniformly.
Data-quality FAQ
Why do I see two rows for the same person?
Most likely cause: a platform-user rename, where the old login still appears in historical observations and a new login already exists in the table from a separate observation. See R3. The two rows have different cntrb_login values but represent the same person. To detect:
-- Two cntrb rows sharing a gh_user_id (definite duplicate)
SELECT gh_user_id, count(*) AS rows, array_agg(cntrb_login)
FROM aveloxis_data.contributors
WHERE gh_user_id IS NOT NULL
GROUP BY gh_user_id
HAVING count(*) > 1;
To detect via identities:
-- Same (platform, platform_user_id) pointing at multiple cntrb_ids (orphan risk)
SELECT platform_id, platform_user_id, count(DISTINCT cntrb_id)
FROM aveloxis_data.contributor_identities
GROUP BY platform_id, platform_user_id
HAVING count(DISTINCT cntrb_id) > 1;
A future release will add a logical merge via cntrb_deleted = 1 (Phase D of the refactor plan); for now, treat duplicates as a known data-quality limitation and coalesce in queries.
Why is cntrb_canonical empty?
Three legal reasons (post-v0.25.6):
The user has set their email to private on GitHub.
EnrichContributorreturns no email;ResolveEmailsToCanonicalcannot help. After the 30-day cooldown the row is retried in case the user changed their setting, but typically stays empty.The contributor is
email-only(nogh_user_id, nogl_id) — created from a commit author with a non-noreply email but no resolvable platform account.cntrb_canonicalmay be set to the commit email itself, or empty ifUpsertContributorFullwas called withcommitEmail = ''.The contributor was created very recently and the enrichment ticker has not yet reached them. They appear in the next enrichment cycle.
Pre-v0.25.6 there was a fourth cause: the commit resolver’s ensureAlias (strategies 2 — DB lookup — and 4 — Search API) populated the contributors_aliases table but did NOT update cntrb_canonical on the parent contributor row. The forward fix in v0.25.6 calls SetContributorCanonical on every alias insert, and the one-shot v0.25.6 migration backfill cntrb_canonical from contributors_aliases reads MIN(alias_email) per cntrb_id and fills empty canonicals from the alias table. After running aveloxis migrate on v0.25.6, this fourth cause is closed; rows that remain empty fall into one of the three reasons above.
What does gh_user_id IS NULL mean?
The contributor was created from a commit-only observation (no GitHub account observed yet), or from a UserRef that did not include a numeric platform_user_id. The runSearchResolve background task tries to backfill gh_user_id by searching GitHub for the email. If found, R11 backfills gh_user_id and gh_login without changing cntrb_id. If not found, cntrb_last_search_attempted_at is stamped and the row is skipped for the cooldown window (30 days).
GitLab users follow the same pattern with gl_id, but the GitLab /users/search?email= endpoint is admin-only on gitlab.com, so search-resolve produces no GitLab hits in practice.
Why is cntrb_id random instead of the deterministic UUID?
The contributor was created when no gh_user_id (or gl_id) was known — typically from a commit author with no platform account observation. Per R2, the cntrb_id does NOT migrate to the deterministic UUID even if a gh_user_id is later discovered (search-resolve backfills gh_user_id on the same random-UUID row). Two columns side by side: random cntrb_id, populated gh_user_id. This is correct.
Why does my query show contributors with no commits?
Contributors are created lazily from any UserRef — issue reporter, PR author, message author, etc. — not just commit authors. A contributor who only opened an issue but never authored a commit is legitimate.
Diagnostic queries
Run these against the production DB (see operator-private docs for connection details). All are read-only.
-- 1. Counts by data quality bucket
SELECT
count(*) FILTER (WHERE gh_user_id IS NOT NULL) AS with_gh_user_id,
count(*) FILTER (WHERE gl_id IS NOT NULL) AS with_gl_id,
count(*) FILTER (WHERE gh_user_id IS NULL AND gl_id IS NULL) AS email_only,
count(*) FILTER (WHERE cntrb_canonical = '' AND gh_login != '') AS gh_login_no_canonical,
count(*) FILTER (WHERE cntrb_company = '' AND cntrb_location = '') AS thin,
count(*) AS total
FROM aveloxis_data.contributors;
-- 2. Find duplicate-by-gh_user_id (R3 rename casualties or worse)
SELECT gh_user_id, count(*) AS dup_count, array_agg(cntrb_login ORDER BY data_collection_date DESC)
FROM aveloxis_data.contributors
WHERE gh_user_id IS NOT NULL
GROUP BY gh_user_id
HAVING count(*) > 1
ORDER BY dup_count DESC
LIMIT 20;
-- 3. Identity rows pointing at multiple cntrb_ids (FK-orphan risk)
SELECT platform_id, platform_user_id, count(DISTINCT cntrb_id) AS distinct_parents
FROM aveloxis_data.contributor_identities
GROUP BY platform_id, platform_user_id
HAVING count(DISTINCT cntrb_id) > 1;
-- 4. Aliases pointing at multiple cntrb_ids (data-quality bug)
SELECT alias_email, count(DISTINCT cntrb_id)
FROM aveloxis_data.contributors_aliases
GROUP BY alias_email
HAVING count(DISTINCT cntrb_id) > 1;
-- 5. Enrichment progress: how many rows past cooldown still need a pass?
SELECT
count(*) FILTER (WHERE cntrb_last_enriched_at IS NULL) AS never_enriched,
count(*) FILTER (WHERE cntrb_last_enriched_at < NOW() - INTERVAL '30 days') AS past_cooldown,
count(*) FILTER (WHERE cntrb_last_enriched_at >= NOW() - INTERVAL '30 days') AS within_cooldown
FROM aveloxis_data.contributors
WHERE cntrb_login != '' AND cntrb_company = '' AND cntrb_location = '';
-- 6. Search-resolve queue depth
SELECT count(*) AS candidates
FROM aveloxis_data.contributors
WHERE cntrb_email != ''
AND gh_user_id IS NULL
AND cntrb_email NOT LIKE '%noreply%'
AND (cntrb_last_search_attempted_at IS NULL
OR cntrb_last_search_attempted_at < NOW() - INTERVAL '30 days');
-- 7. Commits with unresolved authors per repo (Layer 2 progress)
SELECT r.repo_owner || '/' || r.repo_name AS repo,
count(*) AS unresolved
FROM aveloxis_data.commits c
JOIN aveloxis_data.repos r ON r.repo_id = c.repo_id
WHERE c.cmt_author_platform_username IS NULL OR c.cmt_author_platform_username = ''
GROUP BY r.repo_owner, r.repo_name
ORDER BY unresolved DESC
LIMIT 20;
-- 8. FK integrity check: child rows referencing missing contributors
-- (Should always return 0 if R10 holds.)
SELECT 'commits' AS tbl, count(*)
FROM aveloxis_data.commits c
LEFT JOIN aveloxis_data.contributors ct ON ct.cntrb_id = c.cmt_ght_author_id
WHERE c.cmt_ght_author_id IS NOT NULL AND ct.cntrb_id IS NULL
UNION ALL
SELECT 'issues', count(*)
FROM aveloxis_data.issues i
LEFT JOIN aveloxis_data.contributors ct ON ct.cntrb_id = i.reporter_id
WHERE i.reporter_id IS NOT NULL AND ct.cntrb_id IS NULL
UNION ALL
SELECT 'pull_requests', count(*)
FROM aveloxis_data.pull_requests p
LEFT JOIN aveloxis_data.contributors ct ON ct.cntrb_id = p.author_id
WHERE p.author_id IS NOT NULL AND ct.cntrb_id IS NULL;
Intentional limitations
These are accepted trade-offs, NOT bugs. Operators should be aware before filing issues.
The rename edge case (largely closed by v0.20.2)
Per R3, a platform user who renames between observations historically produced two contributor rows when the new login was already observed under a different cntrb_id. v0.20.2 added a logical-merge path in LinkContributorToGitHubUser that resolves these duplicates the next time runSearchResolve processes the relevant email — see R3 for the full description.
The remaining residual case: a duplicate where the loser’s email never gets re-observed by search-resolve (e.g., the email is private, or the user has set noreply). Those duplicates persist until either the email surfaces or an operator manually invokes the merge. For this rare case, a coalesce query is still useful:
-- Coalesce duplicate rows by gh_user_id, picking the most recently collected
SELECT DISTINCT ON (gh_user_id) cntrb_id, gh_user_id, cntrb_login, ...
FROM aveloxis_data.contributors
WHERE gh_user_id IS NOT NULL
ORDER BY gh_user_id, data_collection_date DESC;
Random-UUID rows for email-only contributors
Per R1, commit authors with no resolvable platform account get a random cntrb_id. Per R2, this UUID is not later migrated even if gh_user_id is discovered. Operators should not rely on cntrb_id byte-equality for joins across databases for the email-only subset.
GitLab parity gaps
GitLab does not expose all the fields GitHub does. CLAUDE.md’s “GitHub/GitLab Parity Gaps — Closure Plan” section enumerates the accepted limitations:
Watcher count — no GitLab equivalent;
star_countis the closest analog but semantically different.GraphQL node IDs (
gh_node_id) — GitLab uses numeric project IDs only;SrcRepoIDnumeric serves the same purpose.Contributor identity URL fields (
gh_followers_url,gh_starred_url, etc.) — GitHub-specific denormalized fields with no GitLab equivalent.Search-by-email on GitLab — admin-only on gitlab.com;
runSearchResolveproduces no GitLab hits in practice.
Cooldown windows are 30 days, not configurable
Per R6, cntrb_last_enriched_at and cntrb_last_search_attempted_at use a hard-coded 30-day cooldown. A user who unlocks their public profile within the cooldown window will not have their cntrb_company / cntrb_location updated until the window expires. The cooldown was chosen empirically to balance freshness against API quota.
Search-resolve is GitHub-only, low-yield
runSearchResolve only uses GitHub’s /search/users?q=email+in:email because GitLab’s equivalent endpoint is admin-only. Even on GitHub, success rate is moderate (~20–40% depending on the email cohort) because users frequently set email to private. The task is intentionally low-rate (100 candidates/hour) to stay within the 30/min/token search-API budget.
The cache is per-job, not fleet-wide (today)
Per R7, the ContributorResolver cache is scoped to one repo’s collection. Two workers processing different repos that share contributors each pay the cache-miss SELECT. Phase C of the refactor plan lifts the cache to fleet scope — track that work in summary/04-refactoring-plan.md.
GitLab vs GitHub: column-by-column parity matrix (v0.20.3)
Aveloxis collects contributor data from both GitHub and GitLab and stores them in the same aveloxis_data.contributors table. Some columns map cleanly between platforms; others are intentionally GitHub-only or GitLab-only. This matrix is the contract for what to expect when querying contributor data on a mixed-platform fleet.
GitHub column |
GitLab column |
API source |
Status |
|---|---|---|---|
|
|
GitHub |
✓ both populated ( |
|
|
same |
✓ both populated ( |
|
|
same |
✓ both populated ( |
|
|
same |
✓ both populated ( |
|
|
same |
✓ both populated ( |
|
|
GitHub |
✓ both populated when available; private-email users get |
|
same |
both via the enrichment endpoint |
✓ both populated when set on the user profile |
|
(gl_state implies isAdmin via |
GitHub |
≈ approximate. GitLab “Owner” role on a project maps to admin; the deployment-wide “is admin” GitHub field has no GitLab single-flag equivalent. Stored as a stringified bool on |
|
|
(aveloxis-internal — set to |
Added in v0.20.12 to mirror gl_state. GitHub’s |
(none) |
|
GitLab |
GitLab-only field, populated as of v0.20.3. Useful for filtering blocked/deactivated users out of contributor analytics. GitHub has no equivalent — its |
|
(none) |
GitHub GraphQL global node ID |
accepted limitation. GitLab uses numeric project/user IDs; there’s no GraphQL globally-unique node ID. |
|
(none) |
GitHub |
accepted limitation. GitLab doesn’t classify user accounts as Bot vs Organization at this granularity. Bot detection on GitLab data uses email patterns / heuristics. |
|
(none — |
GitHub |
accepted limitation. GitLab’s |
|
(none) |
GitHub |
accepted limitation (8 fields). These are GitHub REST hypermedia links; GitLab’s REST API uses path-based URLs derived from |
Querying contributor data on mixed-platform fleets
When you need a single value across both platforms, prefer the platform-agnostic cntrb_* columns over the platform-specific gh_* / gl_* ones. The cntrb_login, cntrb_email, cntrb_company, cntrb_location, cntrb_full_name fields are populated regardless of platform and are the recommended targets for analytics queries.
Use the platform-specific columns when you specifically need the GitHub or GitLab perspective — e.g., gh_user_id to join against externally-collected GitHub data, or gl_state to filter out blocked GitLab accounts.
Closable gaps that aren’t planned
The following GitHub fields don’t have a GitLab equivalent we plan to add. They appear above as “accepted limitation”:
gh_node_id: GitLab has no GraphQL globally-unique node ID system. Numericgl_idserves the same purpose at the GitLab side;cntrb_id(deterministic UUID per R1) serves the cross-platform role.gh_type: GitLab user-account taxonomy is binary (User vs not-a-User); the Bot/Organization distinction GitHub exposes doesn’t exist there.gh_followers_urletc.: GitLab’s REST URL scheme is path-derivable fromgl_web_url. Storing duplicates would just be aveloxis copying a hyperlink that’s reconstructable.
These are documented for transparency, not blocking issues. Closing them would require fabricating data — which would be worse for analyst trust than leaving the cells empty.
Next steps
Facade Commits — how git log data feeds Layer 2.
Staged Pipeline — how staging feeds Layer 1.
Mailing-list ingestion — how mailing-list senders feed Layer 3 (the third data origin).
Overview — system architecture.