Contributor Resolution
Contributor resolution is the process of mapping platform user references — login names, numeric platform user IDs, commit author emails — to canonical contributor records in the database. It is one of the most heavily-exercised paths in aveloxis: every issue reporter, PR author, review author, message author, event actor, and commit author triggers it.
This document is the public contract for contributor data in aveloxis. The contract has thirteen rules covering identity-key determinism, immutability, deduplication, enrichment, and FK integrity. Operators rely on these guarantees when writing analytics queries against aveloxis_data.contributors and its child tables.
If you are looking for the high-level resolution flow, jump to Resolution flow. If you are diagnosing a specific data-quality concern, jump to Data-quality FAQ or Diagnostic queries.
What a contributor record represents
A row in aveloxis_data.contributors represents one person or bot that interacts with repositories aveloxis tracks. The same person may have a GitHub identity (gh_user_id, gh_login), a GitLab identity (gl_id, gl_username), one or more commit-author emails, or any combination of these. The cntrb_id UUID on the row is stable and is referenced from 16+ child tables (commits, issues, PRs, messages, reviews, events).
The schema separates three concerns:
aveloxis_data.contributors— one row per individual; carries denormalizedgh_*/gl_*columns for backward-compatible Augur queries and the canonical email (cntrb_canonical).aveloxis_data.contributor_identities— one row per(platform, platform_user_id)observation; the relational truth for “this person has this account on this platform.”aveloxis_data.contributors_aliases— one row per distinct commit-author email; maps alternate emails back to the canonical contributor.
A contributor with both a GitHub and a GitLab identity has one contributors row and two contributor_identities rows. A contributor with three different commit emails has one contributors row and three contributors_aliases rows.
Contract rules
The thirteen rules below define what the system promises about contributor data. They are referenced by ID throughout the codebase (search for “R1”, “R2”, etc. in code comments) and in the private summary/02-contract.md design document.
R1: Identity-key determinism
When (platform, platform_user_id) is known with platform_user_id > 0, cntrb_id MUST equal PlatformUUID(platform, platform_user_id). This produces an Augur-byte-compatible UUID encoding the platform byte and the user ID. Two independent collections of the same GitHub or GitLab user produce the same UUID — cross-database joins on cntrb_id work without content drift.
When platform_user_id == 0 (commit author with no resolved platform user), cntrb_id is a random gen_random_uuid(). This is the only legal source of nondeterminism, typically affecting <1% of contributors.
R2: Identity-key immutability
Once cntrb_id is written, it MUST NOT change. There is no operation in aveloxis that updates cntrb_id on an existing row. If a row was inserted with a random UUID and the contributor later turns out to have a gh_user_id, the search-resolve background task adds the gh_user_id to the existing row but does NOT migrate the row to the deterministic UUID. The two would differ only in the value of cntrb_id, and changing it would orphan every FK reference (commits, issues, PRs, messages — 16+ columns).
This rule is the reason aveloxis does not ship a 16-table merge migration. The cleanup tools in augur-contributor-resolver exist precisely because Augur did the opposite — letting cntrb_id mutate post-hoc and accumulating orphans. Aveloxis prevents the bad state at the write boundary.
R3: cntrb_login partial uniqueness
The partial unique index idx_contributors_login enforces that no two rows share the same non-empty cntrb_login. Empty logins are unconstrained — multiple email-only contributors can coexist with cntrb_login = ''.
Rename edge case. When a platform user renames between observations, three outcomes are possible:
The new login is unobserved elsewhere —
Resolveupdates the existing row’scntrb_loginto the new value via theON CONFLICT (cntrb_id) DO UPDATEbranch. Clean.The new login is already observed under a different
cntrb_idAND the partial unique index trips at write time — theUpsertContributorFull23505 fallback UPDATEs the row’s other fields without touchingcntrb_login. Two rows representing the same person coexist with different logins, both visible in lookup queries. This was the steady-state behavior pre-v0.20.2.(v0.20.2 onward) When
runSearchResolvelater identifies the duplicate via the search API result and callsLinkContributorToGitHubUser, the function now performs a logical merge (soft-delete pattern). It picks a winner — preferring the row whosecntrb_idmatchesPlatformUUID(1, ghUserID)per R1, falling back to the older row — copies non-empty fields from the loser(s) into the winner, insertscontributors_aliasesrows so the loser’s emails resolve to the winner, and setscntrb_deleted = 1on the loser(s). The loser rows are NOT deleted physically (preserving R2 identity-key immutability and R10 FK integrity); they’re filtered out of every lookup query viaCOALESCE(cntrb_deleted, 0) = 0. Analytics/aggregate queries that JOIN oncntrb_idcontinue to see the loser rows so historical activity stays attributed correctly.
The v0.20.2 logical merge means outcome #2 is now self-correcting: the next search-resolve cycle that processes the duplicate’s email cleans it up automatically.
cntrb_deleted semantics (v0.20.2):
0(default) — active row, returned by lookup queries.1— loser of a rename merge. Filtered out ofResolve’s lookup-by-login,FindLoginByEmail,FindContributorIDByLogin,GetThinContributorLogins,GetContributorsNeedingSearch,GetContributorsMissingCanonical, andPopulateAffiliationscandidate selection. Still visible to analytics queries that JOIN oncntrb_iddirectly (so historical FK references continue to resolve).The column already existed in the schema (legacy Augur compatibility); v0.20.2 repurposes it for the soft-delete merge semantics. Pre-v0.20.2 rows are all
cntrb_deleted = 0or NULL — both treated as active byCOALESCE(cntrb_deleted, 0) = 0.
R4: Identity rows are denormalized truth
Every observation of (platform_id, platform_user_id) produces a row in contributor_identities (UNIQUE on the pair). The denormalized gh_* / gl_* columns on the contributors row are convenience copies; when they disagree with contributor_identities, the latter wins.
Backfill from contributor_identities to contributors.gh_* happens via backfillGHColumns and the inline UPDATE inside UpsertContributorBatch.
R5: Email aliasing
Every distinct commit-author email observed for a resolved contributor lands in contributors_aliases with alias_email = email and cntrb_id pointing at the contributor. The UNIQUE (alias_email) constraint enforces that one email maps to exactly one contributor.
If an email is observed under two different cntrb_ids, that is a duplicate-contributor bug, not a multi-mapping. Use the Diagnostic queries to detect this case.
R6: Enrichment is best-effort, cooldown-bounded
A “thin” contributor has empty cntrb_company AND empty cntrb_location. The periodic enrichment ticker calls client.EnrichContributor(login) to fill these fields. If the platform profile is genuinely empty (the user has not set company / location), the row’s cntrb_last_enriched_at is stamped to NOW() to suppress retry for the cooldown window (30 days).
Two separate cooldown columns track two distinct background tasks:
cntrb_last_enriched_at—runEnrichmentandResolveEmailsToCanonicalcooldown.cntrb_last_search_attempted_at—runSearchResolve(search-by-email) cooldown.
Without these cooldowns, every periodic tick would re-process the same “genuinely empty” rows forever, wasting GitHub Search API and core API quota.
R7: Cached resolution on the hot path
Every UserRef observed during staged collection MUST go through ContributorResolver.Resolve (defined in internal/db/contributors.go), not a direct UpsertContributor call. The resolver caches (platform, user_id) → cntrb_id for the lifetime of the collection job. Two observations of the same user inside one repo’s data hit the cache; across workers (different repos), each worker pays the cache-miss SELECT.
This rule is the input to the planned process-wide cache (Phase C of summary/04-refactoring-plan.md), which lifts the cache to fleet scope.
R8: Bulk operations dedup before flushing
UpsertContributorBatch MUST deduplicate by login in memory (“richest data wins”) before flushing in a single transaction. Per-row UpsertContributor calls in a tight loop are forbidden — that pattern produced 14,000 single-row transactions per enrichment cycle on a 100K-repo fleet (v0.18.28 pre-fix) and induced deadlocks on the contributors hot index.
The batch path is used by:
EnrichThinContributors(after the periodic enrichment ticker)The staged processor’s
processBatch(forEntityContributorpayloads)The legacy
Collector.collectContributors
R9: Idempotent re-collection
Every contributor INSERT carries an ON CONFLICT clause. Re-collecting the same repo produces the same final state — DO UPDATE coalesces non-empty incoming fields into existing fields without overwriting non-empty existing values. This is verified by the TestAllDataInsertTablesHaveOnConflict source-contract test.
R10: Foreign-key integrity
Every cntrb_id referenced from a child table MUST exist in contributors. The resolver creates the parent row before any child row references it. This is what prevents the orphan-FK problem that drove the augur-contributor-resolver post-hoc fix scripts.
R11: Search-resolve never modifies identity
The runSearchResolve background task and its LinkContributorToGitHubUser worker function MUST NOT modify cntrb_id (R2) or cntrb_login (R3). They only backfill gh_user_id, gh_login, the audit column cntrb_last_search_attempted_at, and add a contributor_identities row.
R13: Documentation as a first-class deliverable
This document IS R13’s deliverable. The contract is public. Operator-facing changes to contributor data shape MUST be reflected here in the same release that ships the code change. The source-contract test TestContributorResolutionDocReferencesCanonicalFunctions fails the build if any of ContributorResolver.Resolve, UpsertContributorBatch, or LinkContributorToGitHubUser is renamed without updating this doc.
Resolution flow
Layer 1: API-phase resolution (during collection)
When the staged collector observes a UserRef from issues, PRs, events, or messages:
UserRef (login + platform_user_id + name + email + ...)
|
v
ContributorResolver.Resolve
|
├── 1. In-memory cache by (platform, user_id) — hit returns immediately
|
├── 2. SELECT contributor_identities WHERE platform_id=? AND platform_user_id=?
| — hit caches and returns
|
├── 3. SELECT contributors WHERE cntrb_login=?
| — hit reuses existing row, backfills identities row, caches, returns
|
└── 4. INSERT contributors + contributor_identities in one transaction
- userID > 0: ON CONFLICT (cntrb_id) DO UPDATE — R1, R3
- userID == 0: ON CONFLICT (cntrb_login) DO UPDATE — R3 partial
Step 3 (lookup-by-login before INSERT) was added in v0.19.2 to prevent the partial-unique-index race that produced “duplicate key value violates unique constraint idx_contributors_login” floods in production logs. Without it, two workers observing the same renamed user under the new login both attempt INSERT with different deterministic UUIDs and one trips the constraint.
Layer 2: Git-phase resolution (after facade)
After the facade phase loads commits, CommitResolver.ResolveCommits walks every commit with cmt_author_platform_username IS NULL and tries the seven-step chain in R12. On a hit, it calls UpsertContributorFull to create or update the contributor row, then EnsureContributorAlias to record the commit email.
For each unresolved commit:
|
v
resolveOne(hash, email)
|
├── hash cache → email cache → noreply parse → bot skip → DB lookup → Commits API → Search API
|
v
on hit:
|
├── SetCommitAuthorLogin(repo, hash, login)
├── UpsertContributorFull(deterministic-uuid, login, gh_user_id, email) — R1, R2
└── EnsureContributorAlias(cntrb_id, email) — R5
After all commits resolved:
|
└── BackfillCommitAuthorIDs(repo) — bulk UPDATE commits SET cmt_ght_author_id = ...
UpsertContributorFull carries the most defensive logic in the codebase. The 23505 fallback (v0.19.2) catches the partial-unique-index trip described in R3 and degrades gracefully — the rename edge case becomes a logged Debug line, not a job-killing error.
Layer 3: Background tasks
Three periodic tickers run inside aveloxis serve. Each is rate-limited and cooldown-bounded per R6:
Ticker |
Cadence |
What it does |
Cooldown column |
|---|---|---|---|
|
30 min |
≤14000 thin contributors → |
|
|
1 hour |
≤100 contributors with email + no |
|
|
configurable |
Discovers cross-repo contributor activity via Events API; writes |
n/a (per-contributor priority) |
In addition to the tickers, every per-job collection runs the commit resolver (Layer 2) and a ResolveEmailsToCanonical pass that fills cntrb_canonical for ≤500 contributors per call.
GithubUUID / PlatformUUID
The deterministic UUID encoding is byte-compatible with Augur:
Byte(s) |
Content |
|---|---|
0 |
Platform ID (1 = GitHub, 2 = GitLab) |
1–4 |
Platform user ID (big-endian uint32, when |
1–8 |
Platform user ID (big-endian uint64, when |
Remaining |
Zero-filled |
Example: GitHub user 12345 (platform=1):
Byte 0: 0x01
Bytes 1-4: 0x00003039 (12345 big-endian)
Remaining: 0x00000000000000000000
UUID: 01003039-0000-0000-0000-000000000000
The 8-byte fallback for IDs above 2³² is non-Augur-compatible by necessity (Augur predates GitHub’s overflow concern). Existing rows in either layout are never re-encoded. The function is in internal/db/github_uuid.go.
Data-quality FAQ
Why do I see two rows for the same person?
Most likely cause: a platform-user rename, where the old login still appears in historical observations and a new login already exists in the table from a separate observation. See R3. The two rows have different cntrb_login values but represent the same person. To detect:
-- Two cntrb rows sharing a gh_user_id (definite duplicate)
SELECT gh_user_id, count(*) AS rows, array_agg(cntrb_login)
FROM aveloxis_data.contributors
WHERE gh_user_id IS NOT NULL
GROUP BY gh_user_id
HAVING count(*) > 1;
To detect via identities:
-- Same (platform, platform_user_id) pointing at multiple cntrb_ids (orphan risk)
SELECT platform_id, platform_user_id, count(DISTINCT cntrb_id)
FROM aveloxis_data.contributor_identities
GROUP BY platform_id, platform_user_id
HAVING count(DISTINCT cntrb_id) > 1;
A future release will add a logical merge via cntrb_deleted = 1 (Phase D of the refactor plan); for now, treat duplicates as a known data-quality limitation and coalesce in queries.
Why is cntrb_canonical empty?
Three legal reasons:
The user has set their email to private on GitHub.
EnrichContributorreturns no email;ResolveEmailsToCanonicalcannot help. After the 30-day cooldown the row is retried in case the user changed their setting, but typically stays empty.The contributor is
email-only(nogh_user_id, nogl_id) — created from a commit author with a non-noreply email but no resolvable platform account.cntrb_canonicalmay be set to the commit email itself, or empty ifUpsertContributorFullwas called withcommitEmail = ''.The contributor was created very recently and the enrichment ticker has not yet reached them. They appear in the next enrichment cycle.
What does gh_user_id IS NULL mean?
The contributor was created from a commit-only observation (no GitHub account observed yet), or from a UserRef that did not include a numeric platform_user_id. The runSearchResolve background task tries to backfill gh_user_id by searching GitHub for the email. If found, R11 backfills gh_user_id and gh_login without changing cntrb_id. If not found, cntrb_last_search_attempted_at is stamped and the row is skipped for the cooldown window (30 days).
GitLab users follow the same pattern with gl_id, but the GitLab /users/search?email= endpoint is admin-only on gitlab.com, so search-resolve produces no GitLab hits in practice.
Why is cntrb_id random instead of the deterministic UUID?
The contributor was created when no gh_user_id (or gl_id) was known — typically from a commit author with no platform account observation. Per R2, the cntrb_id does NOT migrate to the deterministic UUID even if a gh_user_id is later discovered (search-resolve backfills gh_user_id on the same random-UUID row). Two columns side by side: random cntrb_id, populated gh_user_id. This is correct.
Why does my query show contributors with no commits?
Contributors are created lazily from any UserRef — issue reporter, PR author, message author, etc. — not just commit authors. A contributor who only opened an issue but never authored a commit is legitimate.
Diagnostic queries
Run these against the production DB (see operator-private docs for connection details). All are read-only.
-- 1. Counts by data quality bucket
SELECT
count(*) FILTER (WHERE gh_user_id IS NOT NULL) AS with_gh_user_id,
count(*) FILTER (WHERE gl_id IS NOT NULL) AS with_gl_id,
count(*) FILTER (WHERE gh_user_id IS NULL AND gl_id IS NULL) AS email_only,
count(*) FILTER (WHERE cntrb_canonical = '' AND gh_login != '') AS gh_login_no_canonical,
count(*) FILTER (WHERE cntrb_company = '' AND cntrb_location = '') AS thin,
count(*) AS total
FROM aveloxis_data.contributors;
-- 2. Find duplicate-by-gh_user_id (R3 rename casualties or worse)
SELECT gh_user_id, count(*) AS dup_count, array_agg(cntrb_login ORDER BY data_collection_date DESC)
FROM aveloxis_data.contributors
WHERE gh_user_id IS NOT NULL
GROUP BY gh_user_id
HAVING count(*) > 1
ORDER BY dup_count DESC
LIMIT 20;
-- 3. Identity rows pointing at multiple cntrb_ids (FK-orphan risk)
SELECT platform_id, platform_user_id, count(DISTINCT cntrb_id) AS distinct_parents
FROM aveloxis_data.contributor_identities
GROUP BY platform_id, platform_user_id
HAVING count(DISTINCT cntrb_id) > 1;
-- 4. Aliases pointing at multiple cntrb_ids (data-quality bug)
SELECT alias_email, count(DISTINCT cntrb_id)
FROM aveloxis_data.contributors_aliases
GROUP BY alias_email
HAVING count(DISTINCT cntrb_id) > 1;
-- 5. Enrichment progress: how many rows past cooldown still need a pass?
SELECT
count(*) FILTER (WHERE cntrb_last_enriched_at IS NULL) AS never_enriched,
count(*) FILTER (WHERE cntrb_last_enriched_at < NOW() - INTERVAL '30 days') AS past_cooldown,
count(*) FILTER (WHERE cntrb_last_enriched_at >= NOW() - INTERVAL '30 days') AS within_cooldown
FROM aveloxis_data.contributors
WHERE cntrb_login != '' AND cntrb_company = '' AND cntrb_location = '';
-- 6. Search-resolve queue depth
SELECT count(*) AS candidates
FROM aveloxis_data.contributors
WHERE cntrb_email != ''
AND gh_user_id IS NULL
AND cntrb_email NOT LIKE '%noreply%'
AND (cntrb_last_search_attempted_at IS NULL
OR cntrb_last_search_attempted_at < NOW() - INTERVAL '30 days');
-- 7. Commits with unresolved authors per repo (Layer 2 progress)
SELECT r.repo_owner || '/' || r.repo_name AS repo,
count(*) AS unresolved
FROM aveloxis_data.commits c
JOIN aveloxis_data.repos r ON r.repo_id = c.repo_id
WHERE c.cmt_author_platform_username IS NULL OR c.cmt_author_platform_username = ''
GROUP BY r.repo_owner, r.repo_name
ORDER BY unresolved DESC
LIMIT 20;
-- 8. FK integrity check: child rows referencing missing contributors
-- (Should always return 0 if R10 holds.)
SELECT 'commits' AS tbl, count(*)
FROM aveloxis_data.commits c
LEFT JOIN aveloxis_data.contributors ct ON ct.cntrb_id = c.cmt_ght_author_id
WHERE c.cmt_ght_author_id IS NOT NULL AND ct.cntrb_id IS NULL
UNION ALL
SELECT 'issues', count(*)
FROM aveloxis_data.issues i
LEFT JOIN aveloxis_data.contributors ct ON ct.cntrb_id = i.reporter_id
WHERE i.reporter_id IS NOT NULL AND ct.cntrb_id IS NULL
UNION ALL
SELECT 'pull_requests', count(*)
FROM aveloxis_data.pull_requests p
LEFT JOIN aveloxis_data.contributors ct ON ct.cntrb_id = p.author_id
WHERE p.author_id IS NOT NULL AND ct.cntrb_id IS NULL;
Intentional limitations
These are accepted trade-offs, NOT bugs. Operators should be aware before filing issues.
The rename edge case (largely closed by v0.20.2)
Per R3, a platform user who renames between observations historically produced two contributor rows when the new login was already observed under a different cntrb_id. v0.20.2 added a logical-merge path in LinkContributorToGitHubUser that resolves these duplicates the next time runSearchResolve processes the relevant email — see R3 for the full description.
The remaining residual case: a duplicate where the loser’s email never gets re-observed by search-resolve (e.g., the email is private, or the user has set noreply). Those duplicates persist until either the email surfaces or an operator manually invokes the merge. For this rare case, a coalesce query is still useful:
-- Coalesce duplicate rows by gh_user_id, picking the most recently collected
SELECT DISTINCT ON (gh_user_id) cntrb_id, gh_user_id, cntrb_login, ...
FROM aveloxis_data.contributors
WHERE gh_user_id IS NOT NULL
ORDER BY gh_user_id, data_collection_date DESC;
Random-UUID rows for email-only contributors
Per R1, commit authors with no resolvable platform account get a random cntrb_id. Per R2, this UUID is not later migrated even if gh_user_id is discovered. Operators should not rely on cntrb_id byte-equality for joins across databases for the email-only subset.
GitLab parity gaps
GitLab does not expose all the fields GitHub does. CLAUDE.md’s “GitHub/GitLab Parity Gaps — Closure Plan” section enumerates the accepted limitations:
Watcher count — no GitLab equivalent;
star_countis the closest analog but semantically different.GraphQL node IDs (
gh_node_id) — GitLab uses numeric project IDs only;SrcRepoIDnumeric serves the same purpose.Contributor identity URL fields (
gh_followers_url,gh_starred_url, etc.) — GitHub-specific denormalized fields with no GitLab equivalent.Search-by-email on GitLab — admin-only on gitlab.com;
runSearchResolveproduces no GitLab hits in practice.
Cooldown windows are 30 days, not configurable
Per R6, cntrb_last_enriched_at and cntrb_last_search_attempted_at use a hard-coded 30-day cooldown. A user who unlocks their public profile within the cooldown window will not have their cntrb_company / cntrb_location updated until the window expires. The cooldown was chosen empirically to balance freshness against API quota.
Search-resolve is GitHub-only, low-yield
runSearchResolve only uses GitHub’s /search/users?q=email+in:email because GitLab’s equivalent endpoint is admin-only. Even on GitHub, success rate is moderate (~20–40% depending on the email cohort) because users frequently set email to private. The task is intentionally low-rate (100 candidates/hour) to stay within the 30/min/token search-API budget.
The cache is per-job, not fleet-wide (today)
Per R7, the ContributorResolver cache is scoped to one repo’s collection. Two workers processing different repos that share contributors each pay the cache-miss SELECT. Phase C of the refactor plan lifts the cache to fleet scope — track that work in summary/04-refactoring-plan.md.
Next steps
Facade Commits — how git log data feeds Layer 2.
Staged Pipeline — how staging feeds Layer 1.
Overview — system architecture.