Collection Pipeline
Aveloxis has two collection pipelines: the staged pipeline (used by aveloxis serve) for production workloads, and the direct pipeline (used by aveloxis collect) for ad-hoc runs. Both pipelines collect the same data through the same phases, but differ in how they write to the database.
Overview
Staged Pipeline (serve) Direct Pipeline (collect)
────────────────────── ────────────────────────
Prelim ────────────> Phase 1: Stage to JSONB (writes directly to tables)
Phase 2: Process to tables
│
v
Phase 3: Facade (bare clone + git log)
Phase 4: Analysis (deps, libyear, scc)
Phase 4c: ScanCode (licenses, every 30d)
Phase 4b: Scorecard (local, reuses clone)
│
v
Phase 5: Commit Author Resolution
Phase 6: Canonical Email Enrichment
The staged pipeline is designed for 400K+ repos. It eliminates database contention on the contributors table by decoupling API collection from relational persistence.
Collection order
Repo info and metadata are collected first (Phase 0) to provide commit count before the heavy phases. For repos with more than 10,000 commits, issues, PRs, and events are collected in parallel across 3 goroutines, each with its own staging writer. Messages are collected after the parallel phase completes. Repos under the threshold use sequential collection.
Parallel collection
When CommitCount >= 10,000 (from repo_info metadata):
Phase 0: Repo info, releases, clone stats (sequential)
Phase 1: Contributors (sequential)
Phase 2: Issues | PRs | Events (3 parallel goroutines)
─── wait ───
Phase 3: Messages (sequential, after parallel phase)
A 404 from /releases is treated as zero releases (non-fatal). Not every repo has cut a release, and legacy rows with a stray .git in their slug — though now prevented at write time by model.NormalizeRepoName() in db.UpsertRepo — used to cause this 404. The staged collector logs no releases endpoint (404) — treating as zero releases and continues. See the troubleshooting guide for the underlying fix.
The 3 extra goroutines claim parallel slots tracked by an atomic counter. The scheduler’s fillWorkerSlots pauses new job starts while the total active count (semaphore + parallel slots) exceeds the configured worker limit.
The direct pipeline is simpler – it writes directly to relational tables with inline contributor resolution. Best for testing or collecting a small number of repos.
Prelim phase
Before any data collection, each repo’s URL is checked with an HTTP HEAD request to detect renames, transfers, and dead repos.
Redirect detection
If the URL redirects (repo was renamed or transferred):
New URL already in database: The old repo is marked as a duplicate and dequeued. This prevents collecting the same repo twice under different URLs.
New URL is new: The old repo’s URL is updated to the canonical URL, and all stored URLs in issues, PRs, reviews, and releases are bulk-updated via SQL
REPLACE()to reflect the new org/repo path.
Dead repo sidelining
If the URL returns 404 or 410 (deleted, made private, or DMCA’d):
The repo is marked
repo_archived = TRUEand removed from the queueAll previously collected data is preserved in the database
No further API calls are wasted on this repo
Duplicate checking
If a redirect resolves to a URL that already exists in aveloxis_data.repos, the duplicate entry is dequeued. Only one copy of each repo is collected.
Phase 1: Staging (staged pipeline only)
Raw API responses are written to a JSONB staging table (aveloxis_ops.staging). No FK lookups, no contributor resolution. Multiple workers can write concurrently with zero contention on any relational table.
Staging order
Data is collected and staged in this order:
Contributors – seeded from member/contributor lists
Issues – with labels and assignees bundled per issue
Pull requests – with all children bundled per PR
Events – issue events and PR events
Messages – issue comments, PR comments, inline review comments
Metadata – repo info, releases, clone/traffic stats
Envelope types
Issues and PRs are staged as envelope types that bundle the parent entity with all its children in a single JSONB row:
stagedIssue– contains the issue plus its labels and assigneesstagedPR– contains the pull request plus labels, assignees, reviewers, reviews, commits, files, and head/base metadata
This bundling means that when a PR is processed, all its children can be inserted atomically using the parent’s database ID, without needing a second pass.
Batch flushing
Data is flushed to the staging table in batches (default: 1000 rows per batch, configurable via collection.batch_size). Each batch is a single INSERT with multiple values.
Phase 2: Processing (staged pipeline only)
Staged data is drained in 500-row batches by entity type, in dependency order.
Processing order
Entities are processed in this order to satisfy foreign key constraints:
Contributors – resolved first so all other entities can reference
cntrb_idIssues – upserted with resolved
reporter_idandclosed_by_idPull requests – upserted with resolved
author_idEvents – issue and PR events with resolved
cntrb_idMessages – issue comments, PR comments, review comments with resolved
cntrb_idMetadata – repo info, releases, clone stats
Contributor resolution
Contributors are resolved in bulk with an in-memory write-through cache:
Cache lookup – platform user ID to
cntrb_id(avoids DB round-trips)Database lookup –
contributor_identitiestableCreate new – insert into
contributors+contributor_identities
Envelope processing
When an envelope (bundled issue or PR) is processed:
The parent entity is upserted first to obtain its database ID
All bundled children are upserted using that ID
Each child upsert failure logs a warning but does not abort the parent
Error isolation
A failed upsert for one issue, PR, or message logs a warning but does not abort collection for the entire repo. This per-entity error isolation ensures that a single malformed record does not prevent thousands of good records from being stored.
Phase 3: Facade (git)
After API data is processed, the facade phase handles git-level data.
Bare clone
The repo is cloned as a bare repo (or fetched if a clone already exists):
git clone --bare <url> <path> # first time
git fetch --all # subsequent runs
Bare clones are permanent and stored in the repo_clone_dir directory.
Git log parsing
git log --all --numstat is run with a custom format string using field separators and record separators to reliably parse multi-line output. For each commit:
Per-file rows are inserted into
commits(one row per file touched per commit, matching Augur’s data model)Parent-child relationships are inserted into
commit_parentsCommit messages are inserted into
commit_messages(deduplicated per repo + hash)
Affiliation resolution
Email domains from commit authors and committers are matched against the contributor_affiliations table:
Exact domain match first (e.g.,
user@redhat.commatchesredhat.com)Parent domain fallback (e.g.,
user@mail.google.commatchesgoogle.com)Populates
cmt_author_affiliationandcmt_committer_affiliationon every commit row
Facade aggregates
Aggregate tables are refreshed by SQL aggregation over the commits table:
dm_repo_annual– annual commit stats per contributor per repodm_repo_monthly– monthly statsdm_repo_weekly– weekly statsdm_repo_group_annual,dm_repo_group_monthly,dm_repo_group_weekly– group-level aggregates
Cadence (v0.16.5+): aggregates are refreshed in bulk on the configured matview rebuild day (collection.matview_rebuild_day, default Saturday). The scheduler calls store.RefreshAllRepoAggregates while collection workers are paused, alongside the materialized view refresh. Previously the facade recomputed these tables after every single repo collection, which on a fleet of thousands of repos amounted to tens of thousands of redundant single-repo aggregations per cycle — the matview-day bulk pass supersedes that work.
The per-repo helpers RefreshRepoAggregates(repoID) and RefreshRepoGroupAggregates(repoID) remain in internal/db/aggregates.go for manual/ops usage (e.g., recalculating a single repo after a correction). They are simply no longer invoked automatically from the facade.
Phase 5: Contributor enrichment and canonical emails
After staged collection, EnrichThinContributors calls GET /users/{login} for contributors with missing profile data (empty company and location). This populates company, location, email, name, created_at, and sets cntrb_canonical from the public email (filtering noreply addresses).
Token efficiency (v0.14.4+): Contributors are tracked via cntrb_last_enriched_at to prevent re-enriching users with genuinely empty GitHub profiles on every collection pass. They are retried after 30 days. A separate ResolveEmailsToCanonical pass handles the remaining contributors discovered during commit resolution, limited to 500 per pass.
Phase 6: Analysis
After facade, a temporary full checkout is created from the bare clone (local copy, no network request). Three analyses run against it, then the checkout is deleted.
Dependency scanning
Walks the checkout for manifest files across 12 ecosystems:
Manifest |
Ecosystem |
|---|---|
|
npm |
|
Python (pip) |
|
Go |
|
Rust (Cargo) |
|
Ruby (Bundler) |
|
Java (Maven) |
|
Python (PEP 621) |
|
Python (setuptools) |
|
Java (Gradle) |
|
PHP (Composer) |
|
Swift (SPM) |
|
.NET (NuGet) |
Results are stored in repo_dependencies.
Libyear calculation
For each versioned dependency, queries its package registry to compare the current version against the latest:
Registry |
URL |
|---|---|
npm |
|
PyPI |
|
Go proxy |
|
crates.io |
|
RubyGems |
|
Libyear is calculated as:
libyear = (latest_release_date - current_release_date) / 365
Results are stored in repo_deps_libyear.
Code complexity (scc)
If scc is installed, runs scc -f json --by-file against the checkout. Per-file metrics are stored in repo_labor:
Programming language
Total lines, code lines, comment lines, blank lines
Cyclomatic complexity
If scc is not installed, this phase is silently skipped.
ScanCode Toolkit (license and copyright detection)
After SCC, ScanCode Toolkit runs against the temporary checkout to detect per-file licenses, copyrights, and packages. ScanCode is a Python tool that provides precise, line-level attribution of licenses and copyright holders.
Invocation:
scancode -clpi --only-findings --json <output-file> --quiet --timeout 300 <path>
Flag |
Purpose |
|---|---|
|
Detect copyrights and holders |
|
Detect license expressions (SPDX) |
|
Detect package manifests |
|
Collect file info (type, language, hashes) |
|
Omit files with no detections (reduces output) |
|
Suppress progress output |
|
5-minute per-file timeout for pathological files |
30-day interval: ScanCode only runs once every 30 days per repo. License and copyright data changes infrequently, so re-scanning on every collection pass would waste time. The last-run timestamp is checked via ScancodeLastRun before invoking the tool.
Results are stored in the aveloxis_scan schema (separate from aveloxis_data):
Table |
Contents |
|---|---|
|
Scan metadata: scancode version, duration, files scanned, files with findings |
|
Per-file: SPDX license expression, copyrights, holders, license detections, package data (all as JSONB) |
|
Previous scan results rotated before each new scan |
ScanCode data enriches other features:
SBOMs: CycloneDX includes
evidence.licensesandevidence.copyrighton the root component. SPDX uses the aggregated SPDX expression forlicenseConcluded(vs.licenseDeclaredfrom the registry) and includescopyrightText.Web dashboard: The repo detail page shows a “Source Code Licenses” section with per-license file counts, OSI compliance, and a copyright holders list.
If ScanCode is not installed, this phase is silently skipped. Install it with aveloxis install-tools or pipx install scancode-toolkit-mini.
OpenSSF Scorecard (local execution)
After dependency scanning, libyear, and SCC complete, the OpenSSF Scorecard tool runs against the same temporary checkout in local mode (--local). This is significantly faster than remote mode because:
No redundant clone: Scorecard reuses the existing checkout instead of cloning the repo again.
Local checks run offline: Checks like Binary-Artifacts, Pinned-Dependencies, Dangerous-Workflow, and Token-Permissions evaluate files locally without any API calls.
Fewer API calls: Only API-dependent checks (Code-Review, Maintained, Branch-Protection) hit GitHub, making ~20-50 API calls instead of ~150-300 in remote mode.
Before running scorecard, the checkout’s git remote origin is updated from the bare repo path to the actual GitHub/GitLab URL, so scorecard can resolve the remote for API-dependent checks.
Results are stored in repo_deps_scorecard with one row per check, including the check name, score (0-10), reason, and full details as JSONB. Previous results are rotated to repo_deps_scorecard_history.
The temporary checkout is deleted after scorecard completes. If scorecard is not installed, this phase is silently skipped. Install it with aveloxis install-tools.
Token management: After each scorecard run, the used API token is marked as partially depleted (MarkDepleted) so the key pool rotates past it. No concurrency semaphore is needed — local mode is mostly disk I/O, and the small number of remaining API calls is handled by the token rotation.
Periodic tasks
In addition to per-repo collection, aveloxis serve runs these periodic tasks:
Task |
Interval |
Description |
|---|---|---|
Org refresh |
Every 4 hours |
Re-fetches organization membership lists to discover new repos |
Contributor breadth |
Every 6 hours |
Calls |
Materialized view rebuild |
Weekly (Saturday) |
Pauses all collection workers, refreshes all 19 matviews, resumes collection |
Next steps
Monitoring – track collection progress
Staged Pipeline Architecture – deeper technical details
Contributor Resolution Architecture – how identities are resolved