# Collection Pipeline

Aveloxis has two collection pipelines: the **staged pipeline** (used by `aveloxis serve`) for production workloads, and the **direct pipeline** (used by `aveloxis collect`) for ad-hoc runs. Both pipelines collect the same data through the same phases, but differ in how they write to the database.

---

## Overview

```
                        Staged Pipeline (serve)              Direct Pipeline (collect)
                        ──────────────────────               ────────────────────────
  Prelim ────────────>  Phase 1: Stage to JSONB              (writes directly to tables)
                        Phase 2: Process to tables
                              │
                              v
                        Phase 3: Facade (bare clone + git log)
                        Phase 4: Analysis (deps, libyear, scc)
                        Phase 4c: ScanCode (licenses, every 30d)
                        Phase 4b: Scorecard (local, reuses clone)
                              │
                              v
                        Phase 5: Commit Author Resolution
                        Phase 6: Canonical Email Enrichment
```

The staged pipeline is designed for 400K+ repos. It eliminates database contention on the `contributors` table by decoupling API collection from relational persistence.

### Collection order

Repo info and metadata are collected **first** (Phase 0) to provide commit count before the heavy phases. For repos with more than 10,000 commits, issues, PRs, and events are collected in **parallel** across 3 goroutines, each with its own staging writer. Messages are collected after the parallel phase completes. Repos under the threshold use sequential collection.

### Parallel collection

When `CommitCount >= 10,000` (from repo_info metadata):

```
Phase 0: Repo info, releases, clone stats (sequential)
Phase 1: Contributors (sequential)
Phase 2: Issues | PRs | Events (3 parallel goroutines)
         ─── wait ───
Phase 3: Messages (sequential, after parallel phase)
```

A 404 from `/releases` is treated as *zero releases* (non-fatal). Not every repo has cut a release, and legacy rows with a stray `.git` in their slug — though now prevented at write time by `model.NormalizeRepoName()` in `db.UpsertRepo` — used to cause this 404. The staged collector logs `no releases endpoint (404) — treating as zero releases` and continues. See the troubleshooting guide for the underlying fix.

The 3 extra goroutines claim parallel slots tracked by an atomic counter. The scheduler's `fillWorkerSlots` pauses new job starts while the total active count (semaphore + parallel slots) exceeds the configured worker limit.

The direct pipeline is simpler -- it writes directly to relational tables with inline contributor resolution. Best for testing or collecting a small number of repos.

---

## Prelim phase

Before any data collection, each repo's URL is checked with an HTTP HEAD request to detect renames, transfers, and dead repos.

### Redirect detection

If the URL redirects (repo was renamed or transferred):

- **New URL already in database:** The old repo is marked as a duplicate and dequeued. This prevents collecting the same repo twice under different URLs.
- **New URL is new:** The old repo's URL is updated to the canonical URL, and all stored URLs in issues, PRs, reviews, and releases are bulk-updated via SQL `REPLACE()` to reflect the new org/repo path.

### Dead repo sidelining

If the URL returns 404 or 410 (deleted, made private, or DMCA'd):

- The repo is marked `repo_archived = TRUE` and removed from the queue
- All previously collected data is preserved in the database
- No further API calls are wasted on this repo

### Duplicate checking

If a redirect resolves to a URL that already exists in `aveloxis_data.repos`, the duplicate entry is dequeued. Only one copy of each repo is collected.

---

## Phase 1: Staging (staged pipeline only)

Raw API responses are written to a JSONB staging table (`aveloxis_ops.staging`). No FK lookups, no contributor resolution. Multiple workers can write concurrently with zero contention on any relational table.

### Staging order

Data is collected and staged in this order:

1. **Contributors** -- seeded from member/contributor lists
2. **Issues** -- with labels and assignees bundled per issue
3. **Pull requests** -- with all children bundled per PR
4. **Events** -- issue events and PR events
5. **Messages** -- issue comments, PR comments, inline review comments
6. **Metadata** -- repo info, releases, clone/traffic stats

### Envelope types

Issues and PRs are staged as **envelope types** that bundle the parent entity with all its children in a single JSONB row:

- **`stagedIssue`** -- contains the issue plus its labels and assignees
- **`stagedPR`** -- contains the pull request plus labels, assignees, reviewers, reviews, commits, files, and head/base metadata

This bundling means that when a PR is processed, all its children can be inserted atomically using the parent's database ID, without needing a second pass.

### Batch flushing

Data is flushed to the staging table in batches (default: 1000 rows per batch, configurable via `collection.batch_size`). Each batch is a single `INSERT` with multiple values.

---

## Phase 2: Processing (staged pipeline only)

Staged data is drained in 500-row batches by entity type, in dependency order.

### Processing order

Entities are processed in this order to satisfy foreign key constraints:

1. **Contributors** -- resolved first so all other entities can reference `cntrb_id`
2. **Issues** -- upserted with resolved `reporter_id` and `closed_by_id`
3. **Pull requests** -- upserted with resolved `author_id`
4. **Events** -- issue and PR events with resolved `cntrb_id`
5. **Messages** -- issue comments, PR comments, review comments with resolved `cntrb_id`
6. **Metadata** -- repo info, releases, clone stats

### Contributor resolution

Contributors are resolved in bulk with an in-memory write-through cache:

1. **Cache lookup** -- platform user ID to `cntrb_id` (avoids DB round-trips)
2. **Database lookup** -- `contributor_identities` table
3. **Create new** -- insert into `contributors` + `contributor_identities`

### Envelope processing

When an envelope (bundled issue or PR) is processed:

1. The parent entity is upserted first to obtain its database ID
2. All bundled children are upserted using that ID
3. Each child upsert failure logs a warning but does not abort the parent

### Error isolation

A failed upsert for one issue, PR, or message logs a warning but does not abort collection for the entire repo. This per-entity error isolation ensures that a single malformed record does not prevent thousands of good records from being stored.

---

## Phase 3: Facade (git)

After API data is processed, the facade phase handles git-level data.

### Bare clone

The repo is cloned as a bare repo (or fetched if a clone already exists):

```bash
git clone --bare <url> <path>     # first time
git fetch --all                   # subsequent runs
```

Bare clones are permanent and stored in the `repo_clone_dir` directory.

### Git log parsing

`git log --all --numstat` is run with a custom format string using field separators and record separators to reliably parse multi-line output. For each commit:

- **Per-file rows** are inserted into `commits` (one row per file touched per commit, matching Augur's data model)
- **Parent-child relationships** are inserted into `commit_parents`
- **Commit messages** are inserted into `commit_messages` (deduplicated per repo + hash)

### Affiliation resolution

Email domains from commit authors and committers are matched against the `contributor_affiliations` table:

- Exact domain match first (e.g., `user@redhat.com` matches `redhat.com`)
- Parent domain fallback (e.g., `user@mail.google.com` matches `google.com`)
- Populates `cmt_author_affiliation` and `cmt_committer_affiliation` on every commit row

### Facade aggregates

Aggregate tables are refreshed by SQL aggregation over the `commits` table:

- `dm_repo_annual` -- annual commit stats per contributor per repo
- `dm_repo_monthly` -- monthly stats
- `dm_repo_weekly` -- weekly stats
- `dm_repo_group_annual`, `dm_repo_group_monthly`, `dm_repo_group_weekly` -- group-level aggregates

**Cadence (v0.16.5+):** aggregates are refreshed **in bulk on the configured matview rebuild day** (`collection.matview_rebuild_day`, default Saturday). The scheduler calls `store.RefreshAllRepoAggregates` while collection workers are paused, alongside the materialized view refresh. Previously the facade recomputed these tables *after every single repo collection*, which on a fleet of thousands of repos amounted to tens of thousands of redundant single-repo aggregations per cycle — the matview-day bulk pass supersedes that work.

The per-repo helpers `RefreshRepoAggregates(repoID)` and `RefreshRepoGroupAggregates(repoID)` remain in `internal/db/aggregates.go` for manual/ops usage (e.g., recalculating a single repo after a correction). They are simply no longer invoked automatically from the facade.

---

## Phase 4: Commit author resolution

After facade completes, git commit author emails are resolved to GitHub user accounts. This is the Go implementation of the [augur-contributor-resolver](https://github.com/aveloxis/augur-contributor-resolver) scripts.

### Resolution strategy (cheapest first)

1. **Noreply email parse** (free, no API call) -- `12345+user@users.noreply.github.com` extracts the login and `gh_user_id` directly from the email format
2. **Database lookup** -- checks `contributors` (by `cntrb_email`, `cntrb_canonical`) and `contributors_aliases` (by `alias_email`)
3. **GitHub Commits API** -- `GET /repos/{owner}/{repo}/commits/{sha}` returns the linked GitHub user with all profile fields
4. **GitHub Search API** -- `GET /search/users?q=email+in:email` for remaining non-noreply emails

### For each resolved author

- `cmt_author_platform_username` is set on all commit rows with that hash
- The contributor row is created or updated with a deterministic **GithubUUID**
- All `gh_*` profile fields are backfilled
- Login renames are detected (same `gh_user_id`, different login) and updated
- An alias is created in `contributors_aliases` linking the commit email to the contributor
- After all commits are resolved, a bulk SQL backfill sets `cmt_ght_author_id` by joining `cmt_author_platform_username` to `contributors.gh_login`

---

## Phase 5: Contributor enrichment and canonical emails

After staged collection, `EnrichThinContributors` calls `GET /users/{login}` for contributors with missing profile data (empty company and location). This populates company, location, email, name, created_at, and sets `cntrb_canonical` from the public email (filtering noreply addresses).

**Token efficiency (v0.14.4+)**: Contributors are tracked via `cntrb_last_enriched_at` to prevent re-enriching users with genuinely empty GitHub profiles on every collection pass. They are retried after 30 days. A separate `ResolveEmailsToCanonical` pass handles the remaining contributors discovered during commit resolution, limited to 500 per pass.

---

## Phase 6: Analysis

After facade, a temporary full checkout is created from the bare clone (local copy, no network request). Three analyses run against it, then the checkout is deleted.

### Dependency scanning

Walks the checkout for manifest files across 12 ecosystems:

| Manifest | Ecosystem |
|---|---|
| `package.json` | npm |
| `requirements.txt` | Python (pip) |
| `go.mod` | Go |
| `Cargo.toml` | Rust (Cargo) |
| `Gemfile` | Ruby (Bundler) |
| `pom.xml` | Java (Maven) |
| `pyproject.toml` | Python (PEP 621) |
| `setup.py` | Python (setuptools) |
| `build.gradle` | Java (Gradle) |
| `composer.json` | PHP (Composer) |
| `Package.swift` | Swift (SPM) |
| `*.csproj` | .NET (NuGet) |

Results are stored in `repo_dependencies`.

### Libyear calculation

For each versioned dependency, queries its package registry to compare the current version against the latest:

| Registry | URL |
|---|---|
| npm | `https://registry.npmjs.org/{pkg}` |
| PyPI | `https://pypi.org/pypi/{pkg}/json` |
| Go proxy | `https://proxy.golang.org/{mod}/@v/list` |
| crates.io | `https://crates.io/api/v1/crates/{crate}` |
| RubyGems | `https://rubygems.org/api/v1/versions/{gem}.json` |

Libyear is calculated as:

```
libyear = (latest_release_date - current_release_date) / 365
```

Results are stored in `repo_deps_libyear`.

### Code complexity (scc)

If `scc` is installed, runs `scc -f json --by-file` against the checkout. Per-file metrics are stored in `repo_labor`:

- Programming language
- Total lines, code lines, comment lines, blank lines
- Cyclomatic complexity

If `scc` is not installed, this phase is silently skipped.

### ScanCode Toolkit (license and copyright detection)

After SCC, [ScanCode Toolkit](https://github.com/aboutcode-org/scancode-toolkit) runs against the temporary checkout to detect per-file licenses, copyrights, and packages. ScanCode is a Python tool that provides precise, line-level attribution of licenses and copyright holders.

**Invocation:**

```bash
scancode -clpi --only-findings --json <output-file> --quiet --timeout 300 <path>
```

| Flag | Purpose |
|---|---|
| `-c` | Detect copyrights and holders |
| `-l` | Detect license expressions (SPDX) |
| `-p` | Detect package manifests |
| `-i` | Collect file info (type, language, hashes) |
| `--only-findings` | Omit files with no detections (reduces output) |
| `--quiet` | Suppress progress output |
| `--timeout 300` | 5-minute per-file timeout for pathological files |

**30-day interval**: ScanCode only runs once every 30 days per repo. License and copyright data changes infrequently, so re-scanning on every collection pass would waste time. The last-run timestamp is checked via `ScancodeLastRun` before invoking the tool.

**Results** are stored in the `aveloxis_scan` schema (separate from `aveloxis_data`):

| Table | Contents |
|---|---|
| `scancode_scans` | Scan metadata: scancode version, duration, files scanned, files with findings |
| `scancode_file_results` | Per-file: SPDX license expression, copyrights, holders, license detections, package data (all as JSONB) |
| `*_history` | Previous scan results rotated before each new scan |

ScanCode data enriches other features:
- **SBOMs**: CycloneDX includes `evidence.licenses` and `evidence.copyright` on the root component. SPDX uses the aggregated SPDX expression for `licenseConcluded` (vs. `licenseDeclared` from the registry) and includes `copyrightText`.
- **Web dashboard**: The repo detail page shows a "Source Code Licenses" section with per-license file counts, OSI compliance, and a copyright holders list.

If ScanCode is not installed, this phase is silently skipped. Install it with `aveloxis install-tools` or `pipx install scancode-toolkit-mini`.

### OpenSSF Scorecard (local execution)

After dependency scanning, libyear, and SCC complete, the [OpenSSF Scorecard](https://github.com/ossf/scorecard) tool runs against the **same temporary checkout** in local mode (`--local`). This is significantly faster than remote mode because:

- **No redundant clone**: Scorecard reuses the existing checkout instead of cloning the repo again.
- **Local checks run offline**: Checks like Binary-Artifacts, Pinned-Dependencies, Dangerous-Workflow, and Token-Permissions evaluate files locally without any API calls.
- **Fewer API calls**: Only API-dependent checks (Code-Review, Maintained, Branch-Protection) hit GitHub, making ~20-50 API calls instead of ~150-300 in remote mode.

Before running scorecard, the checkout's git remote origin is updated from the bare repo path to the actual GitHub/GitLab URL, so scorecard can resolve the remote for API-dependent checks.

Results are stored in `repo_deps_scorecard` with one row per check, including the check name, score (0-10), reason, and full details as JSONB. Previous results are rotated to `repo_deps_scorecard_history`.

The temporary checkout is deleted after scorecard completes. If scorecard is not installed, this phase is silently skipped. Install it with `aveloxis install-tools`.

**Token management**: After each scorecard run, the used API token is marked as partially depleted (`MarkDepleted`) so the key pool rotates past it. No concurrency semaphore is needed — local mode is mostly disk I/O, and the small number of remaining API calls is handled by the token rotation.

---

## Periodic tasks

In addition to per-repo collection, `aveloxis serve` runs these periodic tasks:

| Task | Interval | Description |
|---|---|---|
| **Org refresh** | Every 4 hours | Re-fetches organization membership lists to discover new repos |
| **Contributor breadth** | Every 6 hours | Calls `GET /users/{login}/events` for up to 100 contributors to discover cross-repo activity. Results stored in `contributor_repo`. |
| **Materialized view rebuild** | Weekly (Saturday) | Pauses all collection workers, refreshes all 19 matviews, resumes collection |

---

## Next steps

- [Monitoring](monitoring.md) -- track collection progress
- [Staged Pipeline Architecture](../architecture/staged-pipeline.md) -- deeper technical details
- [Contributor Resolution Architecture](../architecture/contributor-resolution.md) -- how identities are resolved