# ScanCode Worker (v0.21.0+)

Per-file license + copyright + package detection is performed by ScanCode Toolkit ([aboutcode-org/scancode-toolkit](https://github.com/aboutcode-org/scancode-toolkit)) in a dedicated worker pool inside `aveloxis serve`, decoupled from the per-repo collection pipeline.

This document covers the design and operation of that worker. Operator-facing tuning lives in [`docs/getting-started/configuration.md`](../getting-started/configuration.md#scancode-worker-v0210); this file explains *why* the system looks the way it does.

## 1. What scancode does (and doesn't)

Scancode walks a working-tree checkout of a repository and emits per-file findings:

- License expression(s) detected — both raw and normalized SPDX.
- Copyright statements and the holder names extracted from them.
- Package manifests detected and their declared dependencies.
- Detection metadata (line numbers, match percentage, etc.).

Results land in `aveloxis_scan.scancode_scans` (one row per scan run) and `aveloxis_scan.scancode_file_results` (one row per file with at least one detection). Previous rows are rotated to `*_history` tables before each new scan.

**Scancode does NOT generate SBOMs.** SPDX and CycloneDX SBOMs are produced by Phase 6 of the per-repo collection pipeline (`internal/collector/sbom.go`) and refreshed on every collection cycle. SBOM generation uses scancode's license data as enrichment when available, but the SBOM artifact itself is regenerated independently.

## 2. Why the worker was decoupled (the 2026-05-14 incident)

Pre-v0.21.0 scancode ran inline in `AnalysisCollector.AnalyzeRepo` as Phase 4 of the per-repo collection pipeline, gated by a package-level 2-slot semaphore in `internal/collector/scancode.go`.

At fleet scale this shape doesn't work. On 2026-05-14, operator-side investigation of a 180-worker production fleet found 177 of 180 worker goroutines parked at `scancode.go:114` (the semaphore acquire) for 7+ hours. The two slot-holders were Linux-kernel-scale repos whose scans legitimately took hours; the other 177 worker goroutines were holding their `collection_queue` row locks the whole time, blocking other operations on those repos.

This is the same architectural anti-pattern v0.19.7 fixed for `PopulateAffiliations`: doing work-that-doesn't-fit-the-per-job-budget inside per-job code paths inevitably stalls the worker pool at scale. The fix follows the same pattern — move the work to a dedicated periodic ticker / worker pool.

The v0.21.0 worker:

- Runs in goroutines independent of the main collection pool. A slow scancode run can't park collection workers.
- Has its own concurrency cap (`collection.scancode_workers`, default 2) so operators can dial it independently of the collection worker count.
- Re-clones each repo (`git clone --depth 1`) instead of sharing the facade's bare clone, eliminating cross-worker filesystem-lock hazards.
- Defaults to a 6-month cadence (was 30 days inline) because per-file license headers in source code change rarely on that timescale.

## 3. Architecture

### 3.1 Components

```
                              aveloxis_data.repos
                              ├── scancode_last_run
                              ├── scancode_version
                              ├── scancode_locked_at
                              ├── scancode_locked_pid
                              ├── scancode_locked_boot_id
                              └── scancode_output_path
                                       ▲
                                       │
              ┌────────────────────────┼───────────────────────┐
              │                        │                       │
              ▼                        ▼                       ▼
       claim query              record lock state      mark complete /
   (FOR UPDATE SKIP LOCKED)     (after cmd.Start)        clear lock
              │                        │                       ▲
              │                        │                       │
              ▼                        ▼                       │
        ┌───────────┐             ┌─────────┐             ┌────────┐
        │dispatcher │──jobs chan─▶│ runner  │──exec──────▶│scancode│
        │ (90s tic) │             │  pool   │             │  proc  │
        └───────────┘             │ (N=2)   │             └────────┘
                                  └─────────┘
              ▲
              │
              │ on Run() startup
        ┌─────────────┐
        │ recover     │  reads ListLockedScancodeRows,
        │ Orphans     │  applies 4-state decision (§5)
        └─────────────┘
```

### 3.2 The lifecycle of a single scan

1. **Dispatcher claim** (paced by `scancode_start_interval_s`, default 90s, as MINIMUM GAP between successful starts — see §3.3 for the v0.21.3 design): the dispatcher calls `ClaimNextScancodeRepo(ctx, cadence)`. The SQL uses `FOR UPDATE SKIP LOCKED` against `aveloxis_data.repos` filtered by:
   - `collection_queue.last_collected IS NOT NULL` — the repo has been collected at least once. Newly-added repos collect basic metrics first; scancode runs against them only after the first collection completes.
   - `repo_archived = FALSE` (or NULL).
   - `scancode_last_run IS NULL OR < NOW() - cadence` — cadence gate.
   - `scancode_locked_at IS NULL OR < NOW() - 12h` — stale-lock fallback (the silent-corpse safety net; the explicit `recoverOrphans` pass is the primary recovery path).
2. **Lock acquired**: the same SQL statement sets `scancode_locked_at = NOW()` on the candidate row and returns the (repo_id, owner, name, git_url) tuple. The row is now claimed atomically.
3. **Job sent to runner channel**: dispatcher writes the job; if no runner slot is free the dispatcher blocks until one is — this is the documented concurrency cap.
4. **Runner picks up job**:
   - Creates a fresh shallow clone: `git clone --depth 1 <repo_git> <scancode_clone_dir>/repo_<id>_<unix_ts>`. The shallow clone is enough for scancode (it walks current file state, not history).
   - Spawns scancode via `exec.CommandContext(ctx, "scancode", "-clpi", "--json", outputPath, ...)`.
   - Calls `cmd.Start()` (NOT `cmd.Run()`) so the OS PID is available *immediately*. Calls `cmd.Wait()` next to actually wait for completion.
   - Between `Start` and `Wait`, calls `store.RecordScancodeLockState(repoID, pid, bootID, outputPath)`. **This is the critical step** that makes crash recovery work — if aveloxis is killed before `cmd.Wait()` returns, the next aveloxis startup has the (pid, boot_id, output_path) tuple in the DB and can recover (see §5).
5. **Scan completes**: runner parses the JSON output, calls `ingestScancodeOutput()` to write `scancode_scans` + `scancode_file_results` (with history rotation), and calls `store.MarkScancodeComplete(repoID, version)`. That UPDATE atomically:
   - Sets `scancode_last_run = NOW()` and `scancode_version = $version`.
   - Clears all four lock columns (`scancode_locked_at`, `scancode_locked_pid`, `scancode_locked_boot_id`, `scancode_output_path`).
6. **Cleanup**: the runner's deferred `os.RemoveAll(tempDir)` removes the clone directory.

On any failure path (clone error, scancode crash, JSON parse error, ingest error), the runner calls `store.ClearScancodeLock(repoID)` to release the lock without setting `scancode_last_run`. The row becomes eligible for re-claim on the next dispatcher tick.

### 3.3 Dispatcher pacing (v0.21.3): minimum-gap, not throughput cap

The pacing semantic for `scancode_start_interval_s` changed between v0.21.0 and v0.21.3. The change is invisible to operators in steady state but materially affects first-pass throughput.

**Pre-v0.21.3 design (broken at scale)**: the dispatcher was driven by `time.NewTicker(startInterval)` — one claim attempt per tick, regardless of how many workers were idle. At 90 s/tick × 7 workers × ~3-min average scan time, the fleet-wide claim rate capped at 40 claims/hour while runners had capacity for ~140. On a 40K-repo fleet this produced ~42-day first-pass estimates when actual capacity was ~12 days. 6 of 7 workers sat idle on average.

**v0.21.3 design (correct)**: the dispatcher maintains a `nextStartAllowed time.Time` deadline that's stamped *after* each successful start. It then loops as fast as the runtime allows, gating each claim on `time.Now() >= nextStartAllowed`. The unbuffered jobs channel provides back-pressure — when all N workers are busy, the dispatcher's send blocks naturally and no over-claiming happens.

Operational effect:
- **Steady-state with idle workers**: claims happen at intervals of exactly `startInterval` between successful starts. Same behavior as before.
- **Steady-state with busy workers**: dispatcher pauses on the unbuffered send. When a runner frees up, the next claim happens after the `startInterval` window — same as before.
- **Burst on restart (the throughput-critical case)**: dispatcher claims one repo per `startInterval` seconds until all N worker slots are full. At 90 s × 7 workers = 630 seconds (~10 min) to saturate the pool. Same as before.
- **First-pass on a large fleet (the regression case)**: workers complete scans in single-digit-to-low-double-digit minutes; the dispatcher refills slots at `startInterval` cadence, so 7 workers stay nearly always busy. Throughput is now bounded by worker capacity, not dispatcher pacing.

For a 40K-repo fleet with `workers=7` and ~3-min average scan time:
- Worker capacity: 7 × (60 / 3) = ~140 repos/hour
- 40,000 ÷ 140 = ~286 hours ≈ **~12 days first-pass**

(Pre-v0.21.3 same configuration: ~42 days, dispatcher-bound.)

If you want to push further, raise `scancode_workers`. The `scancode_start_interval_s` rarely needs tuning unless you want denser starts on a very high-bandwidth network — the 90-second default works fine for most fleets.

## 4. Cadence rationale (180 days default)

Per-file license + copyright headers in source files are near-immutable on the timescale that matters. A 6-month cadence catches:

- New files added to the repo since the last scan.
- Wholesale license changes (rare — but they happen, e.g. a project changes from GPL to MIT).
- Scancode version improvements (newer scancode versions detect licenses older versions missed).

It does NOT catch dependency-license changes, but those don't flow through scancode anyway — they're handled by Phase 4 dependency scanning + Phase 6 SBOM generation, both per-cycle.

Pre-v0.21.0 the inline cadence was 30 days, which produced one full re-scan of the whole fleet every month. At fleet scale (100K+ repos), that's a large continuous load for almost no fresh data. 180 days is a reasonable default; operators can dial via `collection.scancode_cadence_days`.

## 5. Crash recovery — the four-state table

On `aveloxis serve` startup, `ScancodeWorker.Run()` calls `recoverOrphans(ctx)` *before* the dispatcher starts claiming new jobs. The recovery pass examines every row with `scancode_locked_at IS NOT NULL` and applies one of four decisions:

| State | Detection | Action |
|---|---|---|
| **Reboot survivor** | stored `boot_id` ≠ current `/proc/sys/kernel/random/boot_id` | Scancode subprocess is definitively dead (the kernel that hosted it no longer exists). Clear all lock columns. |
| **Live orphan** | boot_id matches AND `kill(-0, pid)` succeeds | Subprocess survived a previous aveloxis crash and is now an orphan of init. Spawn a monitor goroutine that polls every 30s; when the PID dies, attempt to ingest the output file if present. |
| **Recoverable corpse** | boot_id matches, PID is dead, output file exists and parses | Scan finished but aveloxis crashed before ingest. Ingest the orphaned output, then clear the lock. |
| **Lost run** | boot_id matches, PID dead, no usable output | Scan died mid-flight. Clear the lock; the row will re-run on the next cadence tick. |

The boot_id check is what makes the PID check reliable. Linux PIDs are reused — a stored PID of 12345 from before a reboot could legitimately match an unrelated process after the reboot. The boot_id (kernel-generated UUID, changes on every boot) lets the recovery pass decide unambiguously.

On non-Linux dev machines (e.g. macOS) the `/proc` path is absent and `readBootID()` returns an empty string. The recovery pass treats empty boot_id as "unknown" and falls through to the PID check; correctness is preserved (PID reuse is rare on a single boot).

## 6. Graceful shutdown

When the scheduler's context is cancelled (`aveloxis stop serve`):

1. The dispatcher exits immediately on its `<-ctx.Done()` arm. No new claims happen.
2. The dispatcher closes the jobs channel.
3. Runners that were idle return immediately (their `range jobs` loop terminates).
4. Runners that were mid-scan keep going. The runner's `cmd.Wait()` is blocking on the scancode subprocess, which is NOT killed by the ctx cancel (Go's `exec.CommandContext` only kills the subprocess when the cmd object is garbage collected OR explicitly killed via `cmd.Process.Kill()`).
5. `Run()` waits up to `collection.scancode_shutdown_grace_minutes` (default 30 min) for all runners to finish.
6. If the grace expires with runners still active, `Run()` returns. The outstanding subprocesses become orphans — but they're tracked in the DB via the `(pid, boot_id, output_path)` triple recorded in step 4 of §3.2, so the next aveloxis startup's `recoverOrphans` pass will adopt them as live orphans (case 2 of §5).

The grace bound exists because Linux-kernel-sized scans can legitimately run for hours; without a bound, `aveloxis stop` would wait indefinitely on the slowest scancode. The trade-off is clear: lose the in-flight scan data on grace expiry vs. wait hours on stop. Operators who want a different balance can dial the grace.

## 7. Force-rerun cookbook

Cadence is enforced at claim time via the `scancode_last_run` column. To force a rescan, clear that column:

```sql
-- Single repo
UPDATE aveloxis_data.repos SET scancode_last_run = NULL
WHERE repo_owner = 'apache' AND repo_name = 'doris';

-- All repos that ran on a specific scancode version (e.g. after upgrade)
UPDATE aveloxis_data.repos SET scancode_last_run = NULL
WHERE scancode_version = '32.5.0';

-- Whole fleet
UPDATE aveloxis_data.repos SET scancode_last_run = NULL;
```

The claim query orders by `scancode_last_run NULLS FIRST`, so cleared repos move to the head of the queue and get claimed on subsequent dispatcher ticks. The order between cleared repos is `repo_id ASC` (stable).

## 8. Configuration reference

All five knobs live under the `collection` block in `aveloxis.json`:

| Key | Default | Purpose |
|---|---|---|
| `scancode_workers` | `2` | Max concurrent scancode subprocesses. Raise on machines with spare CPU cores. |
| `scancode_start_interval_s` | `90` | Minimum seconds between *successful* claim starts. As of v0.21.3 this is a minimum-gap pacing primitive, NOT a throughput cap — the dispatcher claims as fast as workers free up, with this interval enforced only between consecutive starts. Bounds clone-bandwidth bursts on restart. See §3.3. |
| `scancode_cadence_days` | `180` | Minimum days between successive scans on the same repo. Per-file licenses change rarely. |
| `scancode_clone_dir` | `/tmp/aveloxis-scancode` | Parent directory for per-run shallow clones. Size for ~50 MB × workers peak. |
| `scancode_shutdown_grace_minutes` | `30` | Wait budget for in-flight scans on `aveloxis stop`. Outstanding scans become live-orphans (see §5) if not finished. |

See [`docs/getting-started/configuration.md`](../getting-started/configuration.md#scancode-worker-v0210) for the tuning rationale per knob.

## 9. UX: the "last run" signal

The repo detail page in the web GUI shows:

> **Last run:** 2026-04-08 (scancode 32.5.0)

Or, for never-scanned repos:

> **Last run:** *not yet run — will run in the next scancode worker cycle*

The data flows from `aveloxis_data.repos.scancode_last_run` (written by `MarkScancodeComplete`) through the `/api/v1/repos/{id}/scancode-licenses` endpoint's `last_run` field to the rendered HTML.

## 10. Observability — what to grep in `aveloxis.log`

| Log line | When it fires | Meaning |
|---|---|---|
| `scancode worker started workers=N start_interval=...` | Once at startup | The pool is alive with N runners. If absent, scancode is disabled (binary not installed or `mkdir scancode_clone_dir` failed). |
| `scancode binary not installed; ScancodeWorker disabled` | Startup | Install with `pipx install scancode-toolkit` then restart serve. |
| `scancode recoverOrphans: examining locked rows count=...` | Startup | Recovery pass found stale locks. Followed by per-row decisions. |
| `scancode recover: reboot survivor — clearing lock` | Startup | Case 1 of §5. |
| `scancode recover: live orphan detected — spawning monitor` | Startup | Case 2 of §5. Monitor goroutine will log on completion. |
| `scancode recover: ingested orphaned scancode result` | Startup or monitor | Case 3 of §5 — orphaned data recovered. |
| `running ScanCode repo_id=... owner=... pid=...` | Per scan | Scan started; PID is the subprocess we're tracking. |
| `scancode worker complete repo_id=... version=...` | Per scan | Scan succeeded and was ingested. |
| `scancode runOne: scancode subprocess failed ... pid=...` | Per scan failure | Lock will be cleared. |
| `scancode worker shutdown grace expired ...` | On stop | Outstanding scans become live-orphans on next startup. |

## 11. Code map

| File | Purpose |
|---|---|
| `internal/collector/scancode_worker.go` | The worker itself: `ScancodeWorker` struct, `Run`, `dispatcher`, `runner`, `runOne`, `recoverOrphans`, `monitorOrphan`. |
| `internal/collector/scancode.go` | JSON parsing + ingest (`ingestScancodeOutput`). Shared between the worker and any future direct caller. |
| `internal/db/scancode_worker_store.go` | Store methods: `ClaimNextScancodeRepo`, `RecordScancodeLockState`, `MarkScancodeComplete`, `ClearScancodeLock`, `ListLockedScancodeRows`. |
| `internal/db/scancode_store.go` | Pre-existing scancode data access (file results, freshness reads). `ScancodeFreshness` is new in v0.21.0. |
| `internal/db/migrate.go` | Six column additions + partial index + backfill from `aveloxis_scan.scancode_scans`. |
| `internal/scheduler/scheduler.go` | `Config.Scancode*` fields, default values, goroutine spawn in `Run()`. |
| `internal/config/config.go` | `CollectionConfig.Scancode*` fields and accessor methods. |
| `internal/api/server.go` | `handleScancodeLicenses` returns `last_run` + `scancode_version` in addition to licenses + copyrights. |
| `internal/web/templates.go` | Repo detail page renders the freshness signal above the source-code-licenses table. |

## 12. Regression guards

The v0.21.0 work added these tests as architectural pins. A future refactor that breaks any of them fails the build before it ships:

| Test | Pins |
|---|---|
| `TestAnalyzeRepoNoLongerInvokesScancode` | `AnalyzeRepo` does NOT call `scanScanCode`. |
| `TestScancodeSemaphoreNoLongerExists` | `scancode.go` does NOT declare `scancodeSem`. |
| `TestScancodeNoLongerHas30DaySkipCheck` | `scancode.go` does NOT contain the inline cadence check. |
| `TestRunOneSplitsStartFromWait` | `runOne` calls `cmd.Start()` + `cmd.Wait()`, NOT `cmd.Run()`. |
| `TestRunOnePersistsPidAndBootId` | `runOne` calls `RecordScancodeLockState` between Start and Wait. |
| `TestRunOneClearsLockOnSuccess` / `TestRunOneClearsLockOnFailure` | Lock-clear paths exist on both branches. |
| `TestClaimUsesForUpdateSkipLocked` | Claim SQL uses `FOR UPDATE SKIP LOCKED`. |
| `TestClaimGatesOnLastCollected` | Claim filters on `last_collected IS NOT NULL`. |
| `TestClaimGatesOnCadenceAndStaleLock` | Claim respects both the cadence config and the 12h stale-lock fallback. |
| `TestClaimExcludesArchivedRepos` | Claim WHERE clause matches the partial-index predicate. |
| `TestScancodeWorkerCallsRecoverBeforeDispatcher` | Recovery runs before any new claim. |
| `TestSchedulerRunStartsScancodeWorker` | Scheduler spawns the worker. |
| `TestMainWiresScancodeConfig` | `cmd/aveloxis/main.go` reads the config knobs. |
| `TestScancodeLicensesEndpointReturnsFreshness` | API surfaces the freshness signal. |