ScanCode Worker (v0.21.0+)
Per-file license + copyright + package detection is performed by ScanCode Toolkit (aboutcode-org/scancode-toolkit) in a dedicated worker pool inside aveloxis serve, decoupled from the per-repo collection pipeline.
This document covers the design and operation of that worker. Operator-facing tuning lives in docs/getting-started/configuration.md; this file explains why the system looks the way it does.
1. What scancode does (and doesn’t)
Scancode walks a working-tree checkout of a repository and emits per-file findings:
License expression(s) detected — both raw and normalized SPDX.
Copyright statements and the holder names extracted from them.
Package manifests detected and their declared dependencies.
Detection metadata (line numbers, match percentage, etc.).
Results land in aveloxis_scan.scancode_scans (one row per scan run) and aveloxis_scan.scancode_file_results (one row per file with at least one detection). Previous rows are rotated to *_history tables before each new scan.
Scancode does NOT generate SBOMs. SPDX and CycloneDX SBOMs are produced by Phase 6 of the per-repo collection pipeline (internal/collector/sbom.go) and refreshed on every collection cycle. SBOM generation uses scancode’s license data as enrichment when available, but the SBOM artifact itself is regenerated independently.
2. Why the worker was decoupled (the 2026-05-14 incident)
Pre-v0.21.0 scancode ran inline in AnalysisCollector.AnalyzeRepo as Phase 4 of the per-repo collection pipeline, gated by a package-level 2-slot semaphore in internal/collector/scancode.go.
At fleet scale this shape doesn’t work. On 2026-05-14, operator-side investigation of a 180-worker production fleet found 177 of 180 worker goroutines parked at scancode.go:114 (the semaphore acquire) for 7+ hours. The two slot-holders were Linux-kernel-scale repos whose scans legitimately took hours; the other 177 worker goroutines were holding their collection_queue row locks the whole time, blocking other operations on those repos.
This is the same architectural anti-pattern v0.19.7 fixed for PopulateAffiliations: doing work-that-doesn’t-fit-the-per-job-budget inside per-job code paths inevitably stalls the worker pool at scale. The fix follows the same pattern — move the work to a dedicated periodic ticker / worker pool.
The v0.21.0 worker:
Runs in goroutines independent of the main collection pool. A slow scancode run can’t park collection workers.
Has its own concurrency cap (
collection.scancode_workers, default 2) so operators can dial it independently of the collection worker count.Re-clones each repo (
git clone --depth 1) instead of sharing the facade’s bare clone, eliminating cross-worker filesystem-lock hazards.Defaults to a 6-month cadence (was 30 days inline) because per-file license headers in source code change rarely on that timescale.
3. Architecture
3.1 Components
aveloxis_data.repos
├── scancode_last_run
├── scancode_version
├── scancode_locked_at
├── scancode_locked_pid
├── scancode_locked_boot_id
└── scancode_output_path
▲
│
┌────────────────────────┼───────────────────────┐
│ │ │
▼ ▼ ▼
claim query record lock state mark complete /
(FOR UPDATE SKIP LOCKED) (after cmd.Start) clear lock
│ │ ▲
│ │ │
▼ ▼ │
┌───────────┐ ┌─────────┐ ┌────────┐
│dispatcher │──jobs chan─▶│ runner │──exec──────▶│scancode│
│ (90s tic) │ │ pool │ │ proc │
└───────────┘ │ (N=2) │ └────────┘
└─────────┘
▲
│
│ on Run() startup
┌─────────────┐
│ recover │ reads ListLockedScancodeRows,
│ Orphans │ applies 4-state decision (§5)
└─────────────┘
3.2 The lifecycle of a single scan
Dispatcher claim (paced by
scancode_start_interval_s, default 90s, as MINIMUM GAP between successful starts — see §3.3 for the v0.21.3 design): the dispatcher callsClaimNextScancodeRepo(ctx, cadence). The SQL usesFOR UPDATE SKIP LOCKEDagainstaveloxis_data.reposfiltered by:collection_queue.last_collected IS NOT NULL— the repo has been collected at least once. Newly-added repos collect basic metrics first; scancode runs against them only after the first collection completes.repo_archived = FALSE(or NULL).scancode_last_run IS NULL OR < NOW() - cadence— cadence gate.scancode_locked_at IS NULL OR < NOW() - 12h— stale-lock fallback (the silent-corpse safety net; the explicitrecoverOrphanspass is the primary recovery path).
Lock acquired: the same SQL statement sets
scancode_locked_at = NOW()on the candidate row and returns the (repo_id, owner, name, git_url) tuple. The row is now claimed atomically.Job sent to runner channel: dispatcher writes the job; if no runner slot is free the dispatcher blocks until one is — this is the documented concurrency cap.
Runner picks up job:
Creates a fresh shallow clone:
git clone --depth 1 <repo_git> <scancode_clone_dir>/repo_<id>_<unix_ts>. The shallow clone is enough for scancode (it walks current file state, not history).Spawns scancode via
exec.CommandContext(ctx, "scancode", "-clpi", "--json", outputPath, ...).Calls
cmd.Start()(NOTcmd.Run()) so the OS PID is available immediately. Callscmd.Wait()next to actually wait for completion.Between
StartandWait, callsstore.RecordScancodeLockState(repoID, pid, bootID, outputPath). This is the critical step that makes crash recovery work — if aveloxis is killed beforecmd.Wait()returns, the next aveloxis startup has the (pid, boot_id, output_path) tuple in the DB and can recover (see §5).
Scan completes: runner parses the JSON output, calls
ingestScancodeOutput()to writescancode_scans+scancode_file_results(with history rotation), and callsstore.MarkScancodeComplete(repoID, version). That UPDATE atomically:Sets
scancode_last_run = NOW()andscancode_version = $version.Clears all four lock columns (
scancode_locked_at,scancode_locked_pid,scancode_locked_boot_id,scancode_output_path).
Cleanup: the runner’s deferred
os.RemoveAll(tempDir)removes the clone directory.
On any failure path (clone error, scancode crash, JSON parse error, ingest error), the runner calls store.ClearScancodeLock(repoID) to release the lock without setting scancode_last_run. The row becomes eligible for re-claim on the next dispatcher tick.
3.3 Dispatcher pacing (v0.21.3): minimum-gap, not throughput cap
The pacing semantic for scancode_start_interval_s changed between v0.21.0 and v0.21.3. The change is invisible to operators in steady state but materially affects first-pass throughput.
Pre-v0.21.3 design (broken at scale): the dispatcher was driven by time.NewTicker(startInterval) — one claim attempt per tick, regardless of how many workers were idle. At 90 s/tick × 7 workers × ~3-min average scan time, the fleet-wide claim rate capped at 40 claims/hour while runners had capacity for ~140. On a 40K-repo fleet this produced ~42-day first-pass estimates when actual capacity was ~12 days. 6 of 7 workers sat idle on average.
v0.21.3 design (correct): the dispatcher maintains a nextStartAllowed time.Time deadline that’s stamped after each successful start. It then loops as fast as the runtime allows, gating each claim on time.Now() >= nextStartAllowed. The unbuffered jobs channel provides back-pressure — when all N workers are busy, the dispatcher’s send blocks naturally and no over-claiming happens.
Operational effect:
Steady-state with idle workers: claims happen at intervals of exactly
startIntervalbetween successful starts. Same behavior as before.Steady-state with busy workers: dispatcher pauses on the unbuffered send. When a runner frees up, the next claim happens after the
startIntervalwindow — same as before.Burst on restart (the throughput-critical case): dispatcher claims one repo per
startIntervalseconds until all N worker slots are full. At 90 s × 7 workers = 630 seconds (~10 min) to saturate the pool. Same as before.First-pass on a large fleet (the regression case): workers complete scans in single-digit-to-low-double-digit minutes; the dispatcher refills slots at
startIntervalcadence, so 7 workers stay nearly always busy. Throughput is now bounded by worker capacity, not dispatcher pacing.
For a 40K-repo fleet with workers=7 and ~3-min average scan time:
Worker capacity: 7 × (60 / 3) = ~140 repos/hour
40,000 ÷ 140 = ~286 hours ≈ ~12 days first-pass
(Pre-v0.21.3 same configuration: ~42 days, dispatcher-bound.)
If you want to push further, raise scancode_workers. The scancode_start_interval_s rarely needs tuning unless you want denser starts on a very high-bandwidth network — the 90-second default works fine for most fleets.
4. Cadence rationale (180 days default)
Per-file license + copyright headers in source files are near-immutable on the timescale that matters. A 6-month cadence catches:
New files added to the repo since the last scan.
Wholesale license changes (rare — but they happen, e.g. a project changes from GPL to MIT).
Scancode version improvements (newer scancode versions detect licenses older versions missed).
It does NOT catch dependency-license changes, but those don’t flow through scancode anyway — they’re handled by Phase 4 dependency scanning + Phase 6 SBOM generation, both per-cycle.
Pre-v0.21.0 the inline cadence was 30 days, which produced one full re-scan of the whole fleet every month. At fleet scale (100K+ repos), that’s a large continuous load for almost no fresh data. 180 days is a reasonable default; operators can dial via collection.scancode_cadence_days.
5. Crash recovery — the four-state table
On aveloxis serve startup, ScancodeWorker.Run() calls recoverOrphans(ctx) before the dispatcher starts claiming new jobs. The recovery pass examines every row with scancode_locked_at IS NOT NULL and applies one of four decisions:
State |
Detection |
Action |
|---|---|---|
Reboot survivor |
stored |
Scancode subprocess is definitively dead (the kernel that hosted it no longer exists). Clear all lock columns. |
Live orphan |
boot_id matches AND |
Subprocess survived a previous aveloxis crash and is now an orphan of init. Spawn a monitor goroutine that polls every 30s; when the PID dies, attempt to ingest the output file if present. |
Recoverable corpse |
boot_id matches, PID is dead, output file exists and parses |
Scan finished but aveloxis crashed before ingest. Ingest the orphaned output, then clear the lock. |
Lost run |
boot_id matches, PID dead, no usable output |
Scan died mid-flight. Clear the lock; the row will re-run on the next cadence tick. |
The boot_id check is what makes the PID check reliable. Linux PIDs are reused — a stored PID of 12345 from before a reboot could legitimately match an unrelated process after the reboot. The boot_id (kernel-generated UUID, changes on every boot) lets the recovery pass decide unambiguously.
On non-Linux dev machines (e.g. macOS) the /proc path is absent and readBootID() returns an empty string. The recovery pass treats empty boot_id as “unknown” and falls through to the PID check; correctness is preserved (PID reuse is rare on a single boot).
6. Graceful shutdown
When the scheduler’s context is cancelled (aveloxis stop serve):
The dispatcher exits immediately on its
<-ctx.Done()arm. No new claims happen.The dispatcher closes the jobs channel.
Runners that were idle return immediately (their
range jobsloop terminates).Runners that were mid-scan keep going. The runner’s
cmd.Wait()is blocking on the scancode subprocess, which is NOT killed by the ctx cancel (Go’sexec.CommandContextonly kills the subprocess when the cmd object is garbage collected OR explicitly killed viacmd.Process.Kill()).Run()waits up tocollection.scancode_shutdown_grace_minutes(default 30 min) for all runners to finish.If the grace expires with runners still active,
Run()returns. The outstanding subprocesses become orphans — but they’re tracked in the DB via the(pid, boot_id, output_path)triple recorded in step 4 of §3.2, so the next aveloxis startup’srecoverOrphanspass will adopt them as live orphans (case 2 of §5).
The grace bound exists because Linux-kernel-sized scans can legitimately run for hours; without a bound, aveloxis stop would wait indefinitely on the slowest scancode. The trade-off is clear: lose the in-flight scan data on grace expiry vs. wait hours on stop. Operators who want a different balance can dial the grace.
7. Force-rerun cookbook
Cadence is enforced at claim time via the scancode_last_run column. To force a rescan, clear that column:
-- Single repo
UPDATE aveloxis_data.repos SET scancode_last_run = NULL
WHERE repo_owner = 'apache' AND repo_name = 'doris';
-- All repos that ran on a specific scancode version (e.g. after upgrade)
UPDATE aveloxis_data.repos SET scancode_last_run = NULL
WHERE scancode_version = '32.5.0';
-- Whole fleet
UPDATE aveloxis_data.repos SET scancode_last_run = NULL;
The claim query orders by scancode_last_run NULLS FIRST, so cleared repos move to the head of the queue and get claimed on subsequent dispatcher ticks. The order between cleared repos is repo_id ASC (stable).
8. Configuration reference
All five knobs live under the collection block in aveloxis.json:
Key |
Default |
Purpose |
|---|---|---|
|
|
Max concurrent scancode subprocesses. Raise on machines with spare CPU cores. |
|
|
Minimum seconds between successful claim starts. As of v0.21.3 this is a minimum-gap pacing primitive, NOT a throughput cap — the dispatcher claims as fast as workers free up, with this interval enforced only between consecutive starts. Bounds clone-bandwidth bursts on restart. See §3.3. |
|
|
Minimum days between successive scans on the same repo. Per-file licenses change rarely. |
|
|
Parent directory for per-run shallow clones. Size for ~50 MB × workers peak. |
|
|
Wait budget for in-flight scans on |
See docs/getting-started/configuration.md for the tuning rationale per knob.
9. UX: the “last run” signal
The repo detail page in the web GUI shows:
Last run: 2026-04-08 (scancode 32.5.0)
Or, for never-scanned repos:
Last run: not yet run — will run in the next scancode worker cycle
The data flows from aveloxis_data.repos.scancode_last_run (written by MarkScancodeComplete) through the /api/v1/repos/{id}/scancode-licenses endpoint’s last_run field to the rendered HTML.
10. Observability — what to grep in aveloxis.log
Log line |
When it fires |
Meaning |
|---|---|---|
|
Once at startup |
The pool is alive with N runners. If absent, scancode is disabled (binary not installed or |
|
Startup |
Install with |
|
Startup |
Recovery pass found stale locks. Followed by per-row decisions. |
|
Startup |
Case 1 of §5. |
|
Startup |
Case 2 of §5. Monitor goroutine will log on completion. |
|
Startup or monitor |
Case 3 of §5 — orphaned data recovered. |
|
Per scan |
Scan started; PID is the subprocess we’re tracking. |
|
Per scan |
Scan succeeded and was ingested. |
|
Per scan failure |
Lock will be cleared. |
|
On stop |
Outstanding scans become live-orphans on next startup. |
11. Code map
File |
Purpose |
|---|---|
|
The worker itself: |
|
JSON parsing + ingest ( |
|
Store methods: |
|
Pre-existing scancode data access (file results, freshness reads). |
|
Six column additions + partial index + backfill from |
|
|
|
|
|
|
|
Repo detail page renders the freshness signal above the source-code-licenses table. |
12. Regression guards
The v0.21.0 work added these tests as architectural pins. A future refactor that breaks any of them fails the build before it ships:
Test |
Pins |
|---|---|
|
|
|
|
|
|
|
|
|
|
|
Lock-clear paths exist on both branches. |
|
Claim SQL uses |
|
Claim filters on |
|
Claim respects both the cadence config and the 12h stale-lock fallback. |
|
Claim WHERE clause matches the partial-index predicate. |
|
Recovery runs before any new claim. |
|
Scheduler spawns the worker. |
|
|
|
API surfaces the freshness signal. |