ScanCode Worker (v0.21.0+)
Per-file license + copyright + package detection is performed by ScanCode Toolkit (aboutcode-org/scancode-toolkit) in a dedicated worker pool inside aveloxis serve, decoupled from the per-repo collection pipeline.
This document covers the design and operation of that worker. Operator-facing tuning lives in docs/getting-started/configuration.md; this file explains why the system looks the way it does.
1. What scancode does (and doesn’t)
Scancode walks a working-tree checkout of a repository and emits per-file findings:
License expression(s) detected — both raw and normalized SPDX.
Copyright statements and the holder names extracted from them.
Package manifests detected and their declared dependencies.
Detection metadata (line numbers, match percentage, etc.).
Results land in aveloxis_scan.scancode_scans (one row per scan run) and aveloxis_scan.scancode_file_results (one row per file with at least one detection). Previous rows are rotated to *_history tables before each new scan.
Scancode does NOT generate SBOMs. SPDX and CycloneDX SBOMs are produced by Phase 6 of the per-repo collection pipeline (internal/collector/sbom.go) and refreshed on every collection cycle. SBOM generation uses scancode’s license data as enrichment when available, but the SBOM artifact itself is regenerated independently.
2. Why the worker was decoupled (the 2026-05-14 incident)
Pre-v0.21.0 scancode ran inline in AnalysisCollector.AnalyzeRepo as Phase 4 of the per-repo collection pipeline, gated by a package-level 2-slot semaphore in internal/collector/scancode.go.
At fleet scale this shape doesn’t work. On 2026-05-14, operator-side investigation of a 180-worker production fleet found 177 of 180 worker goroutines parked at scancode.go:114 (the semaphore acquire) for 7+ hours. The two slot-holders were Linux-kernel-scale repos whose scans legitimately took hours; the other 177 worker goroutines were holding their collection_queue row locks the whole time, blocking other operations on those repos.
This is the same architectural anti-pattern v0.19.7 fixed for PopulateAffiliations: doing work-that-doesn’t-fit-the-per-job-budget inside per-job code paths inevitably stalls the worker pool at scale. The fix follows the same pattern — move the work to a dedicated periodic ticker / worker pool.
The v0.21.0 worker:
Runs in goroutines independent of the main collection pool. A slow scancode run can’t park collection workers.
Has its own concurrency cap (
collection.scancode_workers, default 2) so operators can dial it independently of the collection worker count.Re-clones each repo (
git clone --depth 1) instead of sharing the facade’s bare clone, eliminating cross-worker filesystem-lock hazards.Defaults to a 6-month cadence (was 30 days inline) because per-file license headers in source code change rarely on that timescale.
3. Architecture
3.1 Components
aveloxis_data.repos
├── scancode_last_run
├── scancode_version
├── scancode_locked_at
├── scancode_locked_pid
├── scancode_locked_boot_id
└── scancode_output_path
▲
│
┌────────────────────────┼───────────────────────┐
│ │ │
▼ ▼ ▼
claim query record lock state mark complete /
(FOR UPDATE SKIP LOCKED) (after cmd.Start) clear lock
│ │ ▲
│ │ │
▼ ▼ │
┌───────────┐ ┌─────────┐ ┌────────┐
│dispatcher │──jobs chan─▶│ runner │──exec──────▶│scancode│
│ (90s tic) │ │ pool │ │ proc │
└───────────┘ │ (N=2) │ └────────┘
└─────────┘
▲
│
│ on Run() startup
┌─────────────┐
│ recover │ reads ListLockedScancodeRows,
│ Orphans │ applies 4-state decision (§5)
└─────────────┘
3.2 The lifecycle of a single scan
Dispatcher claim (paced by
scancode_start_interval_s, default 90s, as MINIMUM GAP between successful starts — see §3.3 for the v0.21.3 design): the dispatcher callsClaimNextScancodeRepo(ctx, cadence). The SQL usesFOR UPDATE SKIP LOCKEDagainstaveloxis_data.reposfiltered by:collection_queue.last_collected IS NOT NULL— the repo has been collected at least once. Newly-added repos collect basic metrics first; scancode runs against them only after the first collection completes.repo_archived = FALSE(or NULL).scancode_last_run IS NULL OR < NOW() - cadence— cadence gate.scancode_locked_at IS NULL OR < NOW() - 12h— stale-lock fallback (the silent-corpse safety net; the explicitrecoverOrphanspass is the primary recovery path).
Lock acquired: the same SQL statement sets
scancode_locked_at = NOW()on the candidate row and returns the (repo_id, owner, name, git_url) tuple. The row is now claimed atomically.Job sent to runner channel: dispatcher writes the job; if no runner slot is free the dispatcher blocks until one is — this is the documented concurrency cap.
Runner picks up job:
Creates a fresh shallow clone:
git clone --depth 1 <repo_git> <scancode_clone_dir>/repo_<id>_<unix_ts>. The shallow clone is enough for scancode (it walks current file state, not history).Spawns scancode via
exec.CommandContext(ctx, "scancode", "-clpi", "--json", outputPath, ...).Calls
cmd.Start()(NOTcmd.Run()) so the OS PID is available immediately. Callscmd.Wait()next to actually wait for completion.Between
StartandWait, callsstore.RecordScancodeLockState(repoID, pid, bootID, outputPath). This is the critical step that makes crash recovery work — if aveloxis is killed beforecmd.Wait()returns, the next aveloxis startup has the (pid, boot_id, output_path) tuple in the DB and can recover (see §5).
Scan completes: runner parses the JSON output, calls
ingestScancodeOutput()to writescancode_scans+scancode_file_results(with history rotation), and callsstore.MarkScancodeComplete(repoID, version). That UPDATE atomically:Sets
scancode_last_run = NOW()andscancode_version = $version.Clears all four lock columns (
scancode_locked_at,scancode_locked_pid,scancode_locked_boot_id,scancode_output_path).
Cleanup: the runner’s deferred
os.RemoveAll(tempDir)removes the clone directory.
On any failure path (clone error, scancode crash, JSON parse error, ingest error), the runner calls store.ClearScancodeLock(repoID) to release the lock without setting scancode_last_run. The row becomes eligible for re-claim on the next dispatcher tick.
3.3 Dispatcher pacing (v0.21.3): minimum-gap, not throughput cap
The pacing semantic for scancode_start_interval_s changed between v0.21.0 and v0.21.3. The change is invisible to operators in steady state but materially affects first-pass throughput.
Pre-v0.21.3 design (broken at scale): the dispatcher was driven by time.NewTicker(startInterval) — one claim attempt per tick, regardless of how many workers were idle. At 90 s/tick × 7 workers × ~3-min average scan time, the fleet-wide claim rate capped at 40 claims/hour while runners had capacity for ~140. On a 40K-repo fleet this produced ~42-day first-pass estimates when actual capacity was ~12 days. 6 of 7 workers sat idle on average.
v0.21.3 design (correct): the dispatcher maintains a nextStartAllowed time.Time deadline that’s stamped after each successful start. It then loops as fast as the runtime allows, gating each claim on time.Now() >= nextStartAllowed. The unbuffered jobs channel provides back-pressure — when all N workers are busy, the dispatcher’s send blocks naturally and no over-claiming happens.
Operational effect:
Steady-state with idle workers: claims happen at intervals of exactly
startIntervalbetween successful starts. Same behavior as before.Steady-state with busy workers: dispatcher pauses on the unbuffered send. When a runner frees up, the next claim happens after the
startIntervalwindow — same as before.Burst on restart (the throughput-critical case): dispatcher claims one repo per
startIntervalseconds until all N worker slots are full. At 90 s × 7 workers = 630 seconds (~10 min) to saturate the pool. Same as before.First-pass on a large fleet (the regression case): workers complete scans in single-digit-to-low-double-digit minutes; the dispatcher refills slots at
startIntervalcadence, so 7 workers stay nearly always busy. Throughput is now bounded by worker capacity, not dispatcher pacing.
For a 40K-repo fleet with workers=7 and ~3-min average scan time:
Worker capacity: 7 × (60 / 3) = ~140 repos/hour
40,000 ÷ 140 = ~286 hours ≈ ~12 days first-pass
(Pre-v0.21.3 same configuration: ~42 days, dispatcher-bound.)
If you want to push further, raise scancode_workers. The scancode_start_interval_s rarely needs tuning unless you want denser starts on a very high-bandwidth network — the 90-second default works fine for most fleets.
4. Cadence rationale (180 days default)
Per-file license + copyright headers in source files are near-immutable on the timescale that matters. A 6-month cadence catches:
New files added to the repo since the last scan.
Wholesale license changes (rare — but they happen, e.g. a project changes from GPL to MIT).
Scancode version improvements (newer scancode versions detect licenses older versions missed).
It does NOT catch dependency-license changes, but those don’t flow through scancode anyway — they’re handled by Phase 4 dependency scanning + Phase 6 SBOM generation, both per-cycle.
Pre-v0.21.0 the inline cadence was 30 days, which produced one full re-scan of the whole fleet every month. At fleet scale (100K+ repos), that’s a large continuous load for almost no fresh data. 180 days is a reasonable default; operators can dial via collection.scancode_cadence_days.
5. Crash recovery — the four-state table
On aveloxis serve startup, ScancodeWorker.Run() calls recoverOrphans(ctx) before the dispatcher starts claiming new jobs. The recovery pass examines every row with scancode_locked_at IS NOT NULL and applies one of four decisions:
State |
Detection |
Action |
|---|---|---|
Reboot survivor |
stored |
Scancode subprocess is definitively dead (the kernel that hosted it no longer exists). Clear all lock columns. |
Live orphan |
boot_id matches AND |
Subprocess survived a previous aveloxis crash and is now an orphan of init. Spawn a monitor goroutine that polls every 30s; when the PID dies, attempt to ingest the output file if present. |
Recoverable corpse |
boot_id matches, PID is dead, output file exists and parses |
Scan finished but aveloxis crashed before ingest. Ingest the orphaned output, then clear the lock. |
Lost run |
boot_id matches, PID dead, no usable output |
Scan died mid-flight. Clear the lock; the row will re-run on the next cadence tick. |
The boot_id check is what makes the PID check reliable. Linux PIDs are reused — a stored PID of 12345 from before a reboot could legitimately match an unrelated process after the reboot. The boot_id (kernel-generated UUID, changes on every boot) lets the recovery pass decide unambiguously.
On non-Linux dev machines (e.g. macOS) the /proc path is absent and readBootID() returns an empty string. The recovery pass treats empty boot_id as “unknown” and falls through to the PID check; correctness is preserved (PID reuse is rare on a single boot).
6. Graceful shutdown
When the scheduler’s context is cancelled (aveloxis stop serve):
The dispatcher exits immediately on its
<-ctx.Done()arm. No new claims happen.The dispatcher closes the jobs channel.
Runners that were idle return immediately (their
range jobsloop terminates).Runners that were mid-scan keep going. The runner’s
cmd.Wait()is blocking on the scancode subprocess, which is NOT killed by the ctx cancel (Go’sexec.CommandContextonly kills the subprocess when the cmd object is garbage collected OR explicitly killed viacmd.Process.Kill()).Run()waits up tocollection.scancode_shutdown_grace_minutes(default 30 min) for all runners to finish.If the grace expires with runners still active,
Run()returns. The outstanding subprocesses become orphans — but they’re tracked in the DB via the(pid, boot_id, output_path)triple recorded in step 4 of §3.2, so the next aveloxis startup’srecoverOrphanspass will adopt them as live orphans (case 2 of §5).
The grace bound exists because Linux-kernel-sized scans can legitimately run for hours; without a bound, aveloxis stop would wait indefinitely on the slowest scancode. The trade-off is clear: lose the in-flight scan data on grace expiry vs. wait hours on stop. Operators who want a different balance can dial the grace.
7. Force-rerun cookbook
Cadence is enforced at claim time via the scancode_last_run column. To force a rescan, clear that column:
-- Single repo
UPDATE aveloxis_data.repos SET scancode_last_run = NULL
WHERE repo_owner = 'apache' AND repo_name = 'doris';
-- All repos that ran on a specific scancode version (e.g. after upgrade)
UPDATE aveloxis_data.repos SET scancode_last_run = NULL
WHERE scancode_version = '32.5.0';
-- Whole fleet
UPDATE aveloxis_data.repos SET scancode_last_run = NULL;
The claim query orders by scancode_last_run NULLS FIRST, so cleared repos move to the head of the queue and get claimed on subsequent dispatcher ticks. The order between cleared repos is repo_id ASC (stable).
8. Configuration reference
All five knobs live under the collection block in aveloxis.json:
Key |
Default |
Purpose |
|---|---|---|
|
|
Max concurrent scancode subprocesses. Raise on machines with spare CPU cores. |
|
|
Minimum seconds between successful claim starts. As of v0.21.3 this is a minimum-gap pacing primitive, NOT a throughput cap — the dispatcher claims as fast as workers free up, with this interval enforced only between consecutive starts. Bounds clone-bandwidth bursts on restart. See §3.3. |
|
|
Minimum days between successive scans on the same repo. Per-file licenses change rarely. |
|
|
Parent directory for per-run shallow clones. Size for ~50 MB × workers peak. |
|
|
Wait budget for in-flight scans on |
See docs/getting-started/configuration.md for the tuning rationale per knob.
9. UX: the “last run” signal
The repo detail page in the web GUI shows:
Last run: 2026-04-08 (scancode 32.5.0)
Or, for never-scanned repos:
Last run: not yet run — will run in the next scancode worker cycle
The data flows from aveloxis_data.repos.scancode_last_run (written by MarkScancodeComplete) through the /api/v1/repos/{id}/scancode-licenses endpoint’s last_run field to the rendered HTML.
10. Observability — what to grep in aveloxis.log
Log line |
When it fires |
Meaning |
|---|---|---|
|
Once at startup |
The pool is alive with N runners. If absent, scancode is disabled (binary not installed or |
|
Once at startup |
The §13 health check passed — the toolchain works. |
|
Once at startup (ERROR) |
The §13 health check detected a systemic failure (corrupt libmagic, a repeated error, or no JSON). Read the |
|
Startup |
Install with |
|
Startup |
Recovery pass found stale locks. Followed by per-row decisions. |
|
Startup |
Case 1 of §5. |
|
Startup |
Case 2 of §5. Monitor goroutine will log on completion. |
|
Startup or monitor |
Case 3 of §5 — orphaned data recovered. |
|
Per scan |
Scan started; PID is the subprocess we’re tracking. |
|
Per scan |
Scan succeeded and was ingested. |
|
Per scan failure |
Lock will be cleared. |
|
On stop |
Outstanding scans become live-orphans on next startup. |
11. Code map
File |
Purpose |
|---|---|
|
The worker itself: |
|
JSON parsing + ingest ( |
|
Store methods: |
|
Pre-existing scancode data access (file results, freshness reads). |
|
Six column additions + partial index + backfill from |
|
|
|
|
|
|
|
Repo detail page renders the freshness signal above the source-code-licenses table. |
12. Regression guards
The v0.21.0 work added these tests as architectural pins. A future refactor that breaks any of them fails the build before it ships:
Test |
Pins |
|---|---|
|
|
|
|
|
|
|
|
|
|
|
Lock-clear paths exist on both branches. |
|
Claim SQL uses |
|
Claim filters on |
|
Claim respects both the cadence config and the 12h stale-lock fallback. |
|
Claim WHERE clause matches the partial-index predicate. |
|
Recovery runs before any new claim. |
|
Scheduler spawns the worker. |
|
|
|
API surfaces the freshness signal. |
13. Startup health preflight + aveloxis_ops.aveloxis_status
A system-level scancode failure — one where every scan is doomed regardless of the repo — used to be invisible: the fleet just degraded. The 2026-06-09 aveloxis_large incident was the motivating case. On that Ubuntu 24.04 host the system libmagic database (/usr/share/misc/magic.mgc) was corrupt, so scancode (via typecode → python-magic → libmagic) emitted Warning: offset ... invalid at enormous volume — 14+ GB of stderr per large repo — bogging scans down until the wall-clock timeout killed them. With the adaptive timeout stretching doomed scans to 16–24 h, a handful of repos wedged all worker slots and the scanned-repo count crawled.
The preflight
On startup (ScancodeWorker.Run, before the dispatcher claims any work), the worker runs one scancode invocation against a tiny synthetic input and classifies the result:
Bounded and safe. 90-second wall-clock timeout, process-group kill, and a capped 1 MB stderr capture — the health check itself can never hang the worker or buffer gigabytes (the very failure it detects).
classifyScancodeHealthmaps the outcome to a status:brokenif stderr carries the libmagic corruption fingerprint in volume — either the compiled-DB namemagic.mgc, or the OS-independentmagic…Warning…offset…invalidshape that libmagic’s C parser emits on Linux and macOS — repeated ≥ 50× (the wedging bug emits one warning per bad magic-DB entry at load time, saturating stderr; a repaired libmagic emitting a handful of benign warnings while scans complete is not flagged), or any single line repeats ≥ 50× (generic “the toolchain is spamming” signal), or no valid JSON was produced. Volume, not mere presence, is the signal — see the 2026-06-10 false-positive note below.not_installedif thescancodebinary isn’t onPATH.okotherwise.
On anything other than
okit logsERROR "scancode preflight: SYSTEM-LEVEL FAILURE — scancode will not work until fixed"with adetailstring that names the remediation.
It is awareness only — the preflight does not disable scancode (a deliberate scope decision; auto-pause is a possible follow-up). It records, logs, and lets the worker proceed.
Volume, not presence (2026-06-10). An early version of the libmagic check flagged
brokenon the mere presence of anoffset invalidwarning. That false-positives a working install: a repaired libmagic (e.g. afteraveloxis upgrade-toolsinjects typecode-libmagic) can emit a handful of benign warnings while scans complete normally and produce valid data. The wedging bug is different in kind — the corrupt DB emits one warning per bad entry at load time, repeating the fingerprint thousands of times (it saturates the preflight’s 1 MB stderr cap). The check now requires the fingerprint to repeat past the systemic-spam threshold (≥ 50), so a few incidental warnings no longer read as broken.
The status table
The outcome is upserted into aveloxis_ops.aveloxis_status (one row per subsystem, keyed by status_name; see schema.md):
SELECT status_name, status, status_detail, tool_version, data_collection_date
FROM aveloxis_ops.aveloxis_status WHERE status_name = 'scancode';
A broken row’s status_detail for the libmagic case reads, in part: “system libmagic magic database appears corrupt … run aveloxis upgrade-tools to inject typecode-libmagic (works on any OS), …” followed by an OS-aware reinstall hint — brew reinstall libmagic on macOS, apt-get install --reinstall libmagic-mgc libmagic1 file on Linux (chosen via runtime.GOOS; libmagic-mgc is the package that actually ships /usr/share/misc/magic.mgc on Debian/Ubuntu). The table is generic by design — future subsystems record their own health under their own status_name, and the intent is to surface it to the operator (UI/API) over time.
Per-repo failure capture is bounded (v0.25.28)
The startup preflight catches a systemic libmagic failure once. But even with a broken host libmagic, individual large repos still get claimed and fail. Pre-v0.25.28, runOne captured the full subprocess stderr in an unbounded bytes.Buffer and wrote it to a per-repo repo_<id>_stderr.log on failure. On 2026-06-11 a corrupt host magic.mgc made large repos (aws/aws-sdk-cpp, Azure/azure-rest-api-specs, aws/lumberyard) emit 15+ GB of warning spam — buffered entirely in RAM (a multi-GB heap spike per failing repo) before being written as a 15 GB file, filling the scancode clone volume.
v0.25.28 replaces that buffer with a bounded headTailBuffer (internal/collector/tail_buffer.go): the first 1 MB (failure onset) plus the last 256 KB (exit context), with an elision marker reporting the true total byte count. RAM and disk are now fixed regardless of how much the subprocess spews. The failure log line also carries a likely_cause field when the captured stderr is libmagic-dominated, so a flood of large-repo failures reads as “the host magic DB is corrupt” rather than “these specific repos are broken.”
Code: internal/collector/scancode_preflight.go (preflight + classifyScancodeHealth), internal/collector/scancode_worker.go + internal/collector/tail_buffer.go (headTailBuffer), internal/db/aveloxis_status_store.go (SetAveloxisStatus / GetAveloxisStatus), internal/db/schema.sql (table).