ScanCode Worker (v0.21.0+)

Per-file license + copyright + package detection is performed by ScanCode Toolkit (aboutcode-org/scancode-toolkit) in a dedicated worker pool inside aveloxis serve, decoupled from the per-repo collection pipeline.

This document covers the design and operation of that worker. Operator-facing tuning lives in docs/getting-started/configuration.md; this file explains why the system looks the way it does.

1. What scancode does (and doesn’t)

Scancode walks a working-tree checkout of a repository and emits per-file findings:

  • License expression(s) detected — both raw and normalized SPDX.

  • Copyright statements and the holder names extracted from them.

  • Package manifests detected and their declared dependencies.

  • Detection metadata (line numbers, match percentage, etc.).

Results land in aveloxis_scan.scancode_scans (one row per scan run) and aveloxis_scan.scancode_file_results (one row per file with at least one detection). Previous rows are rotated to *_history tables before each new scan.

Scancode does NOT generate SBOMs. SPDX and CycloneDX SBOMs are produced by Phase 6 of the per-repo collection pipeline (internal/collector/sbom.go) and refreshed on every collection cycle. SBOM generation uses scancode’s license data as enrichment when available, but the SBOM artifact itself is regenerated independently.

2. Why the worker was decoupled (the 2026-05-14 incident)

Pre-v0.21.0 scancode ran inline in AnalysisCollector.AnalyzeRepo as Phase 4 of the per-repo collection pipeline, gated by a package-level 2-slot semaphore in internal/collector/scancode.go.

At fleet scale this shape doesn’t work. On 2026-05-14, operator-side investigation of a 180-worker production fleet found 177 of 180 worker goroutines parked at scancode.go:114 (the semaphore acquire) for 7+ hours. The two slot-holders were Linux-kernel-scale repos whose scans legitimately took hours; the other 177 worker goroutines were holding their collection_queue row locks the whole time, blocking other operations on those repos.

This is the same architectural anti-pattern v0.19.7 fixed for PopulateAffiliations: doing work-that-doesn’t-fit-the-per-job-budget inside per-job code paths inevitably stalls the worker pool at scale. The fix follows the same pattern — move the work to a dedicated periodic ticker / worker pool.

The v0.21.0 worker:

  • Runs in goroutines independent of the main collection pool. A slow scancode run can’t park collection workers.

  • Has its own concurrency cap (collection.scancode_workers, default 2) so operators can dial it independently of the collection worker count.

  • Re-clones each repo (git clone --depth 1) instead of sharing the facade’s bare clone, eliminating cross-worker filesystem-lock hazards.

  • Defaults to a 6-month cadence (was 30 days inline) because per-file license headers in source code change rarely on that timescale.

3. Architecture

3.1 Components

                              aveloxis_data.repos
                              ├── scancode_last_run
                              ├── scancode_version
                              ├── scancode_locked_at
                              ├── scancode_locked_pid
                              ├── scancode_locked_boot_id
                              └── scancode_output_path
                                       ▲
                                       │
              ┌────────────────────────┼───────────────────────┐
              │                        │                       │
              ▼                        ▼                       ▼
       claim query              record lock state      mark complete /
   (FOR UPDATE SKIP LOCKED)     (after cmd.Start)        clear lock
              │                        │                       ▲
              │                        │                       │
              ▼                        ▼                       │
        ┌───────────┐             ┌─────────┐             ┌────────┐
        │dispatcher │──jobs chan─▶│ runner  │──exec──────▶│scancode│
        │ (90s tic) │             │  pool   │             │  proc  │
        └───────────┘             │ (N=2)   │             └────────┘
                                  └─────────┘
              ▲
              │
              │ on Run() startup
        ┌─────────────┐
        │ recover     │  reads ListLockedScancodeRows,
        │ Orphans     │  applies 4-state decision (§5)
        └─────────────┘

3.2 The lifecycle of a single scan

  1. Dispatcher claim (paced by scancode_start_interval_s, default 90s, as MINIMUM GAP between successful starts — see §3.3 for the v0.21.3 design): the dispatcher calls ClaimNextScancodeRepo(ctx, cadence). The SQL uses FOR UPDATE SKIP LOCKED against aveloxis_data.repos filtered by:

    • collection_queue.last_collected IS NOT NULL — the repo has been collected at least once. Newly-added repos collect basic metrics first; scancode runs against them only after the first collection completes.

    • repo_archived = FALSE (or NULL).

    • scancode_last_run IS NULL OR < NOW() - cadence — cadence gate.

    • scancode_locked_at IS NULL OR < NOW() - 12h — stale-lock fallback (the silent-corpse safety net; the explicit recoverOrphans pass is the primary recovery path).

  2. Lock acquired: the same SQL statement sets scancode_locked_at = NOW() on the candidate row and returns the (repo_id, owner, name, git_url) tuple. The row is now claimed atomically.

  3. Job sent to runner channel: dispatcher writes the job; if no runner slot is free the dispatcher blocks until one is — this is the documented concurrency cap.

  4. Runner picks up job:

    • Creates a fresh shallow clone: git clone --depth 1 <repo_git> <scancode_clone_dir>/repo_<id>_<unix_ts>. The shallow clone is enough for scancode (it walks current file state, not history).

    • Spawns scancode via exec.CommandContext(ctx, "scancode", "-clpi", "--json", outputPath, ...).

    • Calls cmd.Start() (NOT cmd.Run()) so the OS PID is available immediately. Calls cmd.Wait() next to actually wait for completion.

    • Between Start and Wait, calls store.RecordScancodeLockState(repoID, pid, bootID, outputPath). This is the critical step that makes crash recovery work — if aveloxis is killed before cmd.Wait() returns, the next aveloxis startup has the (pid, boot_id, output_path) tuple in the DB and can recover (see §5).

  5. Scan completes: runner parses the JSON output, calls ingestScancodeOutput() to write scancode_scans + scancode_file_results (with history rotation), and calls store.MarkScancodeComplete(repoID, version). That UPDATE atomically:

    • Sets scancode_last_run = NOW() and scancode_version = $version.

    • Clears all four lock columns (scancode_locked_at, scancode_locked_pid, scancode_locked_boot_id, scancode_output_path).

  6. Cleanup: the runner’s deferred os.RemoveAll(tempDir) removes the clone directory.

On any failure path (clone error, scancode crash, JSON parse error, ingest error), the runner calls store.ClearScancodeLock(repoID) to release the lock without setting scancode_last_run. The row becomes eligible for re-claim on the next dispatcher tick.

3.3 Dispatcher pacing (v0.21.3): minimum-gap, not throughput cap

The pacing semantic for scancode_start_interval_s changed between v0.21.0 and v0.21.3. The change is invisible to operators in steady state but materially affects first-pass throughput.

Pre-v0.21.3 design (broken at scale): the dispatcher was driven by time.NewTicker(startInterval) — one claim attempt per tick, regardless of how many workers were idle. At 90 s/tick × 7 workers × ~3-min average scan time, the fleet-wide claim rate capped at 40 claims/hour while runners had capacity for ~140. On a 40K-repo fleet this produced ~42-day first-pass estimates when actual capacity was ~12 days. 6 of 7 workers sat idle on average.

v0.21.3 design (correct): the dispatcher maintains a nextStartAllowed time.Time deadline that’s stamped after each successful start. It then loops as fast as the runtime allows, gating each claim on time.Now() >= nextStartAllowed. The unbuffered jobs channel provides back-pressure — when all N workers are busy, the dispatcher’s send blocks naturally and no over-claiming happens.

Operational effect:

  • Steady-state with idle workers: claims happen at intervals of exactly startInterval between successful starts. Same behavior as before.

  • Steady-state with busy workers: dispatcher pauses on the unbuffered send. When a runner frees up, the next claim happens after the startInterval window — same as before.

  • Burst on restart (the throughput-critical case): dispatcher claims one repo per startInterval seconds until all N worker slots are full. At 90 s × 7 workers = 630 seconds (~10 min) to saturate the pool. Same as before.

  • First-pass on a large fleet (the regression case): workers complete scans in single-digit-to-low-double-digit minutes; the dispatcher refills slots at startInterval cadence, so 7 workers stay nearly always busy. Throughput is now bounded by worker capacity, not dispatcher pacing.

For a 40K-repo fleet with workers=7 and ~3-min average scan time:

  • Worker capacity: 7 × (60 / 3) = ~140 repos/hour

  • 40,000 ÷ 140 = ~286 hours ≈ ~12 days first-pass

(Pre-v0.21.3 same configuration: ~42 days, dispatcher-bound.)

If you want to push further, raise scancode_workers. The scancode_start_interval_s rarely needs tuning unless you want denser starts on a very high-bandwidth network — the 90-second default works fine for most fleets.

4. Cadence rationale (180 days default)

Per-file license + copyright headers in source files are near-immutable on the timescale that matters. A 6-month cadence catches:

  • New files added to the repo since the last scan.

  • Wholesale license changes (rare — but they happen, e.g. a project changes from GPL to MIT).

  • Scancode version improvements (newer scancode versions detect licenses older versions missed).

It does NOT catch dependency-license changes, but those don’t flow through scancode anyway — they’re handled by Phase 4 dependency scanning + Phase 6 SBOM generation, both per-cycle.

Pre-v0.21.0 the inline cadence was 30 days, which produced one full re-scan of the whole fleet every month. At fleet scale (100K+ repos), that’s a large continuous load for almost no fresh data. 180 days is a reasonable default; operators can dial via collection.scancode_cadence_days.

5. Crash recovery — the four-state table

On aveloxis serve startup, ScancodeWorker.Run() calls recoverOrphans(ctx) before the dispatcher starts claiming new jobs. The recovery pass examines every row with scancode_locked_at IS NOT NULL and applies one of four decisions:

State

Detection

Action

Reboot survivor

stored boot_id ≠ current /proc/sys/kernel/random/boot_id

Scancode subprocess is definitively dead (the kernel that hosted it no longer exists). Clear all lock columns.

Live orphan

boot_id matches AND kill(-0, pid) succeeds

Subprocess survived a previous aveloxis crash and is now an orphan of init. Spawn a monitor goroutine that polls every 30s; when the PID dies, attempt to ingest the output file if present.

Recoverable corpse

boot_id matches, PID is dead, output file exists and parses

Scan finished but aveloxis crashed before ingest. Ingest the orphaned output, then clear the lock.

Lost run

boot_id matches, PID dead, no usable output

Scan died mid-flight. Clear the lock; the row will re-run on the next cadence tick.

The boot_id check is what makes the PID check reliable. Linux PIDs are reused — a stored PID of 12345 from before a reboot could legitimately match an unrelated process after the reboot. The boot_id (kernel-generated UUID, changes on every boot) lets the recovery pass decide unambiguously.

On non-Linux dev machines (e.g. macOS) the /proc path is absent and readBootID() returns an empty string. The recovery pass treats empty boot_id as “unknown” and falls through to the PID check; correctness is preserved (PID reuse is rare on a single boot).

6. Graceful shutdown

When the scheduler’s context is cancelled (aveloxis stop serve):

  1. The dispatcher exits immediately on its <-ctx.Done() arm. No new claims happen.

  2. The dispatcher closes the jobs channel.

  3. Runners that were idle return immediately (their range jobs loop terminates).

  4. Runners that were mid-scan keep going. The runner’s cmd.Wait() is blocking on the scancode subprocess, which is NOT killed by the ctx cancel (Go’s exec.CommandContext only kills the subprocess when the cmd object is garbage collected OR explicitly killed via cmd.Process.Kill()).

  5. Run() waits up to collection.scancode_shutdown_grace_minutes (default 30 min) for all runners to finish.

  6. If the grace expires with runners still active, Run() returns. The outstanding subprocesses become orphans — but they’re tracked in the DB via the (pid, boot_id, output_path) triple recorded in step 4 of §3.2, so the next aveloxis startup’s recoverOrphans pass will adopt them as live orphans (case 2 of §5).

The grace bound exists because Linux-kernel-sized scans can legitimately run for hours; without a bound, aveloxis stop would wait indefinitely on the slowest scancode. The trade-off is clear: lose the in-flight scan data on grace expiry vs. wait hours on stop. Operators who want a different balance can dial the grace.

7. Force-rerun cookbook

Cadence is enforced at claim time via the scancode_last_run column. To force a rescan, clear that column:

-- Single repo
UPDATE aveloxis_data.repos SET scancode_last_run = NULL
WHERE repo_owner = 'apache' AND repo_name = 'doris';

-- All repos that ran on a specific scancode version (e.g. after upgrade)
UPDATE aveloxis_data.repos SET scancode_last_run = NULL
WHERE scancode_version = '32.5.0';

-- Whole fleet
UPDATE aveloxis_data.repos SET scancode_last_run = NULL;

The claim query orders by scancode_last_run NULLS FIRST, so cleared repos move to the head of the queue and get claimed on subsequent dispatcher ticks. The order between cleared repos is repo_id ASC (stable).

8. Configuration reference

All five knobs live under the collection block in aveloxis.json:

Key

Default

Purpose

scancode_workers

2

Max concurrent scancode subprocesses. Raise on machines with spare CPU cores.

scancode_start_interval_s

90

Minimum seconds between successful claim starts. As of v0.21.3 this is a minimum-gap pacing primitive, NOT a throughput cap — the dispatcher claims as fast as workers free up, with this interval enforced only between consecutive starts. Bounds clone-bandwidth bursts on restart. See §3.3.

scancode_cadence_days

180

Minimum days between successive scans on the same repo. Per-file licenses change rarely.

scancode_clone_dir

/tmp/aveloxis-scancode

Parent directory for per-run shallow clones. Size for ~50 MB × workers peak.

scancode_shutdown_grace_minutes

30

Wait budget for in-flight scans on aveloxis stop. Outstanding scans become live-orphans (see §5) if not finished.

See docs/getting-started/configuration.md for the tuning rationale per knob.

9. UX: the “last run” signal

The repo detail page in the web GUI shows:

Last run: 2026-04-08 (scancode 32.5.0)

Or, for never-scanned repos:

Last run: not yet run — will run in the next scancode worker cycle

The data flows from aveloxis_data.repos.scancode_last_run (written by MarkScancodeComplete) through the /api/v1/repos/{id}/scancode-licenses endpoint’s last_run field to the rendered HTML.

10. Observability — what to grep in aveloxis.log

Log line

When it fires

Meaning

scancode worker started workers=N start_interval=...

Once at startup

The pool is alive with N runners. If absent, scancode is disabled (binary not installed or mkdir scancode_clone_dir failed).

scancode binary not installed; ScancodeWorker disabled

Startup

Install with pipx install scancode-toolkit then restart serve.

scancode recoverOrphans: examining locked rows count=...

Startup

Recovery pass found stale locks. Followed by per-row decisions.

scancode recover: reboot survivor clearing lock

Startup

Case 1 of §5.

scancode recover: live orphan detected spawning monitor

Startup

Case 2 of §5. Monitor goroutine will log on completion.

scancode recover: ingested orphaned scancode result

Startup or monitor

Case 3 of §5 — orphaned data recovered.

running ScanCode repo_id=... owner=... pid=...

Per scan

Scan started; PID is the subprocess we’re tracking.

scancode worker complete repo_id=... version=...

Per scan

Scan succeeded and was ingested.

scancode runOne: scancode subprocess failed ... pid=...

Per scan failure

Lock will be cleared.

scancode worker shutdown grace expired ...

On stop

Outstanding scans become live-orphans on next startup.

11. Code map

File

Purpose

internal/collector/scancode_worker.go

The worker itself: ScancodeWorker struct, Run, dispatcher, runner, runOne, recoverOrphans, monitorOrphan.

internal/collector/scancode.go

JSON parsing + ingest (ingestScancodeOutput). Shared between the worker and any future direct caller.

internal/db/scancode_worker_store.go

Store methods: ClaimNextScancodeRepo, RecordScancodeLockState, MarkScancodeComplete, ClearScancodeLock, ListLockedScancodeRows.

internal/db/scancode_store.go

Pre-existing scancode data access (file results, freshness reads). ScancodeFreshness is new in v0.21.0.

internal/db/migrate.go

Six column additions + partial index + backfill from aveloxis_scan.scancode_scans.

internal/scheduler/scheduler.go

Config.Scancode* fields, default values, goroutine spawn in Run().

internal/config/config.go

CollectionConfig.Scancode* fields and accessor methods.

internal/api/server.go

handleScancodeLicenses returns last_run + scancode_version in addition to licenses + copyrights.

internal/web/templates.go

Repo detail page renders the freshness signal above the source-code-licenses table.

12. Regression guards

The v0.21.0 work added these tests as architectural pins. A future refactor that breaks any of them fails the build before it ships:

Test

Pins

TestAnalyzeRepoNoLongerInvokesScancode

AnalyzeRepo does NOT call scanScanCode.

TestScancodeSemaphoreNoLongerExists

scancode.go does NOT declare scancodeSem.

TestScancodeNoLongerHas30DaySkipCheck

scancode.go does NOT contain the inline cadence check.

TestRunOneSplitsStartFromWait

runOne calls cmd.Start() + cmd.Wait(), NOT cmd.Run().

TestRunOnePersistsPidAndBootId

runOne calls RecordScancodeLockState between Start and Wait.

TestRunOneClearsLockOnSuccess / TestRunOneClearsLockOnFailure

Lock-clear paths exist on both branches.

TestClaimUsesForUpdateSkipLocked

Claim SQL uses FOR UPDATE SKIP LOCKED.

TestClaimGatesOnLastCollected

Claim filters on last_collected IS NOT NULL.

TestClaimGatesOnCadenceAndStaleLock

Claim respects both the cadence config and the 12h stale-lock fallback.

TestClaimExcludesArchivedRepos

Claim WHERE clause matches the partial-index predicate.

TestScancodeWorkerCallsRecoverBeforeDispatcher

Recovery runs before any new claim.

TestSchedulerRunStartsScancodeWorker

Scheduler spawns the worker.

TestMainWiresScancodeConfig

cmd/aveloxis/main.go reads the config knobs.

TestScancodeLicensesEndpointReturnsFreshness

API surfaces the freshness signal.