Aveloxis Documentation
Aveloxis is a high-performance open source community health data collection platform written in Go. It collects data from GitHub and GitLab with equal completeness, storing it in a shared PostgreSQL schema for cross-platform analysis. It is designed as a companion to (and eventual replacement for) the Augur collection pipeline.
Key Features
Full GitHub + GitLab parity — same data types collected from both platforms, including MR discussion review comments
Staged collection pipeline — JSONB staging decouples API speed from DB write contention at 400K+ repos
Postgres-backed queue — no Redis, RabbitMQ, or Celery. Multiple instances share the same queue via
SKIP LOCKEDGit commit analysis — bare clones +
git log --numstatfor per-file commit data, parent tracking, and Facade aggregatesContributor resolution — resolves git commit emails to GitHub users via noreply parsing, Commits API, and Search API
Dependency & complexity analysis — scans 15 ecosystems, calculates libyear across 12 package registries, runs scc for code complexity
Vulnerability scanning — OSV.dev batch API for CVE/GHSA lookup across all dependencies
SBOM generation — CycloneDX 1.5 + SPDX 2.3 with license capture from 12 registries
Interactive visualizations — weekly time-series charts, cross-project comparison with Z-score normalization, dependency license analysis
REST API — JSON endpoints for stats, time series, licenses, SBOM download, and repo search
19 materialized views — 8Knot-compatible analytics views, rebuilt weekly
Dead repo sidelining — permanently archives 404’d repos while preserving data
Deterministic contributor IDs — Augur-compatible GithubUUID scheme
Getting Started
User Guide
- Commands Reference
- Web GUI
- Prerequisites
- Configuration
- Starting the Web GUI
- Login Flow
- Creating Groups
- Adding Individual Repos to a Group
- Adding an Entire GitHub Org or GitLab Group
- Navigation and Breadcrumbs
- Comparing Repositories
- Searching and Pagination
- How Org Tracking Works
- How Repos Get Queued for Collection
- Session Management
- Running Alongside
aveloxis serve - Security Considerations
- REST API
- Visualizations
- Collection Pipeline
- Monitoring
- CI/CD Pipelines
- Schema-change verification with
aveloxis data-test - Scaling
- Troubleshooting
- Commits not collected, but Issues and Pull Requests are collected
- Monitor dashboard renders slowly on a large fleet
- Search keystrokes freeze the dashboard at 100K repos
- /api/queue endpoint slow or returns huge JSON
- Token invalidation (401 vs 403)
- FK constraint violations
- “No API keys configured” / Startup failure
- “Commit resolution FAILED”
- “No data collected”
unsupported Unicode escape sequence (SQLSTATE 22P05)- “Pull requests / contributors / events: not found” or “forbidden”
- Gap-filled historical issues/PRs have no comments
- Gap-fill failure on large repos persists across multiple cycles
- Metadata shows issues/PRs but gathered count stays at 0
- Repeated “unexpected status 301” retries on moved/renamed repos
- HTTP 410 Gone on individual issues / PRs
- GitLab repo_info.commit_count is 0 but facade found commits
- Release collection “not found” errors
- Git clone exit status 128
- Garbage timestamps (year 0001 BC)
- Schema version mismatch warning
- Null byte errors in text fields
- Restart procedure
- Checking queue status
- Checking status of jobs when you think things are stuck
- Changed
days_until_recollectis being ignored - Checking collection status
- Re-running a failed repo
- GraphQL PR batch errors on large repos
- Force-recollect a single repository
- Dead repo sidelining and un-sidelining
- Gateway errors (502/503/504)
- Deadlock errors
prepared statement "stmtcache_..." does not exist (SQLSTATE 26000)- Restart appears to take days before collection resumes
- All API tokens exhausted within minutes of restart
- Repeated
duplicate key value violates unique constraint "contributors_pkey"in Postgres logs - Orphaned postgres backend after
aveloxis stop serve - Next steps
Architecture
- Architecture Overview
- Staged Pipeline Architecture
- Contributor Resolution
- What a contributor record represents
- Where contributor data comes from
- Contract rules
- Resolution flow
- GithubUUID / PlatformUUID
- Rename handling: which columns track renames, which don’t
- Data-quality FAQ
- Diagnostic queries
- Intentional limitations
- GitLab vs GitHub: column-by-column parity matrix (v0.20.3)
- Related code
- Next steps
- Facade Commits
- Analysis
- ScanCode Worker (v0.21.0+)
- 1. What scancode does (and doesn’t)
- 2. Why the worker was decoupled (the 2026-05-14 incident)
- 3. Architecture
- 4. Cadence rationale (180 days default)
- 5. Crash recovery — the four-state table
- 6. Graceful shutdown
- 7. Force-rerun cookbook
- 8. Configuration reference
- 9. UX: the “last run” signal
- 10. Observability — what to grep in
aveloxis.log - 11. Code map
- 12. Regression guards
- Distribution Worker (v0.24.0+)
- 1. The question this subsystem answers
- 2. Why a dedicated worker (the v0.24.0 design call)
- 3. The five evidence sources
- 4. The headline analysis query
- 5. Schema
- 6. Worker architecture
- 7. GitHub-only for v1
- 8. Operator CLI
- 9. Config knobs
- 10. What the subsystem does NOT do
- 11. Cross-references
- 12. v0.25.x escape hatches (ephemeral)
- Mailing-List Ingestion (v0.25.7+)
- 1. The question this subsystem answers
- 2. Why a dedicated worker
- 3.
email_message— the first-class entity - 4. The two archive backends
- 5. Classification
- 6. Sender and signaled-repo resolution
- 7. Schema (v0.25.7)
- 8. Worker architecture
- 9. Mirror handling
- 10. Operator CLI
- 11. Config knobs
- 11c. Layer 2 projection — mailing-list → canonical entities (Phase 3)
- 11d. Forge-less PR-equivalents — the special case (Phase C)
- 12. Verification (Phase 4) and the collection-ordering caveat
- 13. What the subsystem does NOT do
- 14. Cross-references
- Materialized Views
- Column Name Mapping (Augur to Aveloxis)
- Platform Abstraction Layer
- Database Package (
internal/db)
Contributing
- Contributing to Aveloxis
- Local development setup
- Code conventions
- SPDX headers (mandatory)
- Package layout
- Imports
- Naming
- Error handling
- Logging (slog)
- Version bumping (mandatory)
- Commit messages
- CLAUDE.md is the canonical record
- Don’t add features beyond what the task requires
- Don’t write comments that restate the code
- Don’t write to files in ways that surprise
- CLI command style
- What NOT to do
- Testing
- Schema migrations
- The two sources of truth
- The fail-closed contract (v0.19.4)
- Adding a column
- Adding an index
- Adding a backfill
- Adding a foreign key
- Migration ordering
- Materialized views
- The
aveloxis migrate --skip-viewsflag - Version-bump checklist for a schema-touching change
- What v0.21.5 made explicit: who can run migrations
- When NOT to write a migration
- Adding a new platform
- Adding a REST endpoint
- Adding a collection phase
- Adding a visualization