# Architecture Overview Aveloxis is a Go-based open source community health data collection pipeline that collects from GitHub and GitLab with equal completeness, storing everything in PostgreSQL. --- ## System diagram ``` GitHub API | v ┌─────────────┐ ┌─────────────────────────────────┐ │ CLI │ │ Aveloxis Scheduler │ │ add-repo │───>│ │ │ add-key │ │ ┌───────────┐ ┌─────────────┐ │ │ prioritize │ │ │ Worker 1 │ │ Worker 2 │ │ │ collect │ │ │ │ │ │ │ └─────────────┘ │ └─────┬─────┘ └──────┬──────┘ │ │ │ │ │ │ v v │ │ ┌─────────────────────────┐ │ │ │ Staged Pipeline │ │ │ │ 1. Prelim (URL check) │ │ │ │ 2. Stage (JSONB) │ │ │ │ 3. Process (relational) │ │ │ │ 4. Facade (git log) │ │ │ │ 5. Commit Resolution │ │ │ │ 6. Analysis │ │ │ └────────────┬────────────┘ │ │ │ │ │ ┌────────────v────────────┐ │ │ │ Periodic Tasks │ │ │ │ - Org refresh (4h) │ │ │ │ - Contributor breadth(6h)│ │ │ │ - Matview rebuild (Sat) │ │ │ └─────────────────────────┘ │ └──────────────┬──────────────────┘ │ v ┌─────────────────────────────────────────────────────┐ │ PostgreSQL │ │ │ │ ┌───────────────────┐ ┌────────────────────────┐ │ │ │ aveloxis_data │ │ aveloxis_ops │ │ │ │ 84 tables │ │ 24 tables │ │ │ │ 19 matviews │ │ - collection_queue │ │ │ │ - repos │ │ - staging (JSONB) │ │ │ │ - issues │ │ - collection_status │ │ │ │ - pull_requests │ │ - worker_oauth │ │ │ │ - commits │ │ - users/sessions │ │ │ │ - contributors │ │ - config │ │ │ │ - messages │ │ - worker_history │ │ │ │ - releases │ │ │ │ │ │ - dependencies │ │ │ │ │ │ - repo_labor │ │ │ │ │ │ - aggregates │ │ │ │ │ └───────────────────┘ └────────────────────────┘ │ └─────────────────────────────────────────────────────┘ | v ┌───────────────────────────┐ │ 8Knot / Analytics Tools │ │ (reads matviews) │ └───────────────────────────┘ ``` --- ## Three schemas Aveloxis uses three PostgreSQL schemas to separate collected data, operational state, and Augur compatibility. ### `aveloxis_data` (84 tables + 22 materialized views) All collected open source community health data: | Category | Tables | Examples | |---|---|---| | Core | 4 | `repos`, `repo_groups`, `platforms`, `repo_groups_list_serve` | | Contributors | 6 | `contributors`, `contributor_identities`, `contributors_aliases`, `contributor_affiliations`, `contributor_repo`, `unresolved_commit_emails` | | Issues | 5 | `issues`, `issue_labels`, `issue_assignees`, `issue_events`, `issue_message_ref` | | Pull Requests | 12 | `pull_requests`, `pull_request_labels`, `pull_request_assignees`, `pull_request_reviewers`, `pull_request_reviews`, `pull_request_commits`, `pull_request_files`, `pull_request_events`, `pull_request_meta`, `pull_request_repo`, `pull_request_message_ref`, `pull_request_review_message_ref` | | Messages | 3 | `messages`, `review_comments`, `pull_request_teams` | | Commits | 3 | `commits`, `commit_parents`, `commit_messages` | | Releases | 1 | `releases` | | Repo metadata | 6 | `repo_info`, `repo_clones`, `repo_badging`, `dei_badging`, `repo_insights`, `repo_insights_records` | | Dependencies | 5 | `repo_dependencies`, `repo_deps_libyear`, `repo_deps_scorecard`, `repo_sbom_scans`, `libraries` | | Aggregates | 6 | `dm_repo_annual`, `dm_repo_monthly`, `dm_repo_weekly`, `dm_repo_group_annual`, `dm_repo_group_monthly`, `dm_repo_group_weekly` | | Code complexity | 4 | `repo_labor`, `repo_meta`, `repo_stats`, `repo_test_coverage` | | Analysis/ML | 8 | `message_analysis`, `message_analysis_summary`, `message_sentiment`, `message_sentiment_summary`, `discourse_insights`, `lstm_anomaly_models`, `lstm_anomaly_results`, `pull_request_analysis` | | CHAOSS | 4 | `chaoss_metric_status`, `chaoss_user`, `repo_group_insights`, `commit_comment_ref` | Plus 22 materialized views for 8Knot compatibility. ### `aveloxis_ops` (24 tables) Operational and orchestration tables: | Category | Tables | Purpose | |---|---|---| | Queue | `collection_queue` | Postgres-backed priority queue with `SKIP LOCKED` | | Staging | `staging` | JSONB staging store for the staged pipeline | | Status | `collection_status` | Tracks core/secondary/facade/ML phases per repo | | Credentials | `worker_oauth` | API key storage | | Users | `users`, `user_sessions`, `user_repos` | User accounts and auth | | Config | `config` | Runtime configuration | | Workers | `worker_history`, `worker_job` | Worker run history | ### `aveloxis_augur_data` (6 views) Augur compatibility layer for [8Knot](https://github.com/oss-aspen/8Knot) and other Augur-era analytics tools. Contains views that alias Aveloxis column names to Augur conventions. Only tables with column name differences need views here — tables with identical columns resolve via the `search_path` fallback to `aveloxis_data`. | View | Augur column aliases | |---|---| | `repo` | `repos` table (singular name) + `primary_language` → `repo_language` | | `repo_info` | `star_count` → `stars_count`, `watcher_count` → `watchers_count` | | `issues` | `issue_number` → `gh_issue_number`, `platform_issue_id` → `gh_issue_id`, `closed_by_id` → `cntrb_id` | | `pull_requests` | `pr_number` → `pr_src_number`, `author_id` → `pr_augur_contributor_id`, `created_at` → `pr_created_at`, `closed_at` → `pr_closed_at`, `merged_at` → `pr_merged_at` | | `releases` | `created_at` → `release_created_at`, `published_at` → `release_published_at`, `updated_at` → `release_updated_at` | | `message` | Alias for `messages` (Augur uses singular) | **Usage:** Set `AUGUR_SCHEMA=aveloxis_augur_data,aveloxis_data` (no space after comma) in 8Knot's `.env`. PostgreSQL checks `aveloxis_augur_data` first (finding the aliased views), then falls through to `aveloxis_data` for all other tables. For existing Augur databases, use `AUGUR_SCHEMA=augur_data` — the compatibility schema is not needed. --- ## Collection flow The full collection flow for a single repo: ``` URL Check (prelim) | v API Collection (phase 1) |-- Contributors (member lists) |-- Issues + labels + assignees |-- Pull requests + all children |-- Events (issue + PR) |-- Messages (comments) |-- Metadata (repo info, releases, clone stats) | v Staging -> Processing (phase 2) |-- Contributors resolved (cache -> DB -> create) |-- Entities upserted in FK order | v ┌──────────────────────────────────────┐ │ Parallel execution │ ├──────────────────┬───────────────────┤ │ Facade (phase 3) │ Analysis (phase 4)│ │ git clone/fetch │ Dependency scan │ │ git log parse │ Libyear (5 reg.) │ │ Commit parents │ Code complexity │ │ Affiliations │ (scc) │ │ Aggregates │ │ └──────────────────┴───────────────────┘ | v Commit Resolution (phase 5) |-- Noreply parse |-- DB lookup |-- Commits API |-- Search API |-- Alias creation |-- Backfill cmt_ght_author_id | v Canonical Email Enrichment (phase 6) | v Done -> repo re-queued with new due time ``` --- ## Key design decisions vs Augur ### Postgres queue instead of Celery/Redis/RabbitMQ Augur uses Celery with RabbitMQ and Redis for job queueing. Aveloxis uses a single PostgreSQL table with `FOR UPDATE SKIP LOCKED`. This eliminates three infrastructure dependencies and makes queue state fully transparent and queryable with plain SQL. ### JSONB staging instead of direct writes Augur writes to relational tables during API collection, causing contention on the `contributors` table when many workers collect simultaneously. Aveloxis stages raw API data as JSONB, then processes it single-threaded per repo. This eliminates contributor table contention at scale (400K+ repos). ### Deterministic contributor IDs Augur generates random UUIDs for contributor IDs, then runs post-hoc fix scripts. Aveloxis generates deterministic UUIDs from the platform user ID, ensuring the same user always gets the same UUID and enabling byte-compatible cross-system joins. ### Bare clones instead of full clones Aveloxis uses bare clones (permanent, smaller) for the facade phase and creates temporary full checkouts only for analysis. This reduces disk usage and avoids the overhead of maintaining working trees. ### Built-in monitoring Augur relies on Flower (a separate Celery monitoring service). Aveloxis includes a built-in HTTP dashboard and REST API. ### Platform abstraction layer Both GitHub and GitLab implement the same `platform.Client` interface with 7 sub-interfaces, ensuring feature parity. All methods use Go 1.23 iterators (`iter.Seq2`) for memory-efficient streaming pagination. #### Known GitLab API limitations The following data is available from GitHub but not from GitLab due to platform API constraints: - **Community profile files** (CHANGELOG, CONTRIBUTING, CODE_OF_CONDUCT, SECURITY) — not yet fetched for GitLab, but closable via `/repository/tree` and `/repository/files` endpoints. - **Watcher count** — GitLab has no public watchers API (`star_count` is captured instead). - **Clone statistics** — GitLab exposes these only via admin-only endpoints. - **GraphQL node IDs** — GitLab uses numeric project/user IDs rather than GitHub-style GraphQL node IDs. Stored in `SrcRepoID` (numeric) instead of `SrcNodeID`. - **Contributor URL fields** — GitHub returns 10+ URL fields per user (followers, gists, starred, etc.) that GitLab's API does not provide. - **Contributor type** — GitHub distinguishes User/Bot/Organization; GitLab does not expose this distinction. --- ## Project structure ``` aveloxis/ cmd/aveloxis/ # CLI entry point (cobra commands) internal/ collector/ # Collection orchestration collector.go # Direct pipeline staged.go # Staged pipeline facade.go # Git clone + log parsing commit_resolver.go # Git email -> GitHub user resolution breadth.go # Contributor breadth worker analysis.go # Dependencies, libyear, scc noreply.go # GitHub noreply email parser prelim.go # Redirect detection and duplicate checking config/ # JSON config loading with defaults db/ # Database layer postgres.go # All upsert methods staging.go # JSONB staging writer and processor migrate.go # Schema migration schema.sql # Full DDL (108 tables) matviews.sql # 22 materialized views contributors.go # Contributor resolver with cache affiliations.go # Email domain -> org resolver aggregates.go # Facade aggregate refresh github_uuid.go # Deterministic UUID generation queue.go # Priority queue operations model/ # Platform-agnostic data types monitor/ # HTTP dashboard and API platform/ # Platform abstraction github/ # GitHub REST API client gitlab/ # GitLab API v4 client scheduler/ # Queue polling, job dispatch ``` --- ## Scheduler internals The scheduler (`internal/scheduler/`) is the long-running loop that drives all collection. It polls the Postgres-backed priority queue and dispatches collection workers. ### Job dispatch The scheduler uses a semaphore (buffered channel) sized to `Workers` to limit concurrency. Each poll tick attempts to acquire a semaphore slot and dequeue a job via `SELECT ... FOR UPDATE SKIP LOCKED`. If no slot is available or no job is due, the tick is skipped. ### Phase execution within a job Each job runs six phases. After the sequential API collection and processing phases (1-2), facade and analysis run **in parallel** since they operate on independent data (bare clone vs. temporary checkout). Commit resolution runs after both complete because it needs facade's commit data. ### Periodic background tasks | Task | Interval | Notes | |---|---|---| | Stale lock recovery | 5 min | Reclaims jobs from crashed workers via `StaleLockTimeout` | | Org refresh | Configurable (default 4h) | Scans GitHub orgs and GitLab groups for new/renamed repos | | User org refresh | Same as org refresh | Scans user-requested org additions | | Contributor breadth | 6h | Discovers cross-repo activity via GitHub Events API | | Matview rebuild | Weekly (Saturday) | Drains all workers, rebuilds 22 materialized views, resumes | ### Graceful shutdown On context cancellation, the scheduler: 1. Drains the semaphore (waits for all active workers to finish) 2. Releases all queue locks held by this worker instance (repos return to `queued` immediately) 3. Any data already staged but not yet processed is preserved and will be processed on next startup ### Startup recovery On startup, before entering the poll loop: 1. Processes any leftover unprocessed staging rows from a previous interrupted run 2. Recovers stale locks from any crashed worker instances 3. Releases any locks held by our own worker ID (from a previous unclean shutdown) --- ## Next steps - [Staged Pipeline](staged-pipeline.md) -- why staging matters and how it works - [Contributor Resolution](contributor-resolution.md) -- identity resolution across platforms - [Facade Commits](facade-commits.md) -- git log parsing and commit data - [Analysis](analysis.md) -- dependency scanning, libyear, code complexity - [Materialized Views](materialized-views.md) -- 8Knot-compatible analytics views