Architecture Overview

Aveloxis is a Go-based open source community health data collection pipeline that collects from GitHub and GitLab with equal completeness, storing everything in PostgreSQL.

System diagram

                                    GitHub API
                                        |
                                        v
  ┌─────────────┐    ┌─────────────────────────────────┐
  │  CLI        │    │  Aveloxis Scheduler              │
  │  add-repo   │───>│                                  │
  │  add-key    │    │  ┌───────────┐  ┌─────────────┐ │
  │  prioritize │    │  │ Worker 1  │  │ Worker 2    │ │
  │  collect    │    │  │           │  │             │ │
  └─────────────┘    │  └─────┬─────┘  └──────┬──────┘ │
                     │        │               │        │
                     │        v               v        │
                     │  ┌─────────────────────────┐    │
                     │  │ Staged Pipeline          │    │
                     │  │ 1. Prelim (URL check)    │    │
                     │  │ 2. Stage (JSONB)         │    │
                     │  │ 3. Process (relational)  │    │
                     │  │ 4. Facade (git log)      │    │
                     │  │ 5. Commit Resolution     │    │
                     │  │ 6. Analysis              │    │
                     │  └────────────┬────────────┘    │
                     │               │                  │
                     │  ┌────────────v────────────┐    │
                     │  │ Periodic Tasks           │    │
                     │  │ - Org refresh (4h)       │    │
                     │  │ - Contributor breadth(6h)│    │
                     │  │ - Matview rebuild (Sat)  │    │
                     │  └─────────────────────────┘    │
                     └──────────────┬──────────────────┘
                                    │
                                    v
  ┌─────────────────────────────────────────────────────┐
  │  PostgreSQL                                         │
  │                                                     │
  │  ┌───────────────────┐  ┌────────────────────────┐ │
  │  │ aveloxis_data     │  │ aveloxis_ops           │ │
  │  │ 84 tables         │  │ 24 tables              │ │
  │  │ 19 matviews       │  │ - collection_queue     │ │
  │  │ - repos           │  │ - staging (JSONB)      │ │
  │  │ - issues          │  │ - collection_status    │ │
  │  │ - pull_requests   │  │ - worker_oauth         │ │
  │  │ - commits         │  │ - users/sessions       │ │
  │  │ - contributors    │  │ - config               │ │
  │  │ - messages        │  │ - worker_history       │ │
  │  │ - releases        │  │                        │ │
  │  │ - dependencies    │  │                        │ │
  │  │ - repo_labor      │  │                        │ │
  │  │ - aggregates      │  │                        │ │
  │  └───────────────────┘  └────────────────────────┘ │
  └─────────────────────────────────────────────────────┘
                                    |
                                    v
                     ┌───────────────────────────┐
                     │  8Knot / Analytics Tools   │
                     │  (reads matviews)          │
                     └───────────────────────────┘

Three schemas

Aveloxis uses three PostgreSQL schemas to separate collected data, operational state, and Augur compatibility.

`aveloxis_data` (84 tables + 22 materialized views)

All collected open source community health data:

Category	Tables	Examples
Core	4	`repos`, `repo_groups`, `platforms`, `repo_groups_list_serve`
Contributors	6	`contributors`, `contributor_identities`, `contributors_aliases`, `contributor_affiliations`, `contributor_repo`, `unresolved_commit_emails`
Issues	5	`issues`, `issue_labels`, `issue_assignees`, `issue_events`, `issue_message_ref`
Pull Requests	12	`pull_requests`, `pull_request_labels`, `pull_request_assignees`, `pull_request_reviewers`, `pull_request_reviews`, `pull_request_commits`, `pull_request_files`, `pull_request_events`, `pull_request_meta`, `pull_request_repo`, `pull_request_message_ref`, `pull_request_review_message_ref`
Messages	3	`messages`, `review_comments`, `pull_request_teams`
Commits	3	`commits`, `commit_parents`, `commit_messages`
Releases	1	`releases`
Repo metadata	6	`repo_info`, `repo_clones`, `repo_badging`, `dei_badging`, `repo_insights`, `repo_insights_records`
Dependencies	5	`repo_dependencies`, `repo_deps_libyear`, `repo_deps_scorecard`, `repo_sbom_scans`, `libraries`
Aggregates	6	`dm_repo_annual`, `dm_repo_monthly`, `dm_repo_weekly`, `dm_repo_group_annual`, `dm_repo_group_monthly`, `dm_repo_group_weekly`
Code complexity	4	`repo_labor`, `repo_meta`, `repo_stats`, `repo_test_coverage`
Analysis/ML	8	`message_analysis`, `message_analysis_summary`, `message_sentiment`, `message_sentiment_summary`, `discourse_insights`, `lstm_anomaly_models`, `lstm_anomaly_results`, `pull_request_analysis`
CHAOSS	4	`chaoss_metric_status`, `chaoss_user`, `repo_group_insights`, `commit_comment_ref`

Plus 22 materialized views for 8Knot compatibility.

`aveloxis_ops` (24 tables)

Operational and orchestration tables:

Category	Tables	Purpose
Queue	`collection_queue`	Postgres-backed priority queue with `SKIP LOCKED`
Staging	`staging`	JSONB staging store for the staged pipeline
Status	`collection_status`	Tracks core/secondary/facade/ML phases per repo
Credentials	`worker_oauth`	API key storage
Users	`users`, `user_sessions`, `user_repos`	User accounts and auth
Config	`config`	Runtime configuration
Workers	`worker_history`, `worker_job`	Worker run history

`aveloxis_augur_data` (6 views)

Augur compatibility layer for 8Knot and other Augur-era analytics tools. Contains views that alias Aveloxis column names to Augur conventions. Only tables with column name differences need views here — tables with identical columns resolve via the search_path fallback to aveloxis_data.

View	Augur column aliases
`repo`	`repos` table (singular name) + `primary_language` → `repo_language`
`repo_info`	`star_count` → `stars_count`, `watcher_count` → `watchers_count`
`issues`	`issue_number` → `gh_issue_number`, `platform_issue_id` → `gh_issue_id`, `closed_by_id` → `cntrb_id`
`pull_requests`	`pr_number` → `pr_src_number`, `author_id` → `pr_augur_contributor_id`, `created_at` → `pr_created_at`, `closed_at` → `pr_closed_at`, `merged_at` → `pr_merged_at`
`releases`	`created_at` → `release_created_at`, `published_at` → `release_published_at`, `updated_at` → `release_updated_at`
`message`	Alias for `messages` (Augur uses singular)

Usage: Set AUGUR_SCHEMA=aveloxis_augur_data,aveloxis_data (no space after comma) in 8Knot’s .env. PostgreSQL checks aveloxis_augur_data first (finding the aliased views), then falls through to aveloxis_data for all other tables. For existing Augur databases, use AUGUR_SCHEMA=augur_data — the compatibility schema is not needed.

Collection flow

The full collection flow for a single repo:

URL Check (prelim)
    |
    v
API Collection (phase 1)
    |-- Contributors (member lists)
    |-- Issues + labels + assignees
    |-- Pull requests + all children
    |-- Events (issue + PR)
    |-- Messages (comments)
    |-- Metadata (repo info, releases, clone stats)
    |
    v
Staging -> Processing (phase 2)
    |-- Contributors resolved (cache -> DB -> create)
    |-- Entities upserted in FK order
    |
    v
┌──────────────────────────────────────┐
│ Parallel execution                   │
├──────────────────┬───────────────────┤
│ Facade (phase 3) │ Analysis (phase 4)│
│  git clone/fetch │  Dependency scan  │
│  git log parse   │  Libyear (5 reg.) │
│  Commit parents  │  Code complexity  │
│  Affiliations    │  (scc)            │
│  Aggregates      │                   │
└──────────────────┴───────────────────┘
    |
    v
Commit Resolution (phase 5)
    |-- Noreply parse
    |-- DB lookup
    |-- Commits API
    |-- Search API
    |-- Alias creation
    |-- Backfill cmt_ght_author_id
    |
    v
Canonical Email Enrichment (phase 6)
    |
    v
Done -> repo re-queued with new due time

Key design decisions vs Augur

Postgres queue instead of Celery/Redis/RabbitMQ

Augur uses Celery with RabbitMQ and Redis for job queueing. Aveloxis uses a single PostgreSQL table with FOR UPDATE SKIP LOCKED. This eliminates three infrastructure dependencies and makes queue state fully transparent and queryable with plain SQL.

JSONB staging instead of direct writes

Augur writes to relational tables during API collection, causing contention on the contributors table when many workers collect simultaneously. Aveloxis stages raw API data as JSONB, then processes it single-threaded per repo. This eliminates contributor table contention at scale (400K+ repos).

Deterministic contributor IDs

Augur generates random UUIDs for contributor IDs, then runs post-hoc fix scripts. Aveloxis generates deterministic UUIDs from the platform user ID, ensuring the same user always gets the same UUID and enabling byte-compatible cross-system joins.

Bare clones instead of full clones

Aveloxis uses bare clones (permanent, smaller) for the facade phase and creates temporary full checkouts only for analysis. This reduces disk usage and avoids the overhead of maintaining working trees.

Built-in monitoring

Augur relies on Flower (a separate Celery monitoring service). Aveloxis includes a built-in HTTP dashboard and REST API.

Platform abstraction layer

Both GitHub and GitLab implement the same platform.Client interface with 7 sub-interfaces, ensuring feature parity. All methods use Go 1.23 iterators (iter.Seq2) for memory-efficient streaming pagination.

Known GitLab API limitations

The following data is available from GitHub but not from GitLab due to platform API constraints:

Community profile files (CHANGELOG, CONTRIBUTING, CODE_OF_CONDUCT, SECURITY) — not yet fetched for GitLab, but closable via /repository/tree and /repository/files endpoints.
Watcher count — GitLab has no public watchers API (star_count is captured instead).
Clone statistics — GitLab exposes these only via admin-only endpoints.
GraphQL node IDs — GitLab uses numeric project/user IDs rather than GitHub-style GraphQL node IDs. Stored in SrcRepoID (numeric) instead of SrcNodeID.
Contributor URL fields — GitHub returns 10+ URL fields per user (followers, gists, starred, etc.) that GitLab’s API does not provide.
Contributor type — GitHub distinguishes User/Bot/Organization; GitLab does not expose this distinction.

Project structure

aveloxis/
  cmd/aveloxis/           # CLI entry point (cobra commands)
  internal/
    collector/            # Collection orchestration
      collector.go        # Direct pipeline
      staged.go           # Staged pipeline
      facade.go           # Git clone + log parsing
      commit_resolver.go  # Git email -> GitHub user resolution
      breadth.go          # Contributor breadth worker
      analysis.go         # Dependencies, libyear, scc
      noreply.go          # GitHub noreply email parser
      prelim.go           # Redirect detection and duplicate checking
    config/               # JSON config loading with defaults
    db/                   # Database layer
      postgres.go         # All upsert methods
      staging.go          # JSONB staging writer and processor
      migrate.go          # Schema migration
      schema.sql          # Full DDL (108 tables)
      matviews.sql        # 22 materialized views
      contributors.go     # Contributor resolver with cache
      affiliations.go     # Email domain -> org resolver
      aggregates.go       # Facade aggregate refresh
      github_uuid.go      # Deterministic UUID generation
      queue.go            # Priority queue operations
    model/                # Platform-agnostic data types
    monitor/              # HTTP dashboard and API
    platform/             # Platform abstraction
      github/             # GitHub REST API client
      gitlab/             # GitLab API v4 client
    scheduler/            # Queue polling, job dispatch

Scheduler internals

The scheduler (internal/scheduler/) is the long-running loop that drives all collection. It polls the Postgres-backed priority queue and dispatches collection workers.

Job dispatch

The scheduler uses a semaphore (buffered channel) sized to Workers to limit concurrency. Each poll tick attempts to acquire a semaphore slot and dequeue a job via SELECT ... FOR UPDATE SKIP LOCKED. If no slot is available or no job is due, the tick is skipped.

Phase execution within a job

Each job runs six phases. After the sequential API collection and processing phases (1-2), facade and analysis run in parallel since they operate on independent data (bare clone vs. temporary checkout). Commit resolution runs after both complete because it needs facade’s commit data.

Periodic background tasks

Task	Interval	Notes
Stale lock recovery	5 min	Reclaims jobs from crashed workers via `StaleLockTimeout`
Org refresh	Configurable (default 4h)	Scans GitHub orgs and GitLab groups for new/renamed repos
User org refresh	Same as org refresh	Scans user-requested org additions
Contributor breadth	6h	Discovers cross-repo activity via GitHub Events API
Matview rebuild	Weekly (Saturday)	Drains all workers, rebuilds 22 materialized views, resumes

Graceful shutdown

On context cancellation, the scheduler:

Drains the semaphore (waits for all active workers to finish)
Releases all queue locks held by this worker instance (repos return to queued immediately)
Any data already staged but not yet processed is preserved and will be processed on next startup

Startup recovery

On startup, before entering the poll loop:

Processes any leftover unprocessed staging rows from a previous interrupted run
Recovers stale locks from any crashed worker instances
Releases any locks held by our own worker ID (from a previous unclean shutdown)

Next steps

Staged Pipeline – why staging matters and how it works
Contributor Resolution – identity resolution across platforms
Facade Commits – git log parsing and commit data
Analysis – dependency scanning, libyear, code complexity
Materialized Views – 8Knot-compatible analytics views