Architecture Overview

Aveloxis is a Go-based open source community health data collection pipeline that collects from GitHub and GitLab with equal completeness, storing everything in PostgreSQL.


System diagram

                                    GitHub API
                                        |
                                        v
  ┌─────────────┐    ┌─────────────────────────────────┐
  │  CLI        │    │  Aveloxis Scheduler              │
  │  add-repo   │───>│                                  │
  │  add-key    │    │  ┌───────────┐  ┌─────────────┐ │
  │  prioritize │    │  │ Worker 1  │  │ Worker 2    │ │
  │  collect    │    │  │           │  │             │ │
  └─────────────┘    │  └─────┬─────┘  └──────┬──────┘ │
                     │        │               │        │
                     │        v               v        │
                     │  ┌─────────────────────────┐    │
                     │  │ Staged Pipeline          │    │
                     │  │ 1. Prelim (URL check)    │    │
                     │  │ 2. Stage (JSONB)         │    │
                     │  │ 3. Process (relational)  │    │
                     │  │ 4. Facade (git log)      │    │
                     │  │ 5. Commit Resolution     │    │
                     │  │ 6. Analysis              │    │
                     │  └────────────┬────────────┘    │
                     │               │                  │
                     │  ┌────────────v────────────┐    │
                     │  │ Periodic Tasks           │    │
                     │  │ - Org refresh (4h)       │    │
                     │  │ - Contributor breadth(6h)│    │
                     │  │ - Matview rebuild (Sat)  │    │
                     │  └─────────────────────────┘    │
                     └──────────────┬──────────────────┘
                                    │
                                    v
  ┌─────────────────────────────────────────────────────┐
  │  PostgreSQL                                         │
  │                                                     │
  │  ┌───────────────────┐  ┌────────────────────────┐ │
  │  │ aveloxis_data     │  │ aveloxis_ops           │ │
  │  │ 84 tables         │  │ 24 tables              │ │
  │  │ 19 matviews       │  │ - collection_queue     │ │
  │  │ - repos           │  │ - staging (JSONB)      │ │
  │  │ - issues          │  │ - collection_status    │ │
  │  │ - pull_requests   │  │ - worker_oauth         │ │
  │  │ - commits         │  │ - users/sessions       │ │
  │  │ - contributors    │  │ - config               │ │
  │  │ - messages        │  │ - worker_history       │ │
  │  │ - releases        │  │                        │ │
  │  │ - dependencies    │  │                        │ │
  │  │ - repo_labor      │  │                        │ │
  │  │ - aggregates      │  │                        │ │
  │  └───────────────────┘  └────────────────────────┘ │
  └─────────────────────────────────────────────────────┘
                                    |
                                    v
                     ┌───────────────────────────┐
                     │  8Knot / Analytics Tools   │
                     │  (reads matviews)          │
                     └───────────────────────────┘

Three schemas

Aveloxis uses three PostgreSQL schemas to separate collected data, operational state, and Augur compatibility.

aveloxis_data (84 tables + 22 materialized views)

All collected open source community health data:

Category

Tables

Examples

Core

4

repos, repo_groups, platforms, repo_groups_list_serve

Contributors

6

contributors, contributor_identities, contributors_aliases, contributor_affiliations, contributor_repo, unresolved_commit_emails

Issues

5

issues, issue_labels, issue_assignees, issue_events, issue_message_ref

Pull Requests

12

pull_requests, pull_request_labels, pull_request_assignees, pull_request_reviewers, pull_request_reviews, pull_request_commits, pull_request_files, pull_request_events, pull_request_meta, pull_request_repo, pull_request_message_ref, pull_request_review_message_ref

Messages

3

messages, review_comments, pull_request_teams

Commits

3

commits, commit_parents, commit_messages

Releases

1

releases

Repo metadata

6

repo_info, repo_clones, repo_badging, dei_badging, repo_insights, repo_insights_records

Dependencies

5

repo_dependencies, repo_deps_libyear, repo_deps_scorecard, repo_sbom_scans, libraries

Aggregates

6

dm_repo_annual, dm_repo_monthly, dm_repo_weekly, dm_repo_group_annual, dm_repo_group_monthly, dm_repo_group_weekly

Code complexity

4

repo_labor, repo_meta, repo_stats, repo_test_coverage

Analysis/ML

8

message_analysis, message_analysis_summary, message_sentiment, message_sentiment_summary, discourse_insights, lstm_anomaly_models, lstm_anomaly_results, pull_request_analysis

CHAOSS

4

chaoss_metric_status, chaoss_user, repo_group_insights, commit_comment_ref

Plus 22 materialized views for 8Knot compatibility.

aveloxis_ops (24 tables)

Operational and orchestration tables:

Category

Tables

Purpose

Queue

collection_queue

Postgres-backed priority queue with SKIP LOCKED

Staging

staging

JSONB staging store for the staged pipeline

Status

collection_status

Tracks core/secondary/facade/ML phases per repo

Credentials

worker_oauth

API key storage

Users

users, user_sessions, user_repos

User accounts and auth

Config

config

Runtime configuration

Workers

worker_history, worker_job

Worker run history

aveloxis_augur_data (6 views)

Augur compatibility layer for 8Knot and other Augur-era analytics tools. Contains views that alias Aveloxis column names to Augur conventions. Only tables with column name differences need views here — tables with identical columns resolve via the search_path fallback to aveloxis_data.

View

Augur column aliases

repo

repos table (singular name) + primary_languagerepo_language

repo_info

star_countstars_count, watcher_countwatchers_count

issues

issue_numbergh_issue_number, platform_issue_idgh_issue_id, closed_by_idcntrb_id

pull_requests

pr_numberpr_src_number, author_idpr_augur_contributor_id, created_atpr_created_at, closed_atpr_closed_at, merged_atpr_merged_at

releases

created_atrelease_created_at, published_atrelease_published_at, updated_atrelease_updated_at

message

Alias for messages (Augur uses singular)

Usage: Set AUGUR_SCHEMA=aveloxis_augur_data,aveloxis_data (no space after comma) in 8Knot’s .env. PostgreSQL checks aveloxis_augur_data first (finding the aliased views), then falls through to aveloxis_data for all other tables. For existing Augur databases, use AUGUR_SCHEMA=augur_data — the compatibility schema is not needed.


Collection flow

The full collection flow for a single repo:

URL Check (prelim)
    |
    v
API Collection (phase 1)
    |-- Contributors (member lists)
    |-- Issues + labels + assignees
    |-- Pull requests + all children
    |-- Events (issue + PR)
    |-- Messages (comments)
    |-- Metadata (repo info, releases, clone stats)
    |
    v
Staging -> Processing (phase 2)
    |-- Contributors resolved (cache -> DB -> create)
    |-- Entities upserted in FK order
    |
    v
┌──────────────────────────────────────┐
│ Parallel execution                   │
├──────────────────┬───────────────────┤
│ Facade (phase 3) │ Analysis (phase 4)│
│  git clone/fetch │  Dependency scan  │
│  git log parse   │  Libyear (5 reg.) │
│  Commit parents  │  Code complexity  │
│  Affiliations    │  (scc)            │
│  Aggregates      │                   │
└──────────────────┴───────────────────┘
    |
    v
Commit Resolution (phase 5)
    |-- Noreply parse
    |-- DB lookup
    |-- Commits API
    |-- Search API
    |-- Alias creation
    |-- Backfill cmt_ght_author_id
    |
    v
Canonical Email Enrichment (phase 6)
    |
    v
Done -> repo re-queued with new due time

Key design decisions vs Augur

Postgres queue instead of Celery/Redis/RabbitMQ

Augur uses Celery with RabbitMQ and Redis for job queueing. Aveloxis uses a single PostgreSQL table with FOR UPDATE SKIP LOCKED. This eliminates three infrastructure dependencies and makes queue state fully transparent and queryable with plain SQL.

JSONB staging instead of direct writes

Augur writes to relational tables during API collection, causing contention on the contributors table when many workers collect simultaneously. Aveloxis stages raw API data as JSONB, then processes it single-threaded per repo. This eliminates contributor table contention at scale (400K+ repos).

Deterministic contributor IDs

Augur generates random UUIDs for contributor IDs, then runs post-hoc fix scripts. Aveloxis generates deterministic UUIDs from the platform user ID, ensuring the same user always gets the same UUID and enabling byte-compatible cross-system joins.

Bare clones instead of full clones

Aveloxis uses bare clones (permanent, smaller) for the facade phase and creates temporary full checkouts only for analysis. This reduces disk usage and avoids the overhead of maintaining working trees.

Built-in monitoring

Augur relies on Flower (a separate Celery monitoring service). Aveloxis includes a built-in HTTP dashboard and REST API.

Platform abstraction layer

Both GitHub and GitLab implement the same platform.Client interface with 7 sub-interfaces, ensuring feature parity. All methods use Go 1.23 iterators (iter.Seq2) for memory-efficient streaming pagination.

Known GitLab API limitations

The following data is available from GitHub but not from GitLab due to platform API constraints:

  • Community profile files (CHANGELOG, CONTRIBUTING, CODE_OF_CONDUCT, SECURITY) — not yet fetched for GitLab, but closable via /repository/tree and /repository/files endpoints.

  • Watcher count — GitLab has no public watchers API (star_count is captured instead).

  • Clone statistics — GitLab exposes these only via admin-only endpoints.

  • GraphQL node IDs — GitLab uses numeric project/user IDs rather than GitHub-style GraphQL node IDs. Stored in SrcRepoID (numeric) instead of SrcNodeID.

  • Contributor URL fields — GitHub returns 10+ URL fields per user (followers, gists, starred, etc.) that GitLab’s API does not provide.

  • Contributor type — GitHub distinguishes User/Bot/Organization; GitLab does not expose this distinction.


Project structure

aveloxis/
  cmd/aveloxis/           # CLI entry point (cobra commands)
  internal/
    collector/            # Collection orchestration
      collector.go        # Direct pipeline
      staged.go           # Staged pipeline
      facade.go           # Git clone + log parsing
      commit_resolver.go  # Git email -> GitHub user resolution
      breadth.go          # Contributor breadth worker
      analysis.go         # Dependencies, libyear, scc
      noreply.go          # GitHub noreply email parser
      prelim.go           # Redirect detection and duplicate checking
    config/               # JSON config loading with defaults
    db/                   # Database layer
      postgres.go         # All upsert methods
      staging.go          # JSONB staging writer and processor
      migrate.go          # Schema migration
      schema.sql          # Full DDL (108 tables)
      matviews.sql        # 22 materialized views
      contributors.go     # Contributor resolver with cache
      affiliations.go     # Email domain -> org resolver
      aggregates.go       # Facade aggregate refresh
      github_uuid.go      # Deterministic UUID generation
      queue.go            # Priority queue operations
    model/                # Platform-agnostic data types
    monitor/              # HTTP dashboard and API
    platform/             # Platform abstraction
      github/             # GitHub REST API client
      gitlab/             # GitLab API v4 client
    scheduler/            # Queue polling, job dispatch

Scheduler internals

The scheduler (internal/scheduler/) is the long-running loop that drives all collection. It polls the Postgres-backed priority queue and dispatches collection workers.

Job dispatch

The scheduler uses a semaphore (buffered channel) sized to Workers to limit concurrency. Each poll tick attempts to acquire a semaphore slot and dequeue a job via SELECT ... FOR UPDATE SKIP LOCKED. If no slot is available or no job is due, the tick is skipped.

Phase execution within a job

Each job runs six phases. After the sequential API collection and processing phases (1-2), facade and analysis run in parallel since they operate on independent data (bare clone vs. temporary checkout). Commit resolution runs after both complete because it needs facade’s commit data.

Periodic background tasks

Task

Interval

Notes

Stale lock recovery

5 min

Reclaims jobs from crashed workers via StaleLockTimeout

Org refresh

Configurable (default 4h)

Scans GitHub orgs and GitLab groups for new/renamed repos

User org refresh

Same as org refresh

Scans user-requested org additions

Contributor breadth

6h

Discovers cross-repo activity via GitHub Events API

Matview rebuild

Weekly (Saturday)

Drains all workers, rebuilds 22 materialized views, resumes

Graceful shutdown

On context cancellation, the scheduler:

  1. Drains the semaphore (waits for all active workers to finish)

  2. Releases all queue locks held by this worker instance (repos return to queued immediately)

  3. Any data already staged but not yet processed is preserved and will be processed on next startup

Startup recovery

On startup, before entering the poll loop:

  1. Processes any leftover unprocessed staging rows from a previous interrupted run

  2. Recovers stale locks from any crashed worker instances

  3. Releases any locks held by our own worker ID (from a previous unclean shutdown)


Next steps