Aveloxis Schema Documentation
Reference documentation for the Aveloxis database schema – the Go-based open source community health metrics collection pipeline.
Overview
Aveloxis uses two PostgreSQL schemas to separate collected data from operational state:
aveloxis_data– Collected open source community health data. Contains tables for repositories, contributors, issues, pull requests, commits, releases, messages, dependency information, and aggregated data mart views. These tables hold the output of collection workers that talk to GitHub, GitLab, and local git clones.aveloxis_ops– Operational and orchestration tables. Contains the collection queue, staging area, API credentials, user accounts, worker state, and configuration. These tables drive the collection pipeline itself.
Both schemas maintain full parity with Augur’s augur_data and augur_operations schemas. All CREATE TABLE statements use IF NOT EXISTS and inserts use ON CONFLICT DO NOTHING for idempotent migrations.
Metadata Columns
Most tables in aveloxis_data include four provenance columns. They are documented once here and referenced as “Standard metadata columns” in the per-table documentation below.
Column |
Type |
Default |
Description |
|---|---|---|---|
|
TEXT |
|
Identifies which component created the row. Values: |
|
TEXT |
|
Version of the tool that created the row. Currently empty; reserved for future use. |
|
TEXT |
|
Where the raw data came from. Values: |
|
TIMESTAMPTZ |
|
Timestamp of when this row was inserted or last updated. Auto-set to |
aveloxis_data Schema
Core Tables
platforms
Lookup table for supported forge platforms. Seeded on schema creation with GitHub (1), GitLab (2), and Mailing List (6 — the transport platform for mailing-list-sourced rows; see Mailing-list ingestion).
Column |
Type |
Source |
Description |
|---|---|---|---|
|
SMALLINT (PK) |
Seeded |
Primary key. |
|
TEXT NOT NULL UNIQUE |
Seeded |
Human-readable platform name. |
repo_groups
Logical groupings of repositories, typically representing a project, organization, or research cohort.
Column |
Type |
Source |
Description |
|---|---|---|---|
|
BIGSERIAL (PK) |
Auto-generated |
Primary key. |
|
TEXT NOT NULL |
User input |
Name of the repo group (e.g., “CHAOSS”, “Linux Foundation”). |
|
TEXT |
User input |
Free-text description of the group. |
|
TEXT |
User input |
URL for the group’s website. |
|
SMALLINT |
User input |
Flag indicating whether the group should be re-cached. |
|
TIMESTAMPTZ |
Auto-generated |
Timestamp of last modification. Defaults to |
|
TEXT |
User input |
Classification of the group type. |
Standard metadata columns |
repos
Central repository table. Every collected repository has exactly one row here. All entity tables (issues, PRs, commits, etc.) reference repo_id.
Column |
Type |
Source |
Description |
|---|---|---|---|
|
BIGSERIAL (PK) |
Auto-generated |
Primary key, referenced by nearly every other table. |
|
BIGINT (FK -> repo_groups) |
User input |
The group this repo belongs to. |
|
SMALLINT NOT NULL (FK -> platforms) |
User input |
|
|
TEXT NOT NULL UNIQUE |
User input |
Git clone URL. Serves as the natural unique key. |
|
TEXT |
GitHub REST: |
Short repository name (e.g., |
|
TEXT |
GitHub REST: |
Owner or namespace (e.g., |
|
TEXT |
Computed |
Local filesystem path to the cloned repo (used by facade). |
|
TEXT |
GitHub REST: |
Repository description from the forge. |
|
TEXT |
GitHub REST: |
Primary programming language. |
|
TEXT |
GitHub REST: |
URL of the parent repo if this is a fork. |
|
BOOLEAN |
GitHub REST: |
Whether the repo is archived on the forge. |
|
TEXT |
GitHub REST: |
The forge’s numeric ID for this repo, stored as text for cross-platform compatibility. |
|
TIMESTAMPTZ |
GitHub REST: |
When the repo was created on the forge. |
|
TIMESTAMPTZ |
GitHub REST: |
When the repo was last updated on the forge. |
Standard metadata columns |
Unique constraint: (repo_git)
repo_groups_list_serve
Mailing list registry + per-list collection state. Each row is a list the MailingListWorker collects; the mlls_* columns (v0.25.7) carry the claim/checkpoint/lock state. See Mailing-list ingestion.
Column |
Type |
Source |
Description |
|---|---|---|---|
|
BIGSERIAL (PK) |
Auto-generated |
Primary key. |
|
BIGINT NOT NULL (FK -> repo_groups) |
User input |
The repo group this mailing list belongs to. |
|
TEXT |
User input |
Mailing list name. |
|
TEXT |
User input |
Description of the mailing list. |
|
TEXT |
User input |
Organization sponsoring the list. |
|
TEXT |
User input |
List address (e.g. |
|
TEXT |
Registration |
Archive system definition that applies ( |
|
TEXT |
Worker |
|
|
BOOLEAN |
Worker |
Partial-scan flag; FALSE → re-eligible immediately. |
|
INTEGER |
Worker |
Consecutive failure counter (quadratic backoff; sideline at 10). |
|
TIMESTAMPTZ |
Worker |
Backoff gate input. |
|
TIMESTAMPTZ |
Worker |
Last successful tail-refresh. |
|
TIMESTAMPTZ / INTEGER / TEXT |
Worker |
|
Standard metadata columns |
Unique constraint: (repo_group_id, rgls_email)
email_message
A mailing-list email as a first-class entity (v0.25.7) — peer to issues / pull_requests / pull_request_reviews. The body lives in messages (linked via email_message_ref); this table carries the classification (Axis A) and repo association (Axis B). Declared after issues/pull_requests/messages in schema.sql because it FK-references all three. See Mailing-list ingestion.
Column |
Type |
Source |
Description |
|---|---|---|---|
|
BIGSERIAL (PK) |
Auto-generated |
Primary key. |
|
BIGINT (FKs) |
Worker |
The repo / group / list this message came from. |
|
SMALLINT NOT NULL (FK -> platforms) |
Worker |
Always |
|
TEXT |
Worker |
Archive system ( |
|
TEXT NOT NULL UNIQUE |
RFC-822 Message-ID (idempotency key). |
|
|
TEXT |
List the message came from. |
|
|
TEXT / TEXT / TIMESTAMPTZ |
Parsed header fields. |
|
|
TEXT |
Threading: In-Reply-To, References, resolved thread root. |
|
|
BOOLEAN |
Classifier |
Body contains a patch. |
|
TEXT |
Classifier |
Axis A: |
|
TEXT |
Classifier |
Which rule fired (subject_regex / body_url / sender / list_id / list_address). |
|
BOOLEAN / TEXT |
Classifier |
Mirror-list mail that echoes GitHub activity. |
|
TEXT |
Classifier |
Axis B: canonical repo URL the message signals (captured even if not in catalog). |
|
BIGINT (FK -> repos, ON DELETE SET NULL) |
Resolver |
Resolved repo FK (NULL until the URL matches a loaded repo). |
|
BIGINT (FKs) |
Router |
Routed target when the message is an issue/PR event. |
|
TEXT |
Router |
Jira/Bugzilla key / commit the message references. |
Standard metadata columns ( |
email_message_ref
Bridge from email_message to the shared messages table (the body), mirroring issue_message_ref / pull_request_message_ref.
Column |
Type |
Source |
Description |
|---|---|---|---|
|
BIGSERIAL (PK) |
Auto-generated |
Primary key. |
|
BIGINT NOT NULL (FK -> email_message) |
Worker |
The email entity. |
|
BIGINT NOT NULL (FK -> messages) |
Worker |
The body row in |
|
BIGINT (FK -> repo_groups) |
Worker |
Group context. |
Standard metadata columns |
Unique constraint: (email_message_id, msg_id)
Contributors
contributors
Platform-agnostic contributor identity. Each unique person across GitHub and GitLab maps to one row. Contains both canonical fields and platform-specific profile data.
Column |
Type |
Source |
Description |
|---|---|---|---|
|
UUID (PK) |
Auto-generated ( |
Primary key. Referenced by issues, PRs, commits, etc. |
|
TEXT NOT NULL |
GitHub REST: |
Platform login/username. |
|
TEXT |
GitHub REST: |
Email address (may be empty if private). |
|
TEXT |
GitHub REST: |
Full display name. |
|
TEXT |
GitHub REST: |
Company affiliation from profile. |
|
TEXT |
GitHub REST: |
Location string from profile. |
|
TEXT |
Computed / |
Canonical email used for identity merging. |
|
TEXT |
GitHub REST: |
Account type (e.g., |
|
SMALLINT |
Computed |
Flag for synthetic/placeholder contributors. |
|
SMALLINT |
Computed |
Soft-delete flag. |
|
NUMERIC(11,8) |
Computed |
Longitude from geocoded location. |
|
NUMERIC(10,8) |
Computed |
Latitude from geocoded location. |
|
CHAR(3) |
Computed |
ISO 3166 country code from geocoded location. |
|
TEXT |
Computed |
State/province from geocoded location. |
|
TEXT |
Computed |
City from geocoded location. |
|
TIMESTAMPTZ |
Computed |
Timestamp of most recent activity by this contributor. |
|
BIGINT |
GitHub REST: |
GitHub’s numeric user ID. |
|
TEXT |
GitHub REST: |
GitHub login name. |
|
TEXT |
GitHub REST: |
GitHub API URL for this user. |
|
TEXT |
GitHub REST: |
GitHub profile URL. |
|
TEXT |
GitHub REST: |
GitHub GraphQL node ID. |
|
TEXT |
GitHub REST: |
GitHub avatar image URL. |
|
TEXT |
GitHub REST: |
Gravatar ID (legacy). |
|
TEXT |
GitHub REST: |
API URL for followers list. |
|
TEXT |
GitHub REST: |
API URL for following list. |
|
TEXT |
GitHub REST: |
API URL for gists. |
|
TEXT |
GitHub REST: |
API URL for starred repos. |
|
TEXT |
GitHub REST: |
API URL for subscriptions. |
|
TEXT |
GitHub REST: |
API URL for orgs membership. |
|
TEXT |
GitHub REST: |
API URL for user’s repos. |
|
TEXT |
GitHub REST: |
API URL for user’s events. |
|
TEXT |
GitHub REST: |
API URL for received events. |
|
TEXT |
GitHub REST: |
GitHub account type string. |
|
TEXT |
GitHub REST: |
Whether user is a GitHub site admin. |
|
TEXT |
GitLab API v4: |
GitLab profile web URL. |
|
TEXT |
GitLab API v4: |
GitLab avatar URL. |
|
TEXT |
GitLab API v4: |
GitLab account state (e.g., |
|
TEXT |
GitLab API v4: |
GitLab username. |
|
TEXT |
GitLab API v4: |
GitLab display name. |
|
BIGINT |
GitLab API v4: |
GitLab numeric user ID. |
|
TIMESTAMPTZ |
GitHub REST: |
When the account was created on the forge. |
Standard metadata columns |
Unique index: (cntrb_login) WHERE cntrb_login != ''
contributor_identities
Maps a contributor to per-platform identities. One cntrb_id may have identities on both GitHub and GitLab.
Column |
Type |
Source |
Description |
|---|---|---|---|
|
BIGSERIAL (PK) |
Auto-generated |
Primary key. |
|
UUID NOT NULL (FK -> contributors) |
Computed |
The canonical contributor this identity belongs to. |
|
SMALLINT NOT NULL (FK -> platforms) |
Computed |
Which platform this identity is on. |
|
BIGINT NOT NULL |
GitHub REST: |
Numeric user ID on the platform. |
|
TEXT NOT NULL |
GitHub REST: |
Username on the platform. |
|
TEXT |
GitHub REST: |
Display name. |
|
TEXT |
GitHub REST: |
Email address. |
|
TEXT |
GitHub REST: |
Avatar image URL. |
|
TEXT |
GitHub REST: |
Profile page URL. |
|
TEXT |
GitHub REST: |
GitHub GraphQL node ID (empty for GitLab). |
|
TEXT |
GitHub REST: |
Account type. Default |
|
BOOLEAN |
GitHub REST: |
Whether the user is a site/instance admin. |
Unique constraint: (platform_id, platform_user_id)
contributors_aliases
Maps alternate email addresses to a contributor’s canonical email. Used by the commit resolver to unify git commit emails with API identities.
Column |
Type |
Source |
Description |
|---|---|---|---|
|
BIGSERIAL (PK) |
Auto-generated |
Primary key. |
|
UUID NOT NULL (FK -> contributors) |
|
The contributor this alias belongs to. |
|
TEXT NOT NULL |
|
The contributor’s canonical email. |
|
TEXT NOT NULL UNIQUE |
|
An alternate email that maps to this contributor. |
|
SMALLINT NOT NULL |
|
Whether this alias is active. |
|
TIMESTAMPTZ |
Auto-generated |
Last modification timestamp. |
Standard metadata columns |
contributor_affiliations
Maps email domains to organizational affiliations. Used to attribute commits and activity to companies/organizations.
Column |
Type |
Source |
Description |
|---|---|---|---|
|
BIGSERIAL (PK) |
Auto-generated |
Primary key. |
|
TEXT NOT NULL UNIQUE |
User input / Augur import |
Email domain (e.g., |
|
DATE |
User input |
Date the affiliation began. Default |
|
TIMESTAMPTZ |
Computed |
When this affiliation was last matched. |
|
TEXT |
User input |
Organization name (e.g., |
|
SMALLINT |
User input |
Whether this mapping is active. |
Standard metadata columns |
contributor_repo
Records contributor activity events tied to specific repositories. Tracks which contributors interact with which repos and how.
Column |
Type |
Source |
Description |
|---|---|---|---|
|
BIGSERIAL (PK) |
Auto-generated |
Primary key. |
|
UUID NOT NULL (FK -> contributors) |
GitHub REST: |
The contributor involved. |
|
TEXT NOT NULL |
GitHub REST: |
Git URL of the repo. |
|
TEXT NOT NULL |
GitHub REST: |
Repository name. |
|
BIGINT NOT NULL |
GitHub REST: |
GitHub’s numeric repository ID. |
|
TEXT |
Computed |
Category of the contribution (e.g., event type). |
|
BIGINT |
GitHub REST: |
Platform event ID that triggered this record. |
|
TIMESTAMPTZ |
GitHub REST: |
Timestamp of the event. |
Standard metadata columns |
Unique constraint: (event_id, tool_version)
contributors_old
Legacy backup table for contributor data. Holds a snapshot of contributor records before a migration or schema change. Structure mirrors the contributors table.
Column |
Type |
Source |
Description |
|---|---|---|---|
|
UUID (PK) |
Augur import |
Primary key (not auto-generated; copied from contributors). |
|
TEXT |
Augur import |
Platform login. |
|
TEXT |
Augur import |
Email address. |
|
TEXT |
Augur import |
Full display name. |
|
TEXT |
Augur import |
Company affiliation. |
|
TIMESTAMPTZ |
Augur import |
Account creation timestamp. |
|
TEXT |
Augur import |
Account type. |
|
SMALLINT |
Augur import |
Fake flag. |
|
SMALLINT |
Augur import |
Soft-delete flag. |
|
NUMERIC(11,8) |
Augur import |
Longitude. |
|
NUMERIC(10,8) |
Augur import |
Latitude. |
|
CHAR(3) |
Augur import |
ISO country code. |
|
TEXT |
Augur import |
State/province. |
|
TEXT |
Augur import |
City. |
|
TEXT |
Augur import |
Location string. |
|
TEXT |
Augur import |
Canonical email. |
|
TIMESTAMPTZ |
Augur import |
Last activity timestamp. |
|
BIGINT |
Augur import |
GitHub user ID. |
|
TEXT |
Augur import |
GitHub login. |
|
TEXT |
Augur import |
GitHub API URL. |
|
TEXT |
Augur import |
GitHub profile URL. |
|
TEXT |
Augur import |
GitHub node ID. |
|
TEXT |
Augur import |
GitHub avatar URL. |
|
TEXT |
Augur import |
Gravatar ID. |
|
TEXT |
Augur import |
Followers API URL. |
|
TEXT |
Augur import |
Following API URL. |
|
TEXT |
Augur import |
Gists API URL. |
|
TEXT |
Augur import |
Starred API URL. |
|
TEXT |
Augur import |
Subscriptions API URL. |
|
TEXT |
Augur import |
Organizations API URL. |
|
TEXT |
Augur import |
Repos API URL. |
|
TEXT |
Augur import |
Events API URL. |
|
TEXT |
Augur import |
Received events API URL. |
|
TEXT |
Augur import |
GitHub account type. |
|
TEXT |
Augur import |
Site admin flag. |
|
TEXT |
Augur import |
GitLab web URL. |
|
TEXT |
Augur import |
GitLab avatar URL. |
|
TEXT |
Augur import |
GitLab account state. |
|
TEXT |
Augur import |
GitLab username. |
|
TEXT |
Augur import |
GitLab full name. |
|
BIGINT |
Augur import |
GitLab user ID. |
Standard metadata columns |
unresolved_commit_emails
Holds email addresses found in git commits that could not be resolved to a known contributor. The commit resolver processes these to attempt matching.
Column |
Type |
Source |
Description |
|---|---|---|---|
|
BIGSERIAL (PK) |
Auto-generated |
Primary key. |
|
TEXT NOT NULL |
Git: |
The unresolved email from a commit. |
|
TEXT |
Git: |
The name associated with the email in the commit. |
Standard metadata columns |
Issues
issues
Issue tracker records from GitHub Issues or GitLab Issues. Each row represents one issue in one repository.
Column |
Type |
Source |
Description |
|---|---|---|---|
|
BIGSERIAL (PK) |
Auto-generated |
Primary key. |
|
BIGINT NOT NULL (FK -> repos) |
Computed |
Repository this issue belongs to. |
|
BIGINT NOT NULL |
GitHub REST: |
The platform’s numeric ID for this issue. |
|
INT NOT NULL |
GitHub REST: |
Human-readable issue number (e.g., |
|
TEXT |
GitHub REST: |
GitHub GraphQL node ID (empty for GitLab). |
|
TEXT |
GitHub REST: |
Issue title. |
|
TEXT |
GitHub REST: |
Issue body/description text. |
|
TEXT |
GitHub REST: |
Current state: |
|
TEXT |
GitHub REST: |
API URL for the issue. |
|
TEXT |
GitHub REST: |
Web URL for the issue. |
|
UUID (FK -> contributors) |
GitHub REST: |
Contributor who opened the issue. |
|
UUID (FK -> contributors) |
GitHub REST: |
Contributor who closed the issue (null if open). |
|
BIGINT |
GitHub REST: |
Non-null if this issue is actually a PR (GitHub conflates the two). |
|
BIGINT |
Computed |
Foreign key to |
|
TIMESTAMPTZ |
GitHub REST: |
When the issue was created. |
|
TIMESTAMPTZ |
GitHub REST: |
When the issue was last updated. |
|
TIMESTAMPTZ |
GitHub REST: |
When the issue was closed (null if open). |
|
TIMESTAMPTZ |
GitHub REST: |
Due date from milestone. |
|
INT |
GitHub REST: |
Number of comments on the issue. |
|
TEXT |
|
Bracketed |
Standard metadata columns |
Unique constraint: (repo_id, platform_issue_id)
issue_labels
Labels attached to issues.
Column |
Type |
Source |
Description |
|---|---|---|---|
|
BIGSERIAL (PK) |
Auto-generated |
Primary key. |
|
BIGINT NOT NULL (FK -> issues) |
Computed |
The issue this label is on. |
|
BIGINT NOT NULL (FK -> repos) |
Computed |
Repository for denormalized querying. |
|
BIGINT |
GitHub REST: |
The platform’s numeric ID for this label. |
|
TEXT |
GitHub REST: |
GitHub GraphQL node ID. |
|
TEXT NOT NULL |
GitHub REST: |
Label name text (e.g., |
|
TEXT |
GitHub REST: |
Label description. |
|
TEXT |
GitHub REST: |
Hex color code (e.g., |
Standard metadata columns |
Unique constraint: (issue_id, label_text)
issue_assignees
Users assigned to an issue.
Column |
Type |
Source |
Description |
|---|---|---|---|
|
BIGSERIAL (PK) |
Auto-generated |
Primary key. |
|
BIGINT NOT NULL (FK -> issues) |
Computed |
The issue. |
|
BIGINT NOT NULL (FK -> repos) |
Computed |
Repository for denormalized querying. |
|
UUID (FK -> contributors) |
GitHub REST: |
The assigned contributor. |
|
BIGINT |
GitHub REST: |
Platform’s numeric ID for the assignee. |
|
TEXT |
GitHub REST: |
GitHub GraphQL node ID. |
Standard metadata columns |
Unique constraint: (issue_id, platform_assignee_id)
issue_events
Timeline events on issues (labeled, closed, assigned, referenced, etc.).
Column |
Type |
Source |
Description |
|---|---|---|---|
|
BIGSERIAL (PK) |
Auto-generated |
Primary key. |
|
BIGINT NOT NULL (FK -> issues) |
Computed |
The issue this event occurred on. |
|
BIGINT NOT NULL (FK -> repos) |
Computed |
Repository for denormalized querying. |
|
UUID (FK -> contributors) |
GitHub REST: |
The contributor who triggered the event. |
|
SMALLINT NOT NULL (FK -> platforms) |
Computed |
Platform where the event occurred. |
|
BIGINT NOT NULL |
GitHub REST: |
Platform’s numeric event ID. |
|
TEXT |
GitHub REST: |
GitHub GraphQL node ID. |
|
TEXT NOT NULL |
GitHub REST: |
Event type (e.g., |
|
TEXT |
GitHub REST: |
Commit SHA if the event references a commit. |
|
TIMESTAMPTZ |
GitHub REST: |
When the event occurred. |
Standard metadata columns |
Unique constraint: (repo_id, platform_event_id)
issue_message_ref
Join table linking issues to messages (comments). Each row maps one comment to one issue.
Column |
Type |
Source |
Description |
|---|---|---|---|
|
BIGSERIAL (PK) |
Auto-generated |
Primary key. |
|
BIGINT NOT NULL (FK -> issues) |
Computed |
The issue. |
|
BIGINT NOT NULL (FK -> repos) |
Computed |
Repository for denormalized querying. |
|
BIGINT NOT NULL (FK -> messages) |
Computed |
The message/comment. |
|
BIGINT |
GitHub (REST |
Platform’s comment ID. |
|
TEXT |
GitHub (REST |
GitHub GraphQL node ID. |
Standard metadata columns |
Unique constraint: (issue_id, msg_id)
Pull Requests
pull_requests
Pull requests (GitHub) or merge requests (GitLab). Each row represents one PR/MR in one repository.
Column |
Type |
Source |
Description |
|---|---|---|---|
|
BIGSERIAL (PK) |
Auto-generated |
Primary key. |
|
BIGINT NOT NULL (FK -> repos) |
Computed |
Repository this PR belongs to. |
|
BIGINT NOT NULL |
GitHub REST: |
Platform’s numeric PR/MR ID. |
|
TEXT |
GitHub REST: |
GitHub GraphQL node ID. |
|
INT NOT NULL |
GitHub REST: |
Human-readable PR number. |
|
TEXT |
GitHub REST: |
API URL. |
|
TEXT |
GitHub REST: |
Web URL. |
|
TEXT |
GitHub REST: |
URL to the diff. |
|
TEXT |
GitHub REST: |
PR title. |
|
TEXT |
GitHub REST: |
PR body/description. |
|
TEXT |
GitHub REST: |
State: |
|
BOOLEAN |
GitHub REST: |
Whether the PR conversation is locked. |
|
TIMESTAMPTZ |
GitHub REST: |
Creation timestamp. |
|
TIMESTAMPTZ |
GitHub REST: |
Last update timestamp. |
|
TIMESTAMPTZ |
GitHub REST: |
Close timestamp (null if open). |
|
TIMESTAMPTZ |
GitHub REST: |
Merge timestamp (null if not merged). |
|
TEXT |
GitHub REST: |
SHA of the merge commit. |
|
UUID (FK -> contributors) |
GitHub REST: |
PR author. |
|
TEXT |
GitHub REST: |
Author’s association to the repo (e.g., |
|
BIGINT |
Computed |
FK to |
|
BIGINT |
Computed |
FK to |
Standard metadata columns |
Unique constraint: (repo_id, platform_pr_id)
pull_request_labels
Labels attached to pull requests.
Column |
Type |
Source |
Description |
|---|---|---|---|
|
BIGSERIAL (PK) |
Auto-generated |
Primary key. |
|
BIGINT NOT NULL (FK -> pull_requests) |
Computed |
The PR this label is on. |
|
BIGINT NOT NULL (FK -> repos) |
Computed |
Repository for denormalized querying. |
|
BIGINT |
GitHub REST: |
Platform’s label ID. |
|
TEXT |
GitHub REST: |
GitHub GraphQL node ID. |
|
TEXT NOT NULL |
GitHub REST: |
Label name text. |
|
TEXT |
GitHub REST: |
Label description. |
|
TEXT |
GitHub REST: |
Hex color code. |
|
BOOLEAN |
GitHub REST: |
Whether this is a default label. |
Standard metadata columns |
Unique constraint: (pull_request_id, label_name)
pull_request_assignees
Users assigned to a pull request.
Column |
Type |
Source |
Description |
|---|---|---|---|
|
BIGSERIAL (PK) |
Auto-generated |
Primary key. |
|
BIGINT NOT NULL (FK -> pull_requests) |
Computed |
The PR. |
|
BIGINT NOT NULL (FK -> repos) |
Computed |
Repository. |
|
UUID (FK -> contributors) |
GitHub REST: |
The assigned contributor. |
|
BIGINT |
GitHub REST: |
Platform’s assignee ID. |
Standard metadata columns |
Unique constraint: (pull_request_id, platform_assignee_id)
pull_request_reviewers
Requested reviewers on a pull request.
Column |
Type |
Source |
Description |
|---|---|---|---|
|
BIGSERIAL (PK) |
Auto-generated |
Primary key. |
|
BIGINT NOT NULL (FK -> pull_requests) |
Computed |
The PR. |
|
BIGINT NOT NULL (FK -> repos) |
Computed |
Repository. |
|
UUID (FK -> contributors) |
GitHub REST: |
The reviewer. |
|
BIGINT |
GitHub REST: |
Platform’s reviewer ID. |
Standard metadata columns |
Unique constraint: (pull_request_id, platform_reviewer_id)
pull_request_reviews
Submitted reviews on a pull request (approved, changes requested, commented).
Column |
Type |
Source |
Description |
|---|---|---|---|
|
BIGSERIAL (PK) |
Auto-generated |
Primary key. |
|
BIGINT NOT NULL (FK -> pull_requests) |
Computed |
The PR being reviewed. |
|
BIGINT NOT NULL (FK -> repos) |
Computed |
Repository. |
|
UUID (FK -> contributors) |
GitHub REST: |
The reviewer. |
|
SMALLINT NOT NULL (FK -> platforms) |
Computed |
Platform. |
|
BIGINT NOT NULL |
GitHub REST: |
Platform’s review ID. |
|
TEXT |
GitHub REST: |
GitHub GraphQL node ID. |
|
TEXT |
GitHub REST: |
State: |
|
TEXT |
GitHub REST: |
Review body text. |
|
TIMESTAMPTZ |
GitHub REST: |
When the review was submitted. |
|
TEXT |
GitHub REST: |
Reviewer’s association to the repo. |
|
TEXT |
GitHub REST: |
SHA of the commit the review was made against. |
|
TEXT |
GitHub REST: |
Web URL of the review. |
Standard metadata columns |
Unique constraint: (pull_request_id, platform_review_id)
pull_request_meta
Branch metadata for the head and base of a pull request.
Column |
Type |
Source |
Description |
|---|---|---|---|
|
BIGSERIAL (PK) |
Auto-generated |
Primary key. |
|
BIGINT NOT NULL (FK -> pull_requests) |
Computed |
The PR. |
|
BIGINT NOT NULL (FK -> repos) |
Computed |
Repository. |
|
UUID (FK -> contributors) |
GitHub REST: |
Owner of the head/base repo. |
|
TEXT NOT NULL |
Computed |
|
|
TEXT |
GitHub REST: |
Label string (e.g., |
|
TEXT |
GitHub REST: |
Branch name. |
|
TEXT |
GitHub REST: |
Commit SHA at the tip of the branch. |
Standard metadata columns |
Unique constraint: (pull_request_id, head_or_base)
pull_request_commits
Commits included in a pull request.
Column |
Type |
Source |
Description |
|---|---|---|---|
|
BIGSERIAL (PK) |
Auto-generated |
Primary key. |
|
BIGINT NOT NULL (FK -> pull_requests) |
Computed |
The PR. |
|
BIGINT NOT NULL (FK -> repos) |
Computed |
Repository. |
|
UUID (FK -> contributors) |
GitHub REST: |
Commit author. |
|
TEXT NOT NULL |
GitHub REST: |
Commit SHA. |
|
TEXT |
GitHub REST: |
GitHub GraphQL node ID. |
|
TEXT |
GitHub REST: |
Commit message. |
|
TEXT |
GitHub REST: |
Author email from the commit. |
|
TIMESTAMPTZ |
GitHub REST: |
Commit timestamp. |
Standard metadata columns |
Unique constraint: (pull_request_id, pr_cmt_sha)
pull_request_files
Files changed in a pull request.
Column |
Type |
Source |
Description |
|---|---|---|---|
|
BIGSERIAL (PK) |
Auto-generated |
Primary key. |
|
BIGINT NOT NULL (FK -> pull_requests) |
Computed |
The PR. |
|
BIGINT NOT NULL (FK -> repos) |
Computed |
Repository. |
|
TEXT NOT NULL |
GitHub REST: |
Path of the changed file. |
|
INT |
GitHub REST: |
Lines added. |
|
INT |
GitHub REST: |
Lines removed. |
Standard metadata columns |
Unique constraint: (pull_request_id, pr_file_path)
pull_request_events
Timeline events on pull requests (labeled, closed, merged, review_requested, etc.).
Column |
Type |
Source |
Description |
|---|---|---|---|
|
BIGSERIAL (PK) |
Auto-generated |
Primary key. |
|
BIGINT NOT NULL (FK -> pull_requests) |
Computed |
The PR. |
|
BIGINT NOT NULL (FK -> repos) |
Computed |
Repository. |
|
UUID (FK -> contributors) |
GitHub REST: |
Contributor who triggered the event. |
|
SMALLINT NOT NULL (FK -> platforms) |
Computed |
Platform. |
|
BIGINT NOT NULL |
GitHub REST: |
Platform’s event ID. |
|
TEXT |
GitHub REST: |
GitHub GraphQL node ID. |
|
TEXT NOT NULL |
GitHub REST: |
Event type (e.g., |
|
TEXT |
GitHub REST: |
Commit SHA if the event references a commit. |
|
TIMESTAMPTZ |
GitHub REST: |
When the event occurred. |
Standard metadata columns |
Unique constraint: (repo_id, platform_event_id)
pull_request_repo
Fork repository metadata referenced in pull requests. Records source repos for PR head/base branches that point to forks.
Column |
Type |
Source |
Description |
|---|---|---|---|
|
BIGSERIAL (PK) |
Auto-generated |
Primary key. |
|
BIGINT |
Computed |
FK to |
|
TEXT |
Computed |
|
|
BIGINT |
GitHub REST: |
Platform’s numeric ID for the source repo. |
|
TEXT |
GitHub REST: |
GitHub GraphQL node ID of the source repo. |
|
TEXT |
GitHub REST: |
Short name of the fork repo. |
|
TEXT |
GitHub REST: |
Full name (e.g., |
|
BOOLEAN |
GitHub REST: |
Whether the fork is private. |
|
UUID (FK -> contributors) |
GitHub REST: |
Owner of the fork repo. |
Standard metadata columns |
pull_request_review_message_ref
Bridge table linking a pull request review to its review-body message in the messages table. One row per review whose author submitted body text alongside their Approve / Request Changes / Comment action. Reviews submitted without body text (the common “Approved” with no comment case) do NOT create a row here.
This is NOT the inline-diff-comment table — those live in review_comments (below) and have their own messages-table row each. The schema retains a number of diff-position columns (pr_review_msg_diff_hunk, pr_review_msg_path, pr_review_msg_position, etc.) from the Augur-era schema, but Aveloxis writes ONLY pr_review_id, repo_id, msg_id, pr_review_src_id, and pr_review_msg_node_id. The diff-position columns are kept for schema parity / 8Knot compatibility but are not populated.
Column |
Type |
Source |
Description |
|---|---|---|---|
|
BIGSERIAL (PK) |
Auto-generated |
Primary key. |
|
BIGINT NOT NULL (FK -> pull_request_reviews) |
Computed |
The review whose body is linked. |
|
BIGINT (FK -> repos) |
Computed |
Repository. |
|
BIGINT NOT NULL |
Computed |
FK to messages table; carries the review body text. |
|
BIGINT |
GitHub REST: |
Platform’s review ID. |
|
TEXT |
GitHub REST: |
GitHub GraphQL node ID for the review. |
|
(various) |
Not populated by Aveloxis |
Vestigial Augur-era columns. Use |
Standard metadata columns |
pull_request_teams
Teams requested to review a pull request.
Column |
Type |
Source |
Description |
|---|---|---|---|
|
BIGSERIAL (PK) |
Auto-generated |
Primary key. |
|
BIGINT (FK -> pull_requests) |
Computed |
The PR. |
|
BIGINT |
GitHub REST: |
Platform’s team ID. |
|
TEXT |
GitHub REST: |
GitHub GraphQL node ID. |
|
TEXT |
GitHub REST: |
API URL for the team. |
|
TEXT |
GitHub REST: |
Team name. |
|
TEXT |
GitHub REST: |
Team slug (URL-safe name). |
|
TEXT |
GitHub REST: |
Team description. |
|
TEXT |
GitHub REST: |
Privacy level (e.g., |
|
TEXT |
GitHub REST: |
Permission level (e.g., |
|
TEXT |
GitHub REST: |
API URL for team members. |
|
TEXT |
GitHub REST: |
API URL for team repos. |
|
BIGINT |
GitHub REST: |
Parent team ID (for nested teams). |
Standard metadata columns |
pull_request_analysis
ML-based merge prediction results for pull requests. Schema parity table; not yet populated by Aveloxis workers.
Column |
Type |
Source |
Description |
|---|---|---|---|
|
BIGSERIAL (PK) |
Auto-generated |
Primary key. |
|
BIGINT (FK -> pull_requests) |
Not yet populated |
The PR analyzed. |
|
NUMERIC(256,250) |
Not yet populated |
Predicted probability of merge. |
|
TEXT |
Not yet populated |
ML model or heuristic used. |
Standard metadata columns |
Messages and Comments
messages
Unified comment/message table shared by issues and pull requests. Each row is one piece of text. The four bridge tables (issue_message_ref, pull_request_message_ref, pull_request_review_message_ref, review_comments) link messages to their semantic origin, enabling cross-type text analysis. The four kinds of text stored here:
Issue conversation comments — comments under an issue. Linked via
issue_message_ref.PR conversation comments — comments under a PR (the IssueComment-shaped nodes GitHub serves under
/issues/{n}/commentson REST, even for PRs). Linked viapull_request_message_ref.PR review bodies — text submitted alongside an Approve / Request Changes / Comment review action, when non-empty. Linked via
pull_request_review_message_ref.Inline PR review comments — diff-anchored comments attached to a specific file/line. Linked via
review_comments(carries the diff position metadata).
Source endpoints by mode (GitHub):
pr_child_mode=restANDlisting_mode=rest: comments fetched via/issues/comments(covers both issue and PR conversation comments),/pulls/comments(inline review comments),/pulls/{n}/reviews(review bodies).pr_child_mode=graphqlANDlisting_mode=graphql(v0.18.5+ Phase 4): issue and PR conversation comments arrive inline via the GraphQL listing + PR-batch queries./issues/commentsis skipped entirely by the staged collector’sfullGraphQLMode()gate. Review bodies are inline viareviewsconnection on PullRequest. The/pulls/commentsREST iterator continues to run regardless of mode because GraphQL’s PullRequestReviewComment type does not exposeside/startSide.
GitLab: all four kinds come from /projects/{id}/merge_requests/{n}/notes and /projects/{id}/merge_requests/{n}/discussions (REST composition; no GraphQL path on GitLab).
Column |
Type |
Source |
Description |
|---|---|---|---|
|
BIGSERIAL (PK) |
Auto-generated |
Primary key. |
|
BIGINT NOT NULL (FK -> repos) |
Computed |
Repository. |
|
BIGINT |
Computed |
Optional FK to |
|
BIGINT NOT NULL |
GitHub (REST or GraphQL inline, see modes above) ; GitLab |
Platform’s comment / review ID. For review bodies this is the platform review ID. |
|
SMALLINT NOT NULL (FK -> platforms) |
Computed |
Platform. |
|
TEXT |
GitHub (REST or GraphQL inline) |
GitHub GraphQL node ID. |
|
TEXT |
GitHub (REST or GraphQL inline) ; GitLab |
Comment / review body text. |
|
TIMESTAMPTZ |
GitHub (REST or GraphQL inline) ; GitLab |
When the message was posted (or review submitted). |
|
TEXT |
Computed |
Email of the comment author (resolved from contributor). |
|
TEXT |
Not yet populated |
Message header (for mailing list messages). |
|
UUID (FK -> contributors) |
Computed (resolved from platform user data) |
Comment / review author. |
Standard metadata columns |
Unique constraint: (platform_msg_id, platform_id)
pull_request_message_ref
Join table linking pull requests to their conversation comments in the messages table. These are the IssueComment-shaped nodes GitHub serves under /issues/{n}/comments on REST (even for PRs); they are NOT diff-anchored inline review comments — those live in review_comments.
Column |
Type |
Source |
Description |
|---|---|---|---|
|
BIGSERIAL (PK) |
Auto-generated |
Primary key. |
|
BIGINT NOT NULL (FK -> pull_requests) |
Computed |
The PR. |
|
BIGINT NOT NULL (FK -> repos) |
Computed |
Repository. |
|
BIGINT NOT NULL (FK -> messages) |
Computed |
The message/comment. |
|
BIGINT |
GitHub (REST |
Platform’s comment ID. |
|
TEXT |
GitHub (REST or GraphQL inline) |
GitHub GraphQL node ID. |
Standard metadata columns |
Unique constraint: (pull_request_id, msg_id)
review_comments
Inline code review comments with full diff positioning. One row per diff-anchored review comment. Each row links to both a pull_request_reviews row (the parent review) and a messages row (the comment body text).
Always populated via REST /repos/{o}/{r}/pulls/comments regardless of pr_child_mode / listing_mode configuration. GitHub’s GraphQL PullRequestReviewComment type does not expose the side / startSide fields aveloxis requires for diff-side fidelity, so the v0.18.5 Phase 4 GraphQL-inline-comments work explicitly carved this endpoint out and kept the REST iterator running. See CLAUDE.md’s “Inline REVIEW comments NOT fetched via GraphQL — deliberate” subsection for the schema-validation history.
Column |
Type |
Source |
Description |
|---|---|---|---|
|
BIGSERIAL (PK) |
Auto-generated |
Primary key. |
|
BIGINT (FK -> pull_request_reviews) |
Computed |
The parent review. |
|
BIGINT NOT NULL (FK -> repos) |
Computed |
Repository. |
|
BIGINT NOT NULL (FK -> messages) |
Computed |
The message body. |
|
BIGINT |
GitHub REST: |
Platform’s comment ID. |
|
TEXT |
GitHub REST: |
GitHub GraphQL node ID. |
|
TEXT |
GitHub REST: |
Diff hunk context. |
|
TEXT |
GitHub REST: |
File path the comment is on. |
|
INT |
GitHub REST: |
Line position in the diff. |
|
INT |
GitHub REST: |
Original line position. |
|
TEXT |
GitHub REST: |
Commit SHA the comment is on. |
|
TEXT |
GitHub REST: |
Original commit SHA. |
|
INT |
GitHub REST: |
End line. |
|
INT |
GitHub REST: |
Original end line. |
|
TEXT |
GitHub REST: |
Diff side ( |
|
INT |
GitHub REST: |
Multi-line start. |
|
INT |
GitHub REST: |
Original multi-line start. |
|
TEXT |
GitHub REST: |
Diff side for start line. |
|
TEXT |
GitHub REST: |
Author’s association to the repo. |
|
TEXT |
GitHub REST: |
Web URL. |
|
TIMESTAMPTZ |
GitHub REST: |
Last update timestamp. |
Unique constraint: (repo_id, platform_src_id)
Commits (Git/Facade)
commits
Commit data from git log --numstat on the repo’s default branch only (resolved via git symbolic-ref HEAD). One row per file per commit (the same commit hash appears multiple times, once for each file touched). Populated by the facade worker.
Note (v0.16.4+): earlier versions of aveloxis walked
git log --all --numstatacross every ref. That over-counted commits relative to GitHub’s metadatacommit_count, which reflects only the default branch. The facade was changed in v0.16.4 to use the default branch exclusively so the gathered count matches the metadata. Commits on side branches that never merged to default are NOT in this table by design.
Column |
Type |
Source |
Description |
|---|---|---|---|
|
BIGSERIAL (PK) |
Auto-generated |
Primary key. |
|
BIGINT NOT NULL (FK -> repos) |
Computed |
Repository. |
|
TEXT NOT NULL |
Git: |
Full SHA-1 hash. |
|
TEXT NOT NULL |
Git: |
Author name from the commit. |
|
TEXT NOT NULL |
Git: |
Author email exactly as it appears in the commit. |
|
TEXT NOT NULL |
|
Resolved/canonical author email. |
|
TEXT NOT NULL |
Git: |
Author date string. |
|
TEXT |
|
Resolved organizational affiliation of the author. |
|
TEXT NOT NULL |
Git: |
Committer name. |
|
TEXT NOT NULL |
Git: |
Committer email exactly as in the commit. |
|
TEXT NOT NULL |
|
Resolved/canonical committer email. |
|
TEXT NOT NULL |
Git: |
Committer date string. |
|
TEXT |
|
Resolved organizational affiliation of the committer. |
|
INT NOT NULL |
Git: |
Lines added in this file. |
|
INT NOT NULL |
Git: |
Lines removed in this file. |
|
INT NOT NULL |
Git: |
Whitespace-only changes in this file. |
|
TEXT NOT NULL |
Git: |
Path of the file changed. |
|
TIMESTAMPTZ NOT NULL |
Auto-generated |
When this row was first processed. |
|
INT |
Augur import |
Legacy GHTorrent committer ID. |
|
TIMESTAMPTZ |
Augur import |
Legacy GHTorrent commit timestamp. |
|
TIMESTAMPTZ |
Git: |
Parsed committer timestamp. |
|
TIMESTAMPTZ |
Git: |
Parsed author timestamp. |
|
TEXT |
|
Platform username resolved from the commit email. |
|
UUID |
|
FK to contributors (resolved author). |
Standard metadata columns |
Indexes: (repo_id, cmt_commit_hash), (cmt_author_email), (cmt_author_raw_email), (cmt_committer_raw_email), (cmt_author_affiliation)
commit_parents
Parent-child relationships between commits. Used to reconstruct commit DAGs and identify merge commits.
Column |
Type |
Source |
Description |
|---|---|---|---|
|
BIGINT NOT NULL (PK part 1) |
Git: |
The child commit ID. |
|
BIGSERIAL NOT NULL (PK part 2) |
Git: |
Auto-incrementing parent ordinal. |
Standard metadata columns |
Primary key: (cmt_id, parent_id)
commit_messages
Stores the full commit message text, deduplicated per repo and commit hash.
Column |
Type |
Source |
Description |
|---|---|---|---|
|
BIGSERIAL (PK) |
Auto-generated |
Primary key. |
|
BIGINT NOT NULL (FK -> repos) |
Computed |
Repository. |
|
TEXT NOT NULL |
Git: |
Full commit message text. |
|
TEXT NOT NULL |
Git: |
Commit SHA. |
Standard metadata columns |
Unique constraint: (repo_id, cmt_hash)
commit_comment_ref
Comments made on specific commits (not PR review comments, but commit-level comments from the forge).
Column |
Type |
Source |
Description |
|---|---|---|---|
|
BIGSERIAL (PK) |
Auto-generated |
Primary key. |
|
BIGINT NOT NULL |
Computed |
The commit this comment is on. |
|
BIGINT (FK -> repos) |
Computed |
Repository. |
|
BIGINT NOT NULL |
Computed |
FK to messages table. |
|
BIGINT NOT NULL |
GitHub REST: |
Platform user ID of the commenter. |
|
TEXT |
GitHub REST: |
Comment body text. |
|
BIGINT |
GitHub REST: |
Line number in the file. |
|
BIGINT |
GitHub REST: |
Position in the diff. |
|
TEXT |
GitHub REST: |
GitHub GraphQL node ID. |
|
BIGINT NOT NULL UNIQUE |
GitHub REST: |
Platform’s comment ID. |
|
TIMESTAMPTZ |
GitHub REST: |
When the comment was created. |
Standard metadata columns |
Releases
releases
Software releases and tags from the forge.
Column |
Type |
Source |
Description |
|---|---|---|---|
|
TEXT NOT NULL (PK part 1) |
GitHub REST: |
Platform’s release identifier. |
|
BIGINT NOT NULL (PK part 2, FK -> repos) |
Computed |
Repository. |
|
TEXT |
GitHub REST: |
Release title. |
|
TEXT |
GitHub REST: |
Release notes/body. |
|
TEXT |
GitHub REST: |
Author login. |
|
TEXT |
GitHub REST: |
Git tag name (e.g., |
|
TEXT |
GitHub REST: |
Web URL. |
|
TIMESTAMPTZ |
GitHub REST: |
Creation timestamp. |
|
TIMESTAMPTZ |
GitHub REST: |
Publication timestamp. |
|
TIMESTAMPTZ |
GitHub REST: |
Last update timestamp. |
|
BOOLEAN |
GitHub REST: |
Whether the release is a draft. |
|
BOOLEAN |
GitHub REST: |
Whether the release is a pre-release. |
|
BOOLEAN |
Computed |
|
Standard metadata columns |
Primary key: (repo_id, release_id)
Repository Metadata
repo_info
Point-in-time snapshots of repository metadata and statistics. A new row is inserted on each collection run, creating a time series.
Column |
Type |
Source |
Description |
|---|---|---|---|
|
BIGSERIAL (PK) |
Auto-generated |
Primary key. |
|
BIGINT NOT NULL (FK -> repos) |
Computed |
Repository. |
|
TIMESTAMPTZ |
GitHub REST: |
When the repo was last pushed to. |
|
TEXT |
GitHub REST: |
Whether issues are enabled. |
|
TEXT |
GitHub REST: |
Whether PRs/MRs are enabled. |
|
TEXT |
GitHub REST: |
Whether the wiki is enabled. |
|
TEXT |
GitHub REST: |
Whether GitHub Pages is enabled. |
|
INT |
GitHub REST: |
Number of forks. |
|
INT |
GitHub REST: |
Number of stars. |
|
INT |
GitHub REST: |
Number of watchers. |
|
INT |
GitHub REST: |
Number of open issues (GitHub includes PRs). |
|
INT |
Computed |
Distinct committers. |
|
BIGINT |
GitLab: |
Total commits. |
|
BIGINT |
Computed |
Total issues. |
|
BIGINT |
Computed |
Closed issues. |
|
BIGINT |
Computed |
Total PRs. |
|
BIGINT |
Computed |
Open PRs. |
|
BIGINT |
Computed |
Closed (not merged) PRs. |
|
BIGINT |
Computed |
Merged PRs. |
|
TEXT |
GitHub REST: |
Default branch name (e.g., |
|
TEXT |
GitHub REST: |
SPDX license identifier. |
|
TEXT |
Computed |
Distinct issue contributors. |
|
TEXT |
Computed |
Whether a CHANGELOG file exists. |
|
TEXT |
Computed |
Whether a CONTRIBUTING file exists. |
|
TEXT |
Computed |
Whether a LICENSE file exists. |
|
TEXT |
Computed |
Whether a CODE_OF_CONDUCT file exists. |
|
TEXT |
Computed |
Whether a SECURITY file exists. |
|
TEXT |
Computed |
Whether a security audit file exists. |
|
TEXT |
Computed |
Repository health status. |
|
TEXT |
GitHub REST: |
Repository topics/tags. |
Standard metadata columns |
repo_clones
Clone traffic data. Only available for repos where you have push access (GitHub-only feature).
Column |
Type |
Source |
Description |
|---|---|---|---|
|
BIGSERIAL (PK) |
Auto-generated |
Primary key. |
|
BIGINT NOT NULL (FK -> repos) |
Computed |
Repository. |
|
TIMESTAMPTZ NOT NULL |
GitHub REST: |
Date of the clone data point. |
|
INT |
GitHub REST: |
Total clones. |
|
INT |
GitHub REST: |
Unique cloners. |
Standard metadata columns |
Unique constraint: (repo_id, clone_timestamp)
repo_badging
CII Best Practices badge data stored as raw JSON. Each row is a snapshot from the CII badge API.
Column |
Type |
Source |
Description |
|---|---|---|---|
|
BIGSERIAL (PK) |
Auto-generated |
Primary key. |
|
BIGINT (FK -> repos) |
Computed |
Repository. |
|
TIMESTAMPTZ |
Auto-generated |
When this snapshot was taken. |
|
JSONB |
Not yet populated |
Raw badge data JSON. |
Standard metadata columns |
dei_badging
DEI (Diversity, Equity, and Inclusion) badging levels for repositories.
Column |
Type |
Source |
Description |
|---|---|---|---|
|
SERIAL NOT NULL (PK part 1) |
Auto-generated |
Primary key part 1. |
|
INT NOT NULL |
Not yet populated |
External badging system ID. |
|
TEXT NOT NULL |
Not yet populated |
Badge level achieved. |
|
BIGINT NOT NULL (PK part 2, FK -> repos) |
Computed |
Repository. |
Primary key: (id, repo_id)
repo_insights
Computed insight records for repositories. Each row is one metric observation at one point in time.
Column |
Type |
Source |
Description |
|---|---|---|---|
|
BIGSERIAL (PK) |
Auto-generated |
Primary key. |
|
BIGINT (FK -> repos) |
Computed |
Repository. |
|
TEXT |
Computed |
Metric name (e.g., |
|
TEXT |
Computed |
Metric value. |
|
TIMESTAMPTZ |
Computed |
Date of the observation. |
|
BOOLEAN |
Computed |
Whether this insight is current. |
|
NUMERIC |
Computed |
Numeric score for ranking. |
|
TEXT |
Computed |
Field name within the metric. |
|
TEXT |
Computed |
How the insight was detected (e.g., anomaly detection). |
Standard metadata columns |
repo_insights_records
Archived insight records. Same structure as repo_insights but for historical data.
Column |
Type |
Source |
Description |
|---|---|---|---|
|
BIGSERIAL (PK) |
Auto-generated |
Primary key. |
|
BIGINT (FK -> repos) |
Computed |
Repository. |
|
TEXT |
Computed |
Metric name. |
|
TEXT |
Computed |
Field name. |
|
TEXT |
Computed |
Metric value. |
|
TIMESTAMPTZ |
Computed |
Observation date. |
|
FLOAT |
Computed |
Numeric score. |
|
TEXT |
Computed |
Detection method. |
Standard metadata columns |
repo_group_insights
Aggregated insights at the repo-group level.
Column |
Type |
Source |
Description |
|---|---|---|---|
|
BIGSERIAL (PK) |
Auto-generated |
Primary key. |
|
BIGINT (FK -> repo_groups) |
Computed |
Repo group. |
|
TEXT |
Computed |
Metric name. |
|
TEXT |
Computed |
Metric value. |
|
BIGINT |
Computed |
FK to |
|
BOOLEAN |
Computed |
Whether this insight is current. |
Standard metadata columns |
Dependencies and SBOM
repo_dependencies
High-level dependency counts per language for a repository.
Column |
Type |
Source |
Description |
|---|---|---|---|
|
BIGSERIAL (PK) |
Auto-generated |
Primary key. |
|
BIGINT (FK -> repos) |
Computed |
Repository. |
|
TEXT |
Not yet populated |
Dependency name. |
|
INT |
Not yet populated |
Number of times this dependency appears. |
|
TEXT |
Not yet populated |
Language of the dependency. |
Standard metadata columns |
repo_deps_libyear
Libyear analysis results. Measures how out-of-date each dependency is by comparing current vs. latest versions.
Column |
Type |
Source |
Description |
|---|---|---|---|
|
BIGSERIAL (PK) |
Auto-generated |
Primary key. |
|
BIGINT (FK -> repos) |
Computed |
Repository. |
|
TEXT |
Not yet populated |
Dependency name. |
|
TEXT |
Not yet populated |
Version requirement string. |
|
TEXT |
Not yet populated |
Dependency type (e.g., |
|
TEXT |
Not yet populated |
Package manager (e.g., |
|
TEXT |
Not yet populated |
Currently used version. |
|
TEXT |
Not yet populated |
Latest available version. |
|
TEXT |
Not yet populated |
Release date of current version. |
|
TEXT |
Not yet populated |
Release date of latest version. |
|
FLOAT |
Not yet populated |
Libyear score (years between current and latest). |
Standard metadata columns |
repo_deps_scorecard
OpenSSF Scorecard check results for a repository.
Column |
Type |
Source |
Description |
|---|---|---|---|
|
BIGSERIAL (PK) |
Auto-generated |
Primary key. |
|
BIGINT (FK -> repos) |
Computed |
Repository. |
|
TEXT |
Not yet populated |
Scorecard check name (e.g., |
|
TEXT |
Not yet populated |
Check score. |
|
JSONB |
Not yet populated |
Detailed check results as JSON. |
Standard metadata columns |
repo_sbom_scans
Raw SBOM (Software Bill of Materials) scan results.
Column |
Type |
Source |
Description |
|---|---|---|---|
|
BIGSERIAL (PK) |
Auto-generated |
Primary key. |
|
BIGINT (FK -> repos) |
Computed |
Repository. |
|
JSON |
Not yet populated |
Raw SBOM scan output as JSON. |
Libraries
libraries
Metadata about software libraries/packages associated with a repository.
Column |
Type |
Source |
Description |
|---|---|---|---|
|
BIGSERIAL (PK) |
Auto-generated |
Primary key. |
|
BIGINT (FK -> repos) |
Computed |
Repository. |
|
TEXT |
Not yet populated |
Package registry platform (e.g., |
|
TEXT |
Not yet populated |
Package name. |
|
TIMESTAMPTZ |
Not yet populated |
When the package was first published. |
|
TIMESTAMPTZ |
Not yet populated |
Last update timestamp. |
|
TEXT |
Not yet populated |
Package description. |
|
TEXT |
Not yet populated |
Package keywords. |
|
TEXT |
Not yet populated |
Homepage URL. |
|
TEXT |
Not yet populated |
License identifier. |
|
INT |
Not yet populated |
Total number of versions published. |
|
TEXT |
Not yet populated |
When the latest version was released. |
|
TEXT |
Not yet populated |
Latest version number. |
|
TEXT |
Not yet populated |
ID on the package manager. |
|
INT |
Not yet populated |
Number of dependencies this library has. |
|
INT |
Not yet populated |
Number of libraries that depend on this one. |
|
TEXT |
Not yet populated |
Primary language. |
Standard metadata columns |
library_dependencies
Manifest-level dependency declarations for libraries.
Column |
Type |
Source |
Description |
|---|---|---|---|
|
BIGSERIAL (PK) |
Auto-generated |
Primary key. |
|
BIGINT (FK -> libraries) |
Computed |
The library. |
|
TEXT |
Not yet populated |
Platform of the manifest. |
|
TEXT |
Not yet populated |
Path to the manifest file. |
|
TEXT |
Not yet populated |
Kind of manifest (e.g., |
|
TEXT NOT NULL |
Not yet populated |
Repository ID and branch identifier. |
Standard metadata columns |
library_version
Published versions of a library.
Column |
Type |
Source |
Description |
|---|---|---|---|
|
BIGSERIAL (PK) |
Auto-generated |
Primary key. |
|
BIGINT (FK -> libraries) |
Computed |
The library. |
|
TEXT |
Not yet populated |
Package registry platform. |
|
TEXT |
Not yet populated |
Version string (e.g., |
|
TIMESTAMPTZ |
Not yet populated |
When this version was published. |
Standard metadata columns |
Facade Aggregates (Data Mart)
These tables contain pre-aggregated commit statistics computed from the commits table. They exist at annual, monthly, and weekly granularity for both per-repo and per-repo-group levels. All are populated by SQL aggregation over the commits table.
dm_repo_annual
Annual commit statistics per contributor per repository.
Column |
Type |
Source |
Description |
|---|---|---|---|
|
BIGINT NOT NULL |
Computed: SQL aggregation |
Repository. |
|
TEXT NOT NULL |
Computed: SQL aggregation |
Contributor email. |
|
TEXT |
Computed: SQL aggregation |
Organizational affiliation. |
|
SMALLINT NOT NULL |
Computed: SQL aggregation |
Calendar year. |
|
BIGINT NOT NULL |
Computed: SQL aggregation |
Total lines added. |
|
BIGINT NOT NULL |
Computed: SQL aggregation |
Total lines removed. |
|
BIGINT NOT NULL |
Computed: SQL aggregation |
Total whitespace changes. |
|
BIGINT NOT NULL |
Computed: SQL aggregation |
Distinct files changed. |
|
BIGINT NOT NULL |
Computed: SQL aggregation |
Number of commits/patches. |
Standard metadata columns |
Index: (repo_id, affiliation)
dm_repo_monthly
Monthly commit statistics per contributor per repository.
Column |
Type |
Source |
Description |
|---|---|---|---|
|
BIGINT NOT NULL |
Computed: SQL aggregation |
Repository. |
|
TEXT NOT NULL |
Computed: SQL aggregation |
Contributor email. |
|
TEXT |
Computed: SQL aggregation |
Organizational affiliation. |
|
SMALLINT NOT NULL |
Computed: SQL aggregation |
Calendar month (1-12). |
|
SMALLINT NOT NULL |
Computed: SQL aggregation |
Calendar year. |
|
BIGINT NOT NULL |
Computed: SQL aggregation |
Total lines added. |
|
BIGINT NOT NULL |
Computed: SQL aggregation |
Total lines removed. |
|
BIGINT NOT NULL |
Computed: SQL aggregation |
Total whitespace changes. |
|
BIGINT NOT NULL |
Computed: SQL aggregation |
Distinct files changed. |
|
BIGINT NOT NULL |
Computed: SQL aggregation |
Number of commits/patches. |
Standard metadata columns |
dm_repo_weekly
Weekly commit statistics per contributor per repository.
Column |
Type |
Source |
Description |
|---|---|---|---|
|
BIGINT NOT NULL |
Computed: SQL aggregation |
Repository. |
|
TEXT NOT NULL |
Computed: SQL aggregation |
Contributor email. |
|
TEXT |
Computed: SQL aggregation |
Organizational affiliation. |
|
SMALLINT NOT NULL |
Computed: SQL aggregation |
ISO week number (1-53). |
|
SMALLINT NOT NULL |
Computed: SQL aggregation |
Calendar year. |
|
BIGINT NOT NULL |
Computed: SQL aggregation |
Total lines added. |
|
BIGINT NOT NULL |
Computed: SQL aggregation |
Total lines removed. |
|
BIGINT NOT NULL |
Computed: SQL aggregation |
Total whitespace changes. |
|
BIGINT NOT NULL |
Computed: SQL aggregation |
Distinct files changed. |
|
BIGINT NOT NULL |
Computed: SQL aggregation |
Number of commits/patches. |
Standard metadata columns |
dm_repo_group_annual
Annual commit statistics per contributor per repo group.
Column |
Type |
Source |
Description |
|---|---|---|---|
|
BIGINT NOT NULL |
Computed: SQL aggregation |
Repo group. |
|
TEXT NOT NULL |
Computed: SQL aggregation |
Contributor email. |
|
TEXT |
Computed: SQL aggregation |
Organizational affiliation. |
|
SMALLINT NOT NULL |
Computed: SQL aggregation |
Calendar year. |
|
BIGINT NOT NULL |
Computed: SQL aggregation |
Total lines added. |
|
BIGINT NOT NULL |
Computed: SQL aggregation |
Total lines removed. |
|
BIGINT NOT NULL |
Computed: SQL aggregation |
Total whitespace changes. |
|
BIGINT NOT NULL |
Computed: SQL aggregation |
Distinct files changed. |
|
BIGINT NOT NULL |
Computed: SQL aggregation |
Number of commits/patches. |
Standard metadata columns |
dm_repo_group_monthly
Monthly commit statistics per contributor per repo group.
Column |
Type |
Source |
Description |
|---|---|---|---|
|
BIGINT NOT NULL |
Computed: SQL aggregation |
Repo group. |
|
TEXT NOT NULL |
Computed: SQL aggregation |
Contributor email. |
|
TEXT |
Computed: SQL aggregation |
Organizational affiliation. |
|
SMALLINT NOT NULL |
Computed: SQL aggregation |
Calendar month (1-12). |
|
SMALLINT NOT NULL |
Computed: SQL aggregation |
Calendar year. |
|
BIGINT NOT NULL |
Computed: SQL aggregation |
Total lines added. |
|
BIGINT NOT NULL |
Computed: SQL aggregation |
Total lines removed. |
|
BIGINT NOT NULL |
Computed: SQL aggregation |
Total whitespace changes. |
|
BIGINT NOT NULL |
Computed: SQL aggregation |
Distinct files changed. |
|
BIGINT NOT NULL |
Computed: SQL aggregation |
Number of commits/patches. |
Standard metadata columns |
dm_repo_group_weekly
Weekly commit statistics per contributor per repo group.
Column |
Type |
Source |
Description |
|---|---|---|---|
|
BIGINT NOT NULL |
Computed: SQL aggregation |
Repo group. |
|
TEXT NOT NULL |
Computed: SQL aggregation |
Contributor email. |
|
TEXT |
Computed: SQL aggregation |
Organizational affiliation. |
|
SMALLINT NOT NULL |
Computed: SQL aggregation |
ISO week number (1-53). |
|
SMALLINT NOT NULL |
Computed: SQL aggregation |
Calendar year. |
|
BIGINT NOT NULL |
Computed: SQL aggregation |
Total lines added. |
|
BIGINT NOT NULL |
Computed: SQL aggregation |
Total lines removed. |
|
BIGINT NOT NULL |
Computed: SQL aggregation |
Total whitespace changes. |
|
BIGINT NOT NULL |
Computed: SQL aggregation |
Distinct files changed. |
|
BIGINT NOT NULL |
Computed: SQL aggregation |
Number of commits/patches. |
Standard metadata columns |
Code Complexity
repo_labor
Per-file code complexity and line count metrics, typically from scc (Sloc Cloc and Code) analysis of cloned repositories.
Column |
Type |
Source |
Description |
|---|---|---|---|
|
BIGSERIAL (PK) |
Auto-generated |
Primary key. |
|
BIGINT (FK -> repos) |
Computed |
Repository. |
|
TIMESTAMPTZ |
Computed |
When the repo was cloned for analysis. |
|
TIMESTAMPTZ |
Computed |
When the analysis was run. |
|
TEXT |
Computed |
Language of the file. |
|
TEXT |
Computed |
Path within the repo. |
|
TEXT |
Computed |
File name. |
|
INT |
Computed |
Total lines in the file. |
|
INT |
Computed |
Lines of code (excluding comments and blanks). |
|
INT |
Computed |
Lines of comments. |
|
INT |
Computed |
Blank lines. |
|
INT |
Computed |
Cyclomatic complexity score. |
|
TEXT |
Computed |
Git URL of the repo. |
Standard metadata columns |
repo_meta
Key-value metadata store for arbitrary repository attributes.
Column |
Type |
Source |
Description |
|---|---|---|---|
|
BIGSERIAL (PK) |
Auto-generated |
Primary key. |
|
BIGINT NOT NULL (FK -> repos) |
Computed |
Repository. |
|
TEXT |
Computed |
Metadata key name. |
|
TEXT |
Computed |
Metadata value. Default |
Standard metadata columns |
repo_stats
Numeric statistics for repositories, stored as key-value pairs.
Column |
Type |
Source |
Description |
|---|---|---|---|
|
BIGSERIAL (PK) |
Auto-generated |
Primary key. |
|
BIGINT NOT NULL (FK -> repos) |
Computed |
Repository. |
|
TEXT |
Computed |
Statistic name. |
|
BIGINT |
Computed |
Statistic value. |
Standard metadata columns |
repo_test_coverage
Per-file test coverage data.
Column |
Type |
Source |
Description |
|---|---|---|---|
|
BIGSERIAL (PK) |
Auto-generated |
Primary key (also serves as repo reference). |
|
TIMESTAMPTZ |
Not yet populated |
When the repo was cloned. |
|
TIMESTAMPTZ |
Not yet populated |
When coverage analysis was run. |
|
TEXT |
Not yet populated |
Language of the file. |
|
TEXT |
Not yet populated |
File path. |
|
TEXT |
Not yet populated |
File name. |
|
TEXT |
Not yet populated |
Test framework used. |
|
BIGINT |
Not yet populated |
Total statements in the file. |
|
BIGINT |
Not yet populated |
Total subroutines. |
|
BIGINT |
Not yet populated |
Statements covered by tests. |
|
BIGINT |
Not yet populated |
Subroutines covered by tests. |
Standard metadata columns |
CHAOSS Metrics
chaoss_metric_status
Registry of CHAOSS metrics and their implementation status in Aveloxis.
Column |
Type |
Source |
Description |
|---|---|---|---|
|
BIGSERIAL (PK) |
Auto-generated |
Primary key. |
|
TEXT |
User input / Augur import |
Metric group. |
|
TEXT |
User input / Augur import |
Data source for the metric. |
|
TEXT |
User input / Augur import |
Metric type. |
|
TEXT |
User input / Augur import |
Backend implementation status. |
|
TEXT |
User input / Augur import |
Frontend implementation status. |
|
BOOLEAN |
User input / Augur import |
Whether the metric is formally defined by CHAOSS. |
|
TEXT |
User input / Augur import |
API endpoint for per-repo queries. |
|
TEXT |
User input / Augur import |
API endpoint for per-repo-group queries. |
|
TEXT |
User input / Augur import |
Metric name. |
|
TEXT |
User input / Augur import |
CHAOSS working group. |
|
JSON |
User input / Augur import |
Additional metric information as JSON. |
|
TEXT |
User input / Augur import |
Focus area within the working group. |
Standard metadata columns |
chaoss_user
User accounts for the CHAOSS metrics interface.
Column |
Type |
Source |
Description |
|---|---|---|---|
|
BIGSERIAL (PK) |
Auto-generated |
Primary key. |
|
TEXT |
User input |
Login name. |
|
TEXT |
User input |
Hashed password. |
|
TEXT UNIQUE |
User input |
Email address. |
|
TEXT |
User input |
Phone number for text notifications. |
|
TEXT |
User input |
First name. |
|
TEXT |
User input |
Last name. |
Standard metadata columns |
Analysis and ML (schema parity, not yet populated)
message_analysis
Per-message sentiment analysis results from ML workers.
Column |
Type |
Source |
Description |
|---|---|---|---|
|
BIGSERIAL (PK) |
Auto-generated |
Primary key. |
|
BIGINT |
Not yet populated |
FK to messages table. |
|
BIGINT |
Not yet populated |
ID of the worker run that produced this result. |
|
FLOAT |
Not yet populated |
Sentiment score (-1.0 to 1.0). |
|
FLOAT |
Not yet populated |
Autoencoder reconstruction error (anomaly score). |
|
BOOLEAN |
Not yet populated |
Whether the message was flagged as novel. |
|
BOOLEAN |
Not yet populated |
Whether the message was flagged as feedback. |
Standard metadata columns |
message_analysis_summary
Aggregated sentiment analysis summaries per repository per time period.
Column |
Type |
Source |
Description |
|---|---|---|---|
|
BIGSERIAL (PK) |
Auto-generated |
Primary key. |
|
BIGINT (FK -> repos) |
Not yet populated |
Repository. |
|
BIGINT |
Not yet populated |
Worker run ID. |
|
FLOAT |
Not yet populated |
Ratio of positive messages. |
|
FLOAT |
Not yet populated |
Ratio of negative messages. |
|
BIGINT |
Not yet populated |
Count of novel messages. |
|
TIMESTAMPTZ |
Not yet populated |
Time period for the summary. |
Standard metadata columns |
message_sentiment
Duplicate of message_analysis for schema parity with Augur. Not yet populated.
Column |
Type |
Source |
Description |
|---|---|---|---|
|
BIGSERIAL (PK) |
Auto-generated |
Primary key. |
|
BIGINT |
Not yet populated |
FK to messages. |
|
BIGINT |
Not yet populated |
Worker run ID. |
|
FLOAT |
Not yet populated |
Sentiment score. |
|
FLOAT |
Not yet populated |
Autoencoder reconstruction error. |
|
BOOLEAN |
Not yet populated |
Novelty flag. |
|
BOOLEAN |
Not yet populated |
Feedback flag. |
Standard metadata columns |
message_sentiment_summary
Aggregated sentiment summaries. Duplicate of message_analysis_summary for schema parity.
Column |
Type |
Source |
Description |
|---|---|---|---|
|
BIGSERIAL (PK) |
Auto-generated |
Primary key. |
|
BIGINT (FK -> repos) |
Not yet populated |
Repository. |
|
BIGINT |
Not yet populated |
Worker run ID. |
|
FLOAT |
Not yet populated |
Positive message ratio. |
|
FLOAT |
Not yet populated |
Negative message ratio. |
|
BIGINT |
Not yet populated |
Novel message count. |
|
TIMESTAMPTZ |
Not yet populated |
Summary time period. |
Standard metadata columns |
discourse_insights
Discourse act classification for messages. Categorizes messages by their communicative function.
Column |
Type |
Source |
Description |
|---|---|---|---|
|
BIGSERIAL (PK) |
Auto-generated |
Primary key. |
|
BIGINT |
Not yet populated |
FK to messages. |
|
TEXT |
Not yet populated |
Discourse act label (e.g., |
Standard metadata columns |
lstm_anomaly_models
LSTM neural network model definitions for anomaly detection.
Column |
Type |
Source |
Description |
|---|---|---|---|
|
BIGSERIAL (PK) |
Auto-generated |
Primary key. |
|
TEXT |
Not yet populated |
Model name. |
|
TEXT |
Not yet populated |
Model description. |
|
BIGINT |
Not yet populated |
Number of days of history the model considers. |
|
BIGINT |
Not yet populated |
Number of days of data used for training. |
|
BIGINT |
Not yet populated |
Training batch size. |
|
TEXT |
Not yet populated |
Which metric this model targets. |
Standard metadata columns |
lstm_anomaly_results
Results from LSTM anomaly detection runs.
Column |
Type |
Source |
Description |
|---|---|---|---|
|
BIGSERIAL (PK) |
Auto-generated |
Primary key. |
|
BIGINT (FK -> repos) |
Not yet populated |
Repository. |
|
TEXT |
Not yet populated |
Category of the repository. |
|
BIGINT (FK -> lstm_anomaly_models) |
Not yet populated |
Model used. |
|
TEXT |
Not yet populated |
Metric analyzed. |
|
FLOAT |
Not yet populated |
Anomaly detection contamination factor. |
|
FLOAT |
Not yet populated |
MAE of the model predictions. |
|
TEXT |
Not yet populated |
Human-readable remarks. |
|
TEXT |
Not yet populated |
Specific field within the metric. |
|
FLOAT |
Not yet populated |
Mean of actual values. |
|
FLOAT |
Not yet populated |
Mean of predicted values. |
Standard metadata columns |
Topic Modeling (schema parity, not yet populated)
topic_model_meta
Metadata for trained topic models. Stores model configuration, quality metrics, and file paths.
Column |
Type |
Source |
Description |
|---|---|---|---|
|
UUID (PK) |
Auto-generated ( |
Primary key. |
|
BIGINT (FK -> repos) |
Not yet populated |
Repository the model was trained on. |
|
TEXT NOT NULL |
Not yet populated |
Topic modeling method (e.g., |
|
INT NOT NULL |
Not yet populated |
Number of topics. |
|
INT NOT NULL |
Not yet populated |
Words per topic. |
|
JSON NOT NULL |
Not yet populated |
Full training parameters as JSON. |
|
JSON NOT NULL |
Not yet populated |
Paths to saved model files. |
|
TEXT NOT NULL |
Not yet populated |
Hash of parameters for deduplication. |
|
FLOAT NOT NULL |
Not yet populated |
Topic coherence score. |
|
FLOAT NOT NULL |
Not yet populated |
Perplexity score. |
|
FLOAT NOT NULL |
Not yet populated |
Topic diversity score. |
|
JSON NOT NULL |
Not yet populated |
Quality metrics as JSON. |
|
BIGINT NOT NULL |
Not yet populated |
Number of messages used for training. |
|
JSON NOT NULL |
Not yet populated |
Fingerprint of the training data. |
|
JSON |
Not yet populated |
Pre-computed visualization data. |
|
TIMESTAMPTZ NOT NULL |
Not yet populated |
When training started. |
|
TIMESTAMPTZ NOT NULL |
Not yet populated |
When training finished. |
Standard metadata columns |
topic_model_event
Event log for topic model training and inference runs.
Column |
Type |
Source |
Description |
|---|---|---|---|
|
BIGSERIAL (PK) |
Auto-generated |
Primary key. |
|
TIMESTAMPTZ NOT NULL |
Auto-generated |
Event timestamp. Default |
|
INT |
Not yet populated |
Repository. |
|
UUID |
Not yet populated |
FK to topic_model_meta. |
|
TEXT NOT NULL |
Not yet populated |
Event description. |
|
TEXT NOT NULL |
Not yet populated |
Log level (e.g., |
|
JSONB NOT NULL |
Not yet populated |
Event payload as JSON. |
topic_words
Words and their probabilities within discovered topics.
Column |
Type |
Source |
Description |
|---|---|---|---|
|
BIGSERIAL (PK) |
Auto-generated |
Primary key. |
|
BIGINT |
Not yet populated |
Topic number. |
|
TEXT |
Not yet populated |
Word. |
|
FLOAT |
Not yet populated |
Probability of this word in the topic. |
Standard metadata columns |
repo_cluster_messages
Cluster assignments for repository messages (content and mechanism clusters).
Column |
Type |
Source |
Description |
|---|---|---|---|
|
BIGSERIAL (PK) |
Auto-generated |
Primary key. |
|
BIGINT (FK -> repos) |
Not yet populated |
Repository. |
|
INT |
Not yet populated |
Content-based cluster assignment. |
|
INT |
Not yet populated |
Mechanism-based cluster assignment. |
Standard metadata columns |
repo_topic
Topic distribution for repositories (which topics are associated with a repo).
Column |
Type |
Source |
Description |
|---|---|---|---|
|
BIGSERIAL (PK) |
Auto-generated |
Primary key. |
|
BIGINT (FK -> repos) |
Not yet populated |
Repository. |
|
INT |
Not yet populated |
Topic number. |
|
FLOAT |
Not yet populated |
Probability of this topic for the repo. |
Standard metadata columns |
Network Analysis (schema parity, not yet populated)
network_beyond_augur
Cross-repository contributor activity network data. Records how contributors act across multiple repos.
Column |
Type |
Source |
Description |
|---|---|---|---|
|
UUID |
Not yet populated |
Contributor. |
|
TEXT |
Not yet populated |
Git URL of the repo. |
|
TEXT |
Not yet populated |
Repository name. |
|
TEXT |
Not yet populated |
Action type (e.g., |
|
FLOAT |
Not yet populated |
Year of the action. |
|
NUMERIC |
Not yet populated |
Quarter of the action (1-4). |
|
BIGINT |
Not yet populated |
Count of actions. |
No primary key defined.
network_beyond_augur_dependencies
Cross-repository dependency network data. Tracks how repos depend on each other via contributor overlap.
Column |
Type |
Source |
Description |
|---|---|---|---|
|
UUID |
Not yet populated |
Contributor. |
|
TEXT |
Not yet populated |
Git URL. |
|
TEXT |
Not yet populated |
Repository name. |
|
TEXT |
Not yet populated |
Action type. |
|
FLOAT |
Not yet populated |
Year. |
|
NUMERIC |
Not yet populated |
Quarter (1-4). |
|
BIGINT |
Not yet populated |
Count. |
No primary key defined.
Miscellaneous
exclude
Exclusion rules for filtering out certain email addresses or domains from analysis.
Column |
Type |
Source |
Description |
|---|---|---|---|
|
INT (PK) |
User input |
Primary key. |
|
INT NOT NULL |
User input |
Project/repo group to apply the exclusion to. |
|
TEXT |
User input |
Email address to exclude. |
|
TEXT |
User input |
Email domain to exclude. |
historical_repo_urls
Tracks URL changes for repositories over time (e.g., when a repo is renamed or transferred).
Column |
Type |
Source |
Description |
|---|---|---|---|
|
BIGINT NOT NULL (PK part 1) |
Computed |
Repository. |
|
TEXT NOT NULL (PK part 2) |
Computed |
A historical git URL for this repo. |
|
TIMESTAMPTZ |
Auto-generated |
When this URL was recorded. |
Primary key: (repo_id, git_url)
repos_fetch_log
Log of repository fetch attempts and their outcomes.
Column |
Type |
Source |
Description |
|---|---|---|---|
|
INT NOT NULL |
Computed |
Repository ID. |
|
TEXT NOT NULL |
Computed |
Fetch status (e.g., |
|
TIMESTAMPTZ NOT NULL |
Auto-generated |
When the fetch was attempted. |
No primary key defined.
settings
Application settings for the data schema (separate from ops config).
Column |
Type |
Source |
Description |
|---|---|---|---|
|
INT (PK) |
User input |
Primary key. |
|
TEXT NOT NULL |
User input |
Setting name. |
|
TEXT NOT NULL |
User input |
Setting value. |
|
TIMESTAMPTZ NOT NULL |
Auto-generated |
Last modification timestamp. |
unknown_cache
Cache for contributor emails that could not be resolved to an affiliation. Prevents repeated lookups.
Column |
Type |
Source |
Description |
|---|---|---|---|
|
TEXT NOT NULL |
Computed |
Type of unknown entry. |
|
INT NOT NULL |
Computed |
Repo group context. |
|
TEXT NOT NULL |
Computed |
Unresolved email. |
|
TEXT |
Computed |
Email domain. |
|
BIGINT NOT NULL |
Computed |
Lines added by this email. |
Standard metadata columns |
No primary key defined.
utility_log
General-purpose utility log for miscellaneous operations.
Column |
Type |
Source |
Description |
|---|---|---|---|
|
BIGSERIAL (PK) |
Auto-generated |
Primary key. |
|
TEXT NOT NULL |
Computed |
Log level. |
|
TEXT NOT NULL |
Computed |
Status message. |
|
TIMESTAMPTZ NOT NULL |
Auto-generated |
When the operation was attempted. |
working_commits
Tracks the current working commit for facade processing per repository.
Column |
Type |
Source |
Description |
|---|---|---|---|
|
INT NOT NULL |
Computed |
Repository ID. |
|
TEXT |
Git: |
The commit hash currently being processed. |
No primary key defined.
aveloxis_ops Schema
Collection Pipeline
staging
Raw API responses land here before being parsed and inserted into aveloxis_data tables. Acts as a durable inbox for the ETL pipeline.
Column |
Type |
Source |
Description |
|---|---|---|---|
|
BIGSERIAL (PK) |
Auto-generated |
Primary key. |
|
BIGINT NOT NULL (FK -> repos) |
Computed |
Repository this payload is for. |
|
SMALLINT NOT NULL (FK -> platforms) |
Computed |
Platform the data came from. |
|
TEXT NOT NULL |
Computed |
Type of entity (e.g., |
|
JSONB NOT NULL |
GitHub REST API / GitLab API v4 |
Raw API response body. |
|
TIMESTAMPTZ |
Auto-generated |
When the payload was staged. |
|
BOOLEAN |
Computed |
Whether the payload has been parsed into data tables. Default |
Index: (repo_id, entity_type) WHERE NOT processed
collection_queue
Postgres-backed priority queue that drives the collection pipeline. Each repo has at most one row. Workers claim repos by locking rows.
Column |
Type |
Source |
Description |
|---|---|---|---|
|
BIGINT (PK, FK -> repos) |
Computed |
Repository. |
|
INT NOT NULL |
Computed |
Priority (lower = higher priority). Default |
|
TEXT NOT NULL |
Computed |
Queue status: |
|
TIMESTAMPTZ NOT NULL |
Computed |
When this repo is next due for collection. |
|
TEXT |
Computed |
Worker instance ID that holds the lock. |
|
TIMESTAMPTZ |
Computed |
When the lock was acquired. |
|
TIMESTAMPTZ |
Computed |
When collection last completed. |
|
TEXT |
Computed |
Error message from last failed run. |
|
INT |
Computed |
Issues collected in the last run. |
|
INT |
Computed |
PRs collected in the last run. |
|
INT |
Computed |
Messages collected in the last run. |
|
INT |
Computed |
Events collected in the last run. |
|
INT |
Computed |
Releases collected in the last run. |
|
INT |
Computed |
Contributors collected in the last run. |
|
BIGINT |
Computed |
Duration of the last collection run in milliseconds. |
|
TIMESTAMPTZ |
Auto-generated |
Last update timestamp. |
Index: (priority, due_at) WHERE status = 'queued'
collection_status
Tracks the overall collection status for each repo across multiple pipeline stages (core, secondary, facade, ML).
Column |
Type |
Source |
Description |
|---|---|---|---|
|
BIGINT (PK, FK -> repos) |
Computed |
Repository. |
|
TEXT |
Computed |
Status of core data collection (issues, PRs). Default |
|
TEXT |
Computed |
Task ID for the core collection worker. |
|
TIMESTAMPTZ |
Computed |
When core data was last collected. |
|
BIGINT |
Computed |
Weight/size of core data workload. |
|
TEXT |
Computed |
Status of secondary data collection (reviews, files). Default |
|
TEXT |
Computed |
Task ID for the secondary worker. |
|
TIMESTAMPTZ |
Computed |
When secondary data was last collected. |
|
BIGINT |
Computed |
Weight of secondary data workload. |
|
TEXT |
Computed |
Status of facade/git log collection. Default |
|
TEXT |
Computed |
Task ID for the facade worker. |
|
TIMESTAMPTZ |
Computed |
When facade data was last collected. |
|
BIGINT |
Computed |
Weight of facade workload. |
|
TIMESTAMPTZ |
Computed |
When events were last collected. |
|
BIGINT |
Computed |
Total issues + PRs (used for workload estimation). |
|
BIGINT |
Computed |
Total commits (used for workload estimation). |
|
TEXT |
Computed |
Status of ML pipeline. Default |
|
TEXT |
Computed |
Task ID for the ML worker. |
|
TIMESTAMPTZ |
Computed |
When ML data was last produced. |
|
BIGINT |
Computed |
Weight of ML workload. |
|
TIMESTAMPTZ |
Auto-generated |
Last update timestamp. |
API Credentials
worker_oauth
OAuth tokens and API keys used by collection workers to authenticate with GitHub and GitLab APIs.
Column |
Type |
Source |
Description |
|---|---|---|---|
|
BIGSERIAL (PK) |
Auto-generated |
Primary key. |
|
TEXT NOT NULL |
User input |
Descriptive name for this credential. |
|
TEXT NOT NULL |
User input |
OAuth consumer key (for OAuth 1.0 flows). |
|
TEXT NOT NULL |
User input |
OAuth consumer secret. |
|
TEXT NOT NULL |
User input |
Access token (PAT or OAuth token). |
|
TEXT NOT NULL |
User input |
OAuth access token secret. |
|
TEXT |
User input |
Local directory for git operations using this credential. |
|
TEXT NOT NULL |
User input |
Platform name: |
|
INT |
User input |
Rate limit for this token. Default |
|
TIMESTAMPTZ |
Auto-generated |
When this credential was added. |
Unique constraint: (access_token, platform)
Users and Auth
users
User accounts for the Aveloxis web interface and API.
Column |
Type |
Source |
Description |
|---|---|---|---|
|
SERIAL (PK) |
Auto-generated |
Primary key. |
|
TEXT NOT NULL UNIQUE |
User input |
Username. |
|
TEXT NOT NULL |
User input |
Hashed password. |
|
TEXT NOT NULL UNIQUE |
User input |
Email address. |
|
TEXT UNIQUE |
User input |
Phone number for notifications. |
|
TEXT NOT NULL |
User input |
First name. |
|
TEXT NOT NULL |
User input |
Last name. |
|
BOOLEAN NOT NULL |
User input |
Whether this user is an admin. Default |
|
BOOLEAN NOT NULL |
Computed |
Whether the email has been verified. Default |
Standard metadata columns |
user_groups
Named groups of repositories curated by users. Users can organize repos into groups for easier management.
Column |
Type |
Source |
Description |
|---|---|---|---|
|
BIGSERIAL (PK) |
Auto-generated |
Primary key. |
|
INT NOT NULL (FK -> users) |
Computed |
Owning user. |
|
TEXT NOT NULL |
User input |
Group name. |
|
BOOLEAN NOT NULL |
User input |
Whether this group is favorited. Default |
Unique constraint: (user_id, name)
user_repos
Join table linking repos to user groups.
Column |
Type |
Source |
Description |
|---|---|---|---|
|
BIGINT NOT NULL (PK part 1) |
User input |
Repository. |
|
BIGINT NOT NULL (PK part 2, FK -> user_groups) |
User input |
User group. |
Primary key: (group_id, repo_id)
client_applications
Registered API client applications (for OAuth flows).
Column |
Type |
Source |
Description |
|---|---|---|---|
|
TEXT (PK) |
Auto-generated |
Application ID. |
|
TEXT NOT NULL |
Auto-generated |
API key for the application. |
|
INT NOT NULL (FK -> users) |
Computed |
User who owns this application. |
|
TEXT NOT NULL |
User input |
Application name. |
|
TEXT NOT NULL |
User input |
OAuth redirect URL. |
user_session_tokens
Active session tokens for authenticated users.
Column |
Type |
Source |
Description |
|---|---|---|---|
|
TEXT (PK) |
Auto-generated |
Session token string. |
|
INT NOT NULL (FK -> users) |
Computed |
User this session belongs to. |
|
BIGINT |
Auto-generated |
Creation timestamp (Unix epoch). |
|
BIGINT |
Computed |
Expiration timestamp (Unix epoch). |
|
TEXT (FK -> client_applications) |
Computed |
Client application that initiated this session. |
refresh_tokens
Refresh tokens for renewing expired session tokens.
Column |
Type |
Source |
Description |
|---|---|---|---|
|
TEXT (PK) |
Auto-generated |
Refresh token string. |
|
TEXT NOT NULL UNIQUE (FK -> user_session_tokens) |
Computed |
The session token this refresh token can renew. |
subscription_types
Types of event subscriptions available (for webhook/notification systems).
Column |
Type |
Source |
Description |
|---|---|---|---|
|
BIGSERIAL (PK) |
Auto-generated |
Primary key. |
|
TEXT NOT NULL UNIQUE |
User input |
Subscription type name. |
subscriptions
Links client applications to the event types they are subscribed to.
Column |
Type |
Source |
Description |
|---|---|---|---|
|
TEXT NOT NULL (PK part 1, FK -> client_applications) |
Computed |
Client application. |
|
BIGINT NOT NULL (PK part 2, FK -> subscription_types) |
Computed |
Subscription type. |
Primary key: (application_id, type_id)
Configuration
augur_settings
Legacy settings store, maintained for compatibility with Augur tooling.
Column |
Type |
Source |
Description |
|---|---|---|---|
|
BIGSERIAL (PK) |
Auto-generated |
Primary key. |
|
TEXT |
User input / Augur import |
Setting name. |
|
TEXT |
User input / Augur import |
Setting value. |
|
TIMESTAMPTZ |
Auto-generated |
Last modification timestamp. |
config
Structured configuration store with section/setting hierarchy.
Column |
Type |
Source |
Description |
|---|---|---|---|
|
SMALLSERIAL (PK) |
Auto-generated |
Primary key. |
|
TEXT NOT NULL |
User input |
Configuration section (e.g., |
|
TEXT NOT NULL |
User input |
Setting name within the section. |
|
TEXT |
User input |
Setting value. |
|
TEXT |
User input |
Value type hint (e.g., |
Unique constraint: (section_name, setting_name)
GitHub Users (Affiliation Data)
github_users
Cached GitHub user data used for affiliation resolution. Populated by scanning commit authors and looking up their GitHub profiles.
Column |
Type |
Source |
Description |
|---|---|---|---|
|
TEXT |
GitHub REST: |
GitHub login. |
|
TEXT |
GitHub REST: |
Email address. |
|
TEXT |
|
Resolved organizational affiliation. |
|
TEXT |
|
How the affiliation was determined. |
|
TEXT |
Computed |
Commit count. |
|
TEXT |
GitHub REST: |
Profile location. |
|
TEXT |
Computed |
Resolved country code. |
No primary key defined.
Network Weighted Tables
Pre-computed weighted contributor-action matrices used for network analysis.
network_weighted_commits
Weighted commit activity per contributor per repository.
Column |
Type |
Source |
Description |
|---|---|---|---|
|
BIGINT |
Computed |
Repository. |
|
UUID |
Computed |
Contributor. |
|
FLOAT |
Computed |
Weight/score of the activity. |
|
TEXT |
Computed |
Type of commit action. |
|
TEXT |
Computed |
Collection method used. |
|
TIMESTAMPTZ |
Auto-generated |
When this record was computed. |
No primary key defined.
network_weighted_issues
Weighted issue activity per contributor per repository.
Column |
Type |
Source |
Description |
|---|---|---|---|
|
BIGINT |
Computed |
Repository. |
|
UUID |
Computed |
Contributor. |
|
FLOAT |
Computed |
Weight/score of the activity. |
|
TEXT |
Computed |
Type of issue action. |
|
TEXT |
Computed |
Collection method used. |
|
TIMESTAMPTZ |
Auto-generated |
When this record was computed. |
No primary key defined.
network_weighted_pr_reviews
Weighted PR review activity per contributor per repository.
Column |
Type |
Source |
Description |
|---|---|---|---|
|
BIGINT |
Computed |
Repository. |
|
UUID |
Computed |
Contributor. |
|
FLOAT |
Computed |
Weight/score of the activity. |
|
TEXT |
Computed |
Type of review action. |
|
TEXT |
Computed |
Collection method used. |
|
TIMESTAMPTZ |
Auto-generated |
When this record was computed. |
No primary key defined.
network_weighted_prs
Weighted pull request activity per contributor per repository.
Column |
Type |
Source |
Description |
|---|---|---|---|
|
BIGINT |
Computed |
Repository. |
|
UUID |
Computed |
Contributor. |
|
FLOAT |
Computed |
Weight/score of the activity. |
|
TEXT |
Computed |
Type of PR action. |
|
TEXT |
Computed |
Collection method used. |
|
TIMESTAMPTZ |
Auto-generated |
When this record was computed. |
No primary key defined.
Worker Management
worker_history
Audit log of worker task executions. One row per worker run.
Column |
Type |
Source |
Description |
|---|---|---|---|
|
BIGSERIAL (PK) |
Auto-generated |
Primary key. |
|
BIGINT |
Computed |
Repository that was processed. |
|
TEXT NOT NULL |
Computed |
Worker name/identifier. |
|
TEXT NOT NULL |
Computed |
Job model name. |
|
INT |
Computed |
OAuth credential used. |
|
TIMESTAMPTZ NOT NULL |
Auto-generated |
When the worker ran. |
|
TEXT NOT NULL |
Computed |
Outcome status (e.g., |
|
INT |
Computed |
Number of records processed. |
worker_job
Persistent state for each worker job model. Tracks progress across restarts.
Column |
Type |
Source |
Description |
|---|---|---|---|
|
TEXT (PK) |
Computed |
Job model name (e.g., |
|
INT NOT NULL |
Computed |
Current state code. Default |
|
INT |
Computed |
For zombie detection: expected head position. |
|
TEXT NOT NULL |
Computed |
Cursor/since-ID for incremental collection. Default |
|
TEXT |
Computed |
Human-readable description. |
|
INT |
Computed |
Records processed in the last run. |
|
TIMESTAMPTZ |
Computed |
When the job last ran. |
|
INT |
Computed |
State of any post-collection analysis. |
|
INT NOT NULL |
Computed |
Default OAuth credential for this job. |
worker_settings_facade
Configuration settings specific to the facade (git log) worker.
Column |
Type |
Source |
Description |
|---|---|---|---|
|
INT (PK) |
User input |
Primary key. |
|
TEXT NOT NULL |
User input |
Setting name. |
|
TEXT NOT NULL |
User input |
Setting value. |
|
TIMESTAMPTZ NOT NULL |
Auto-generated |
Last modification timestamp. |
Fetch Log and Working Commits (ops)
repos_fetch_log (aveloxis_ops)
Operational fetch log (mirrors aveloxis_data.repos_fetch_log for the ops schema).
Column |
Type |
Source |
Description |
|---|---|---|---|
|
INT NOT NULL |
Computed |
Repository ID. |
|
TEXT NOT NULL |
Computed |
Fetch status. |
|
TIMESTAMPTZ NOT NULL |
Auto-generated |
Fetch attempt timestamp. |
No primary key defined.
working_commits (aveloxis_ops)
Operational working commit tracker (mirrors aveloxis_data.working_commits for the ops schema).
Column |
Type |
Source |
Description |
|---|---|---|---|
|
INT NOT NULL |
Computed |
Repository ID. |
|
TEXT |
Git: |
Current working commit hash. |
No primary key defined.
Indexes
The schema defines the following additional indexes for query performance:
aveloxis_data indexes
Index |
Table |
Columns |
Condition |
|---|---|---|---|
|
contributors |
|
|
|
commits |
|
|
|
commits |
|
|
|
commits |
|
|
|
commits |
|
|
|
commits |
|
|
|
dm_repo_annual |
|
|
|
issues |
|
|
|
issues |
|
|
|
pull_requests |
|
|
|
pull_requests |
|
|
|
messages |
|
|
|
issue_events |
|
|
|
pull_request_events |
|
|
|
releases |
|
|
|
repo_info |
|
|
|
contributor_identities |
|
|
|
commit_parents |
|
|
|
repo_labor |
|
|
|
repo_deps_libyear |
|
|
|
repo_deps_scorecard |
|
|
|
repo_dependencies |
|
aveloxis_ops indexes
Index |
Table |
Columns |
Condition |
|---|---|---|---|
|
staging |
|
|
|
collection_queue |
|
|