Mailing-List Ingestion (v0.25.7+)
Mailing-list archives are collected by the MailingListWorker, a dedicated decoupled worker pool inside aveloxis serve. It ingests email from project mailing-list archives into the canonical Aveloxis tables, classifies each message, and — where the email corresponds to an issue, pull request, or review — projects it onto those entities. Off by default; opt in via collection.mailing_list_enabled = true.
This document explains why the system looks the way it does, what the data is good for, and how to interpret it. Operator-facing tuning lives in docs/getting-started/configuration.md; the CLI is in docs/guide/commands.md. The full design rationale (sampling, migration patterns, decision log) is in summary/10-apache-history-ingestion.md and summary/11-apache-mailing-list-implementation-plan.md.
1. The question this subsystem answers
A great deal of open-source project activity never appears in the GitHub/GitLab API: design discussion, release votes, patch review on lists that predate (or replace) pull requests, and the Jira/Bugzilla issue history of projects that migrated to GitHub issues. The MailingListWorker recovers that activity.
Every email becomes a first-class email_message entity (a peer to issues, pull_requests, and pull_request_reviews), with its body stored in the shared messages table via an email_message_ref bridge — the same unified-message architecture every other text source uses. Classification then routes the message onto the canonical home it belongs in:
an issue-tracker notification → linked to an
issuesrow;a patch submission (kernel-style
[PATCH]) → a pull-request-equivalent;a
Reviewed-by:/ review reply → a pull-request-review-equivalent;everything else (votes, announcements, discussion, support) → stays a standalone
mailing_list_onlymessage.
1.1 Two orthogonal axes
The design keeps two questions separate (this is the single most important thing to understand about the subsystem):
Axis A — message class: what kind of email is this? (issue notification, patch, review, vote, announcement, discussion, …). Stored in
email_message.msg_class.Axis B — repo association: which repo/PMC does it concern? Stored as the
signaled_repo_url/signaled_repo_idpair (see §6).
A message can be classified (Axis A) without resolving to a known repo (Axis B), and vice-versa. The leftovers on each axis are explicit: mailing_list_only / unclassified on Axis A, an unresolved signaled_repo_id on Axis B. Nothing is silently dropped.
2. Why a dedicated worker
The work is independent of every per-repo collection phase, and follows the same decoupled-pool pattern v0.21.0 applied to scancode and v0.24.0 to the DistributionWorker:
No API tokens. Apache Pony Mail is unauthenticated; the kernel public-inbox path is a
git clone. The optionalpolite_emailonly sets a contact header.Its own cadence. Lists are tail-refreshed on a 30-day default cadence — much slower than main collection — and a one-time full-history backfill runs when a list has no checkpoint.
Its own claim queue. Lists are claimed from
aveloxis_data.repo_groups_list_serve(themlls_*columns), entirely separate from the repocollection_queue. Enabling the subsystem does not slow per-repo collection.
3. email_message — the first-class entity
email_message ──email_message_ref──▶ messages (body, platform_id = 6)
│
├─ signaled_repo_id ──▶ repos (which repo it concerns, Axis B)
├─ linked_issue_id ──▶ issues (routed: issue-tracker mail)
├─ linked_pull_request_id──▶ pull_requests(routed: patch / PR-equivalent)
└─ thread_root_id (threading: In-Reply-To / References)
platform_id = 6 ('Mailing List') tags every message-table row sourced from a list. Metadata convention on each row: data_source = the specific list address (e.g. dev@kafka.apache.org), tool_source = "Aveloxis Mailing List Collector", tool_version = the release, data_collection_date = load time.
4. The two archive backends
Backends implement a common ArchiveSource interface (Name, EnumerateLists, FirstMonth, FetchMonth), so adding a third archive system is a config + one-file job. The worker spawns one runner pool per registered system.
4.1 Apache Pony Mail (apache_ponymail)
The Foal API at lists.apache.org:
Endpoint |
Use |
|---|---|
|
bulk monthly mbox download (the data path) |
|
per-domain list catalog (enumeration) |
|
list-level |
The mbox stream is parsed mboxrd, with MIME multipart + quoted-printable/base64 decoding. A 404 is a clean empty-month miss; 429 → rate-limited (feeds the Pacer); 5xx / transport → transient (feeds the Breaker).
stats.luawindow matters.firstYear/firstMonthare list-level metadata returned for any date window.FirstMonththerefore uses the cheapest window (d=lte=1d, ~1 s / ~50 KB). An earlierd=lte=30yforced Pony Mail to aggregate the list’s entire history and stream back every message (~18 MB / ~35 s on a busy list), which timed out the worker. Regression tripwire:TestPonyMailFirstMonthUsesCheapWindow.
4.2 lore.kernel.org public-inbox (lore_public_inbox)
lore’s HTTP surface is Anubis-gated, so the sanctioned bulk path is a bare git clone of the per-list public-inbox archive (https://lore.kernel.org/<list>/git/0.git). FetchMonth walks the archive’s commits within the month window and reads each message from the m blob (git cat-file -p <hash>:m). Enumeration returns nil (the catalog isn’t machine-listable under Anubis); kernel lists are curated via register-mailing-list.
5. Classification
internal/mailinglist/systems.yaml defines per-system ordered rules (subject regex, body URL, sender, List-Id, list-address) compiled at load. System.Classify(msg) returns the first match as a (class, source, captures) triple. The eleven classes:
Class |
Routes to |
Typical source |
|---|---|---|
|
|
|
|
|
kernel |
|
|
|
|
linked issue/PR (mirror) |
|
|
(metadata) |
|
|
|
human discussion, votes, releases, Q&A |
captures carries structured data the rule extracted — e.g. {external_key: "KAFKA-20167"} (the Jira key, used to bridge to an issue) or {repo: "arrow-rs"} (the repo signal, used for Axis B).
6. Sender and signaled-repo resolution
6.1 Sender identity (§5d)
Sender resolution happens at drain time in the MailingListProcessor (not in the fetch worker — see §8.1). For the inline stamp on the messages row the Processor does a DB lookup (ResolveContributorIDByEmail), cached per-list so a recurring sender is resolved once. Senders that don’t resolve from the DB keep their sender_email and are retried two ways as the contributors table fills: the existing BackfillMailingListSenderIDs ticker (hourly DB re-lookup), and the v0.25.x runMailingListSenderResolve ticker, which runs senders with ≥ a message threshold through the shared ResolveEmailToIdentity chain (noreply → bot → DB → GitHub Search → GitHub global commit-search) — the same chain the commit resolver uses. Global commit-search is the load-bearing step: list senders are largely committers, so a sender who keeps their profile email private still resolves via a public commit they authored anywhere on GitHub. See Contributor Resolution → Shared email→identity resolution.
6.2 Signaled repo — two columns, never block
Which repo a message concerns is captured as a pair:
signaled_repo_url— the canonical repo URL extracted from the message’s signal (a bot/mirror[repo]bracket, agithub.com/owner/repobody URL, aGH-NNNNNkey). Captured even if that repo isn’t in the catalog.signaled_repo_id— the FK torepos, filled in only once the URL resolves to a repo we hold.
Resolution is bidirectional and non-blocking: mail-side at write time via FindRepoByURL; repo-side when a new repo is created, by sweeping email_message (ResolveSignaledRepoForURL). An unresolved signaled_repo_id means “we captured a real signal pointing at a repo we don’t have loaded” (e.g. Arrow’s github@ list naming apache/arrow-rs when only apache/arrow is tracked) — not a defect. Tracking the whole org (load-foundation-orgs) drives resolution toward 100%.
7. Schema (v0.25.7)
-- platforms gains row 6
INSERT INTO aveloxis_data.platforms (platform_id, ...) VALUES (6, 'Mailing List', ...);
-- email_message: the first-class entity (declared AFTER issues/pull_requests/
-- messages in schema.sql — it FK-references all three; see the ordering tripwire)
CREATE TABLE aveloxis_data.email_message ( ... );
CREATE TABLE aveloxis_data.email_message_ref ( ... ); -- bridge to messages
-- issues gains an external key for Jira/Bugzilla import correlation
ALTER TABLE aveloxis_data.issues ADD COLUMN external_key TEXT DEFAULT '';
-- partial unique: (repo_id, external_key) WHERE external_key <> ''
-- repo_groups_list_serve gains the claim/checkpoint/lock columns
mlls_system, mlls_last_month, mlls_scan_complete,
mlls_failed_attempts, mlls_last_failed_at, mlls_last_run,
mlls_locked_at, mlls_locked_pid, mlls_locked_boot_id
-- + UNIQUE (repo_group_id, rgls_email)
Per-column documentation is in docs/schema.md. See docs/contributing/schema-migrations.md for the table-ordering rule that the v0.25.9 Phase 4 run surfaced (an FK-bearing table must be CREATEd after its referenced tables, since schema.sql runs as one transaction).
8. Worker architecture
aveloxis_data.repo_groups_list_serve.mlls_*
▲
┌───────────┼────────────┐
▼ ▼
ClaimNextList CheckpointListMonth /
(FOR UPDATE SKIP LOCKED) CompleteListScan / RecordListFailure
│ ▲
▼ │
┌───────────┐ per-system ┌─────────┐
│ dispatcher │──jobs chan─▶│ runner │
│ per system │ │ pool │
└───────────┘ │ (N=2) │
└─────────┘
│ claim→fetch→classify→STAGE→checkpoint
▼
ArchiveSource (Pony Mail | public-inbox)
Claim:
ClaimNextList(system, cadence, staleLock, pid, bootID)acquires a list withFOR UPDATE SKIP LOCKED, gated on cadence and on stale-lock recovery (MailingListStaleLock = 2h— a lock older than that is presumed dead, the v0.21.0(pid, boot_id)recovery shape).RecoverStaleListLocksruns at startup.Checkpoint: each completed month stamps
mlls_last_monthviaCheckpointListMonth, so an interrupted scan resumes from where it stopped rather than re-fetching.Months to scan: from
mlls_last_monthforward to the current month; for a never-scanned list, fromFirstMonth(full history) whenmailing_list_backfill_months <= 0, else the recent N-month window.Failure backoff (v0.21.4 quadratic, base 120s):
RecordListFailureschedules 2m → 8m → 18m → … and sidelines the list afterMailingListMaxFailures = 10consecutive failures.
8.1 Staging split (v0.25.x)
The pipeline is split into a fetch half and a resolve+write half across a staging table, for the same reason the API pipeline stages: doing per-message sender-resolution + hot-table writes inline (on every fetched message, across concurrent list runners) reproduces Augur’s lock contention on contributors / issues / pull_requests. The split keeps the fetchers off the hot tables.
MailingListWorker(fetch half): claim → fetch a month → classify each message (cheap, no DB) → stage the classified envelope intoaveloxis_ops.mailing_list_staging→ checkpoint. It never touches the hot tables.MailingListProcessor(resolve+write half): drainsmailing_list_stagingone list at a time, single-threaded (mailing_list_processor_workers, default 1 —>1only fans out across distinct lists via an in-process per-list guard). Per drained message it resolves the repo (once per list, from the stagedrepo_group_id), resolves the sender (per-list cached), resolves mirror-links / signaled-repo, and writesemail_message+ (for non-mirror)messages+email_message_ref.Deferral: a list whose
repo_grouphas no repo yet is left staged (messages.repo_idisNOT NULL); it drains automatically onceload-foundation-orgs/ DOAP-enrichment populates the group.aveloxis mailing-list-statssurfaces these stuck lists. The hourly staging sweep isprocessed-gated, so undrained rows are never purged.
9. Mirror handling
collection.mailing_list_mirror_handling controls what happens to mirror-class mail (github_mirror) — notification lists that merely echo GitHub activity Aveloxis already collects via the API:
Value |
Behavior |
|---|---|
|
drop mirror mail entirely (the API copy is authoritative) |
|
record the |
|
store everything, including the body |
The default avoids wholesale-duplicating GitHub data into a second form while keeping the linkage and timeline.
10. Operator CLI
aveloxis load-foundation-core-repos # one core repo per project (was: import-foundations)
aveloxis load-foundation-orgs --yes # track the foundation's GitHub org(s) for repo discovery
aveloxis load-apache-lists # register per-PMC dev@/users@ lists via enumeration
aveloxis register-mailing-list \ # register one list (any system, e.g. the kernel)
--system lore_public_inbox --list linux-pci@vger.kernel.org --repo https://github.com/torvalds/linux
aveloxis backfill-issue-external-keys # populate issues.external_key from [KEY-N] title prefixes (conflict-safe)
aveloxis backfill-mailing-list-projection # project existing issue_event mail → issues, in place (Phase 5)
aveloxis mailing-list-stats # coverage rollup (+ missed-LINK shadow guard)
aveloxis verify-mailing-list [--strict] # Phase 4 branch-coverage harness (§12)
See docs/guide/commands.md for full flag references. The REST rollup is GET /api/v1/mailing-list/stats (docs/guide/api.md).
11. Config knobs
All under collection in aveloxis.json:
{
"collection": {
"mailing_list_enabled": false, // master switch
"mailing_list_workers": 2, // concurrent list runners per system
"mailing_list_cadence_days": 30, // tail-refresh cadence
"mailing_list_backfill_months": 6, // history window when no checkpoint (<=0 = full history)
"mailing_list_polite_email": "", // contact in the User-Agent for archive admins
"mailing_list_mirror_handling": "metadata_only"
}
}
11c. Layer 2 projection — mailing-list → canonical entities (Phase 3)
Layer 1 (every email → email_message + body + classification + threading) is universal and lossless. Layer 2 additionally projects a message onto a canonical entity (issues / pull_requests / pull_request_reviews) only where the mail maps cleanly to how that community operates — gated by the per-system projection_policy in systems.yaml (clean_fit for Apache; none for the forge-less kernel). The processor reads the policy via System.ProjectionClean().
Analytical purpose: before this subsystem, Apache projects’ issue data was absent (Apache tracks issues in Jira/Bugzilla, not GitHub Issues). Projected issues land under the PMC’s GitHub repo_id — issues has no platform_id column, the repo carries the platform — so they appear in that repo’s standard per-repo issue analytics exactly like native issues. Provenance lives in external_key + data_source ('JIRA') + tool_source.
Phase A (shipped) — issue_event → issues link-or-create (MailingListProcessor, drain-time):
An
issue_eventmessage with a parsedexternal_key(e.g.KAFKA-123): LINK if an issue for that key already exists — matched byexternal_keyOR by the bracketed[KEY]in a native issue’s title (the Apache Jira→GitHub import shape). LINK-by-title prevents the missed-LINK shadow: without it, projecting beforebackfill-issue-external-keyswould mint a synthetic that squats the key (the UNIQUE index then blocks the native issue from getting it). Else → CREATE a synthetic issue (negative, deterministicplatform_issue_id, idempotent on(repo_id, platform_issue_id)).Thread-inheritance (#1): once any message in a thread is projected onto an issue, the rest of the thread — human discussion,
Re:replies,discuss-class mail that carries no key — inherits that issue (viathread_root_id, cache +FindIssueForThread). So the full email history attaches, not just the Jira-notification stream.Every projected email is bridged as a comment (
issue_message_ref);issues.comment_countis recomputed so threads show in analytics.reporter_idis the resolved sender only when it is not thejira@/bot sender; real-actor-from-body parsing is a follow-up.email_message.projected_kindrecords the outcome (issue/pr/review/mailing_list_only).
Backfill (Phase 5): aveloxis backfill-mailing-list-projection runs the same projection over email_message rows collected before the projection code existed — in place, no re-collection. Three idempotent steps to convergence: keyed projection → thread-inheritance → mark-remaining. aveloxis mailing-list-stats surfaces any missed-LINK shadows (synthetic issue whose key sits in a native issue’s title) for remediation; the conflict-safe backfill-issue-external-keys no longer errors on them.
Sender attribution (Phases 2+4): senders the DB can’t resolve are run through the shared email→identity chain; direct-human senders that still don’t resolve get an email-only contributor (random cntrb_id, cntrb_email set) so they’re counted and ride the convergence ticker. Bot/relay senders (jira@, git@, CI) never become contributors.
Phase B (verified, NOT built) — PR/review synthesis from github_mirror mail. Verification (2026-06-04, summary/12 §3) settled it: pull_requests.platform_pr_id stores the GitHub PR databaseId, but mirror mail carries only the PR number — a synthesized PR keyed on the number would duplicate the API collector’s row rather than merge. Decision: don’t synthesize; the lever for full Apache PR data is org collection (load-foundation-orgs) + the existing github_mirror LINK path (which already covers collected PRs correctly). linked_pr_review_id remains in the schema should a future uncollectable-sibling case justify a number→databaseId resolution step.
For projection_policy: none (kernel): none of the above runs — a [PATCH] is not a PR; Layer 1 is the faithful record.
11d. Forge-less PR-equivalents — the special case (Phase C)
Some communities — most notably the Linux kernel (lore.kernel.org,
projection_policy: none) — do code review entirely by email. A [PATCH]
thread is the pull request; the Re: replies are the review. There is no
forge, and therefore no pull_requests / pull_request_reviews entity to
project onto.
The special case, stated plainly: Aveloxis deliberately does NOT
synthesize pull_requests rows for these. Fabricating a “PR” for a community
that doesn’t use one misrepresents how it works and would pollute the real PR
tables (the §1 governing principle: project only where it’s a clean, faithful
fit). The faithful record is the email_message rows themselves —
msg_class IN ('patch_submission','review') + thread_root_id grouping.
To make that ergonomic without a fake forge entity, Phase C ships a read-only
VIEW, aveloxis_data.mailing_list_pr_equivalents, that groups those mail
threads and presents each as a PR-equivalent:
column |
meaning |
|---|---|
|
the patch-series identity (cover letter |
|
the registered repo + list |
|
from the series root patch (author resolved to a contributor when known) |
|
first / last message in the thread |
|
thread aggregates |
|
always |
Two properties worth knowing:
It is a plain VIEW — zero storage, never materialized/refreshed, and intentionally absent from the matview refresh list. Querying it always reflects current
email_messagedata.It is empty until forge-less lists are collected. The filter
msg_class IN ('patch_submission','review')is itself the forge-less gate: the 2026-06-03 survey found Apache produces zero of these classes (it is GitHub-PR-native), so the view contains only kernel-style mail and stays empty until lore/public-inbox lists (register-mailing-list) are actually collected. That’s by design — not a bug.
Analysts who want “PR-like” activity for forge-less projects query this view;
pull_requests stays exclusively real forge data.
12. Verification (Phase 4) and the collection-ordering caveat
aveloxis verify-mailing-list is the branch-coverage harness: it reports, per logic branch, whether the subsystem produced any rows — every msg_class, both backends, each routing outcome, threading, signaled-repo resolution, sender resolution, and external_key backfill — each marked PASS / EMPTY / DEFER. --strict exits non-zero if a required (mailing-list-native) branch is empty, so it can gate a verification collection.
Collection ordering matters. Five branches are cross-subsystem —
bridged-to-issue,bridged-to-PR,mirror-linked,sender-resolved, andexternal_key. They resolve inline only when the linked repos’ GitHub issues/PRs/contributors are already present when the mail is written. In a fresh run that collects GitHub and mailing lists concurrently, those references stay NULL until the periodic backfills catch up — so the harness reports them as DEFER (informational, not gating). To exercise them in a verification run, collect the linked repos’ GitHub data first (or re-collect the lists afterward), then runbackfill-issue-external-keysand let the sender-backfill ticker run.
13. What the subsystem does NOT do
Does NOT re-import data GitHub already has. Mirror lists default to
metadata_only; the API copy of a PR/issue is authoritative. The subsystem captures the linkage, not a second body.Does NOT enumerate lore.kernel.org. The public-inbox catalog is Anubis-gated; kernel lists are curated via
register-mailing-list.Does NOT classify GitLab list mail specially. The classifier is system-driven; a GitLab-oriented system definition can be added to
systems.yamlif needed.Does NOT block on unresolved signals. An unresolved
signaled_repo_idor sender is retried later, never an error.Does NOT collect attachments. Patch bodies are parsed for classification (
has_patch), but binary attachments aren’t stored.
14. Cross-references
Architectural cousins:
distribution.mdandscancode.md— the same decoupled-pool pattern for different domains.Adding a backend:
docs/contributing/adding-a-platform.md(theArchiveSourceinterface follows the platform-extension shape).Source-of-truth files:
internal/mailinglist/—systems.yaml+ classifier,defensive.go(Pacer/Breaker),archive.go(interface + mbox/RFC822 parse),ponymail.go,publicinbox.gointernal/collector/mailinglist_worker.go— fetch→classify→STAGE→checkpoint (the fetch half)internal/collector/mailinglist_processor.go— drain staging → resolve sender/mirror/repo → project (Layer 2) → writeemail_message/messages/bridges (the resolve+write half)internal/db/mailinglist_staging_store.go,internal/db/mailinglist_projection_store.go,internal/db/mailinglist_sender_resolve_store.gointernal/db/email_message_store.go,internal/db/mailinglist_state_store.gocmd/aveloxis/{load_foundation_orgs,load_apache_lists,register_mailing_list,backfill_external_keys,mailing_list_stats,verify_mailing_list}.gointernal/api/server.go—handleMailingListStats
Design archive:
summary/10-apache-history-ingestion.md,summary/11-apache-mailing-list-implementation-plan.md,summary/12-mailing-list-projection.md.