Mailing-List Ingestion (v0.25.7+)
Mailing-list archives are collected by the MailingListWorker, a dedicated decoupled worker pool inside aveloxis serve. It ingests email from project mailing-list archives into the canonical Aveloxis tables, classifies each message, and — where the email corresponds to an issue, pull request, or review — projects it onto those entities. Off by default; opt in via collection.mailing_list_enabled = true.
This document explains why the system looks the way it does, what the data is good for, and how to interpret it. Operator-facing tuning lives in docs/getting-started/configuration.md; the CLI is in docs/guide/commands.md. The full design rationale (sampling, migration patterns, decision log) is in summary/10-apache-history-ingestion.md and summary/11-apache-mailing-list-implementation-plan.md.
1. The question this subsystem answers
A great deal of open-source project activity never appears in the GitHub/GitLab API: design discussion, release votes, patch review on lists that predate (or replace) pull requests, and the Jira/Bugzilla issue history of projects that migrated to GitHub issues. The MailingListWorker recovers that activity.
Every email becomes a first-class email_message entity (a peer to issues, pull_requests, and pull_request_reviews), with its body stored in the shared messages table via an email_message_ref bridge — the same unified-message architecture every other text source uses. Classification then routes the message onto the canonical home it belongs in:
an issue-tracker notification → linked to an
issuesrow;a patch submission (kernel-style
[PATCH]) → a pull-request-equivalent;a
Reviewed-by:/ review reply → a pull-request-review-equivalent;everything else (votes, announcements, discussion, support) → stays a standalone
mailing_list_onlymessage.
1.1 Two orthogonal axes
The design keeps two questions separate (this is the single most important thing to understand about the subsystem):
Axis A — message class: what kind of email is this? (issue notification, patch, review, vote, announcement, discussion, …). Stored in
email_message.msg_class.Axis B — repo association: which repo/PMC does it concern? Stored as the
signaled_repo_url/signaled_repo_idpair (see §6).
A message can be classified (Axis A) without resolving to a known repo (Axis B), and vice-versa. The leftovers on each axis are explicit: mailing_list_only / unclassified on Axis A, an unresolved signaled_repo_id on Axis B. Nothing is silently dropped.
2. Why a dedicated worker
The work is independent of every per-repo collection phase, and follows the same decoupled-pool pattern v0.21.0 applied to scancode and v0.24.0 to the DistributionWorker:
No API tokens. Apache Pony Mail is unauthenticated; the kernel public-inbox path is a
git clone. The optionalpolite_emailonly sets a contact header.Its own cadence. Lists are tail-refreshed on a 30-day default cadence — much slower than main collection — and a one-time full-history backfill runs when a list has no checkpoint.
Its own claim queue. Lists are claimed from
aveloxis_data.repo_groups_list_serve(themlls_*columns), entirely separate from the repocollection_queue. Enabling the subsystem does not slow per-repo collection.
3. email_message — the first-class entity
email_message ──email_message_ref──▶ messages (body, platform_id = 6)
│
├─ signaled_repo_id ──▶ repos (which repo it concerns, Axis B)
├─ linked_issue_id ──▶ issues (routed: issue-tracker mail)
├─ linked_pull_request_id──▶ pull_requests(routed: patch / PR-equivalent)
└─ thread_root_id (threading: In-Reply-To / References)
platform_id = 6 ('Mailing List') tags every message-table row sourced from a list. Metadata convention on each row: data_source = the specific list address (e.g. dev@kafka.apache.org), tool_source = "Aveloxis Mailing List Collector", tool_version = the release, data_collection_date = load time.
4. The two archive backends
Backends implement a common ArchiveSource interface (Name, EnumerateLists, FirstMonth, FetchMonth), so adding a third archive system is a config + one-file job. The worker spawns one runner pool per registered system.
4.1 Apache Pony Mail (apache_ponymail)
The Foal API at lists.apache.org:
Endpoint |
Use |
|---|---|
|
bulk monthly mbox download (the data path) |
|
per-domain list catalog (enumeration) |
|
list-level |
The mbox stream is parsed mboxrd, with MIME multipart + quoted-printable/base64 decoding. A 404 is a clean empty-month miss; 429 → rate-limited (feeds the Pacer); 5xx / transport → transient (feeds the Breaker).
stats.luawindow matters.firstYear/firstMonthare list-level metadata returned for any date window.FirstMonththerefore uses the cheapest window (d=lte=1d, ~1 s / ~50 KB). An earlierd=lte=30yforced Pony Mail to aggregate the list’s entire history and stream back every message (~18 MB / ~35 s on a busy list), which timed out the worker. Regression tripwire:TestPonyMailFirstMonthUsesCheapWindow.
4.2 lore.kernel.org public-inbox (lore_public_inbox)
lore’s HTTP surface is Anubis-gated, so the sanctioned bulk path is a bare git clone of the per-list public-inbox archive (https://lore.kernel.org/<list>/git/0.git). FetchMonth walks the archive’s commits within the month window and reads each message from the m blob (git cat-file -p <hash>:m). Enumeration returns nil (the catalog isn’t machine-listable under Anubis); kernel lists are curated via register-mailing-list.
5. Classification
internal/mailinglist/systems.yaml defines per-system ordered rules (subject regex, body URL, sender, List-Id, list-address) compiled at load. System.Classify(msg) returns the first match as a (class, source, captures) triple. The eleven classes:
Class |
Routes to |
Typical source |
|---|---|---|
|
|
|
|
|
kernel |
|
|
|
|
linked issue/PR (mirror) |
|
|
(metadata) |
|
|
|
human discussion, votes, releases, Q&A |
captures carries structured data the rule extracted — e.g. {external_key: "KAFKA-20167"} (the Jira key, used to bridge to an issue) or {repo: "arrow-rs"} (the repo signal, used for Axis B).
6. Sender and signaled-repo resolution
6.1 Sender identity (§5d)
The sender email is resolved to a contributor via the same ResolveContributorIDByEmail chain the commit resolver uses, and stamped on the messages row. Unresolved senders keep their sender_email and are retried by a periodic BackfillMailingListSenderIDs ticker (hourly) as the contributors table fills from ongoing collection. List senders are largely committers, so resolution improves over time and after the linked repos’ GitHub collection completes.
6.2 Signaled repo — two columns, never block
Which repo a message concerns is captured as a pair:
signaled_repo_url— the canonical repo URL extracted from the message’s signal (a bot/mirror[repo]bracket, agithub.com/owner/repobody URL, aGH-NNNNNkey). Captured even if that repo isn’t in the catalog.signaled_repo_id— the FK torepos, filled in only once the URL resolves to a repo we hold.
Resolution is bidirectional and non-blocking: mail-side at write time via FindRepoByURL; repo-side when a new repo is created, by sweeping email_message (ResolveSignaledRepoForURL). An unresolved signaled_repo_id means “we captured a real signal pointing at a repo we don’t have loaded” (e.g. Arrow’s github@ list naming apache/arrow-rs when only apache/arrow is tracked) — not a defect. Tracking the whole org (load-foundation-orgs) drives resolution toward 100%.
7. Schema (v0.25.7)
-- platforms gains row 6
INSERT INTO aveloxis_data.platforms (platform_id, ...) VALUES (6, 'Mailing List', ...);
-- email_message: the first-class entity (declared AFTER issues/pull_requests/
-- messages in schema.sql — it FK-references all three; see the ordering tripwire)
CREATE TABLE aveloxis_data.email_message ( ... );
CREATE TABLE aveloxis_data.email_message_ref ( ... ); -- bridge to messages
-- issues gains an external key for Jira/Bugzilla import correlation
ALTER TABLE aveloxis_data.issues ADD COLUMN external_key TEXT DEFAULT '';
-- partial unique: (repo_id, external_key) WHERE external_key <> ''
-- repo_groups_list_serve gains the claim/checkpoint/lock columns
mlls_system, mlls_last_month, mlls_scan_complete,
mlls_failed_attempts, mlls_last_failed_at, mlls_last_run,
mlls_locked_at, mlls_locked_pid, mlls_locked_boot_id
-- + UNIQUE (repo_group_id, rgls_email)
Per-column documentation is in docs/schema.md. See docs/contributing/schema-migrations.md for the table-ordering rule that the v0.25.9 Phase 4 run surfaced (an FK-bearing table must be CREATEd after its referenced tables, since schema.sql runs as one transaction).
8. Worker architecture
aveloxis_data.repo_groups_list_serve.mlls_*
▲
┌───────────┼────────────┐
▼ ▼
ClaimNextList CheckpointListMonth /
(FOR UPDATE SKIP LOCKED) CompleteListScan / RecordListFailure
│ ▲
▼ │
┌───────────┐ per-system ┌─────────┐
│ dispatcher │──jobs chan─▶│ runner │
│ per system │ │ pool │
└───────────┘ │ (N=2) │
└─────────┘
│ claim→fetch→classify→resolve→route→checkpoint
▼
ArchiveSource (Pony Mail | public-inbox)
Claim:
ClaimNextList(system, cadence, staleLock, pid, bootID)acquires a list withFOR UPDATE SKIP LOCKED, gated on cadence and on stale-lock recovery (MailingListStaleLock = 2h— a lock older than that is presumed dead, the v0.21.0(pid, boot_id)recovery shape).RecoverStaleListLocksruns at startup.Checkpoint: each completed month stamps
mlls_last_monthviaCheckpointListMonth, so an interrupted scan resumes from where it stopped rather than re-fetching.Months to scan: from
mlls_last_monthforward to the current month; for a never-scanned list, fromFirstMonth(full history) whenmailing_list_backfill_months <= 0, else the recent N-month window.Failure backoff (v0.21.4 quadratic, base 120s):
RecordListFailureschedules 2m → 8m → 18m → … and sidelines the list afterMailingListMaxFailures = 10consecutive failures.
9. Mirror handling
collection.mailing_list_mirror_handling controls what happens to mirror-class mail (github_mirror) — notification lists that merely echo GitHub activity Aveloxis already collects via the API:
Value |
Behavior |
|---|---|
|
drop mirror mail entirely (the API copy is authoritative) |
|
record the |
|
store everything, including the body |
The default avoids wholesale-duplicating GitHub data into a second form while keeping the linkage and timeline.
10. Operator CLI
aveloxis load-foundation-core-repos # one core repo per project (was: import-foundations)
aveloxis load-foundation-orgs --yes # track the foundation's GitHub org(s) for repo discovery
aveloxis load-apache-lists # register per-PMC dev@/users@ lists via enumeration
aveloxis register-mailing-list \ # register one list (any system, e.g. the kernel)
--system lore_public_inbox --list linux-pci@vger.kernel.org --repo https://github.com/torvalds/linux
aveloxis backfill-issue-external-keys # populate issues.external_key from [KEY-N] title prefixes
aveloxis mailing-list-stats # coverage rollup
aveloxis verify-mailing-list [--strict] # Phase 4 branch-coverage harness (§12)
See docs/guide/commands.md for full flag references. The REST rollup is GET /api/v1/mailing-list/stats (docs/guide/api.md).
11. Config knobs
All under collection in aveloxis.json:
{
"collection": {
"mailing_list_enabled": false, // master switch
"mailing_list_workers": 2, // concurrent list runners per system
"mailing_list_cadence_days": 30, // tail-refresh cadence
"mailing_list_backfill_months": 6, // history window when no checkpoint (<=0 = full history)
"mailing_list_polite_email": "", // contact in the User-Agent for archive admins
"mailing_list_mirror_handling": "metadata_only"
}
}
12. Verification (Phase 4) and the collection-ordering caveat
aveloxis verify-mailing-list is the branch-coverage harness: it reports, per logic branch, whether the subsystem produced any rows — every msg_class, both backends, each routing outcome, threading, signaled-repo resolution, sender resolution, and external_key backfill — each marked PASS / EMPTY / DEFER. --strict exits non-zero if a required (mailing-list-native) branch is empty, so it can gate a verification collection.
Collection ordering matters. Five branches are cross-subsystem —
bridged-to-issue,bridged-to-PR,mirror-linked,sender-resolved, andexternal_key. They resolve inline only when the linked repos’ GitHub issues/PRs/contributors are already present when the mail is written. In a fresh run that collects GitHub and mailing lists concurrently, those references stay NULL until the periodic backfills catch up — so the harness reports them as DEFER (informational, not gating). To exercise them in a verification run, collect the linked repos’ GitHub data first (or re-collect the lists afterward), then runbackfill-issue-external-keysand let the sender-backfill ticker run.
13. What the subsystem does NOT do
Does NOT re-import data GitHub already has. Mirror lists default to
metadata_only; the API copy of a PR/issue is authoritative. The subsystem captures the linkage, not a second body.Does NOT enumerate lore.kernel.org. The public-inbox catalog is Anubis-gated; kernel lists are curated via
register-mailing-list.Does NOT classify GitLab list mail specially. The classifier is system-driven; a GitLab-oriented system definition can be added to
systems.yamlif needed.Does NOT block on unresolved signals. An unresolved
signaled_repo_idor sender is retried later, never an error.Does NOT collect attachments. Patch bodies are parsed for classification (
has_patch), but binary attachments aren’t stored.
14. Cross-references
Architectural cousins:
distribution.mdandscancode.md— the same decoupled-pool pattern for different domains.Adding a backend:
docs/contributing/adding-a-platform.md(theArchiveSourceinterface follows the platform-extension shape).Source-of-truth files:
internal/mailinglist/—systems.yaml+ classifier,defensive.go(Pacer/Breaker),archive.go(interface + mbox/RFC822 parse),ponymail.go,publicinbox.gointernal/collector/mailinglist_worker.go— claim→fetch→classify→resolve→route→checkpointinternal/db/email_message_store.go,internal/db/mailinglist_state_store.gocmd/aveloxis/{load_foundation_orgs,load_apache_lists,register_mailing_list,backfill_external_keys,mailing_list_stats,verify_mailing_list}.gointernal/api/server.go—handleMailingListStats
Design archive:
summary/10-apache-history-ingestion.md,summary/11-apache-mailing-list-implementation-plan.md.