Schema migrations
How to add, change, or backfill database state without losing operator data. The patterns here are the result of several production incidents; the rules are non-negotiable.
The two sources of truth
internal/db/schema.sql— declarative CREATE TABLE / CREATE INDEX / INSERT statements. This is what gets applied to a brand-new database viapg.pool.Exec(ctx, schemaSQL)at the top ofRunMigrations. Fresh installs land here directly.internal/db/migrate.go— imperativeaddColumnIfMissing,execMigrationStep,execCreateIndexConcurrentlycalls that bring existing deployments up to the current schema. Operators with a running database runaveloxis migrateafter deploying a new binary.
Both must reflect the same final state. Adding a column to schema.sql without a corresponding addColumnIfMissing in migrate.go means existing deployments will be missing that column. Adding it to migrate.go without schema.sql means fresh installs will be missing it.
The fail-closed contract (v0.19.4)
Every schema-mutating call in migrate.go must surface errors. The historical bug: addColumnIfMissing used the discard pattern _, _ = pg.pool.Exec(...). ALTER TABLE failures (including typos like the v0.21.0 incident’s aveloxis_scan.scancode_scans.created_at referencing a column that doesn’t exist) silently failed, migrate logged "schema migrations complete", and operators discovered the gap weeks later when a feature queried the missing column.
The post-v0.19.4 contract:
func RunMigrations(ctx context.Context, store *PostgresStore, logger *slog.Logger) error {
var errs []error
pg := store
// Base schema first.
if _, err := pg.pool.Exec(ctx, schemaSQL); err != nil {
errs = append(errs, fmt.Errorf("base schema: %w", err))
}
// Per-version idempotent steps. Each helper appends to `errs` on failure
// and continues to the next step — so a single broken step surfaces every
// OTHER step's failure too, instead of bailing on the first.
addColumnIfMissing(ctx, pg, logger, &errs, "aveloxis_data", "repos",
"scancode_failed_attempts", "INTEGER DEFAULT 0")
execMigrationStep(ctx, pg, logger, &errs,
"v0.23.5 example backfill",
`UPDATE aveloxis_data.foo SET bar = COALESCE(bar, 0)`)
execCreateIndexConcurrently(ctx, pg, logger, &errs,
"aveloxis_data", "idx_foo_bar",
`CREATE INDEX CONCURRENTLY IF NOT EXISTS idx_foo_bar
ON aveloxis_data.foo (bar) WHERE bar > 0`)
if len(errs) > 0 {
return fmt.Errorf("schema migration had %d error(s): %w",
len(errs), errors.Join(errs...))
}
return nil
}
The aggregated error is what aveloxis serve returns from runServe → causes the process to exit non-zero. Operators see every failure at once, fix everything, re-run.
Never introduce a _, _ = pg.pool.Exec(...) in migrate.go. The pre-existing source-contract test TestRunMigrationsNoBareExecWithoutCheck fires the build if you do.
Adding a column
The pattern, idempotent across re-runs:
// internal/db/migrate.go — in RunMigrations
addColumnIfMissing(ctx, pg, logger, &errs, "aveloxis_data", "repos",
"languages", "JSONB DEFAULT '{}'::jsonb")
And internal/db/schema.sql:
CREATE TABLE IF NOT EXISTS aveloxis_data.repos (
repo_id BIGSERIAL PRIMARY KEY,
-- ... existing columns ...
languages JSONB DEFAULT '{}'::jsonb,
-- ...
);
addColumnIfMissing issues ALTER TABLE ... ADD COLUMN IF NOT EXISTS ... so it’s a no-op on databases that already have the column. Set a non-NULL default if existing rows need a sensible value at migration time.
Source-contract test
Pin both sides:
func TestSchemaHasLanguagesColumn(t *testing.T) {
schema, _ := os.ReadFile("schema.sql")
if !strings.Contains(string(schema), "languages") {
t.Error("schema.sql must declare repos.languages")
}
}
func TestMigrateAddsLanguagesColumn(t *testing.T) {
migrate, _ := os.ReadFile("migrate.go")
if !strings.Contains(string(migrate), `"languages"`) {
t.Error("migrate.go must add languages via addColumnIfMissing")
}
}
Integration test (the v0.21.1 lesson)
Source-contract tests verify the code says what you wrote, not that it’s CORRECT against the actual schema. Add an integration test that runs the migration against a real DB:
//go:build integration // gate on AVELOXIS_TEST_DB
func TestLanguagesColumnExistsAfterMigrate(t *testing.T) {
store := openTestStore(t) // applies RunMigrations
ctx := context.Background()
var dataType string
err := store.pool.QueryRow(ctx, `
SELECT data_type FROM information_schema.columns
WHERE table_schema = 'aveloxis_data'
AND table_name = 'repos'
AND column_name = 'languages'`).Scan(&dataType)
if err != nil {
t.Fatalf("languages column not found post-migrate: %v", err)
}
if dataType != "jsonb" {
t.Errorf("expected jsonb, got %s", dataType)
}
}
Adding an index
For indexes on existing populated tables, always use CREATE INDEX CONCURRENTLY. A plain CREATE INDEX acquires ACCESS EXCLUSIVE for the duration of the build — on fleet-scale tables (5M+ rows) that’s minutes of blocked writes, which stalls every collection worker. CONCURRENTLY trades ~2x build time for not blocking writes.
execCreateIndexConcurrently(ctx, pg, logger, &errs,
"aveloxis_data", "idx_repos_owner_name_trgm",
`CREATE INDEX CONCURRENTLY IF NOT EXISTS idx_repos_owner_name_trgm
ON aveloxis_data.repos USING GIN ((repo_owner || '/' || repo_name) gin_trgm_ops)`)
The helper handles IF NOT EXISTS semantics AND the failure-recovery edge case where a prior interrupted CONCURRENTLY build left an INVALID index (cleaned up via DROP INDEX before retry). See v0.20.1.
For indexes in schema.sql (fresh installs), use CREATE INDEX IF NOT EXISTS ... — fresh installs run inside the base-schema transaction, so blocking-during-build isn’t a concern.
Adding a backfill
Backfill = a one-shot UPDATE that brings existing rows in line with new column semantics.
Pattern (idempotent — re-running is a no-op once the rows match):
execMigrationStep(ctx, pg, logger, &errs,
"v0.23.X backfill repos.languages from JSONB default",
`UPDATE aveloxis_data.repos
SET languages = '{}'::jsonb
WHERE languages IS NULL`)
Idempotency is mandatory
Migrations re-run on every aveloxis migrate invocation (operator may run it multiple times during a fix cycle). A backfill that ISN’T idempotent will either:
do duplicate work on each run (wasteful but harmless), OR
corrupt data on each run (catastrophic).
Make the WHERE clause exclude already-backfilled rows. Common patterns:
WHERE column IS NULL— only touch unset rows.WHERE column IS DISTINCT FROM target_value— handles NULL correctly (IS DISTINCT FROMreturns TRUE when one side is NULL and the other isn’t).WHERE column = old_default OR column = ''— only touch rows still at the pre-fix value.
The v0.21.2 backfill of cumulative last_commits is a good model:
UPDATE aveloxis_ops.collection_queue q
SET last_commits = sub.cnt
FROM (
SELECT repo_id, COUNT(DISTINCT cmt_commit_hash) AS cnt
FROM aveloxis_data.commits
GROUP BY repo_id
) sub
WHERE q.repo_id = sub.repo_id
AND q.last_commits IS DISTINCT FROM sub.cnt; -- the idempotency gate
Source-contract test
func TestMigrationBackfillsLanguages(t *testing.T) {
src, _ := os.ReadFile("migrate.go")
code := string(src)
if !strings.Contains(code, "v0.23.X backfill repos.languages") {
t.Error("backfill step missing from migrate.go")
}
if !strings.Contains(code, "IS NULL") {
t.Error("backfill must include an idempotency gate (WHERE ... IS NULL)")
}
}
Integration test
Test both the first-run effect and the idempotency:
func TestLanguagesBackfillIsIdempotent(t *testing.T) {
store := openTestStore(t) // applies migrations once
ctx := context.Background()
// Insert a row with NULL languages.
repoID := seedRepo(t, store, /* languages = */ nil)
// Run RunMigrations again — backfill should fire.
_ = RunMigrations(ctx, store, slog.New(slog.NewTextHandler(io.Discard, nil)))
var langs string
store.pool.QueryRow(ctx, `SELECT languages::text FROM aveloxis_data.repos WHERE repo_id = $1`, repoID).Scan(&langs)
if langs != "{}" {
t.Errorf("expected backfill to set languages to '{}', got %s", langs)
}
// Run a THIRD time. Should be a no-op.
_ = RunMigrations(ctx, store, slog.New(slog.NewTextHandler(io.Discard, nil)))
var langsAfter string
store.pool.QueryRow(ctx, `SELECT languages::text FROM aveloxis_data.repos WHERE repo_id = $1`, repoID).Scan(&langsAfter)
if langsAfter != "{}" {
t.Errorf("re-running migrations changed the value (should be idempotent), got %s", langsAfter)
}
}
Adding a foreign key
If the FK is on a column referencing aveloxis_data.contributors(cntrb_id), it needs to participate in the v0.22.1 cascade contract. See the existing cntrb_id_cascade.go and cntrb_id_fk_indexes.go — the test TestSchemaDeclaresDeferredOnCntrbIDFKs enforces the count (currently 17). Add your new FK to cntrbIDChildFKs AND bump the expected count in the test.
For other FK additions:
Use
DEFERRABLE INITIALLY DEFERREDif the inserts will happen in any order within a transaction (v0.22.7 made every FK deferrable for consistency).Add an index on the child column. Postgres doesn’t auto-index FK child columns; without an index, every DELETE on the parent triggers a full scan of the child table. v0.22.6 added 15 missing such indexes — don’t add the 16th.
Use
ON UPDATE CASCADE ON DELETE RESTRICTfor FKs to mutable identity keys (like cntrb_id which sometimes gets rewritten by the migration tools). Use plainON DELETE NO ACTIONfor FKs to immutable surrogate keys (most cases).
Migration ordering
Within RunMigrations, steps run in source order. Order matters when:
Step B depends on a column or table created by step A.
Step B’s backfill needs data shape established by step A’s column add.
The convention: order steps roughly by version (oldest at top, newest at bottom). Add a comment noting which version owns each step. The v0.22.x cntrb_id work is the canonical example — ensureOnUpdateCascadeOnCntrbIDFKs runs BEFORE ensureCntrbIDFKIndexes because cascade behavior is the load-bearing change, the indexes just make it fast (pinned by TestEnsureCntrbIDFKIndexesRunsAfterCascadeStep).
Materialized views
Materialized views are dm_repo_annual, dm_repo_monthly, dm_repo_weekly, plus group variants and Augur compatibility views. They’re declared in schema.sql AND have helper functions for refresh:
// Bulk refresh — used by the scheduler's weekly rebuild (default Saturday).
store.RefreshAllRepoAggregates(ctx)
// Per-repo — left in internal/db/aggregates.go for manual recalc, not on the hot path.
store.RefreshRepoAggregates(ctx, repoID)
If you add a new materialized view:
Declare it in
schema.sql(CREATE MATERIALIZED VIEW … WITH NO DATA so fresh installs land without blocking on a populated build).Add its name to the list in
CreateMaterializedViewsIfNotExist.Add it to
RefreshAllRepoAggregates.Test that
aveloxis refresh-viewsrebuilds it (integration test).
Operators rebuild matviews on the configured day (collection.matview_rebuild_day, default Saturday). Pre-existing matviews are NOT refreshed on every startup — that was a 2024 design decision to avoid multi-hour startup delays on fleet-scale databases.
The aveloxis migrate --skip-views flag
When iterating on a schema fix:
aveloxis migrate --skip-views # schema + columns + backfills, no matview rebuild
This was added in v0.19.5 specifically for the “iterate on a v0.19.4 schema-error fix” workflow — the matview rebuild adds significant wall-clock time per attempt, and the views are derived data anyway. Use this during dev; let operators decide when to refresh views in production.
Version-bump checklist for a schema-touching change
Add the column / index / FK / backfill to
schema.sql.Add the corresponding
addColumnIfMissing/execCreateIndexConcurrently/execMigrationSteptomigrate.go.Write source-contract tests for both files.
Write an integration test that runs RunMigrations + asserts the post-state.
Bump
internal/db/version.go.Add a
### Changes in vX.Y.Zsection toCLAUDE.mddocumenting:What was added and why.
What the migration does on existing deployments.
The operator command sequence to deploy (
aveloxis stop all; aveloxis migrate --skip-views; aveloxis start all).Any operational concern (long-running CONCURRENTLY index builds, fail-closed startup behavior, etc.).
Run
go test ./...AND the integration tier.(Optional but recommended) Run
aveloxis data-test --released-tag <prev> --repo <test-repo>to verify the new version doesn’t lose data vs the prior release. Seedocs/guide/data-test.md.
What v0.21.5 made explicit: who can run migrations
Only aveloxis serve (which migrates at startup) and the dedicated aveloxis migrate subcommand are allowed to call store.Migrate(ctx). Other CLIs (collect, add-repo, import-from-augur, add-key, etc.) used to defensively migrate; v0.21.5 removed those calls.
The reason: defensive migration meant aveloxis import-foundations --dry-run was triggering CREATE INDEX CONCURRENTLY rebuilds — surprising operators who expected dry-run to mean inspect-only. The current contract: if a column is missing, the operator must run aveloxis migrate first. Postgres’s “column does not exist” error is self-describing enough.
If you’re adding a new subcommand, do NOT call store.Migrate. The pre-existing test TestNonServerCommandsDoNotMigrate will catch it.
When NOT to write a migration
Don’t migrate when the change is purely in-process behavior:
Adding a new column to a Go struct that doesn’t map to a DB column.
Refactoring a function signature.
Adding a new config knob (those land in
aveloxis.json, not the DB).
But bump the version anyway — the version is the only way operators tell two binaries apart. See code-conventions.md.