Schema migrations

How to add, change, or backfill database state without losing operator data. The patterns here are the result of several production incidents; the rules are non-negotiable.

The two sources of truth

  1. internal/db/schema.sql — declarative CREATE TABLE / CREATE INDEX / INSERT statements. This is what gets applied to a brand-new database via pg.pool.Exec(ctx, schemaSQL) at the top of RunMigrations. Fresh installs land here directly.

  2. internal/db/migrate.go — imperative addColumnIfMissing, execMigrationStep, execCreateIndexConcurrently calls that bring existing deployments up to the current schema. Operators with a running database run aveloxis migrate after deploying a new binary.

Both must reflect the same final state. Adding a column to schema.sql without a corresponding addColumnIfMissing in migrate.go means existing deployments will be missing that column. Adding it to migrate.go without schema.sql means fresh installs will be missing it.

The fail-closed contract (v0.19.4)

Every schema-mutating call in migrate.go must surface errors. The historical bug: addColumnIfMissing used the discard pattern _, _ = pg.pool.Exec(...). ALTER TABLE failures (including typos like the v0.21.0 incident’s aveloxis_scan.scancode_scans.created_at referencing a column that doesn’t exist) silently failed, migrate logged "schema migrations complete", and operators discovered the gap weeks later when a feature queried the missing column.

The post-v0.19.4 contract:

func RunMigrations(ctx context.Context, store *PostgresStore, logger *slog.Logger) error {
    var errs []error
    pg := store

    // Base schema first.
    if _, err := pg.pool.Exec(ctx, schemaSQL); err != nil {
        errs = append(errs, fmt.Errorf("base schema: %w", err))
    }

    // Per-version idempotent steps. Each helper appends to `errs` on failure
    // and continues to the next step — so a single broken step surfaces every
    // OTHER step's failure too, instead of bailing on the first.
    addColumnIfMissing(ctx, pg, logger, &errs, "aveloxis_data", "repos",
        "scancode_failed_attempts", "INTEGER DEFAULT 0")
    execMigrationStep(ctx, pg, logger, &errs,
        "v0.23.5 example backfill",
        `UPDATE aveloxis_data.foo SET bar = COALESCE(bar, 0)`)
    execCreateIndexConcurrently(ctx, pg, logger, &errs,
        "aveloxis_data", "idx_foo_bar",
        `CREATE INDEX CONCURRENTLY IF NOT EXISTS idx_foo_bar
            ON aveloxis_data.foo (bar) WHERE bar > 0`)

    if len(errs) > 0 {
        return fmt.Errorf("schema migration had %d error(s): %w",
            len(errs), errors.Join(errs...))
    }
    return nil
}

The aggregated error is what aveloxis serve returns from runServe → causes the process to exit non-zero. Operators see every failure at once, fix everything, re-run.

Never introduce a _, _ = pg.pool.Exec(...) in migrate.go. The pre-existing source-contract test TestRunMigrationsNoBareExecWithoutCheck fires the build if you do.

Adding a column

The pattern, idempotent across re-runs:

// internal/db/migrate.go — in RunMigrations
addColumnIfMissing(ctx, pg, logger, &errs, "aveloxis_data", "repos",
    "languages", "JSONB DEFAULT '{}'::jsonb")

And internal/db/schema.sql:

CREATE TABLE IF NOT EXISTS aveloxis_data.repos (
    repo_id          BIGSERIAL PRIMARY KEY,
    -- ... existing columns ...
    languages        JSONB DEFAULT '{}'::jsonb,
    -- ...
);

addColumnIfMissing issues ALTER TABLE ... ADD COLUMN IF NOT EXISTS ... so it’s a no-op on databases that already have the column. Set a non-NULL default if existing rows need a sensible value at migration time.

Source-contract test

Pin both sides:

func TestSchemaHasLanguagesColumn(t *testing.T) {
    schema, _ := os.ReadFile("schema.sql")
    if !strings.Contains(string(schema), "languages") {
        t.Error("schema.sql must declare repos.languages")
    }
}

func TestMigrateAddsLanguagesColumn(t *testing.T) {
    migrate, _ := os.ReadFile("migrate.go")
    if !strings.Contains(string(migrate), `"languages"`) {
        t.Error("migrate.go must add languages via addColumnIfMissing")
    }
}

Integration test (the v0.21.1 lesson)

Source-contract tests verify the code says what you wrote, not that it’s CORRECT against the actual schema. Add an integration test that runs the migration against a real DB:

//go:build integration  // gate on AVELOXIS_TEST_DB

func TestLanguagesColumnExistsAfterMigrate(t *testing.T) {
    store := openTestStore(t)  // applies RunMigrations
    ctx := context.Background()

    var dataType string
    err := store.pool.QueryRow(ctx, `
        SELECT data_type FROM information_schema.columns
        WHERE table_schema = 'aveloxis_data'
          AND table_name = 'repos'
          AND column_name = 'languages'`).Scan(&dataType)
    if err != nil {
        t.Fatalf("languages column not found post-migrate: %v", err)
    }
    if dataType != "jsonb" {
        t.Errorf("expected jsonb, got %s", dataType)
    }
}

Adding an index

For indexes on existing populated tables, always use CREATE INDEX CONCURRENTLY. A plain CREATE INDEX acquires ACCESS EXCLUSIVE for the duration of the build — on fleet-scale tables (5M+ rows) that’s minutes of blocked writes, which stalls every collection worker. CONCURRENTLY trades ~2x build time for not blocking writes.

execCreateIndexConcurrently(ctx, pg, logger, &errs,
    "aveloxis_data", "idx_repos_owner_name_trgm",
    `CREATE INDEX CONCURRENTLY IF NOT EXISTS idx_repos_owner_name_trgm
        ON aveloxis_data.repos USING GIN ((repo_owner || '/' || repo_name) gin_trgm_ops)`)

The helper handles IF NOT EXISTS semantics AND the failure-recovery edge case where a prior interrupted CONCURRENTLY build left an INVALID index (cleaned up via DROP INDEX before retry). See v0.20.1.

For indexes in schema.sql (fresh installs), use CREATE INDEX IF NOT EXISTS ... — fresh installs run inside the base-schema transaction, so blocking-during-build isn’t a concern.

Adding a backfill

Backfill = a one-shot UPDATE that brings existing rows in line with new column semantics.

Pattern (idempotent — re-running is a no-op once the rows match):

execMigrationStep(ctx, pg, logger, &errs,
    "v0.23.X backfill repos.languages from JSONB default",
    `UPDATE aveloxis_data.repos
     SET languages = '{}'::jsonb
     WHERE languages IS NULL`)

Idempotency is mandatory

Migrations re-run on every aveloxis migrate invocation (operator may run it multiple times during a fix cycle). A backfill that ISN’T idempotent will either:

  • do duplicate work on each run (wasteful but harmless), OR

  • corrupt data on each run (catastrophic).

Make the WHERE clause exclude already-backfilled rows. Common patterns:

  • WHERE column IS NULL — only touch unset rows.

  • WHERE column IS DISTINCT FROM target_value — handles NULL correctly (IS DISTINCT FROM returns TRUE when one side is NULL and the other isn’t).

  • WHERE column = old_default OR column = '' — only touch rows still at the pre-fix value.

The v0.21.2 backfill of cumulative last_commits is a good model:

UPDATE aveloxis_ops.collection_queue q
SET last_commits = sub.cnt
FROM (
    SELECT repo_id, COUNT(DISTINCT cmt_commit_hash) AS cnt
    FROM aveloxis_data.commits
    GROUP BY repo_id
) sub
WHERE q.repo_id = sub.repo_id
  AND q.last_commits IS DISTINCT FROM sub.cnt;   -- the idempotency gate

Source-contract test

func TestMigrationBackfillsLanguages(t *testing.T) {
    src, _ := os.ReadFile("migrate.go")
    code := string(src)
    if !strings.Contains(code, "v0.23.X backfill repos.languages") {
        t.Error("backfill step missing from migrate.go")
    }
    if !strings.Contains(code, "IS NULL") {
        t.Error("backfill must include an idempotency gate (WHERE ... IS NULL)")
    }
}

Integration test

Test both the first-run effect and the idempotency:

func TestLanguagesBackfillIsIdempotent(t *testing.T) {
    store := openTestStore(t)  // applies migrations once
    ctx := context.Background()

    // Insert a row with NULL languages.
    repoID := seedRepo(t, store, /* languages = */ nil)

    // Run RunMigrations again — backfill should fire.
    _ = RunMigrations(ctx, store, slog.New(slog.NewTextHandler(io.Discard, nil)))

    var langs string
    store.pool.QueryRow(ctx, `SELECT languages::text FROM aveloxis_data.repos WHERE repo_id = $1`, repoID).Scan(&langs)
    if langs != "{}" {
        t.Errorf("expected backfill to set languages to '{}', got %s", langs)
    }

    // Run a THIRD time. Should be a no-op.
    _ = RunMigrations(ctx, store, slog.New(slog.NewTextHandler(io.Discard, nil)))

    var langsAfter string
    store.pool.QueryRow(ctx, `SELECT languages::text FROM aveloxis_data.repos WHERE repo_id = $1`, repoID).Scan(&langsAfter)
    if langsAfter != "{}" {
        t.Errorf("re-running migrations changed the value (should be idempotent), got %s", langsAfter)
    }
}

Adding a foreign key

If the FK is on a column referencing aveloxis_data.contributors(cntrb_id), it needs to participate in the v0.22.1 cascade contract. See the existing cntrb_id_cascade.go and cntrb_id_fk_indexes.go — the test TestSchemaDeclaresDeferredOnCntrbIDFKs enforces the count (currently 17). Add your new FK to cntrbIDChildFKs AND bump the expected count in the test.

For other FK additions:

  • Use DEFERRABLE INITIALLY DEFERRED if the inserts will happen in any order within a transaction (v0.22.7 made every FK deferrable for consistency).

  • Add an index on the child column. Postgres doesn’t auto-index FK child columns; without an index, every DELETE on the parent triggers a full scan of the child table. v0.22.6 added 15 missing such indexes — don’t add the 16th.

  • Use ON UPDATE CASCADE ON DELETE RESTRICT for FKs to mutable identity keys (like cntrb_id which sometimes gets rewritten by the migration tools). Use plain ON DELETE NO ACTION for FKs to immutable surrogate keys (most cases).

Migration ordering

Within RunMigrations, steps run in source order. Order matters when:

  • Step B depends on a column or table created by step A.

  • Step B’s backfill needs data shape established by step A’s column add.

The convention: order steps roughly by version (oldest at top, newest at bottom). Add a comment noting which version owns each step. The v0.22.x cntrb_id work is the canonical example — ensureOnUpdateCascadeOnCntrbIDFKs runs BEFORE ensureCntrbIDFKIndexes because cascade behavior is the load-bearing change, the indexes just make it fast (pinned by TestEnsureCntrbIDFKIndexesRunsAfterCascadeStep).

Materialized views

Materialized views are dm_repo_annual, dm_repo_monthly, dm_repo_weekly, plus group variants and Augur compatibility views. They’re declared in schema.sql AND have helper functions for refresh:

// Bulk refresh — used by the scheduler's weekly rebuild (default Saturday).
store.RefreshAllRepoAggregates(ctx)

// Per-repo — left in internal/db/aggregates.go for manual recalc, not on the hot path.
store.RefreshRepoAggregates(ctx, repoID)

If you add a new materialized view:

  1. Declare it in schema.sql (CREATE MATERIALIZED VIEW … WITH NO DATA so fresh installs land without blocking on a populated build).

  2. Add its name to the list in CreateMaterializedViewsIfNotExist.

  3. Add it to RefreshAllRepoAggregates.

  4. Test that aveloxis refresh-views rebuilds it (integration test).

Operators rebuild matviews on the configured day (collection.matview_rebuild_day, default Saturday). Pre-existing matviews are NOT refreshed on every startup — that was a 2024 design decision to avoid multi-hour startup delays on fleet-scale databases.

The aveloxis migrate --skip-views flag

When iterating on a schema fix:

aveloxis migrate --skip-views   # schema + columns + backfills, no matview rebuild

This was added in v0.19.5 specifically for the “iterate on a v0.19.4 schema-error fix” workflow — the matview rebuild adds significant wall-clock time per attempt, and the views are derived data anyway. Use this during dev; let operators decide when to refresh views in production.

Version-bump checklist for a schema-touching change

  1. Add the column / index / FK / backfill to schema.sql.

  2. Add the corresponding addColumnIfMissing / execCreateIndexConcurrently / execMigrationStep to migrate.go.

  3. Write source-contract tests for both files.

  4. Write an integration test that runs RunMigrations + asserts the post-state.

  5. Bump internal/db/version.go.

  6. Add a ### Changes in vX.Y.Z section to CLAUDE.md documenting:

    • What was added and why.

    • What the migration does on existing deployments.

    • The operator command sequence to deploy (aveloxis stop all; aveloxis migrate --skip-views; aveloxis start all).

    • Any operational concern (long-running CONCURRENTLY index builds, fail-closed startup behavior, etc.).

  7. Run go test ./... AND the integration tier.

  8. (Optional but recommended) Run aveloxis data-test --released-tag <prev> --repo <test-repo> to verify the new version doesn’t lose data vs the prior release. See docs/guide/data-test.md.

What v0.21.5 made explicit: who can run migrations

Only aveloxis serve (which migrates at startup) and the dedicated aveloxis migrate subcommand are allowed to call store.Migrate(ctx). Other CLIs (collect, add-repo, import-from-augur, add-key, etc.) used to defensively migrate; v0.21.5 removed those calls.

The reason: defensive migration meant aveloxis import-foundations --dry-run was triggering CREATE INDEX CONCURRENTLY rebuilds — surprising operators who expected dry-run to mean inspect-only. The current contract: if a column is missing, the operator must run aveloxis migrate first. Postgres’s “column does not exist” error is self-describing enough.

If you’re adding a new subcommand, do NOT call store.Migrate. The pre-existing test TestNonServerCommandsDoNotMigrate will catch it.

When NOT to write a migration

Don’t migrate when the change is purely in-process behavior:

  • Adding a new column to a Go struct that doesn’t map to a DB column.

  • Refactoring a function signature.

  • Adding a new config knob (those land in aveloxis.json, not the DB).

But bump the version anyway — the version is the only way operators tell two binaries apart. See code-conventions.md.