<!--
SPDX-FileCopyrightText: 2026 Sean Goggins, University of Missouri, Derek Howard
SPDX-License-Identifier: MIT
-->

# Schema migrations

How to add, change, or backfill database state without losing operator data. The patterns here are the result of several production incidents; the rules are non-negotiable.

## The two sources of truth

1. **`internal/db/schema.sql`** — declarative CREATE TABLE / CREATE INDEX / INSERT statements. This is what gets applied to a brand-new database via `pg.pool.Exec(ctx, schemaSQL)` at the top of `RunMigrations`. Fresh installs land here directly.

2. **`internal/db/migrate.go`** — imperative `addColumnIfMissing`, `execMigrationStep`, `execCreateIndexConcurrently` calls that bring existing deployments up to the current schema. Operators with a running database run `aveloxis migrate` after deploying a new binary.

**Both must reflect the same final state.** Adding a column to `schema.sql` without a corresponding `addColumnIfMissing` in `migrate.go` means existing deployments will be missing that column. Adding it to `migrate.go` without `schema.sql` means fresh installs will be missing it.

## The fail-closed contract (v0.19.4)

Every schema-mutating call in `migrate.go` must surface errors. The historical bug: `addColumnIfMissing` used the discard pattern `_, _ = pg.pool.Exec(...)`. ALTER TABLE failures (including typos like the v0.21.0 incident's `aveloxis_scan.scancode_scans.created_at` referencing a column that doesn't exist) silently failed, migrate logged `"schema migrations complete"`, and operators discovered the gap weeks later when a feature queried the missing column.

The post-v0.19.4 contract:

```go
func RunMigrations(ctx context.Context, store *PostgresStore, logger *slog.Logger) error {
    var errs []error
    pg := store

    // Base schema first.
    if _, err := pg.pool.Exec(ctx, schemaSQL); err != nil {
        errs = append(errs, fmt.Errorf("base schema: %w", err))
    }

    // Per-version idempotent steps. Each helper appends to `errs` on failure
    // and continues to the next step — so a single broken step surfaces every
    // OTHER step's failure too, instead of bailing on the first.
    addColumnIfMissing(ctx, pg, logger, &errs, "aveloxis_data", "repos",
        "scancode_failed_attempts", "INTEGER DEFAULT 0")
    execMigrationStep(ctx, pg, logger, &errs,
        "v0.23.5 example backfill",
        `UPDATE aveloxis_data.foo SET bar = COALESCE(bar, 0)`)
    execCreateIndexConcurrently(ctx, pg, logger, &errs,
        "aveloxis_data", "idx_foo_bar",
        `CREATE INDEX CONCURRENTLY IF NOT EXISTS idx_foo_bar
            ON aveloxis_data.foo (bar) WHERE bar > 0`)

    if len(errs) > 0 {
        return fmt.Errorf("schema migration had %d error(s): %w",
            len(errs), errors.Join(errs...))
    }
    return nil
}
```

The aggregated error is what `aveloxis serve` returns from `runServe` → causes the process to exit non-zero. Operators see every failure at once, fix everything, re-run.

**Never** introduce a `_, _ = pg.pool.Exec(...)` in `migrate.go`. The pre-existing source-contract test `TestRunMigrationsNoBareExecWithoutCheck` fires the build if you do.

## Adding a column

The pattern, idempotent across re-runs:

```go
// internal/db/migrate.go — in RunMigrations
addColumnIfMissing(ctx, pg, logger, &errs, "aveloxis_data", "repos",
    "languages", "JSONB DEFAULT '{}'::jsonb")
```

And `internal/db/schema.sql`:

```sql
CREATE TABLE IF NOT EXISTS aveloxis_data.repos (
    repo_id          BIGSERIAL PRIMARY KEY,
    -- ... existing columns ...
    languages        JSONB DEFAULT '{}'::jsonb,
    -- ...
);
```

`addColumnIfMissing` issues `ALTER TABLE ... ADD COLUMN IF NOT EXISTS ...` so it's a no-op on databases that already have the column. Set a non-NULL default if existing rows need a sensible value at migration time.

### Source-contract test

Pin both sides:

```go
func TestSchemaHasLanguagesColumn(t *testing.T) {
    schema, _ := os.ReadFile("schema.sql")
    if !strings.Contains(string(schema), "languages") {
        t.Error("schema.sql must declare repos.languages")
    }
}

func TestMigrateAddsLanguagesColumn(t *testing.T) {
    migrate, _ := os.ReadFile("migrate.go")
    if !strings.Contains(string(migrate), `"languages"`) {
        t.Error("migrate.go must add languages via addColumnIfMissing")
    }
}
```

### Integration test (the v0.21.1 lesson)

Source-contract tests verify the code says what you wrote, not that it's CORRECT against the actual schema. Add an integration test that runs the migration against a real DB:

```go
//go:build integration  // gate on AVELOXIS_TEST_DB

func TestLanguagesColumnExistsAfterMigrate(t *testing.T) {
    store := openTestStore(t)  // applies RunMigrations
    ctx := context.Background()

    var dataType string
    err := store.pool.QueryRow(ctx, `
        SELECT data_type FROM information_schema.columns
        WHERE table_schema = 'aveloxis_data'
          AND table_name = 'repos'
          AND column_name = 'languages'`).Scan(&dataType)
    if err != nil {
        t.Fatalf("languages column not found post-migrate: %v", err)
    }
    if dataType != "jsonb" {
        t.Errorf("expected jsonb, got %s", dataType)
    }
}
```

## Adding an index

For indexes on existing populated tables, **always use `CREATE INDEX CONCURRENTLY`**. A plain `CREATE INDEX` acquires `ACCESS EXCLUSIVE` for the duration of the build — on fleet-scale tables (5M+ rows) that's minutes of blocked writes, which stalls every collection worker. CONCURRENTLY trades ~2x build time for not blocking writes.

```go
execCreateIndexConcurrently(ctx, pg, logger, &errs,
    "aveloxis_data", "idx_repos_owner_name_trgm",
    `CREATE INDEX CONCURRENTLY IF NOT EXISTS idx_repos_owner_name_trgm
        ON aveloxis_data.repos USING GIN ((repo_owner || '/' || repo_name) gin_trgm_ops)`)
```

The helper handles `IF NOT EXISTS` semantics AND the failure-recovery edge case where a prior interrupted CONCURRENTLY build left an INVALID index (cleaned up via `DROP INDEX` before retry). See v0.20.1.

For indexes in `schema.sql` (fresh installs), use `CREATE INDEX IF NOT EXISTS ...` — fresh installs run inside the base-schema transaction, so blocking-during-build isn't a concern.

## Adding a backfill

Backfill = a one-shot UPDATE that brings existing rows in line with new column semantics.

Pattern (idempotent — re-running is a no-op once the rows match):

```go
execMigrationStep(ctx, pg, logger, &errs,
    "v0.23.X backfill repos.languages from JSONB default",
    `UPDATE aveloxis_data.repos
     SET languages = '{}'::jsonb
     WHERE languages IS NULL`)
```

### Idempotency is mandatory

Migrations re-run on every `aveloxis migrate` invocation (operator may run it multiple times during a fix cycle). A backfill that ISN'T idempotent will either:

- do duplicate work on each run (wasteful but harmless), OR
- corrupt data on each run (catastrophic).

Make the WHERE clause exclude already-backfilled rows. Common patterns:

- `WHERE column IS NULL` — only touch unset rows.
- `WHERE column IS DISTINCT FROM target_value` — handles NULL correctly (`IS DISTINCT FROM` returns TRUE when one side is NULL and the other isn't).
- `WHERE column = old_default OR column = ''` — only touch rows still at the pre-fix value.

The v0.21.2 backfill of cumulative `last_commits` is a good model:

```sql
UPDATE aveloxis_ops.collection_queue q
SET last_commits = sub.cnt
FROM (
    SELECT repo_id, COUNT(DISTINCT cmt_commit_hash) AS cnt
    FROM aveloxis_data.commits
    GROUP BY repo_id
) sub
WHERE q.repo_id = sub.repo_id
  AND q.last_commits IS DISTINCT FROM sub.cnt;   -- the idempotency gate
```

### Source-contract test

```go
func TestMigrationBackfillsLanguages(t *testing.T) {
    src, _ := os.ReadFile("migrate.go")
    code := string(src)
    if !strings.Contains(code, "v0.23.X backfill repos.languages") {
        t.Error("backfill step missing from migrate.go")
    }
    if !strings.Contains(code, "IS NULL") {
        t.Error("backfill must include an idempotency gate (WHERE ... IS NULL)")
    }
}
```

### Integration test

Test both the first-run effect and the idempotency:

```go
func TestLanguagesBackfillIsIdempotent(t *testing.T) {
    store := openTestStore(t)  // applies migrations once
    ctx := context.Background()

    // Insert a row with NULL languages.
    repoID := seedRepo(t, store, /* languages = */ nil)

    // Run RunMigrations again — backfill should fire.
    _ = RunMigrations(ctx, store, slog.New(slog.NewTextHandler(io.Discard, nil)))

    var langs string
    store.pool.QueryRow(ctx, `SELECT languages::text FROM aveloxis_data.repos WHERE repo_id = $1`, repoID).Scan(&langs)
    if langs != "{}" {
        t.Errorf("expected backfill to set languages to '{}', got %s", langs)
    }

    // Run a THIRD time. Should be a no-op.
    _ = RunMigrations(ctx, store, slog.New(slog.NewTextHandler(io.Discard, nil)))

    var langsAfter string
    store.pool.QueryRow(ctx, `SELECT languages::text FROM aveloxis_data.repos WHERE repo_id = $1`, repoID).Scan(&langsAfter)
    if langsAfter != "{}" {
        t.Errorf("re-running migrations changed the value (should be idempotent), got %s", langsAfter)
    }
}
```

## Adding a foreign key

If the FK is on a column referencing `aveloxis_data.contributors(cntrb_id)`, it needs to participate in the v0.22.1 cascade contract. See the existing `cntrb_id_cascade.go` and `cntrb_id_fk_indexes.go` — the test `TestSchemaDeclaresDeferredOnCntrbIDFKs` enforces the count (currently 17). Add your new FK to `cntrbIDChildFKs` AND bump the expected count in the test.

For other FK additions:

- Use `DEFERRABLE INITIALLY DEFERRED` if the inserts will happen in any order within a transaction (v0.22.7 made every FK deferrable for consistency).
- Add an index on the child column. Postgres doesn't auto-index FK child columns; without an index, every DELETE on the parent triggers a full scan of the child table. v0.22.6 added 15 missing such indexes — don't add the 16th.
- Use `ON UPDATE CASCADE ON DELETE RESTRICT` for FKs to mutable identity keys (like cntrb_id which sometimes gets rewritten by the migration tools). Use plain `ON DELETE NO ACTION` for FKs to immutable surrogate keys (most cases).

## Migration ordering

Within `RunMigrations`, steps run in source order. Order matters when:

- Step B depends on a column or table created by step A.
- Step B's backfill needs data shape established by step A's column add.

The convention: order steps roughly by version (oldest at top, newest at bottom). Add a comment noting which version owns each step. The v0.22.x cntrb_id work is the canonical example — `ensureOnUpdateCascadeOnCntrbIDFKs` runs BEFORE `ensureCntrbIDFKIndexes` because cascade behavior is the load-bearing change, the indexes just make it fast (pinned by `TestEnsureCntrbIDFKIndexesRunsAfterCascadeStep`).

## Materialized views

Materialized views are `dm_repo_annual`, `dm_repo_monthly`, `dm_repo_weekly`, plus group variants and Augur compatibility views. They're declared in `schema.sql` AND have helper functions for refresh:

```go
// Bulk refresh — used by the scheduler's weekly rebuild (default Saturday).
store.RefreshAllRepoAggregates(ctx)

// Per-repo — left in internal/db/aggregates.go for manual recalc, not on the hot path.
store.RefreshRepoAggregates(ctx, repoID)
```

If you add a new materialized view:

1. Declare it in `schema.sql` (CREATE MATERIALIZED VIEW ... WITH NO DATA so fresh installs land without blocking on a populated build).
2. Add its name to the list in `CreateMaterializedViewsIfNotExist`.
3. Add it to `RefreshAllRepoAggregates`.
4. Test that `aveloxis refresh-views` rebuilds it (integration test).

Operators rebuild matviews on the configured day (`collection.matview_rebuild_day`, default Saturday). Pre-existing matviews are NOT refreshed on every startup — that was a 2024 design decision to avoid multi-hour startup delays on fleet-scale databases.

## The `aveloxis migrate --skip-views` flag

When iterating on a schema fix:

```bash
aveloxis migrate --skip-views   # schema + columns + backfills, no matview rebuild
```

This was added in v0.19.5 specifically for the "iterate on a v0.19.4 schema-error fix" workflow — the matview rebuild adds significant wall-clock time per attempt, and the views are derived data anyway. Use this during dev; let operators decide when to refresh views in production.

## Version-bump checklist for a schema-touching change

1. Add the column / index / FK / backfill to `schema.sql`.
2. Add the corresponding `addColumnIfMissing` / `execCreateIndexConcurrently` / `execMigrationStep` to `migrate.go`.
3. Write source-contract tests for both files.
4. Write an integration test that runs RunMigrations + asserts the post-state.
5. Bump `internal/db/version.go`.
6. Add a `### Changes in vX.Y.Z` section to `CLAUDE.md` documenting:
   - What was added and why.
   - What the migration does on existing deployments.
   - The operator command sequence to deploy (`aveloxis stop all; aveloxis migrate --skip-views; aveloxis start all`).
   - Any operational concern (long-running CONCURRENTLY index builds, fail-closed startup behavior, etc.).
7. Run `go test ./...` AND the integration tier.
8. (Optional but recommended) Run `aveloxis data-test --released-tag <prev> --repo <test-repo>` to verify the new version doesn't lose data vs the prior release. See [`docs/guide/data-test.md`](../guide/data-test.md).

## What v0.21.5 made explicit: who can run migrations

Only `aveloxis serve` (which migrates at startup) and the dedicated `aveloxis migrate` subcommand are allowed to call `store.Migrate(ctx)`. Other CLIs (`collect`, `add-repo`, `import-from-augur`, `add-key`, etc.) used to defensively migrate; v0.21.5 removed those calls.

The reason: defensive migration meant `aveloxis import-foundations --dry-run` was triggering `CREATE INDEX CONCURRENTLY` rebuilds — surprising operators who expected dry-run to mean inspect-only. The current contract: if a column is missing, the operator must run `aveloxis migrate` first. Postgres's "column does not exist" error is self-describing enough.

If you're adding a new subcommand, do NOT call `store.Migrate`. The pre-existing test `TestNonServerCommandsDoNotMigrate` will catch it.

## When NOT to write a migration

Don't migrate when the change is purely in-process behavior:

- Adding a new column to a Go struct that doesn't map to a DB column.
- Refactoring a function signature.
- Adding a new config knob (those land in `aveloxis.json`, not the DB).

But bump the version anyway — the version is the only way operators tell two binaries apart. See [`code-conventions.md`](code-conventions.md).