Allow gateway to boot with only a Postgres connection by AntoineToussaint · Pull Request #7361 · tensorzero/tensorzero · GitHub
Skip to content

Allow gateway to boot with only a Postgres connection#7361

Open
AntoineToussaint wants to merge 10 commits intomainfrom
db-config-boot
Open

Allow gateway to boot with only a Postgres connection#7361
AntoineToussaint wants to merge 10 commits intomainfrom
db-config-boot

Conversation

@AntoineToussaint
Copy link
Copy Markdown
Member

@AntoineToussaint AntoineToussaint commented Apr 23, 2026

Summary

  • Gateway now falls through to the DB-authoritative config load path when TENSORZERO_POSTGRES_URL is set, even without ENABLE_CONFIG_IN_DATABASE. An empty, freshly-migrated database is a valid starting point: every singleton defaults, every collection is empty.
  • First step toward a zero-config deploy: operator supplies a Postgres URL, populates functions/variants/models via REST (or the UI, once Allow adding/editing files not referenced by config #7275 and Graceful config loading degradation #7285 land).
  • Review-driven hardening: StartupConfig struct replaces a positional (_, _, bool) tuple, Option<&str> threaded into load_startup_config_from_database so the env var is read once, empty env var is filtered to None so a shell/compose misconfiguration produces the clear "no config source" error instead of an opaque sqlx dial failure, and a WARN log fires when the gateway takes the implicit DB-boot path (many deployments set the env var for observability/rate-limiting without intending DB config).
  • Adds the 5th and 6th live-tests flavors: db-only-boot-e2e matrix over database: [clickhouse, postgres] — migrated DB, no config rows, no files on disk, gateway boots, REST endpoints return the defaulted config.

Test plan

  • Unit-ish: load_config_from_db_returns_defaults_on_empty_database (uses matches_pattern! field-by-field against UninitializedConfig) — passes locally.
  • E2E: db_only_boot::db_only_boot_serves_status_with_defaulted_config asserts /status returns StatusResponse { status: "ok", version: $VERSION, config_hash: non-empty }.
  • E2E: db_only_boot::db_only_boot_returns_default_config_via_config_toml_endpoint asserts /internal/config_toml returns the same hash /status reported, with empty path_contents and a TOML body that parses back as a valid table. The hash-equivalence check is the load-bearing contract (the UI depends on it).
  • CI: both db-only-boot-e2e (database: clickhouse) and db-only-boot-e2e (database: postgres) green. Existing --config-file and live-tests-config-in-database paths unchanged.

What's NOT in this PR

  • REST config-write e2e coverage. This PR exercises the read roundtrip; write roundtrip coverage lives in the existing config-editing profile.
  • UI-side zero-config deploy story. Needs Allow adding/editing files not referenced by config #7275 (files not referenced by config) and Graceful config loading degradation #7285 (graceful config loading degradation) first.
  • Refactor of the CI matrix / config-hash model. See in-thread discussion — out of scope for the boot path, tracked for a future pass.

🤖 Generated with Claude Code

When `--config-file` and `--default-config` are both absent, the gateway
now falls through to the DB-authoritative load path whenever
`TENSORZERO_POSTGRES_URL` is set. Previously this required explicit
opt-in via the `ENABLE_CONFIG_IN_DATABASE` feature flag.

An empty database is a valid starting point: every singleton falls back
to its default and every collection is empty, so the gateway serves a
functional runtime with zero user config. This is the first step toward
a "zero-config deploy": the operator provides a database URL and
populates functions, variants, and models through REST endpoints.

Also adds an empty-database smoke test for `load_config_from_db`.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
AntoineToussaint and others added 9 commits April 23, 2026 14:45
Apply review feedback on the startup-config-from-Postgres fallback:

- Treat an empty `TENSORZERO_POSTGRES_URL` as absent so a shell/compose
  misconfiguration produces the clear "no config source" error instead
  of an opaque sqlx dial failure.
- Read the env var once and thread the `Option<String>` into
  `load_startup_config_from_database`, eliminating the double read.
- Log a prominent `WARN` when falling through to the implicit DB path
  (env var set, no feature flag, no `--config-file`) so operators see
  the fallback in startup logs. Many deployments set the env var for
  observability/rate-limiting without intending DB-config boot.
- Replace the positional `(…, …, bool /* config_in_database */)` tuple
  with a `StartupConfig` struct so callers don't rely on an
  inline-comment-documented bool.
- Introduce a `TENSORZERO_POSTGRES_URL_ENV` constant for the two new
  call sites in this file.
- Rewrite the empty-DB smoke test with `expect_that!` + `matches_pattern!`
  per `AGENTS.md` guidance, giving per-field failure diagnostics.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
`expect_that!` needs a `#[gtest]` test context to collect failures; the
`#[sqlx::test]` macro doesn't provide one, so using it here panics with
"No test context found" instead of running the assertion. Switch to
`assert_that!`, which works without the gtest context and matches the
convention used by every other test in this file.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Spawns the actual gateway binary with `TENSORZERO_POSTGRES_URL` set
against a migrated Postgres and nothing else (no `--config-file`, no
`--default-config`, no `ENABLE_CONFIG_IN_DATABASE` feature flag) and
verifies the gateway binds a port, serves a healthy `/health`, and
returns a well-formed `StatusResponse` from `/status`. This is the
end-to-end counterpart to the unit-level empty-DB test on
`load_config_from_db`: that one proves the loader returns defaults,
this one proves the full binary actually reaches listening state and
answers HTTP with that defaulted config.

Also factors the "wait for listening + parse bound addr + build
ChildData" tail of `start_gateway_impl` into a shared
`await_gateway_listening` helper so the new
`start_gateway_from_db_url_on_random_port` helper doesn't duplicate it.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Extend the new integration test from "does the gateway serve /health"
to the full config-in-database scenario the UI will build on top of:
migrated Postgres, no config rows, no `--config-file`, feature flag
on, then assert:

- `/health` 200
- `/status` returns `ok` + a non-empty `config_hash`
- `/internal/config_toml` returns a default editable TOML whose hash
  matches `/status`, and whose `path_contents` is empty (no
  user-provided templates)
- The TOML body parses as a valid TOML table

The helper `start_gateway_from_db_url_on_random_port` now takes an
`extra_env` slice so callers can either exercise the implicit-opt-in
path (env var only) or the full config-in-database scenario (feature
flag on) without duplicating the subprocess plumbing. Adds `toml` to
gateway dev-dependencies for assertion-side parsing.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Adds a dedicated e2e scenario, parallel to `live-tests`,
`live-tests-config-in-database`, `evaluation-tests`, and the existing
live flavors: gateway booting from a migrated-but-empty
Postgres + ClickHouse stack with `ENABLE_CONFIG_IN_DATABASE=true` and
no `--config-file`. This is the deploy shape the configure-via-UI
story builds on — schema present, no config rows, no files on disk.

New pieces, all mirroring the existing config-in-database pattern:

- `crates/tensorzero-core/tests/e2e/docker-compose.db-only-boot.yml`:
  override of `docker-compose.live.yml` that drops
  `gateway-migrate-config`, flips the feature flag, clears
  `--config-file`, and uses `!override` on `volumes` to remove every
  inherited bind mount (config TOMLs, fixtures, credentials) — so the
  gateway literally has nothing on disk to read.
- `crates/tensorzero-core/tests/e2e/db_only_boot/mod.rs`: two
  `#[gtest] #[tokio::test]` Rust tests that run inside the live-tests
  container and hit the gateway over the compose network: one asserts
  `/status` reports the default config and a non-empty hash, the other
  asserts `/internal/config_toml` returns the same hash with empty
  `path_contents` and a TOML body that parses back as a valid table.
- `crates/.config/nextest.toml`: new `db-only-boot` profile filtering
  to `db_only_boot::` tests, and `e2e`'s `default-filter` excludes
  them so they only run in their own CI job.
- `.github/workflows/db-only-boot-e2e.yml`: new reusable workflow
  standing up the stack, running the profile inside `live-tests`, and
  asserting the gateway logs show the DB-authoritative boot banner.
- `.github/workflows/general.yml`: wires the new job behind
  `detect-changes.outputs.code`; `ci/check-all-general-jobs-passed.sh`
  adds it to `ALLOWED_SKIP` so the merge queue tolerates skipped runs.

Also drops the subprocess-spawning `crates/gateway/tests/boot_from_empty_db.rs`
and its helper additions in `gateway/tests/common/mod.rs` and
`gateway/Cargo.toml` — superseded by the in-container Rust test.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Four small cleanups from the branch review:

- `load_startup_config_from_database` takes `Option<&str>` instead of
  `Option<String>` — the function never owned the url; caller now
  passes `postgres_url.as_deref()`.
- Consolidate `UnwrittenConfig` import into the existing
  `use tensorzero_core::config::{...}` block and drop the two inline
  long-form paths, per AGENTS.md.
- Fold the three separate `expect_that!` calls on `StatusResponse`
  into a single `matches_pattern!` — if the struct gains a field, the
  test now makes a conscious choice instead of silently ignoring it.
- Replace `toml::from_str(...).unwrap_or_else(|e| panic!(...))` with
  `assert_that!(parsed, ok(predicate(toml::Value::is_table)))` so
  success + the "is-a-table" check collapse to one googletest
  assertion.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Two mistakes in the initial push of the new job:

- The workflow pulls `tensorzero/live-tests:sha-$SHA` but only declared
  `build-gateway-container` in `needs:`. Adds
  `build-live-tests-container` and `build-fixtures-container` to the
  dependency list, matching `live-tests-config-in-database`. Also
  gates the job on the same fork/dependabot condition the sibling jobs
  use.
- `pre-commit`'s `check-yaml` can't parse Compose's `!override` custom
  tag, so `validate` failed on the new compose file. Excludes that
  single file from `check-yaml`; Docker Compose still validates it at
  stack-up time.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The CI job failed because `docker compose run live-tests` started its
full `depends_on` graph — including `fixtures-postgres`, which exits
1 when loading fixtures against a migrated-but-empty DB. The whole
point of this scenario is an empty DB, so fixture loading is a
semantic mismatch.

Override `live-tests.depends_on` with `!override` to keep only the
infra + gateway + migrations services and drop `fixtures` and
`fixtures-postgres`. The `up --wait gateway` and the subsequent
`run --rm live-tests` both pass locally after this change.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Run the zero-config boot scenario in both observability modes:

- Postgres-config + ClickHouse-data (default TOML-config deploy shape)
- Postgres-config + Postgres-data (single-datastore deploy)

Matches the `live-tests` workflow's `database: [clickhouse, postgres]`
matrix. When `matrix.database == postgres`, sets
`TENSORZERO_INTERNAL_TEST_OBSERVABILITY_BACKEND=postgres` so the gateway
uses Postgres as the primary observability backend and exercises its
pgcron/pgvector/trigram extension checks.

The `check-all-general-jobs-passed.sh` ALLOWED_SKIP entry
(`db-only-boot-e2e`) already covers matrix-suffixed job names via the
existing `"entry ("` prefix match — no change needed there.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants