ci: various tweaks#10901
Conversation
The three unit-race jobs each ran the full test suite on a single runner, taking 60-67 minutes and sitting on the critical path of every CI run. Add a unit-race-parallel make target backed by a new scripts/unit_race_part.sh, which splits the package list round-robin into a configurable number of tranches. Known-heavy packages are listed first so they land in different tranches. Each tranche runs on its own runner: the workflow matrix becomes three flavors (plain, test_db_sqlite, test_db_postgres) times four tranches, bringing the expected per-job time down to roughly a quarter of the previous duration.
Add a trancheoffset make variable and a new itest-only-parallel target that runs tranches without re-building the itest binaries. Together with the existing tranches and parallel variables, this allows a single itest configuration to be spread over multiple machines: each machine runs ITEST_PARALLELISM tranches starting at its own offset, against the same total tranche count and shuffle seed. itest-parallel is unchanged in behavior and is now expressed as build-itest followed by itest-only-parallel.
The three postgres itest configurations were the longest jobs in the workflow at 65-81 minutes each: every job built the itest binaries from scratch and then ran all 8 tranches in parallel on a single 4-core runner. Following the approach used in lightninglabs/taproot-assets, build the itest binaries once per distinct build-tag set (plain postgres, and postgres with test_native_sql) in a dedicated job that uploads them as artifacts. The test jobs download the binaries and each runs only SPLIT_ITEST_PARALLELISM (2) of the SMALL_TRANCHES (8) tranches, spreading every postgres config over 4 runners. The shuffle seed is fixed per configuration rather than per job so that all tranche groups of a config agree on the test partition.
There was a problem hiding this comment.
Code Review
This pull request introduces parallel execution capabilities for integration and unit race tests by adding support for a tranche offset, allowing tests to be distributed across multiple machines. A new script, scripts/unit_race_part.sh, is added to run a specific tranche of unit tests in race detector mode. Feedback on the new script highlights a critical logic bug where heavy packages are not correctly filtered and run twice, as well as a robustness issue regarding uncaught pipeline failures. A refactored solution using Bash associative arrays is suggested to address these issues.
Important
The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.
Reading the package list through process substitution meant a go list failure was invisible to set -e: every tranche would select zero packages and exit zero, letting a broken tree pass the race jobs vacuously. Capture the list via command substitution instead, which propagates the pipeline failure, and reject an empty result.
Measurement of the first split run (33-48 min per job at 8 tranches, 2 per runner) showed per-tranche wall time dominates: the itests are wait-bound rather than CPU-bound, so larger tranches set a high floor regardless of runner load. The basic configs already run 16 tranches on a single runner in 17-25 minutes total. Move the split configs to 16 tranches with 4 per runner job. The job count is unchanged; per-runner load stays below the old single runner setup (4 tranches and a dedicated postgres instance versus 8 tranches sharing one), and expected per-job time drops to roughly 15-20 minutes.
PR Severity: LOW
All changed files are LOW severity (CI/CD, make, scripts). <!-- pr-severity-bot --> |
PR Severity: LOW
Low (5 files)
All changed files fall under LOW-severity categories (.github/, make/, scripts/*). No severity bump: 5 files (threshold >20) and 352 lines (threshold >500). CI/tooling improvement adding parallelized unit race test support. To override, add a severity-override-{critical,high,medium,low} label. |
The kvdb_* and test_db_* build tags work by swapping the backend used by kvdb.GetTestBackend, channeldb.MakeTestDB and the sqldb test constructors. Only the packages that construct test stores through those helpers behave differently under the tags; the remaining ~160 packages of the suite run byte-for-byte identical tests in every variant. Each DB-variant job nonetheless ran the full suite, costing 27-35 minutes per unit job and four runners per race flavor. Teach the pkg make variable to accept a list of packages, plumb an optional package filter through unit_race_part.sh, and run the variant jobs only over the DB-touching package closure (defined as DB_UNIT_PKGS in the workflow). The plain unit and race runs still cover the full suite. The DB race flavors drop from four tranches to two, removing four runner jobs.
Both jobs cost a full runner slot and per-job setup to run a grep on go.mod and a 20-line shell script respectively. With runner-pool queueing being the dominant contributor to CI wall time, runner slots matter more than step duration. Run both as steps of the static checks job instead, dropping three jobs per run.
The job deleted all caches not accessed within 12 hours on every pull request from a write-access author. At current activity levels the caches produced within any 12-hour window already exceed the 10GB repository quota (currently 11GB across 15 entries), so the job only ever deletes entries that GitHub's LRU eviction would discard anyway. It also has a harmful failure mode: after any quiet 12-hour stretch the next PR wipes every cache, including the hot master branch ones, forcing a fleet-wide cold build. LRU eviction never does that, since master caches are kept fresh by restore-keys hits from PR jobs. Dropping the job also lets the workflow's actions: write permission fall back to read-only.
The cleanup-space action dates from when hosted runners shipped with roughly 14GB of free disk. Current ubuntu runners have 145GB disks with 88GB free before cleanup (measured in run 27379687191), far more than any job here uses, so the step only adds about two minutes of latency to the start of every job that carries it - roughly 60 runner-minutes per workflow run across its nine usages. The release workflow keeps its usage since it runs rarely and off the PR path.
Remove six itest configurations from per-PR CI:
- bitcoind-sqlite-nativesql and bitcoind-postgres-nativesql: the
test_native_sql build tag only splices in the dev migrations
from sqldb/migrations_dev.go, which is empty by design now that
all migrations live in the main line. The -experiment flavors
are therefore identical today and remain a strict superset
whenever a dev migration is in flight, so the plain nativesql
flavors add no coverage.
- bitcoind-miner: the miner is the lntest harness's block
generator; lnd never connects to it, so this config varies no
production code. Breakage in lntest's bitcoind miner support is
a test-framework concern, not a per-PR lnd one.
- bitcoind-notxindex and bitcoind-rpcpolling: each differs from
the main bitcoind run only in a thin chain-access adapter (the
txindex-less lookup fallback and the RPC polling client), both
rarely touched and partially unit-covered, while the rest of the
run is a duplicate.
- bitcoind-etcd: niche clustered deployment shape; the etcd
storage layer itself is still unit-tested on every PR via the
targeted kvdb_etcd job.
This removes nine runner jobs and roughly 230 runner-minutes per
run, which translates directly to less queueing under the org-level
concurrent-job cap.
The bitcoind interface test variants both ran with a transaction index available, so the notifier's manual historical lookup paths (confDetailsManually, historicalSpendDetails) were only exercised incidentally, via script-dispatch requests and tx-not-found races. End-to-end coverage of the no-txindex deployment shape came from the bitcoind-notxindex itest configuration, which was removed from per-PR CI. Add a bitcoind-no-txindex variant to the interface suite. It runs against the same registered bitcoind driver but with txindex disabled on the backend, forcing every historical confirmation and spend lookup through the manual scan fallbacks. This covers the differing surface of the removed itest configuration more systematically and at far lower cost, since it runs inside the existing unit test jobs.
The bitcoind-miner itest configuration was removed from per-PR CI because it varies no production lnd code: the miner is the lntest harness's block generator. Its real purpose is validating lntest's bitcoind miner support, which downstream projects consume. Restore that coverage with a dedicated path-filtered workflow that runs the configuration only when lntest itself (or the workflow) changes, on PRs and on pushes to master. This keeps the validation where it is relevant without spending a runner slot on every push. Note that the new check must not be marked required in branch protection, since it does not run on most PRs.
|
PR Severity MEDIUM. Severity changed from severity-low to severity-medium. All 8 files are CI/CD config, scripts, or test infrastructure (base: LOW), but 667 non-excluded lines changed exceeds the 500-line bump threshold, raising to MEDIUM. No production code changed. <!-- pr-severity-bot --> |
The targeted DB-variant unit jobs run their package list through a single go test invocation, which executes package test binaries in parallel. The untargeted path serializes packages via xargs -L 1, and some test fixtures depend on that: the kvdb postgres fixture starts an embedded postgres bound to the fixed port 9876, so under the kvdb_postgres tag concurrent package binaries race for the port (seen as 'process already listening on port 9876' failures in CI). Pass -p 1 to go test when more than one package is targeted, restoring the serial execution the suite has always assumed. Single-package invocations are unaffected and keep parallel builds.
The windows itest job spent roughly a quarter of its 43 minutes compiling the itest binaries on the (much slower) windows runner. All itest binaries are built with CGO_ENABLED=0, so they cross-compile cleanly: build them with GOOS=windows in the existing itest binary build job and have the windows job download the artifact and run itest-only-parallel instead. This also drops the Go toolchain setup and cache download from the windows runner entirely.
The two sqlite itest configurations were the slowest single-runner jobs left at 31-32 minutes, still running SMALL_TRANCHES (8). The basic configurations demonstrate that a single runner handles 16 tranches in about 20 minutes, since the itests are wait-bound rather than CPU-bound and benefit from finer-grained tranches. Switch the job to TRANCHES (16), leaving SMALL_TRANCHES to the windows and macOS jobs.
Two CI runs of data show the tranche holding both channeldb and invoices is consistently the slowest at 24-28 minutes against 12-17 for its siblings, with channeldb the heaviest package by a wide margin. Reorder HEAVY_PKGS so that under the four-tranche CI split channeldb shares a tranche only with light packages and invoices moves next to chainntnfs. The partition remains disjoint and covering (verified across all four tranches).
The three cross-compile jobs built all 18 release targets on every push, costing roughly 45 runner-minutes per run. Platform-specific compile errors come in a small number of classes (OS family, word size, ARM), so build one job with one target per class instead. The full target list is still built by the release workflow. Experimental: revert this commit to restore the full per-PR matrix.
The plain unit-race tranches already race-test every package on the default backend. The sqlite and postgres race flavors existed to catch races in backend-specific concurrency handling, but that surface is exercised for correctness by the non-race test_db_* unit variants on every PR, and the four flavor jobs cost roughly 50 runner-minutes per run. Experimental: revert this commit to restore the DB race flavors.
The btcd configuration differs from the bitcoind one only in the chain-access layer (BtcdNotifier and the rpcclient paths), which the chainntnfs interface test suite exercises against a real btcd node in the unit jobs on every PR. btcd itself also runs as the miner in every other itest configuration. This is the same reasoning that removed the rpcpolling and notxindex configurations. Experimental: revert this commit to restore the configuration.
The SQL itests formed a 2x2 product of engine (sqlite, postgres) and store mode (kv shim, native SQL experiment). Two cells cover all the main effects: postgres-native-experiment, where active native-SQL development and the hardest concurrency semantics live, and sqlite-shim, which covers the sqlite engine and the (stable, engine-generic) shim layer. The dropped interactions keep unit-level coverage on every PR: shim-on-postgres via the kvdb_postgres unit variant and native-on-sqlite via the test_db_sqlite unit variant, which also includes any in-flight dev migrations. What is lost is only the whole-daemon run of those two cells, at a saving of roughly 105 runner-minutes per run. Experimental: revert this commit to restore the full matrix.
The heavy-package list driving the round-robin tranche split was based on received wisdom about which packages are slow, and the CI logs show that wisdom is badly stale: channeldb runs in 6 seconds under race while invoices takes 691, with lnwire (328s, fuzz seed corpus), internal/musig2v040 (246s) and the lnwallet/test suites (420s combined) following — none of which were in the list. The rebalancing attempts were rearranging seconds-sized packages around an 11.5-minute boulder, which is why the slowest tranche stayed at 27 minutes. Replace the heavy-first round-robin with a greedy longest-processing-time assignment driven by a table of measured durations (unlisted packages get a small default). Each tranche job computes the same deterministic assignment and runs its own bucket. The estimated per-tranche weights come out at 700-702 seconds, so the slowest race job should drop from 27 to roughly 19 minutes.
Prebuilding the windows itest binaries on linux barely moved the job's duration: the Run itest step is 44 of its 45 minutes, so test execution dominates, not compilation. The same wait-bound dynamics that took the sqlite itests from 31 to 21 minutes when moving from 8 to 16 tranches should apply here, since the windows runner has the same four cores. Experimental: revert to SMALL_TRANCHES if the windows runner proves unable to handle 16 concurrent tranches.
Neither prebuilding the binaries nor doubling the tranche count moved the windows itest job below 41 minutes: its Run itest step is machine-bound at roughly 35 minutes of work, since process creation and file I/O are far more expensive on the windows runner than on linux, and a saturated machine gains nothing from finer tranches. More machines do help, so spread the 16 tranches over two windows runners with 8 apiece, reusing the prebuilt binaries and the tranche offset machinery. Expected per-job time is roughly 20 minutes. Experimental: revert this commit to restore the single windows job.
yyforyongyu
left a comment
There was a problem hiding this comment.
Thanks a lot for taking a stab at this long-running issue! The runtime reduction is very valuable, and I think the parallelization/sharding pieces are worth keeping.
That said, CI is also how we signal which supported configurations are protected from regressions. Dropping full itest coverage for user-facing modes is not just a CI cleanup; it effectively reduces support confidence for those modes. If we want to stop supporting any of them, that needs a wider discussion, release-note/deprecation planning, and likely a multi-release user-notice period.
I don’t think the following coverage can be dropped from per-PR CI without that discussion first:
-
btcdchain backend: drops full-daemonbackend=btcditest. Users running--bitcoin.node=btcdlose end-to-end regression coverage. -
bitcoindRPC polling mode: drops full-daemonbackend="bitcoind rpcpolling"itest. Users running--bitcoind.rpcpollinglose coverage for config wiring, startup, sync, and daemon-level behavior. -
bitcoindwithout txindex: drops full-daemonbackend="bitcoind notxindex"itest. The new focusedchainntnfstest is good additional coverage, but it does not replace full daemon coverage. -
etcddatabase backend: drops full-daemonbackend=bitcoind dbbackend=etcditest. Unit-levelkvdb_etcdcoverage remains, but clustered/etcd deployments lose end-to-end coverage. -
Postgres KV-shim mode: drops full-daemon
backend=bitcoind dbbackend=postgres. Postgres users not on native SQL lose full daemon coverage. -
Native SQL matrix: dropping non-experimental native SQL may be okay if
test_native_sqlis truly a superset, but dropping SQLite native SQL full itests and other engine/store-mode combinations is still a coverage reduction that should be explicitly approved. -
DB-specific race coverage: drops
unit-race tags=test_db_sqliteandunit-race tags=test_db_postgres, which makes backend-specific concurrency regressions less likely to be caught before release. -
Cross-platform release targets: reduces per-PR cross-compilation from all release targets to representative targets, so supported platform build failures may only be caught later.

Testing, testing, nothing to see here..