ci: various tweaks by jtobin · Pull Request #10901 · lightningnetwork/lnd · GitHub
Skip to content

ci: various tweaks#10901

Draft
jtobin wants to merge 23 commits into
lightningnetwork:masterfrom
jtobin:ci-tweaks
Draft

ci: various tweaks#10901
jtobin wants to merge 23 commits into
lightningnetwork:masterfrom
jtobin:ci-tweaks

Conversation

@jtobin

@jtobin jtobin commented Jun 11, 2026

Copy link
Copy Markdown
Collaborator

Testing, testing, nothing to see here..

jtobin added 3 commits June 11, 2026 17:08
The three unit-race jobs each ran the full test suite on a single
runner, taking 60-67 minutes and sitting on the critical path of
every CI run.

Add a unit-race-parallel make target backed by a new
scripts/unit_race_part.sh, which splits the package list round-robin
into a configurable number of tranches. Known-heavy packages are
listed first so they land in different tranches. Each tranche runs
on its own runner: the workflow matrix becomes three flavors (plain,
test_db_sqlite, test_db_postgres) times four tranches, bringing the
expected per-job time down to roughly a quarter of the previous
duration.
Add a trancheoffset make variable and a new itest-only-parallel
target that runs tranches without re-building the itest binaries.
Together with the existing tranches and parallel variables, this
allows a single itest configuration to be spread over multiple
machines: each machine runs ITEST_PARALLELISM tranches starting at
its own offset, against the same total tranche count and shuffle
seed.

itest-parallel is unchanged in behavior and is now expressed as
build-itest followed by itest-only-parallel.
The three postgres itest configurations were the longest jobs in the
workflow at 65-81 minutes each: every job built the itest binaries
from scratch and then ran all 8 tranches in parallel on a single
4-core runner.

Following the approach used in lightninglabs/taproot-assets, build
the itest binaries once per distinct build-tag set (plain postgres,
and postgres with test_native_sql) in a dedicated job that uploads
them as artifacts. The test jobs download the binaries and each runs
only SPLIT_ITEST_PARALLELISM (2) of the SMALL_TRANCHES (8) tranches,
spreading every postgres config over 4 runners. The shuffle seed is
fixed per configuration rather than per job so that all tranche
groups of a config agree on the test partition.
@jtobin jtobin marked this pull request as draft June 11, 2026 21:50
@gemini-code-assist

Copy link
Copy Markdown

@gemini-code-assist gemini-code-assist Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces parallel execution capabilities for integration and unit race tests by adding support for a tranche offset, allowing tests to be distributed across multiple machines. A new script, scripts/unit_race_part.sh, is added to run a specific tranche of unit tests in race detector mode. Feedback on the new script highlights a critical logic bug where heavy packages are not correctly filtered and run twice, as well as a robustness issue regarding uncaught pipeline failures. A refactored solution using Bash associative arrays is suggested to address these issues.

Important

The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.

Comment thread scripts/unit_race_part.sh Outdated
Reading the package list through process substitution meant a go
list failure was invisible to set -e: every tranche would select
zero packages and exit zero, letting a broken tree pass the race
jobs vacuously. Capture the list via command substitution instead,
which propagates the pipeline failure, and reject an empty result.
Measurement of the first split run (33-48 min per job at 8 tranches,
2 per runner) showed per-tranche wall time dominates: the itests are
wait-bound rather than CPU-bound, so larger tranches set a high
floor regardless of runner load. The basic configs already run 16
tranches on a single runner in 17-25 minutes total.

Move the split configs to 16 tranches with 4 per runner job. The
job count is unchanged; per-runner load stays below the old single
runner setup (4 tranches and a dedicated postgres instance versus 8
tranches sharing one), and expected per-job time drops to roughly
15-20 minutes.
@github-actions github-actions Bot added the severity-low Best-effort review label Jun 12, 2026
@github-actions

Copy link
Copy Markdown

PR Severity: LOW

Automated classification | 5 files | 352 lines changed

All changed files are LOW severity (CI/CD, make, scripts).
No severity bump: 5 files, 352 lines, no critical packages.

<!-- pr-severity-bot -->

@github-actions

Copy link
Copy Markdown

PR Severity: LOW

Automated classification | 5 files | 352 lines changed

Low (5 files)
  • .github/workflows/main.yml - CI/CD workflow configuration
  • Makefile - Build/make tooling
  • make/testing_flags.mk - Make testing flags/tooling
  • scripts/itest_parallel.sh - Integration test runner script
  • scripts/unit_race_part.sh - Unit race test partitioning script (new file)

All changed files fall under LOW-severity categories (.github/, make/, scripts/*). No severity bump: 5 files (threshold >20) and 352 lines (threshold >500). CI/tooling improvement adding parallelized unit race test support.


To override, add a severity-override-{critical,high,medium,low} label.
<!-- pr-severity-bot -->

jtobin added 7 commits June 12, 2026 08:17
The kvdb_* and test_db_* build tags work by swapping the backend
used by kvdb.GetTestBackend, channeldb.MakeTestDB and the sqldb test
constructors. Only the packages that construct test stores through
those helpers behave differently under the tags; the remaining ~160
packages of the suite run byte-for-byte identical tests in every
variant. Each DB-variant job nonetheless ran the full suite, costing
27-35 minutes per unit job and four runners per race flavor.

Teach the pkg make variable to accept a list of packages, plumb an
optional package filter through unit_race_part.sh, and run the
variant jobs only over the DB-touching package closure (defined as
DB_UNIT_PKGS in the workflow). The plain unit and race runs still
cover the full suite. The DB race flavors drop from four tranches to
two, removing four runner jobs.
Both jobs cost a full runner slot and per-job setup to run a grep on
go.mod and a 20-line shell script respectively. With runner-pool
queueing being the dominant contributor to CI wall time, runner
slots matter more than step duration. Run both as steps of the
static checks job instead, dropping three jobs per run.
The job deleted all caches not accessed within 12 hours on every
pull request from a write-access author. At current activity levels
the caches produced within any 12-hour window already exceed the
10GB repository quota (currently 11GB across 15 entries), so the job
only ever deletes entries that GitHub's LRU eviction would discard
anyway. It also has a harmful failure mode: after any quiet 12-hour
stretch the next PR wipes every cache, including the hot master
branch ones, forcing a fleet-wide cold build. LRU eviction never
does that, since master caches are kept fresh by restore-keys hits
from PR jobs.

Dropping the job also lets the workflow's actions: write permission
fall back to read-only.
The cleanup-space action dates from when hosted runners shipped with
roughly 14GB of free disk. Current ubuntu runners have 145GB disks
with 88GB free before cleanup (measured in run 27379687191), far
more than any job here uses, so the step only adds about two minutes
of latency to the start of every job that carries it - roughly 60
runner-minutes per workflow run across its nine usages. The release
workflow keeps its usage since it runs rarely and off the PR path.
Remove six itest configurations from per-PR CI:

  - bitcoind-sqlite-nativesql and bitcoind-postgres-nativesql: the
    test_native_sql build tag only splices in the dev migrations
    from sqldb/migrations_dev.go, which is empty by design now that
    all migrations live in the main line. The -experiment flavors
    are therefore identical today and remain a strict superset
    whenever a dev migration is in flight, so the plain nativesql
    flavors add no coverage.

  - bitcoind-miner: the miner is the lntest harness's block
    generator; lnd never connects to it, so this config varies no
    production code. Breakage in lntest's bitcoind miner support is
    a test-framework concern, not a per-PR lnd one.

  - bitcoind-notxindex and bitcoind-rpcpolling: each differs from
    the main bitcoind run only in a thin chain-access adapter (the
    txindex-less lookup fallback and the RPC polling client), both
    rarely touched and partially unit-covered, while the rest of the
    run is a duplicate.

  - bitcoind-etcd: niche clustered deployment shape; the etcd
    storage layer itself is still unit-tested on every PR via the
    targeted kvdb_etcd job.

This removes nine runner jobs and roughly 230 runner-minutes per
run, which translates directly to less queueing under the org-level
concurrent-job cap.
The bitcoind interface test variants both ran with a transaction
index available, so the notifier's manual historical lookup paths
(confDetailsManually, historicalSpendDetails) were only exercised
incidentally, via script-dispatch requests and tx-not-found races.
End-to-end coverage of the no-txindex deployment shape came from the
bitcoind-notxindex itest configuration, which was removed from
per-PR CI.

Add a bitcoind-no-txindex variant to the interface suite. It runs
against the same registered bitcoind driver but with txindex
disabled on the backend, forcing every historical confirmation and
spend lookup through the manual scan fallbacks. This covers the
differing surface of the removed itest configuration more
systematically and at far lower cost, since it runs inside the
existing unit test jobs.
The bitcoind-miner itest configuration was removed from per-PR CI
because it varies no production lnd code: the miner is the lntest
harness's block generator. Its real purpose is validating lntest's
bitcoind miner support, which downstream projects consume.

Restore that coverage with a dedicated path-filtered workflow that
runs the configuration only when lntest itself (or the workflow)
changes, on PRs and on pushes to master. This keeps the validation
where it is relevant without spending a runner slot on every push.

Note that the new check must not be marked required in branch
protection, since it does not run on most PRs.
@github-actions github-actions Bot added severity-medium Focused review required and removed severity-low Best-effort review labels Jun 12, 2026
@github-actions

Copy link
Copy Markdown

PR Severity MEDIUM. Severity changed from severity-low to severity-medium. All 8 files are CI/CD config, scripts, or test infrastructure (base: LOW), but 667 non-excluded lines changed exceeds the 500-line bump threshold, raising to MEDIUM. No production code changed. <!-- pr-severity-bot -->

The targeted DB-variant unit jobs run their package list through a
single go test invocation, which executes package test binaries in
parallel. The untargeted path serializes packages via xargs -L 1,
and some test fixtures depend on that: the kvdb postgres fixture
starts an embedded postgres bound to the fixed port 9876, so under
the kvdb_postgres tag concurrent package binaries race for the port
(seen as 'process already listening on port 9876' failures in CI).

Pass -p 1 to go test when more than one package is targeted,
restoring the serial execution the suite has always assumed.
Single-package invocations are unaffected and keep parallel builds.
@github-actions github-actions Bot added severity-critical Requires expert review - security/consensus critical and removed severity-medium Focused review required labels Jun 12, 2026
@github-actions

Copy link
Copy Markdown

jtobin added 4 commits June 12, 2026 11:10
The windows itest job spent roughly a quarter of its 43 minutes
compiling the itest binaries on the (much slower) windows runner.
All itest binaries are built with CGO_ENABLED=0, so they
cross-compile cleanly: build them with GOOS=windows in the existing
itest binary build job and have the windows job download the
artifact and run itest-only-parallel instead. This also drops the
Go toolchain setup and cache download from the windows runner
entirely.
The two sqlite itest configurations were the slowest single-runner
jobs left at 31-32 minutes, still running SMALL_TRANCHES (8). The
basic configurations demonstrate that a single runner handles 16
tranches in about 20 minutes, since the itests are wait-bound
rather than CPU-bound and benefit from finer-grained tranches.
Switch the job to TRANCHES (16), leaving SMALL_TRANCHES to the
windows and macOS jobs.
Two CI runs of data show the tranche holding both channeldb and
invoices is consistently the slowest at 24-28 minutes against 12-17
for its siblings, with channeldb the heaviest package by a wide
margin. Reorder HEAVY_PKGS so that under the four-tranche CI split
channeldb shares a tranche only with light packages and invoices
moves next to chainntnfs. The partition remains disjoint and
covering (verified across all four tranches).
The three cross-compile jobs built all 18 release targets on every
push, costing roughly 45 runner-minutes per run. Platform-specific
compile errors come in a small number of classes (OS family, word
size, ARM), so build one job with one target per class instead. The
full target list is still built by the release workflow.

Experimental: revert this commit to restore the full per-PR matrix.
jtobin added 3 commits June 12, 2026 12:17
The plain unit-race tranches already race-test every package on the
default backend. The sqlite and postgres race flavors existed to
catch races in backend-specific concurrency handling, but that
surface is exercised for correctness by the non-race test_db_* unit
variants on every PR, and the four flavor jobs cost roughly 50
runner-minutes per run.

Experimental: revert this commit to restore the DB race flavors.
The btcd configuration differs from the bitcoind one only in the
chain-access layer (BtcdNotifier and the rpcclient paths), which the
chainntnfs interface test suite exercises against a real btcd node
in the unit jobs on every PR. btcd itself also runs as the miner in
every other itest configuration. This is the same reasoning that
removed the rpcpolling and notxindex configurations.

Experimental: revert this commit to restore the configuration.
The SQL itests formed a 2x2 product of engine (sqlite, postgres) and
store mode (kv shim, native SQL experiment). Two cells cover all the
main effects: postgres-native-experiment, where active native-SQL
development and the hardest concurrency semantics live, and
sqlite-shim, which covers the sqlite engine and the (stable,
engine-generic) shim layer. The dropped interactions keep unit-level
coverage on every PR: shim-on-postgres via the kvdb_postgres unit
variant and native-on-sqlite via the test_db_sqlite unit variant,
which also includes any in-flight dev migrations. What is lost is
only the whole-daemon run of those two cells, at a saving of roughly
105 runner-minutes per run.

Experimental: revert this commit to restore the full matrix.
jtobin added 2 commits June 12, 2026 13:46
The heavy-package list driving the round-robin tranche split was
based on received wisdom about which packages are slow, and the CI
logs show that wisdom is badly stale: channeldb runs in 6 seconds
under race while invoices takes 691, with lnwire (328s, fuzz seed
corpus), internal/musig2v040 (246s) and the lnwallet/test suites
(420s combined) following — none of which were in the list. The
rebalancing attempts were rearranging seconds-sized packages around
an 11.5-minute boulder, which is why the slowest tranche stayed at
27 minutes.

Replace the heavy-first round-robin with a greedy
longest-processing-time assignment driven by a table of measured
durations (unlisted packages get a small default). Each tranche job
computes the same deterministic assignment and runs its own bucket.
The estimated per-tranche weights come out at 700-702 seconds, so
the slowest race job should drop from 27 to roughly 19 minutes.
Prebuilding the windows itest binaries on linux barely moved the
job's duration: the Run itest step is 44 of its 45 minutes, so test
execution dominates, not compilation. The same wait-bound dynamics
that took the sqlite itests from 31 to 21 minutes when moving from
8 to 16 tranches should apply here, since the windows runner has
the same four cores.

Experimental: revert to SMALL_TRANCHES if the windows runner proves
unable to handle 16 concurrent tranches.
@github-actions github-actions Bot added severity-critical Requires expert review - security/consensus critical and removed severity-critical Requires expert review - security/consensus critical labels Jun 12, 2026
Neither prebuilding the binaries nor doubling the tranche count
moved the windows itest job below 41 minutes: its Run itest step is
machine-bound at roughly 35 minutes of work, since process creation
and file I/O are far more expensive on the windows runner than on
linux, and a saturated machine gains nothing from finer tranches.
More machines do help, so spread the 16 tranches over two windows
runners with 8 apiece, reusing the prebuilt binaries and the
tranche offset machinery. Expected per-job time is roughly 20
minutes.

Experimental: revert this commit to restore the single windows job.

@yyforyongyu yyforyongyu left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks a lot for taking a stab at this long-running issue! The runtime reduction is very valuable, and I think the parallelization/sharding pieces are worth keeping.

That said, CI is also how we signal which supported configurations are protected from regressions. Dropping full itest coverage for user-facing modes is not just a CI cleanup; it effectively reduces support confidence for those modes. If we want to stop supporting any of them, that needs a wider discussion, release-note/deprecation planning, and likely a multi-release user-notice period.

I don’t think the following coverage can be dropped from per-PR CI without that discussion first:

  • btcd chain backend: drops full-daemon backend=btcd itest. Users running --bitcoin.node=btcd lose end-to-end regression coverage.

  • bitcoind RPC polling mode: drops full-daemon backend="bitcoind rpcpolling" itest. Users running --bitcoind.rpcpolling lose coverage for config wiring, startup, sync, and daemon-level behavior.

  • bitcoind without txindex: drops full-daemon backend="bitcoind notxindex" itest. The new focused chainntnfs test is good additional coverage, but it does not replace full daemon coverage.

  • etcd database backend: drops full-daemon backend=bitcoind dbbackend=etcd itest. Unit-level kvdb_etcd coverage remains, but clustered/etcd deployments lose end-to-end coverage.

  • Postgres KV-shim mode: drops full-daemon backend=bitcoind dbbackend=postgres. Postgres users not on native SQL lose full daemon coverage.

  • Native SQL matrix: dropping non-experimental native SQL may be okay if test_native_sql is truly a superset, but dropping SQLite native SQL full itests and other engine/store-mode combinations is still a coverage reduction that should be explicitly approved.

  • DB-specific race coverage: drops unit-race tags=test_db_sqlite and unit-race tags=test_db_postgres, which makes backend-specific concurrency regressions less likely to be caught before release.

  • Cross-platform release targets: reduces per-PR cross-compilation from all release targets to representative targets, so supported platform build failures may only be caught later.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

no-changelog severity-critical Requires expert review - security/consensus critical

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants