Do not report "Server died" when a test run hits the global time limit#106183
Conversation
The flaky and targeted stateless checks deliberately rerun the selected
tests until their time budget is exhausted (`rerun_count = 50`, so the
time limit is the intended stop condition). `clickhouse-test` raised
`StopTesting` for five unrelated reasons - server died, hung check,
`--max-failures`, and the global time limit - all exiting with the same
`STOP_TESTING_EXIT_CODE`, which the job side mapped unconditionally to a
synthetic "Server died" leaf. A clean, expected timeout was therefore
indistinguishable from a real crash.
Give the graceful time-limit stop a distinct `GLOBAL_TIME_LIMIT_EXIT_CODE`
and report it as a benign "Global time limit reached" result rather than
"Server died". Hand the budget to `clickhouse-test` via `--global_time_limit`
so it stops gracefully between tests; keep the external `Shell.run` timeout
only as a larger safety net for a genuinely frozen process.
To keep the graceful exit code from being clobbered into 143 ("Server
died") by the `killpg` feedback loop in `stop_tests`, parallel workers
now stop quietly on the deadline and the parent ignores `SIGTERM` while
terminating them.
Example of a targeted check mislabeled as "Server died" on timeout:
https://s3.amazonaws.com/clickhouse-test-reports/PRs/105943/9a0a506d1908116e6fc0532eff6a1c126b5cbcfb/stateless_tests_arm_asan_ubsan_targeted/job.log
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
The external `Shell.run` safety-net timeout was `global_time_limit + 600`, but a test that starts just before the global deadline can still be inside its own per-test alarm window, which `clickhouse-test` arms as `int(args.timeout * 1.1) + 60` (720s with the default `--timeout 600`). With only a 600s margin the external `SIGTERM` could fire before the graceful per-test stop, returning `143` and reporting `Server died` - exactly the misclassification this PR removes. Widen the margin to 900s so it exceeds the per-test bound plus the worker shutdown wind-down, leaving the external kill for a genuinely frozen process. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
There was a problem hiding this comment.
global_time_limit_reached is set before the parent runs the --hung-check liveness probe below, and the later if global_time_limit_reached branch forces GLOBAL_TIME_LIMIT_EXIT_CODE. If the server becomes unresponsive in the same polling iteration that crosses args.stop_time, this branch claims the benign timeout first; the immediately following hung check can still print Hung check failed and set stop_testing, but the final exit remains 3, so FTResultsProcessor reports an OK Global time limit reached leaf instead of a failed server-not-responding run. The PR says real hung/server failures keep precedence, so please run the liveness check before claiming the global timeout or clear the benign flag when the hung check fires.

The flaky and targeted stateless checks deliberately rerun the selected tests until their time budget is exhausted (
rerun_count = 50, so the global time limit is the intended stop condition, not the repeat count). However,clickhouse-testraisedStopTestingfor five unrelated reasons — server died, hung check,--max-failures, and the global time limit — all exiting with the sameSTOP_TESTING_EXIT_CODE, which the job side mapped unconditionally to a syntheticServer diedleaf. A clean, expected timeout was therefore indistinguishable from a real crash and reported asServer died.This change gives the graceful time-limit stop a distinct
GLOBAL_TIME_LIMIT_EXIT_CODEand reports it as a benignGlobal time limit reachedresult instead ofServer died. The time budget is now handed toclickhouse-testvia--global_time_limitso it stops gracefully between tests; the externalShell.runtimeout is kept only as a larger safety net for a genuinely frozen process.To keep the graceful exit code from being clobbered into
143(which is treated asServer died) by thekillpgfeedback loop instop_tests, parallel workers now stop quietly on the deadline and the parent ignoresSIGTERMwhile terminating them. Genuine server deaths and--max-failuresaborts keep precedence over the time-limit path.Example of a targeted check mislabeled as
Server diedon timeout (and corresponding PR #105943):https://s3.amazonaws.com/clickhouse-test-reports/PRs/105943/9a0a506d1908116e6fc0532eff6a1c126b5cbcfb/stateless_tests_arm_asan_ubsan_targeted/job.log
#105943
Changelog category (leave one):
Changelog entry (a user-readable short description of the changes that goes into CHANGELOG.md):
...
Documentation entry for user-facing changes
🤖 Generated with Claude Code
Version info
26.6.1.310