{{ message }}
CI: do not report "Server died" when a run stops on --max-failures#107602
Merged
Algunenano merged 2 commits intoJun 16, 2026
Conversation
…ix UNKNOWN ordering The flaky check runs with `--max-failures 5`. When a test legitimately failed five times, `clickhouse-test` raised `StopTesting` with the default `STOP_TESTING_EXIT_CODE` (2) - the same code used when the server actually dies. The job side (`functional_tests_results.py`) then reported a generic "Server died" leaf and demoted every real failure to `UNKNOWN`, hiding which test actually failed even though the server was perfectly alive. Give the `--max-failures` / `--max-failures-chain` stop its own exit code (`MAX_FAILURES_EXIT_CODE = 4`), distinct from server death and the global time limit. The parent detects the max-failures stop (shared failure counter) and re-raises with the new code; the worker-side raises carry it for sequential runs. The job side reports it as "Too many test failures" and keeps the failing tests as FAIL (real, already-attributed failures) instead of demoting them. Also fix the HTML report ordering: `getStatusPriority` in `json.html` had no branch for UNKNOWN, so it fell through to the lowest priority and sorted below all the OK rows (fail > ok > unknown). Rank UNKNOWN right under the errors and failures so the rows whose real result was lost are visible at the top. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Contributor
In a parallel run the worker that reaches `--max-failures` or `--max-failures-chain` raises `StopTesting` with `MAX_FAILURES_EXIT_CODE` inside its own process, so that exception never reaches the parent. The parent reconstructed the code only from the shared total-failure counter, so the `--max-failures-chain` path (and `--max-failures 0`) fell through. Worse, reproducing it shows both parallel paths exited 143, not the default 2: the worker's `stop_tests()` does `killpg(SIGTERM)` over the whole process group, killing the parent before it can re-raise with the right code. 143 (`128 + SIGTERM`) is an aborted-run code, so the job side still reported "Server died" and demoted the real failures to `UNKNOWN`. The global time limit path avoids this because the parent initiates it and masks SIGTERM first. Fix: - Carry the worker's intended exit code to the parent through a shared `stop_exit_code` value (set before `stop_testing`); the parent re-raises with it. This covers both `--max-failures` and `--max-failures-chain`. - The max-failures / chain raises no longer call `stop_tests()`, so they do not SIGTERM the parent. The parent masks SIGTERM and terminates the workers in an orderly way, exactly like the time limit path. Server death keeps `stop_tests()` (still reported as "Server died"). - A sibling worker that only observes `stop_testing` now breaks quietly when concurrent instead of calling `stop_tests()`, mirroring the time limit handling. Add focused coverage in `ci/tests`: a unit test that `MAX_FAILURES_EXIT_CODE` maps to a "Too many test failures" leaf and keeps the failures as `FAIL` (contrasted with the aborted-run code that demotes to `UNKNOWN`), and an end-to-end test that a parallel run exits with `MAX_FAILURES_EXIT_CODE` for both the chain-limit and total-limit paths. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
1 task
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.

The "Stateless tests (flaky check)" job reports "Server died" and demotes every failing test to UNKNOWN whenever a run stops on
--max-failures, even though the server is perfectly alive.clickhouse-testraisedStopTestingwith the defaultSTOP_TESTING_EXIT_CODE(2) — the same code used for a real server death — so the job side could not tell the two apart and hid which test actually failed.This gives the
--max-failures/--max-failures-chainstop its own exit code (MAX_FAILURES_EXIT_CODE = 4), distinct from server death and the global time limit. The job now reports it as "Too many test failures" and keeps the failing tests asFAIL(real, already-attributed failures) instead of demoting them toUNKNOWNand synthesizing a "Server died" leaf.It also fixes the HTML report ordering:
getStatusPriorityinci/praktika/json.htmlhad no branch forUNKNOWN, soUNKNOWNrows sorted below all theOKrows (fail > ok > unknown).UNKNOWNnow ranks right under the errors and failures, so rows whose real result was lost are visible at the top.Example report where the mislabel is visible — a single failing test (
deltaLakeLocalis not built with MSan) shown as a generic "Server died" plus fiveUNKNOWNreruns: https://s3.amazonaws.com/clickhouse-test-reports/json.html?PR=107341&sha=49c86b80ca9c46024be1ec61a1c3ba97a45544b8&name_0=PR&name_1=Stateless+tests+%28amd_msan%2C+flaky+check%29&name_2=TestsRelated: #107341
cc @leshikus
Changelog category (leave one):
Changelog entry (a user-readable short description of the changes that goes into CHANGELOG.md):
Not for changelog (CI only).
Version info
26.6.1.882