CI: do not report "Server died" when a run stops on --max-failures by Algunenano · Pull Request #107602 · ClickHouse/ClickHouse · GitHub
Skip to content

CI: do not report "Server died" when a run stops on --max-failures#107602

Merged
Algunenano merged 2 commits into
ClickHouse:masterfrom
Algunenano:fix-flaky-check-mislabel
Jun 16, 2026
Merged

CI: do not report "Server died" when a run stops on --max-failures#107602
Algunenano merged 2 commits into
ClickHouse:masterfrom
Algunenano:fix-flaky-check-mislabel

Conversation

@Algunenano

@Algunenano Algunenano commented Jun 16, 2026

Copy link
Copy Markdown
Member

The "Stateless tests (flaky check)" job reports "Server died" and demotes every failing test to UNKNOWN whenever a run stops on --max-failures, even though the server is perfectly alive. clickhouse-test raised StopTesting with the default STOP_TESTING_EXIT_CODE (2) — the same code used for a real server death — so the job side could not tell the two apart and hid which test actually failed.

This gives the --max-failures / --max-failures-chain stop its own exit code (MAX_FAILURES_EXIT_CODE = 4), distinct from server death and the global time limit. The job now reports it as "Too many test failures" and keeps the failing tests as FAIL (real, already-attributed failures) instead of demoting them to UNKNOWN and synthesizing a "Server died" leaf.

It also fixes the HTML report ordering: getStatusPriority in ci/praktika/json.html had no branch for UNKNOWN, so UNKNOWN rows sorted below all the OK rows (fail > ok > unknown). UNKNOWN now ranks right under the errors and failures, so rows whose real result was lost are visible at the top.

Example report where the mislabel is visible — a single failing test (deltaLakeLocal is not built with MSan) shown as a generic "Server died" plus five UNKNOWN reruns: https://s3.amazonaws.com/clickhouse-test-reports/json.html?PR=107341&sha=49c86b80ca9c46024be1ec61a1c3ba97a45544b8&name_0=PR&name_1=Stateless+tests+%28amd_msan%2C+flaky+check%29&name_2=Tests

Related: #107341

cc @leshikus

Changelog category (leave one):

  • CI Fix or Improvement

Changelog entry (a user-readable short description of the changes that goes into CHANGELOG.md):

Not for changelog (CI only).

Version info

  • Merged into: 26.6.1.882

…ix UNKNOWN ordering

The flaky check runs with `--max-failures 5`. When a test legitimately failed
five times, `clickhouse-test` raised `StopTesting` with the default
`STOP_TESTING_EXIT_CODE` (2) - the same code used when the server actually dies.
The job side (`functional_tests_results.py`) then reported a generic "Server
died" leaf and demoted every real failure to `UNKNOWN`, hiding which test
actually failed even though the server was perfectly alive.

Give the `--max-failures` / `--max-failures-chain` stop its own exit code
(`MAX_FAILURES_EXIT_CODE = 4`), distinct from server death and the global time
limit. The parent detects the max-failures stop (shared failure counter) and
re-raises with the new code; the worker-side raises carry it for sequential
runs. The job side reports it as "Too many test failures" and keeps the failing
tests as FAIL (real, already-attributed failures) instead of demoting them.

Also fix the HTML report ordering: `getStatusPriority` in `json.html` had no
branch for UNKNOWN, so it fell through to the lowest priority and sorted below
all the OK rows (fail > ok > unknown). Rank UNKNOWN right under the errors and
failures so the rows whose real result was lost are visible at the top.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
@clickhouse-gh

clickhouse-gh Bot commented Jun 16, 2026

Copy link
Copy Markdown
Contributor

@clickhouse-gh clickhouse-gh Bot added the pr-ci label Jun 16, 2026
Comment thread tests/clickhouse-test Outdated
In a parallel run the worker that reaches `--max-failures` or
`--max-failures-chain` raises `StopTesting` with `MAX_FAILURES_EXIT_CODE`
inside its own process, so that exception never reaches the parent. The
parent reconstructed the code only from the shared total-failure counter,
so the `--max-failures-chain` path (and `--max-failures 0`) fell through.

Worse, reproducing it shows both parallel paths exited 143, not the default
2: the worker's `stop_tests()` does `killpg(SIGTERM)` over the whole process
group, killing the parent before it can re-raise with the right code. 143
(`128 + SIGTERM`) is an aborted-run code, so the job side still reported
"Server died" and demoted the real failures to `UNKNOWN`. The global time
limit path avoids this because the parent initiates it and masks SIGTERM
first.

Fix:
- Carry the worker's intended exit code to the parent through a shared
  `stop_exit_code` value (set before `stop_testing`); the parent re-raises
  with it. This covers both `--max-failures` and `--max-failures-chain`.
- The max-failures / chain raises no longer call `stop_tests()`, so they do
  not SIGTERM the parent. The parent masks SIGTERM and terminates the workers
  in an orderly way, exactly like the time limit path. Server death keeps
  `stop_tests()` (still reported as "Server died").
- A sibling worker that only observes `stop_testing` now breaks quietly when
  concurrent instead of calling `stop_tests()`, mirroring the time limit
  handling.

Add focused coverage in `ci/tests`: a unit test that `MAX_FAILURES_EXIT_CODE`
maps to a "Too many test failures" leaf and keeps the failures as `FAIL`
(contrasted with the aborted-run code that demotes to `UNKNOWN`), and an
end-to-end test that a parallel run exits with `MAX_FAILURES_EXIT_CODE` for
both the chain-limit and total-limit paths.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@leshikus leshikus self-requested a review June 16, 2026 12:37

@leshikus leshikus left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good

@Algunenano Algunenano enabled auto-merge June 16, 2026 12:45
@Algunenano Algunenano disabled auto-merge June 16, 2026 13:26
@Algunenano Algunenano enabled auto-merge June 16, 2026 13:35
@Algunenano Algunenano added this pull request to the merge queue Jun 16, 2026
Merged via the queue into ClickHouse:master with commit bb53b35 Jun 16, 2026
166 checks passed
@Algunenano Algunenano deleted the fix-flaky-check-mislabel branch June 16, 2026 17:23
@robot-ch-test-poll4 robot-ch-test-poll4 added the pr-synced-to-cloud The PR is synced to the cloud repo label Jun 16, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

pr-ci pr-synced-to-cloud The PR is synced to the cloud repo

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants