Do not report "Server died" when a test run hits the global time limit by alexey-milovidov · Pull Request #106183 · ClickHouse/ClickHouse · GitHub
Skip to content

Do not report "Server died" when a test run hits the global time limit#106183

Merged
alexey-milovidov merged 3 commits into
masterfrom
fix-timeout-not-server-died
Jun 2, 2026
Merged

Do not report "Server died" when a test run hits the global time limit#106183
alexey-milovidov merged 3 commits into
masterfrom
fix-timeout-not-server-died

Conversation

@alexey-milovidov

@alexey-milovidov alexey-milovidov commented May 31, 2026

Copy link
Copy Markdown
Member

The flaky and targeted stateless checks deliberately rerun the selected tests until their time budget is exhausted (rerun_count = 50, so the global time limit is the intended stop condition, not the repeat count). However, clickhouse-test raised StopTesting for five unrelated reasons — server died, hung check, --max-failures, and the global time limit — all exiting with the same STOP_TESTING_EXIT_CODE, which the job side mapped unconditionally to a synthetic Server died leaf. A clean, expected timeout was therefore indistinguishable from a real crash and reported as Server died.

This change gives the graceful time-limit stop a distinct GLOBAL_TIME_LIMIT_EXIT_CODE and reports it as a benign Global time limit reached result instead of Server died. The time budget is now handed to clickhouse-test via --global_time_limit so it stops gracefully between tests; the external Shell.run timeout is kept only as a larger safety net for a genuinely frozen process.

To keep the graceful exit code from being clobbered into 143 (which is treated as Server died) by the killpg feedback loop in stop_tests, parallel workers now stop quietly on the deadline and the parent ignores SIGTERM while terminating them. Genuine server deaths and --max-failures aborts keep precedence over the time-limit path.

Example of a targeted check mislabeled as Server died on timeout (and corresponding PR #105943):
https://s3.amazonaws.com/clickhouse-test-reports/PRs/105943/9a0a506d1908116e6fc0532eff6a1c126b5cbcfb/stateless_tests_arm_asan_ubsan_targeted/job.log
#105943

Changelog category (leave one):

  • CI Fix or Improvement (changelog entry is not required)

Changelog entry (a user-readable short description of the changes that goes into CHANGELOG.md):

...

Documentation entry for user-facing changes

  • Documentation is written (mandatory for new features)

🤖 Generated with Claude Code

Version info

  • Merged into: 26.6.1.310

The flaky and targeted stateless checks deliberately rerun the selected
tests until their time budget is exhausted (`rerun_count = 50`, so the
time limit is the intended stop condition). `clickhouse-test` raised
`StopTesting` for five unrelated reasons - server died, hung check,
`--max-failures`, and the global time limit - all exiting with the same
`STOP_TESTING_EXIT_CODE`, which the job side mapped unconditionally to a
synthetic "Server died" leaf. A clean, expected timeout was therefore
indistinguishable from a real crash.

Give the graceful time-limit stop a distinct `GLOBAL_TIME_LIMIT_EXIT_CODE`
and report it as a benign "Global time limit reached" result rather than
"Server died". Hand the budget to `clickhouse-test` via `--global_time_limit`
so it stops gracefully between tests; keep the external `Shell.run` timeout
only as a larger safety net for a genuinely frozen process.

To keep the graceful exit code from being clobbered into 143 ("Server
died") by the `killpg` feedback loop in `stop_tests`, parallel workers
now stop quietly on the deadline and the parent ignores `SIGTERM` while
terminating them.

Example of a targeted check mislabeled as "Server died" on timeout:
https://s3.amazonaws.com/clickhouse-test-reports/PRs/105943/9a0a506d1908116e6fc0532eff6a1c126b5cbcfb/stateless_tests_arm_asan_ubsan_targeted/job.log

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
@clickhouse-gh

clickhouse-gh Bot commented May 31, 2026

Copy link
Copy Markdown
Contributor

@clickhouse-gh clickhouse-gh Bot added the pr-ci label May 31, 2026
Comment thread ci/jobs/functional_tests.py Outdated
Comment thread tests/clickhouse-test
@CheSema CheSema self-assigned this May 31, 2026
alexey-milovidov and others added 2 commits June 1, 2026 09:45
The external `Shell.run` safety-net timeout was `global_time_limit + 600`,
but a test that starts just before the global deadline can still be inside
its own per-test alarm window, which `clickhouse-test` arms as
`int(args.timeout * 1.1) + 60` (720s with the default `--timeout 600`).
With only a 600s margin the external `SIGTERM` could fire before the
graceful per-test stop, returning `143` and reporting `Server died` -
exactly the misclassification this PR removes.

Widen the margin to 900s so it exceeds the per-test bound plus the worker
shutdown wind-down, leaving the external kill for a genuinely frozen process.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Comment thread tests/clickhouse-test

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

global_time_limit_reached is set before the parent runs the --hung-check liveness probe below, and the later if global_time_limit_reached branch forces GLOBAL_TIME_LIMIT_EXIT_CODE. If the server becomes unresponsive in the same polling iteration that crosses args.stop_time, this branch claims the benign timeout first; the immediately following hung check can still print Hung check failed and set stop_testing, but the final exit remains 3, so FTResultsProcessor reports an OK Global time limit reached leaf instead of a failed server-not-responding run. The PR says real hung/server failures keep precedence, so please run the liveness check before claiming the global timeout or clear the benign flag when the hung check fires.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok for now.

@alexey-milovidov alexey-milovidov added this pull request to the merge queue Jun 2, 2026
Merged via the queue into master with commit 5a647dc Jun 2, 2026
166 checks passed
@alexey-milovidov alexey-milovidov deleted the fix-timeout-not-server-died branch June 2, 2026 05:07
@robot-clickhouse-ci-2 robot-clickhouse-ci-2 added the pr-synced-to-cloud The PR is synced to the cloud repo label Jun 2, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

pr-ci pr-synced-to-cloud The PR is synced to the cloud repo

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants