{{ message }}
Wire functions stress test to ProcessList for KILL-QUERY-style timeouts#104694
Merged
Algunenano merged 7 commits intoMay 15, 2026
Merged
Conversation
Each iteration now creates its own `QueryStatus` via `ProcessList::insert` with `max_execution_time` set to a configurable `--iteration-timeout` (default 30s). The `CancellationChecker` singleton (started for the test's lifetime) flips `QueryStatus::is_killed` when the deadline passes; `request_shutdown` additionally calls `Context::killCurrentQuery` so any cancellation-aware code observes the kill immediately. Iterations that run for more than `2 * iteration_timeout` are reported as `P_TIMEOUT_NOT_HONORED` regardless of how they terminate (success, `MEMORY_LIMIT_EXCEEDED`, `QUERY_WAS_CANCELLED`, `TIMEOUT_EXCEEDED`). The fix in any flagged function is one of: poll `CurrentThread::isQueryCanceled`, make the per-iteration work cheaper, or reject the offending input shape inside the function itself. Other cleanups: - Inner-catch `MEMORY_LIMIT_EXCEEDED || LOGICAL_ERROR` rethrow checks (7 copies) replaced by `isOuterGuardError`, which also covers `QUERY_WAS_CANCELLED` and `TIMEOUT_EXCEEDED`. - New `S_QUERY_CANCELLED` counter for iterations whose function actually observed the cancellation; surfaced in the report as `query cancelled by timeout`. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Contributor
1 task
`catch (...) {}` in `request_shutdown` failed the `catch_all` style check.
Narrow to `catch (Exception &)` (the only thing `killCurrentQuery` can
throw via `getProcessListElement`'s `LOGICAL_ERROR` race) and `LOG_INFO`
the message instead of swallowing.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
`request_shutdown` previously called `t->context->killCurrentQuery()` from the signal listener thread while worker threads concurrently called `context->setProcessListElement(...)` on the same `Context`. `Context::process_list_elem` / `has_process_list_elem` have no synchronization, so the access was a data race. Publish the iteration's `QueryStatusPtr` under a dedicated mutex on `FunctionsStressTestThread` and cancel through it. `QueryStatus::cancelQuery` takes its own `cancel_mutex`, so the cross-thread call is safe. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
`iteration_timeout_seconds` (default 15s, was 30s) is the per-iteration `max_execution_time`. `iteration_too_slow_seconds` (default 20s, new) is the wall-clock threshold above which the iteration is reported as `P_TIMEOUT_NOT_HONORED`. Decoupling them avoids forcing a 2x multiplier and lets us shorten the test budget without making the too-slow check proportionally noisier. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
`CancellationChecker` is a process-wide singleton. Its `terminateThread` flips `stop_thread = true` to break the worker out of its loop, but nothing ever set it back to `false`. After `gtest_functions_stress` exited, any later test added to the same binary that relied on `max_execution_time` cancellation would re-enter `workerFunction`, observe `stop_thread == true` immediately, and exit without doing any work — silently disabling cancellation. Clear `stop_thread` right before `workerFunction` returns (still under the mutex) so the singleton is left in its initial state. The server starts the worker exactly once during boot, so this is a no-op for production; it only matters for callers (tests) that own the worker thread and may want to restart it. Addresses ClickHouse#104694 (comment)
Member
Author
|
Fix for the stress finding in #104875, but they need each other |
The check is driven by --iteration-too-slow, not by 2x the per-query max_execution_time. Address review feedback on PR ClickHouse#104694. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
1 task
antonio2368
approved these changes
May 15, 2026
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Contributor
LLVM Coverage Report
Changed lines: 1.57% (2/127) · Uncovered code |
Merged
via the queue into
ClickHouse:master
with commit May 15, 2026
88a1611
165 of 167 checks passed
This was referenced May 17, 2026
Member
1 task
This was referenced May 17, 2026
alexey-milovidov
pushed a commit
that referenced
this pull request
May 17, 2026
…replicas `OptimizeTrivialGroupByLimitPass` runs in the analyzer on the initiator and lowers `max_rows_to_group_by` on the query's local context. That mutation is not propagated to remote replicas, so once CI's `automatic_parallel_replicas_mode=2` randomization is sampled the aggregator on the remotes never applies the cap and `OverflowAny` stays at zero. The test then sees `04229_on 0` instead of `04229_on 1` and fails 3/3 reruns (every PR run hits the same diff). Pin `enable_parallel_replicas = 0` at the query level in both `SETTINGS` clauses so the optimization is exercised on the local-coordinator path it is designed for. Query-level `SETTINGS` takes priority over the runner's session-level injection, so the pin is robust against randomization. No `no-*` tag is added (per `.claude/CLAUDE.md` guidance) and no other randomization is disabled — `max_threads = 1` was already pinned, the rest stays randomized. Locally reproduced with a 3-replica `default_parallel_replicas` cluster on loopback: 5/5 fails without the pin (`04229_on 0`), 20/20 passes with the pin under the same parallel-replicas session settings. Cross-PR scope (CIDB, 30d): seen on PRs #104473, #100146, #101033, #104694, #104966 — all unrelated to the test or its source PR. Reported by @alexey-milovidov on #100146.
This was referenced May 18, 2026
alexey-milovidov
pushed a commit
to Onyx2406/ClickHouse
that referenced
this pull request
May 19, 2026
The `Docker server image` job intermittently fails when fetching the prebuilt `.deb` packages from `clickhouse-builds.s3.amazonaws.com`: ``` ERROR 500: Internal Server Error. ERROR 503: Service Unavailable. ``` Three unrelated PRs hit this S3 download flake over three days (`ClickHouse#100173` on 2026-05-15 with `500`, `ClickHouse#104694` on 2026-05-14 with `503`, `ClickHouse#104853` on 2026-05-13 with `503`). On master in the last 14 days the master `Docker server image` job has 1 failure of the same shape. `docker/server/Dockerfile.alpine` already wraps `wget` with a retry helper (added in ClickHouse#100380 to absorb the same kind of transient errors during Alpine cross-architecture builds via QEMU). The Ubuntu and distroless flavours still call `wget ... || exit 1` on the very first attempt, so a single `500`/`503` from S3 kills the whole build. This change ports the same `wget_with_retry` wrapper to `docker/server/Dockerfile.ubuntu` and `docker/server/Dockerfile.distroless` for the `DIRECT_DOWNLOAD_URLS` block (the path used by CI). `WGET_RETRIES` (default `5`) and `WGET_RETRY_DELAY` (default `1s`) match the Alpine version and remain overridable via `--build-arg`. CI report (PR ClickHouse#100173, sha 3ceac71): https://s3.amazonaws.com/clickhouse-test-reports/json.html?PR=100173&sha=3ceac71a43a5fd4fc1a7859937b23e71b8fe42ae&name_0=PR&name_1=Docker%20server%20image Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.

Each iteration of the functions stress test now creates its own
QueryStatusviaProcessList::insertwithmax_execution_timeset to a configurable--iteration-timeout(default 15s). TheCancellationCheckersingleton (started for the test's lifetime) flipsQueryStatus::is_killedonce the deadline passes.request_shutdownadditionally cancels the in-flight iteration through itsQueryStatusPtr(cancellation goes throughQueryStatus::cancelQuery, which has its owncancel_mutex; we do not touch the worker'sContext::process_list_elemfrom the listener thread).Iterations whose wall-clock duration exceeds
--iteration-too-slowseconds (default 20s, independent of the timeout) are reported as a new problemtimeout_not_honored, regardless of how they terminate (success,MEMORY_LIMIT_EXCEEDED,QUERY_WAS_CANCELLED,TIMEOUT_EXCEEDED). The remedy for any flagged function is one of: pollCurrentThread::isQueryCanceled, make the per-iteration work cheaper, or reject the offending input shape inside the function itself.Other cleanups:
MEMORY_LIMIT_EXCEEDED || LOGICAL_ERRORrethrow checks (7 copies) replaced byisOuterGuardError, which also coversQUERY_WAS_CANCELLEDandTIMEOUT_EXCEEDED.S_QUERY_CANCELLEDcounter for iterations whose function actually observed the cancellation; surfaced in the report asquery cancelled by timeout.Validation runs:
--iteration-timeout 0.01 --iteration-too-slow 0.02 --duration 3 --threads 4(deliberately tiny budget): the test fails with many functions reported undertimeout_not_honored. Confirms the wiring is reachable end to end.--iteration-timeout 15 --iteration-too-slow 20,--duration 60 --threads 16): test passes;stuck threads: 0; notimeout_not_honoredreported.This lays the groundwork for natural follow-ups: drop the
function_arg_constraintscap onarrayResizeand addisQueryCanceledpolling insideGatherUtils::resizeConstantSize. Once those land, the test will validate them without further test-side changes.Changelog category (leave one):
Changelog entry (a user-readable short description of the changes that goes into CHANGELOG.md):
...
Version info
26.5.1.676