Wire functions stress test to ProcessList for KILL-QUERY-style timeouts by Algunenano · Pull Request #104694 · ClickHouse/ClickHouse · GitHub
Skip to content

Wire functions stress test to ProcessList for KILL-QUERY-style timeouts#104694

Merged
Algunenano merged 7 commits into
ClickHouse:masterfrom
Algunenano:stress-test-use-process-list
May 15, 2026
Merged

Wire functions stress test to ProcessList for KILL-QUERY-style timeouts#104694
Algunenano merged 7 commits into
ClickHouse:masterfrom
Algunenano:stress-test-use-process-list

Conversation

@Algunenano

@Algunenano Algunenano commented May 12, 2026

Copy link
Copy Markdown
Member

Each iteration of the functions stress test now creates its own QueryStatus via ProcessList::insert with max_execution_time set to a configurable --iteration-timeout (default 15s). The CancellationChecker singleton (started for the test's lifetime) flips QueryStatus::is_killed once the deadline passes. request_shutdown additionally cancels the in-flight iteration through its QueryStatusPtr (cancellation goes through QueryStatus::cancelQuery, which has its own cancel_mutex; we do not touch the worker's Context::process_list_elem from the listener thread).

Iterations whose wall-clock duration exceeds --iteration-too-slow seconds (default 20s, independent of the timeout) are reported as a new problem timeout_not_honored, regardless of how they terminate (success, MEMORY_LIMIT_EXCEEDED, QUERY_WAS_CANCELLED, TIMEOUT_EXCEEDED). The remedy for any flagged function is one of: poll CurrentThread::isQueryCanceled, make the per-iteration work cheaper, or reject the offending input shape inside the function itself.

Other cleanups:

  • Inner-catch MEMORY_LIMIT_EXCEEDED || LOGICAL_ERROR rethrow checks (7 copies) replaced by isOuterGuardError, which also covers QUERY_WAS_CANCELLED and TIMEOUT_EXCEEDED.
  • New S_QUERY_CANCELLED counter for iterations whose function actually observed the cancellation; surfaced in the report as query cancelled by timeout.

Validation runs:

  • --iteration-timeout 0.01 --iteration-too-slow 0.02 --duration 3 --threads 4 (deliberately tiny budget): the test fails with many functions reported under timeout_not_honored. Confirms the wiring is reachable end to end.
  • Default settings (--iteration-timeout 15 --iteration-too-slow 20, --duration 60 --threads 16): test passes; stuck threads: 0; no timeout_not_honored reported.

This lays the groundwork for natural follow-ups: drop the function_arg_constraints cap on arrayResize and add isQueryCanceled polling inside GatherUtils::resizeConstantSize. Once those land, the test will validate them without further test-side changes.

Changelog category (leave one):

  • CI Fix or Improvement (changelog entry is not required)

Changelog entry (a user-readable short description of the changes that goes into CHANGELOG.md):

...

Version info

  • Merged into: 26.5.1.676

Each iteration now creates its own `QueryStatus` via `ProcessList::insert`
with `max_execution_time` set to a configurable `--iteration-timeout`
(default 30s). The `CancellationChecker` singleton (started for the test's
lifetime) flips `QueryStatus::is_killed` when the deadline passes;
`request_shutdown` additionally calls `Context::killCurrentQuery` so any
cancellation-aware code observes the kill immediately.

Iterations that run for more than `2 * iteration_timeout` are reported as
`P_TIMEOUT_NOT_HONORED` regardless of how they terminate (success,
`MEMORY_LIMIT_EXCEEDED`, `QUERY_WAS_CANCELLED`, `TIMEOUT_EXCEEDED`).
The fix in any flagged function is one of: poll
`CurrentThread::isQueryCanceled`, make the per-iteration work cheaper, or
reject the offending input shape inside the function itself.

Other cleanups:
- Inner-catch `MEMORY_LIMIT_EXCEEDED || LOGICAL_ERROR` rethrow checks (7
  copies) replaced by `isOuterGuardError`, which also covers
  `QUERY_WAS_CANCELLED` and `TIMEOUT_EXCEEDED`.
- New `S_QUERY_CANCELLED` counter for iterations whose function actually
  observed the cancellation; surfaced in the report as
  `query cancelled by timeout`.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@clickhouse-gh

clickhouse-gh Bot commented May 12, 2026

Copy link
Copy Markdown
Contributor

`catch (...) {}` in `request_shutdown` failed the `catch_all` style check.
Narrow to `catch (Exception &)` (the only thing `killCurrentQuery` can
throw via `getProcessListElement`'s `LOGICAL_ERROR` race) and `LOG_INFO`
the message instead of swallowing.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Comment thread src/Functions/tests/gtest_functions_stress.cpp Outdated
Algunenano and others added 2 commits May 12, 2026 16:30
`request_shutdown` previously called `t->context->killCurrentQuery()`
from the signal listener thread while worker threads concurrently called
`context->setProcessListElement(...)` on the same `Context`.
`Context::process_list_elem` / `has_process_list_elem` have no
synchronization, so the access was a data race.

Publish the iteration's `QueryStatusPtr` under a dedicated mutex on
`FunctionsStressTestThread` and cancel through it. `QueryStatus::cancelQuery`
takes its own `cancel_mutex`, so the cross-thread call is safe.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
`iteration_timeout_seconds` (default 15s, was 30s) is the per-iteration
`max_execution_time`. `iteration_too_slow_seconds` (default 20s, new) is
the wall-clock threshold above which the iteration is reported as
`P_TIMEOUT_NOT_HONORED`. Decoupling them avoids forcing a 2x multiplier
and lets us shorten the test budget without making the too-slow check
proportionally noisier.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Comment thread src/Functions/tests/gtest_functions_stress.cpp
`CancellationChecker` is a process-wide singleton. Its `terminateThread`
flips `stop_thread = true` to break the worker out of its loop, but
nothing ever set it back to `false`. After `gtest_functions_stress`
exited, any later test added to the same binary that relied on
`max_execution_time` cancellation would re-enter `workerFunction`,
observe `stop_thread == true` immediately, and exit without doing any
work — silently disabling cancellation.

Clear `stop_thread` right before `workerFunction` returns (still under
the mutex) so the singleton is left in its initial state. The server
starts the worker exactly once during boot, so this is a no-op for
production; it only matters for callers (tests) that own the worker
thread and may want to restart it.

Addresses ClickHouse#104694 (comment)
@Algunenano

Copy link
Copy Markdown
Member Author

Fix for the stress finding in #104875, but they need each other

Comment thread src/Functions/tests/gtest_functions_stress.cpp Outdated
The check is driven by --iteration-too-slow, not by 2x the per-query
max_execution_time. Address review feedback on PR ClickHouse#104694.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Comment thread src/Functions/tests/gtest_functions_stress.cpp
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@clickhouse-gh

clickhouse-gh Bot commented May 15, 2026

Copy link
Copy Markdown
Contributor

LLVM Coverage Report

Metric Baseline Current Δ
Lines 84.10% 84.10% +0.00%
Functions 90.90% 90.90% +0.00%
Branches 76.50% 76.60% +0.10%

Changed lines: 1.57% (2/127) · Uncovered code

Full report · Diff report

@Algunenano Algunenano added this pull request to the merge queue May 15, 2026
@Algunenano Algunenano removed this pull request from the merge queue due to a manual request May 15, 2026
@Algunenano Algunenano added this pull request to the merge queue May 15, 2026
Merged via the queue into ClickHouse:master with commit 88a1611 May 15, 2026
165 of 167 checks passed
@Algunenano Algunenano deleted the stress-test-use-process-list branch May 15, 2026 14:07
@robot-ch-test-poll4 robot-ch-test-poll4 added the pr-synced-to-cloud The PR is synced to the cloud repo label May 15, 2026
@alexey-milovidov

Copy link
Copy Markdown
Member

alexey-milovidov pushed a commit that referenced this pull request May 17, 2026
…replicas

`OptimizeTrivialGroupByLimitPass` runs in the analyzer on the initiator
and lowers `max_rows_to_group_by` on the query's local context. That
mutation is not propagated to remote replicas, so once CI's
`automatic_parallel_replicas_mode=2` randomization is sampled the
aggregator on the remotes never applies the cap and `OverflowAny` stays
at zero. The test then sees `04229_on   0` instead of
`04229_on   1` and fails 3/3 reruns (every PR run hits the same diff).

Pin `enable_parallel_replicas = 0` at the query level in both `SETTINGS`
clauses so the optimization is exercised on the local-coordinator path
it is designed for. Query-level `SETTINGS` takes priority over the
runner's session-level injection, so the pin is robust against
randomization. No `no-*` tag is added (per
`.claude/CLAUDE.md` guidance) and no other randomization is disabled —
`max_threads = 1` was already pinned, the rest stays randomized.

Locally reproduced with a 3-replica `default_parallel_replicas` cluster
on loopback: 5/5 fails without the pin (`04229_on   0`), 20/20 passes
with the pin under the same parallel-replicas session settings.

Cross-PR scope (CIDB, 30d): seen on PRs #104473, #100146, #101033,
#104694, #104966 — all unrelated to the test or its source PR.

Reported by @alexey-milovidov on
#100146.
alexey-milovidov pushed a commit to Onyx2406/ClickHouse that referenced this pull request May 19, 2026
The `Docker server image` job intermittently fails when fetching the
prebuilt `.deb` packages from `clickhouse-builds.s3.amazonaws.com`:

```
ERROR 500: Internal Server Error.
ERROR 503: Service Unavailable.
```

Three unrelated PRs hit this S3 download flake over three days
(`ClickHouse#100173` on 2026-05-15 with `500`, `ClickHouse#104694` on 2026-05-14 with `503`,
`ClickHouse#104853` on 2026-05-13 with `503`). On master in the last 14 days the
master `Docker server image` job has 1 failure of the same shape.

`docker/server/Dockerfile.alpine` already wraps `wget` with a retry
helper (added in ClickHouse#100380 to absorb the same kind of transient errors
during Alpine cross-architecture builds via QEMU). The Ubuntu and
distroless flavours still call `wget ... || exit 1` on the very first
attempt, so a single `500`/`503` from S3 kills the whole build.

This change ports the same `wget_with_retry` wrapper to
`docker/server/Dockerfile.ubuntu` and `docker/server/Dockerfile.distroless`
for the `DIRECT_DOWNLOAD_URLS` block (the path used by CI). `WGET_RETRIES`
(default `5`) and `WGET_RETRY_DELAY` (default `1s`) match the Alpine
version and remain overridable via `--build-arg`.

CI report (PR ClickHouse#100173, sha 3ceac71): https://s3.amazonaws.com/clickhouse-test-reports/json.html?PR=100173&sha=3ceac71a43a5fd4fc1a7859937b23e71b8fe42ae&name_0=PR&name_1=Docker%20server%20image

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

pr-ci pr-synced-to-cloud The PR is synced to the cloud repo

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants