iframe-proxy

Algunenano · 2026-05-12T11:37:35Z

Each iteration of the functions stress test now creates its own QueryStatus via ProcessList::insert with max_execution_time set to a configurable --iteration-timeout (default 15s). The CancellationChecker singleton (started for the test's lifetime) flips QueryStatus::is_killed once the deadline passes. request_shutdown additionally cancels the in-flight iteration through its QueryStatusPtr (cancellation goes through QueryStatus::cancelQuery, which has its own cancel_mutex; we do not touch the worker's Context::process_list_elem from the listener thread).

Iterations whose wall-clock duration exceeds --iteration-too-slow seconds (default 20s, independent of the timeout) are reported as a new problem timeout_not_honored, regardless of how they terminate (success, MEMORY_LIMIT_EXCEEDED, QUERY_WAS_CANCELLED, TIMEOUT_EXCEEDED). The remedy for any flagged function is one of: poll CurrentThread::isQueryCanceled, make the per-iteration work cheaper, or reject the offending input shape inside the function itself.

Other cleanups:

Inner-catch MEMORY_LIMIT_EXCEEDED || LOGICAL_ERROR rethrow checks (7 copies) replaced by isOuterGuardError, which also covers QUERY_WAS_CANCELLED and TIMEOUT_EXCEEDED.
New S_QUERY_CANCELLED counter for iterations whose function actually observed the cancellation; surfaced in the report as query cancelled by timeout.

Validation runs:

--iteration-timeout 0.01 --iteration-too-slow 0.02 --duration 3 --threads 4 (deliberately tiny budget): the test fails with many functions reported under timeout_not_honored. Confirms the wiring is reachable end to end.
Default settings (--iteration-timeout 15 --iteration-too-slow 20, --duration 60 --threads 16): test passes; stuck threads: 0; no timeout_not_honored reported.

This lays the groundwork for natural follow-ups: drop the function_arg_constraints cap on arrayResize and add isQueryCanceled polling inside GatherUtils::resizeConstantSize. Once those land, the test will validate them without further test-side changes.

Changelog category (leave one):

CI Fix or Improvement (changelog entry is not required)

Changelog entry (a user-readable short description of the changes that goes into CHANGELOG.md):

...

Version info

Merged into: 26.5.1.676

Each iteration now creates its own `QueryStatus` via `ProcessList::insert` with `max_execution_time` set to a configurable `--iteration-timeout` (default 30s). The `CancellationChecker` singleton (started for the test's lifetime) flips `QueryStatus::is_killed` when the deadline passes; `request_shutdown` additionally calls `Context::killCurrentQuery` so any cancellation-aware code observes the kill immediately. Iterations that run for more than `2 * iteration_timeout` are reported as `P_TIMEOUT_NOT_HONORED` regardless of how they terminate (success, `MEMORY_LIMIT_EXCEEDED`, `QUERY_WAS_CANCELLED`, `TIMEOUT_EXCEEDED`). The fix in any flagged function is one of: poll `CurrentThread::isQueryCanceled`, make the per-iteration work cheaper, or reject the offending input shape inside the function itself. Other cleanups: - Inner-catch `MEMORY_LIMIT_EXCEEDED || LOGICAL_ERROR` rethrow checks (7 copies) replaced by `isOuterGuardError`, which also covers `QUERY_WAS_CANCELLED` and `TIMEOUT_EXCEEDED`. - New `S_QUERY_CANCELLED` counter for iterations whose function actually observed the cancellation; surfaced in the report as `query cancelled by timeout`. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

clickhouse-gh · 2026-05-12T11:38:10Z

`catch (...) {}` in `request_shutdown` failed the `catch_all` style check. Narrow to `catch (Exception &)` (the only thing `killCurrentQuery` can throw via `getProcessListElement`'s `LOGICAL_ERROR` race) and `LOG_INFO` the message instead of swallowing. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

`request_shutdown` previously called `t->context->killCurrentQuery()` from the signal listener thread while worker threads concurrently called `context->setProcessListElement(...)` on the same `Context`. `Context::process_list_elem` / `has_process_list_elem` have no synchronization, so the access was a data race. Publish the iteration's `QueryStatusPtr` under a dedicated mutex on `FunctionsStressTestThread` and cancel through it. `QueryStatus::cancelQuery` takes its own `cancel_mutex`, so the cross-thread call is safe. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

`iteration_timeout_seconds` (default 15s, was 30s) is the per-iteration `max_execution_time`. `iteration_too_slow_seconds` (default 20s, new) is the wall-clock threshold above which the iteration is reported as `P_TIMEOUT_NOT_HONORED`. Decoupling them avoids forcing a 2x multiplier and lets us shorten the test budget without making the too-slow check proportionally noisier. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

`CancellationChecker` is a process-wide singleton. Its `terminateThread` flips `stop_thread = true` to break the worker out of its loop, but nothing ever set it back to `false`. After `gtest_functions_stress` exited, any later test added to the same binary that relied on `max_execution_time` cancellation would re-enter `workerFunction`, observe `stop_thread == true` immediately, and exit without doing any work — silently disabling cancellation. Clear `stop_thread` right before `workerFunction` returns (still under the mutex) so the singleton is left in its initial state. The server starts the worker exactly once during boot, so this is a no-op for production; it only matters for callers (tests) that own the worker thread and may want to restart it. Addresses ClickHouse#104694 (comment)

Algunenano · 2026-05-13T16:32:31Z

Fix for the stress finding in #104875, but they need each other

The check is driven by --iteration-too-slow, not by 2x the per-query max_execution_time. Address review feedback on PR ClickHouse#104694. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

clickhouse-gh · 2026-05-15T12:09:40Z

LLVM Coverage Report

Metric	Baseline	Current	Δ
Lines	84.10%	84.10%	+0.00%
Functions	90.90%	90.90%	+0.00%
Branches	76.50%	76.60%	+0.10%

Changed lines: 1.57% (2/127) · Uncovered code

Full report · Diff report

alexey-milovidov · 2026-05-17T06:18:46Z

@alexey-milovidov

…replicas `OptimizeTrivialGroupByLimitPass` runs in the analyzer on the initiator and lowers `max_rows_to_group_by` on the query's local context. That mutation is not propagated to remote replicas, so once CI's `automatic_parallel_replicas_mode=2` randomization is sampled the aggregator on the remotes never applies the cap and `OverflowAny` stays at zero. The test then sees `04229_on 0` instead of `04229_on 1` and fails 3/3 reruns (every PR run hits the same diff). Pin `enable_parallel_replicas = 0` at the query level in both `SETTINGS` clauses so the optimization is exercised on the local-coordinator path it is designed for. Query-level `SETTINGS` takes priority over the runner's session-level injection, so the pin is robust against randomization. No `no-*` tag is added (per `.claude/CLAUDE.md` guidance) and no other randomization is disabled — `max_threads = 1` was already pinned, the rest stays randomized. Locally reproduced with a 3-replica `default_parallel_replicas` cluster on loopback: 5/5 fails without the pin (`04229_on 0`), 20/20 passes with the pin under the same parallel-replicas session settings. Cross-PR scope (CIDB, 30d): seen on PRs #104473, #100146, #101033, #104694, #104966 — all unrelated to the test or its source PR. Reported by @alexey-milovidov on #100146.

The `Docker server image` job intermittently fails when fetching the prebuilt `.deb` packages from `clickhouse-builds.s3.amazonaws.com`: ``` ERROR 500: Internal Server Error. ERROR 503: Service Unavailable. ``` Three unrelated PRs hit this S3 download flake over three days (`ClickHouse#100173` on 2026-05-15 with `500`, `ClickHouse#104694` on 2026-05-14 with `503`, `ClickHouse#104853` on 2026-05-13 with `503`). On master in the last 14 days the master `Docker server image` job has 1 failure of the same shape. `docker/server/Dockerfile.alpine` already wraps `wget` with a retry helper (added in ClickHouse#100380 to absorb the same kind of transient errors during Alpine cross-architecture builds via QEMU). The Ubuntu and distroless flavours still call `wget ... || exit 1` on the very first attempt, so a single `500`/`503` from S3 kills the whole build. This change ports the same `wget_with_retry` wrapper to `docker/server/Dockerfile.ubuntu` and `docker/server/Dockerfile.distroless` for the `DIRECT_DOWNLOAD_URLS` block (the path used by CI). `WGET_RETRIES` (default `5`) and `WGET_RETRY_DELAY` (default `1s`) match the Alpine version and remain overridable via `--build-arg`. CI report (PR ClickHouse#100173, sha 3ceac71): https://s3.amazonaws.com/clickhouse-test-reports/json.html?PR=100173&sha=3ceac71a43a5fd4fc1a7859937b23e71b8fe42ae&name_0=PR&name_1=Docker%20server%20image Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

clickhouse-gh Bot added the pr-ci label May 12, 2026

Algunenano mentioned this pull request May 12, 2026

Cap arrayResize size argument in functions stress test #104254

Merged

1 task

clickhouse-gh Bot reviewed May 12, 2026

View reviewed changes

Comment thread src/Functions/tests/gtest_functions_stress.cpp Outdated

Algunenano and others added 2 commits May 12, 2026 16:30

clickhouse-gh Bot reviewed May 13, 2026

View reviewed changes

Comment thread src/Functions/tests/gtest_functions_stress.cpp

alexey-milovidov mentioned this pull request May 13, 2026

ThreadPool: wake most recently idle worker thread (LIFO) when scheduling a job to reduce memory fragmentation #100177

Merged

alexey-milovidov mentioned this pull request May 13, 2026

Enable clang-tidy check for uninitialized variables #100399

Merged

clickhouse-gh Bot reviewed May 14, 2026

View reviewed changes

Comment thread src/Functions/tests/gtest_functions_stress.cpp Outdated

gtest_functions_stress: fix timeout_not_honored description text

14c4595

The check is driven by --iteration-too-slow, not by 2x the per-query max_execution_time. Address review feedback on PR ClickHouse#104694. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

antonio2368 self-assigned this May 14, 2026

alexey-milovidov mentioned this pull request May 15, 2026

Fix distributed task iterator not initialized exception with parallel replicas #100146

Merged

1 task

groeneai mentioned this pull request May 15, 2026

Stabilize 04229_trivial_group_by_limit_profile_events under parallel replicas #104996

Merged

antonio2368 approved these changes May 15, 2026

View reviewed changes

Comment thread src/Functions/tests/gtest_functions_stress.cpp

gtest_functions_stress: validate iteration-too-slow > iteration-timeout

d7c8701

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Algunenano added this pull request to the merge queue May 15, 2026

Algunenano removed this pull request from the merge queue due to a manual request May 15, 2026

Algunenano added this pull request to the merge queue May 15, 2026

Merged via the queue into ClickHouse:master with commit 88a1611 May 15, 2026
165 of 167 checks passed

Algunenano deleted the stress-test-use-process-list branch May 15, 2026 14:07

robot-ch-test-poll4 added the pr-synced-to-cloud The PR is synced to the cloud repo label May 15, 2026

This was referenced May 17, 2026

Retry wget downloads in Ubuntu and distroless Dockerfiles #105139

Merged

Add array_to_string as alias of arrayStringConcat #105121

Merged

alexey-milovidov mentioned this pull request May 17, 2026

Revert "Wire functions stress test to ProcessList for KILL-QUERY-style timeouts" #105143

Closed

alexey-milovidov mentioned this pull request May 17, 2026

Stop the bleeding in function_prop_fuzzer #105146

Merged

1 task

This was referenced May 18, 2026

Add regression test for empty-pipe reverse-order read (STID 2635-3a9e) #105132

Merged

ORC/Parquet/Arrow: warn when URL doesn't support Range Byte offsets causing the whole file to be read. #96988

Merged

Sunbelt Computer Software

PL/B Language Development and Support

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Wire functions stress test to ProcessList for KILL-QUERY-style timeouts#104694

Wire functions stress test to ProcessList for KILL-QUERY-style timeouts#104694
Algunenano merged 7 commits into
ClickHouse:masterfrom
Algunenano:stress-test-use-process-list

Algunenano commented May 12, 2026 •

edited by robot-clickhouse

Loading

Uh oh!

clickhouse-gh Bot commented May 12, 2026 •

edited by Algunenano

Loading

Uh oh!

Uh oh!

Uh oh!

Algunenano commented May 13, 2026

Uh oh!

Uh oh!

Uh oh!

clickhouse-gh Bot commented May 15, 2026

Uh oh!

Uh oh!

alexey-milovidov commented May 17, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Sunbelt Computer Software

PL/B Language Development and Support

Uh oh!

Conversation

Algunenano commented May 12, 2026 • edited by robot-clickhouse Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Changelog category (leave one):

Changelog entry (a user-readable short description of the changes that goes into CHANGELOG.md):

Version info

Uh oh!

clickhouse-gh Bot commented May 12, 2026 • edited by Algunenano Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

AI Review

Summary

Final Verdict

Uh oh!

Uh oh!

Uh oh!

Algunenano commented May 13, 2026

Uh oh!

Uh oh!

Uh oh!

clickhouse-gh Bot commented May 15, 2026

LLVM Coverage Report

Uh oh!

Uh oh!

alexey-milovidov commented May 17, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Algunenano commented May 12, 2026 •

edited by robot-clickhouse

Loading

clickhouse-gh Bot commented May 12, 2026 •

edited by Algunenano

Loading