iframe-proxy

Algunenano · 2026-05-13T16:31:09Z

arrayResize ignored max_execution_time and KILL QUERY because the inner fill loop of resizeConstantSize / resizeDynamicSize did not check the cancellation flag. A worst-case input could keep one thread busy for tens of seconds (or minutes under sanitizers) past any timeout.

Adds a static CurrentThread::throwIfQueryCancelled() helper for general reuse and polls it every 1024 iterations inside both resize variants. The existing inline caller in MergeTreeIndexVectorSimilarity migrates to the helper.

Changelog category (leave one):

Bug Fix (user-visible misbehavior in an official stable release)

Changelog entry (a user-readable short description of the changes that goes into CHANGELOG.md):

Make arrayResize honor query cancellation (max_execution_time, KILL QUERY).

The inner fill loop of resizeConstantSize / resizeDynamicSize did not poll cancellation, so a single call resizing to ~1e9 elements would keep running long past max_execution_time or KILL QUERY. Add a static CurrentThread::throwIfQueryCancelled helper and poll it every 1024 iterations inside both resize variants. Migrate the existing inline caller in MergeTreeIndexVectorSimilarity to the helper. Stuck-thread symptom seen here: https://s3.amazonaws.com/clickhouse-test-reports/json.html?PR=104823&sha=9df07df5a0a2f94ea2ee509d2fef72b5c1b01a5d&name_0=PR&name_1=Unit%20tests%20%28asan_ubsan%2C%20function_prop_fuzzer%29 Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

clickhouse-gh · 2026-05-13T16:31:47Z

Workflow [PR], commit [24e5892]

Summary: ❌

job_name	test_name	status	info
Fast test		FAIL
	04238_array_resize_cancellation_kill	FAIL	cidb
	04237_array_resize_cancellation	FAIL	cidb
Docs check		DROPPED
Fast test (arm_darwin)		DROPPED
Build (amd_debug)		DROPPED
Build (amd_asan_ubsan)		DROPPED
Build (amd_tsan)		DROPPED
Build (amd_msan)		DROPPED
Build (amd_binary)		DROPPED
Build (arm_debug)		DROPPED
Build (arm_asan_ubsan)		DROPPED

AI Review

Summary

This PR adds CurrentThread::throwIfQueryCancelled and wires cancellation polling into arrayResize and vector-index build loops so long-running CPU-bound sections observe max_execution_time and KILL QUERY. The intent is correct, but there are blocking issues: one concurrency/lifetime hazard in the new LocalConnection callback usage and two broken test expectations that currently disagree with script output.

Findings

❌ Blockers

[src/Client/LocalConnection.cpp:168] The interactive cancel callback is now callable from worker threads via CurrentThread::throwIfQueryCancelled, but the lambda still dereferences this->state and this->query_context. LocalConnection::~LocalConnection can state.reset() immediately after sendCancel, and sendQuery can replace state/query_context for a new query. This allows in-flight callback execution against destroyed or replaced state, which is a memory-safety/correctness risk.
Suggested fix: capture stable per-query objects in the callback (for example a dedicated shared state object and the specific ContextPtr) or clear/synchronize callback invocation before resetting/replacing per-query state.
[tests/queries/0_stateless/04237_array_resize_cancellation.reference:2] The reference still expects elapsed under 20s, while the script now prints elapsed under 10s. The test output cannot match the expected file as written.
Suggested fix: update either the script output string or the reference line so they match exactly.
[tests/queries/0_stateless/04238_array_resize_cancellation_kill.reference:2] The reference still expects cancelled under 20s, while the script now prints cancelled under 10s. This mismatch makes the test fail even when behavior is correct.
Suggested fix: align expected output with script output.

Final Verdict

Status: ❌ Block

Minimum required actions:

Make the LocalConnection interactive-cancel callback safe with respect to query-state/context lifetime under concurrent worker-thread invocation.
Fix the 04237 expected output mismatch (10s vs 20s).
Fix the 04238 expected output mismatch (10s vs 20s).

Algunenano · 2026-05-14T09:08:11Z

There is an issue with this approach which is the same that we had with scalar subqueries: If the cancel packet isn't read (with TCP), then we don't know the query was KILLed. We should unify the behaviour and make it simpler and in a single place if possible.

… time limit directly Two changes to make `CurrentThread::throwIfQueryCancelled` cover the in-function polling cases that the previous version missed: - Invoke `Context::interactive_cancel_callback` from the polling site. Without this, TCP `Cancel` packets sent mid-query (Ctrl-C in clickhouse-client) only reach the server when the pipeline executor next ticks, which for long-running function calls can be many seconds. Throttled to at most once per `interactive_delay` per thread so it doesn't flood the wire. - Use `QueryStatus::checkTimeLimit()` instead of `throwIfKilled()`. The former enforces both `is_killed` and the elapsed-time limit directly, so polling is responsive even when `CancellationChecker` is slow to flip `is_killed`. This is what the previous fast-test failure on PR ClickHouse#104875 hit: the deadline never fired, polling never saw `is_killed`, and the resize ran to natural completion. Adds `tests/queries/0_stateless/04238_array_resize_cancellation_kill.sh` to cover the Ctrl-C / TCP cancel path; the existing `04237_*` test already covers the `max_execution_time` path. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…rent invocations `CurrentThread::throwIfQueryCancelled` now invokes `interactive_cancel_callback` from arbitrary pipeline worker threads in addition to the executor's wait loop, so the callback can run concurrently from multiple threads. The TCP variant already serializes via `callback_mutex`; the local variant did not, racing on the plain-bool write to `state->is_cancelled` and on the user-supplied `process_progress_callback`. Serialize the callback body with a `try_to_lock` mutex: if another thread is already inside the callback, skip the call (it would do the same work). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

The 5s threshold turned out to be tight on slower CI machines (fast-test in particular). Without the polling fix this query takes ~60s on a release build (longer with sanitizers and slow VMs), so 20s still asserts the fix is in effect while leaving ample headroom. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

clickhouse-gh · 2026-05-14T15:01:57Z

+
+if (( ELAPSED_MS < 20000 ))
+then
+    echo "cancelled under 20s"


The new script output and the reference file disagree: this script prints cancelled under 20s, but 04238_array_resize_cancellation_kill.reference still expects cancelled under 5s.

As written, the test will fail even when cancellation works correctly. Please either restore the 5s text in the script or update the reference to 20s.

04238_array_resize_cancellation_kill.sh now prints cancelled under 10s, but 04238_array_resize_cancellation_kill.reference still expects cancelled under 20s. As-is, the test output cannot match the reference even when cancellation works. Please update one side so the expected line matches the script.

…lback mutex The previous commit serialized the interactive cancel callback body but `LocalQueryState::is_cancelled` was still touched outside that lock by `LocalConnection::sendCancel`, the destructor, and the pull loop in `pullBlock`. The race is still live as long as the field is a plain `bool`. Switch the field to `std::atomic<bool>` so every read/write is race-free regardless of which thread does the access. The callback-body mutex stays because it serializes the user-supplied `process_progress_callback`, which isn't ours to assume thread-safe. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Companion fix to the previous threshold-bump commit: the script prints "cancelled under 20s" but the reference still said "cancelled under 5s", which made the test fail in CI even when cancellation worked correctly. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…he threshold The bugfix-validation job runs the new tests against the PR base (without the polling fix) to confirm they would have caught the bug. With a single `arrayResize(-1e9)` call the natural completion is ~10s on the fastest CI runners, well under the 20s threshold, so the tests passed on the base too and bugfix-validation reported "Failed to reproduce the bug." Use a 3-element `ARRAY JOIN` size vector. The resize then runs three times per query, pushing natural completion well over 20s on every CI machine, while the patched binary still aborts the first call shortly after the deadline so the test finishes quickly. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

clickhouse-gh · 2026-05-14T19:53:33Z

LLVM Coverage Report

Metric	Baseline	Current	Δ
Lines	84.10%	84.10%	+0.00%
Functions	92.00%	92.00%	+0.00%
Branches	76.50%	76.60%	+0.10%

Changed lines: 82.61% (38/46) · Uncovered code

Full report · Diff report

The review prompt told the model to pass --repo to every helper that accepts it, but resolve-pr-review-thread and unresolve-pr-review-thread take only --thread-id (a globally unique GraphQL node id that already identifies the repo). Recent runs added --repo ClickHouse/ClickHouse and got "error: unrecognized arguments" from gh.py. Tighten the instruction to use --repo only on the commands that document it and call out the two helpers that reject it. Failing run: https://s3.amazonaws.com/clickhouse-test-reports/PRs/104875/59f9cd17cc1226775a59470916300b9b7bc73a4d/code_review/job.log PR: ClickHouse#104875

Worker threads run multiple queries back-to-back. The previous code kept `last_callback_at` as a plain `thread_local` so query A's recent poll could suppress query B's first cancel check for up to `interactive_delay`. Track the query-context pointer in TLS and treat any change as a reset of the throttle, so each query gets an immediate first cancel poll. Addresses ClickHouse#104875 (comment) Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…absolute index Previous code used `i % CANCELLATION_CHECK_INTERVAL == 0` where `i` is the absolute array index, starting at `array_size`. When `length - array_size` is smaller than the interval and `array_size % interval != 0` (for example resizing `[x]` to `1001`), the check never fires inside that row's append phase. With many such rows in a single block, the function spends seconds without polling and cancellation only fires at the next between-block check. Switch to `(i - array_size) % CANCELLATION_CHECK_INTERVAL == 0` at all four fill loops, so we poll every 1024 appended elements regardless of starting offset. Same cost, no new state.

clickhouse-gh · 2026-05-18T13:02:17Z

@@ -0,0 +1,2 @@
+got expected timeout exception
+elapsed under 20s


04237_array_resize_cancellation.sh now prints elapsed under 10s, but this reference still expects elapsed under 20s. That makes the test fail even when timeout cancellation behaves correctly. Please align the expected line with the script output.

clickhouse-gh · 2026-05-18T13:03:44Z

+        auto callback_mutex = std::make_shared<std::mutex>();
        query_context->setInteractiveCancelCallback(
-            [this, check_cancelled = is_cancelled_callback, progress_callback = process_progress_callback]() -> bool
+            [this, check_cancelled = is_cancelled_callback, progress_callback = process_progress_callback, callback_mutex]() -> bool


CurrentThread::throwIfQueryCancelled can now invoke this callback from worker threads, but the lambda still dereferences this->state and this->query_context. LocalConnection::~LocalConnection calls state.reset() immediately after sendCancel(), and sendQuery also replaces state/query_context for the next query. That creates a race where an in-flight callback touches destroyed or replaced state (including state->progress) or cancels the wrong query.

Please capture stable per-query objects in the callback (for example a shared state object + the specific ContextPtr) or clear/synchronize the callback before resetting/replacing state.

clickhouse-gh Bot added the pr-bugfix Pull request with bugfix, not backported by default label May 13, 2026

Algunenano mentioned this pull request May 13, 2026

Wire functions stress test to ProcessList for KILL-QUERY-style timeouts #104694

Merged

Algunenano self-assigned this May 14, 2026

Algunenano marked this pull request as draft May 14, 2026 09:06

clickhouse-gh Bot reviewed May 14, 2026

View reviewed changes

Comment thread src/Common/CurrentThread.cpp

Algunenano and others added 2 commits May 14, 2026 16:40

clickhouse-gh Bot reviewed May 14, 2026

View reviewed changes

Algunenano and others added 3 commits May 14, 2026 17:56

clickhouse-gh Bot reviewed May 14, 2026

View reviewed changes

Comment thread src/Common/CurrentThread.cpp

Algunenano mentioned this pull request May 17, 2026

Make arrayResize honor max_execution_time on heavy element types (#104877) #105147

Closed

1 task

groeneai mentioned this pull request May 18, 2026

ThreadPool: wake most recently idle worker thread (LIFO) when scheduling a job to reduce memory fragmentation #100177

Merged

clickhouse-gh Bot reviewed May 18, 2026

View reviewed changes

Comment thread src/Functions/GatherUtils/Algorithms.h Outdated

clickhouse-gh Bot reviewed May 18, 2026

View reviewed changes

Comment thread tests/queries/0_stateless/04237_array_resize_cancellation.sh Outdated

clickhouse-gh Bot reviewed May 18, 2026

View reviewed changes

Comment thread tests/queries/0_stateless/04238_array_resize_cancellation_kill.sh Outdated

Algunenano added 2 commits May 18, 2026 14:54

Update 04237_array_resize_cancellation.sh

3a1f7e3

Update 04238_array_resize_cancellation_kill.sh

24e5892

clickhouse-gh Bot reviewed May 18, 2026

View reviewed changes

Algunenano mentioned this pull request May 18, 2026

RFC: per-query cancellation event hub; faster, more uniform cancellation across the codebase #104929

Open

Algunenano closed this May 20, 2026

Sunbelt Computer Software

PL/B Language Development and Support

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Make arrayResize honor query cancellation#104875

Make arrayResize honor query cancellation#104875
Algunenano wants to merge 11 commits into
ClickHouse:masterfrom
Algunenano:fix-array-resize-cancellation

Algunenano commented May 13, 2026

Uh oh!

clickhouse-gh Bot commented May 13, 2026 •

edited

Loading

Uh oh!

Algunenano commented May 14, 2026

Uh oh!

Uh oh!

clickhouse-gh Bot May 14, 2026

Uh oh!

clickhouse-gh Bot May 18, 2026

Uh oh!

Uh oh!

clickhouse-gh Bot commented May 14, 2026

Uh oh!

Uh oh!

Uh oh!

Uh oh!

clickhouse-gh Bot May 18, 2026

Uh oh!

clickhouse-gh Bot May 18, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

		@@ -0,0 +1,2 @@
		got expected timeout exception
		elapsed under 20s

Sunbelt Computer Software

PL/B Language Development and Support

Uh oh!

Conversation

Algunenano commented May 13, 2026

Changelog category (leave one):

Changelog entry (a user-readable short description of the changes that goes into CHANGELOG.md):

Uh oh!

clickhouse-gh Bot commented May 13, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

AI Review

Summary

Findings

Final Verdict

Uh oh!

Algunenano commented May 14, 2026

Uh oh!

Uh oh!

clickhouse-gh Bot May 14, 2026

Choose a reason for hiding this comment

Uh oh!

clickhouse-gh Bot May 18, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

clickhouse-gh Bot commented May 14, 2026

LLVM Coverage Report

Uh oh!

Uh oh!

Uh oh!

Uh oh!

clickhouse-gh Bot May 18, 2026

Choose a reason for hiding this comment

Uh oh!

clickhouse-gh Bot May 18, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

clickhouse-gh Bot commented May 13, 2026 •

edited

Loading