Make arrayResize honor query cancellation#104875
Conversation
The inner fill loop of resizeConstantSize / resizeDynamicSize did not poll cancellation, so a single call resizing to ~1e9 elements would keep running long past max_execution_time or KILL QUERY. Add a static CurrentThread::throwIfQueryCancelled helper and poll it every 1024 iterations inside both resize variants. Migrate the existing inline caller in MergeTreeIndexVectorSimilarity to the helper. Stuck-thread symptom seen here: https://s3.amazonaws.com/clickhouse-test-reports/json.html?PR=104823&sha=9df07df5a0a2f94ea2ee509d2fef72b5c1b01a5d&name_0=PR&name_1=Unit%20tests%20%28asan_ubsan%2C%20function_prop_fuzzer%29 Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
There is an issue with this approach which is the same that we had with scalar subqueries: If the cancel packet isn't read (with TCP), then we don't know the query was KILLed. We should unify the behaviour and make it simpler and in a single place if possible. |
… time limit directly Two changes to make `CurrentThread::throwIfQueryCancelled` cover the in-function polling cases that the previous version missed: - Invoke `Context::interactive_cancel_callback` from the polling site. Without this, TCP `Cancel` packets sent mid-query (Ctrl-C in clickhouse-client) only reach the server when the pipeline executor next ticks, which for long-running function calls can be many seconds. Throttled to at most once per `interactive_delay` per thread so it doesn't flood the wire. - Use `QueryStatus::checkTimeLimit()` instead of `throwIfKilled()`. The former enforces both `is_killed` and the elapsed-time limit directly, so polling is responsive even when `CancellationChecker` is slow to flip `is_killed`. This is what the previous fast-test failure on PR ClickHouse#104875 hit: the deadline never fired, polling never saw `is_killed`, and the resize ran to natural completion. Adds `tests/queries/0_stateless/04238_array_resize_cancellation_kill.sh` to cover the Ctrl-C / TCP cancel path; the existing `04237_*` test already covers the `max_execution_time` path. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…rent invocations `CurrentThread::throwIfQueryCancelled` now invokes `interactive_cancel_callback` from arbitrary pipeline worker threads in addition to the executor's wait loop, so the callback can run concurrently from multiple threads. The TCP variant already serializes via `callback_mutex`; the local variant did not, racing on the plain-bool write to `state->is_cancelled` and on the user-supplied `process_progress_callback`. Serialize the callback body with a `try_to_lock` mutex: if another thread is already inside the callback, skip the call (it would do the same work). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The 5s threshold turned out to be tight on slower CI machines (fast-test in particular). Without the polling fix this query takes ~60s on a release build (longer with sanitizers and slow VMs), so 20s still asserts the fix is in effect while leaving ample headroom. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
|
||
| if (( ELAPSED_MS < 20000 )) | ||
| then | ||
| echo "cancelled under 20s" |
There was a problem hiding this comment.
The new script output and the reference file disagree: this script prints cancelled under 20s, but 04238_array_resize_cancellation_kill.reference still expects cancelled under 5s.
As written, the test will fail even when cancellation works correctly. Please either restore the 5s text in the script or update the reference to 20s.
There was a problem hiding this comment.
04238_array_resize_cancellation_kill.sh now prints cancelled under 10s, but 04238_array_resize_cancellation_kill.reference still expects cancelled under 20s. As-is, the test output cannot match the reference even when cancellation works. Please update one side so the expected line matches the script.
…lback mutex The previous commit serialized the interactive cancel callback body but `LocalQueryState::is_cancelled` was still touched outside that lock by `LocalConnection::sendCancel`, the destructor, and the pull loop in `pullBlock`. The race is still live as long as the field is a plain `bool`. Switch the field to `std::atomic<bool>` so every read/write is race-free regardless of which thread does the access. The callback-body mutex stays because it serializes the user-supplied `process_progress_callback`, which isn't ours to assume thread-safe. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Companion fix to the previous threshold-bump commit: the script prints "cancelled under 20s" but the reference still said "cancelled under 5s", which made the test fail in CI even when cancellation worked correctly. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…he threshold The bugfix-validation job runs the new tests against the PR base (without the polling fix) to confirm they would have caught the bug. With a single `arrayResize(-1e9)` call the natural completion is ~10s on the fastest CI runners, well under the 20s threshold, so the tests passed on the base too and bugfix-validation reported "Failed to reproduce the bug." Use a 3-element `ARRAY JOIN` size vector. The resize then runs three times per query, pushing natural completion well over 20s on every CI machine, while the patched binary still aborts the first call shortly after the deadline so the test finishes quickly. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
LLVM Coverage Report
Changed lines: 82.61% (38/46) · Uncovered code |
The review prompt told the model to pass --repo to every helper that accepts it, but resolve-pr-review-thread and unresolve-pr-review-thread take only --thread-id (a globally unique GraphQL node id that already identifies the repo). Recent runs added --repo ClickHouse/ClickHouse and got "error: unrecognized arguments" from gh.py. Tighten the instruction to use --repo only on the commands that document it and call out the two helpers that reject it. Failing run: https://s3.amazonaws.com/clickhouse-test-reports/PRs/104875/59f9cd17cc1226775a59470916300b9b7bc73a4d/code_review/job.log PR: ClickHouse#104875
Worker threads run multiple queries back-to-back. The previous code kept `last_callback_at` as a plain `thread_local` so query A's recent poll could suppress query B's first cancel check for up to `interactive_delay`. Track the query-context pointer in TLS and treat any change as a reset of the throttle, so each query gets an immediate first cancel poll. Addresses ClickHouse#104875 (comment) Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…absolute index Previous code used `i % CANCELLATION_CHECK_INTERVAL == 0` where `i` is the absolute array index, starting at `array_size`. When `length - array_size` is smaller than the interval and `array_size % interval != 0` (for example resizing `[x]` to `1001`), the check never fires inside that row's append phase. With many such rows in a single block, the function spends seconds without polling and cancellation only fires at the next between-block check. Switch to `(i - array_size) % CANCELLATION_CHECK_INTERVAL == 0` at all four fill loops, so we poll every 1024 appended elements regardless of starting offset. Same cost, no new state.
| @@ -0,0 +1,2 @@ | |||
| got expected timeout exception | |||
| elapsed under 20s | |||
There was a problem hiding this comment.
04237_array_resize_cancellation.sh now prints elapsed under 10s, but this reference still expects elapsed under 20s. That makes the test fail even when timeout cancellation behaves correctly. Please align the expected line with the script output.
There was a problem hiding this comment.
CurrentThread::throwIfQueryCancelled can now invoke this callback from worker threads, but the lambda still dereferences this->state and this->query_context. LocalConnection::~LocalConnection calls state.reset() immediately after sendCancel(), and sendQuery also replaces state/query_context for the next query. That creates a race where an in-flight callback touches destroyed or replaced state (including state->progress) or cancels the wrong query.
Please capture stable per-query objects in the callback (for example a shared state object + the specific ContextPtr) or clear/synchronize the callback before resetting/replacing state.

arrayResize ignored max_execution_time and KILL QUERY because the inner fill loop of resizeConstantSize / resizeDynamicSize did not check the cancellation flag. A worst-case input could keep one thread busy for tens of seconds (or minutes under sanitizers) past any timeout.
Adds a static
CurrentThread::throwIfQueryCancelled()helper for general reuse and polls it every 1024 iterations inside both resize variants. The existing inline caller inMergeTreeIndexVectorSimilaritymigrates to the helper.Changelog category (leave one):
Changelog entry (a user-readable short description of the changes that goes into CHANGELOG.md):
Make
arrayResizehonor query cancellation (max_execution_time,KILL QUERY).