Make arrayResize honor query cancellation by Algunenano · Pull Request #104875 · ClickHouse/ClickHouse · GitHub
Skip to content

Make arrayResize honor query cancellation#104875

Closed
Algunenano wants to merge 11 commits into
ClickHouse:masterfrom
Algunenano:fix-array-resize-cancellation
Closed

Make arrayResize honor query cancellation#104875
Algunenano wants to merge 11 commits into
ClickHouse:masterfrom
Algunenano:fix-array-resize-cancellation

Conversation

@Algunenano

Copy link
Copy Markdown
Member

arrayResize ignored max_execution_time and KILL QUERY because the inner fill loop of resizeConstantSize / resizeDynamicSize did not check the cancellation flag. A worst-case input could keep one thread busy for tens of seconds (or minutes under sanitizers) past any timeout.

Adds a static CurrentThread::throwIfQueryCancelled() helper for general reuse and polls it every 1024 iterations inside both resize variants. The existing inline caller in MergeTreeIndexVectorSimilarity migrates to the helper.

Changelog category (leave one):

  • Bug Fix (user-visible misbehavior in an official stable release)

Changelog entry (a user-readable short description of the changes that goes into CHANGELOG.md):

Make arrayResize honor query cancellation (max_execution_time, KILL QUERY).

The inner fill loop of resizeConstantSize / resizeDynamicSize did not
poll cancellation, so a single call resizing to ~1e9 elements would
keep running long past max_execution_time or KILL QUERY. Add a static
CurrentThread::throwIfQueryCancelled helper and poll it every 1024
iterations inside both resize variants. Migrate the existing inline
caller in MergeTreeIndexVectorSimilarity to the helper.

Stuck-thread symptom seen here:
https://s3.amazonaws.com/clickhouse-test-reports/json.html?PR=104823&sha=9df07df5a0a2f94ea2ee509d2fef72b5c1b01a5d&name_0=PR&name_1=Unit%20tests%20%28asan_ubsan%2C%20function_prop_fuzzer%29

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@clickhouse-gh

clickhouse-gh Bot commented May 13, 2026

Copy link
Copy Markdown
Contributor

@clickhouse-gh clickhouse-gh Bot added the pr-bugfix Pull request with bugfix, not backported by default label May 13, 2026
@Algunenano Algunenano self-assigned this May 14, 2026
@Algunenano Algunenano marked this pull request as draft May 14, 2026 09:06
@Algunenano

Copy link
Copy Markdown
Member Author

There is an issue with this approach which is the same that we had with scalar subqueries: If the cancel packet isn't read (with TCP), then we don't know the query was KILLed. We should unify the behaviour and make it simpler and in a single place if possible.

… time limit directly

Two changes to make `CurrentThread::throwIfQueryCancelled` cover the in-function
polling cases that the previous version missed:

- Invoke `Context::interactive_cancel_callback` from the polling site. Without
  this, TCP `Cancel` packets sent mid-query (Ctrl-C in clickhouse-client) only
  reach the server when the pipeline executor next ticks, which for
  long-running function calls can be many seconds. Throttled to at most once
  per `interactive_delay` per thread so it doesn't flood the wire.
- Use `QueryStatus::checkTimeLimit()` instead of `throwIfKilled()`. The former
  enforces both `is_killed` and the elapsed-time limit directly, so polling is
  responsive even when `CancellationChecker` is slow to flip `is_killed`. This
  is what the previous fast-test failure on PR ClickHouse#104875 hit: the deadline never
  fired, polling never saw `is_killed`, and the resize ran to natural
  completion.

Adds `tests/queries/0_stateless/04238_array_resize_cancellation_kill.sh` to
cover the Ctrl-C / TCP cancel path; the existing `04237_*` test already covers
the `max_execution_time` path.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Comment thread src/Common/CurrentThread.cpp
Algunenano and others added 2 commits May 14, 2026 16:40
…rent invocations

`CurrentThread::throwIfQueryCancelled` now invokes `interactive_cancel_callback`
from arbitrary pipeline worker threads in addition to the executor's wait loop,
so the callback can run concurrently from multiple threads. The TCP variant
already serializes via `callback_mutex`; the local variant did not, racing on
the plain-bool write to `state->is_cancelled` and on the user-supplied
`process_progress_callback`.

Serialize the callback body with a `try_to_lock` mutex: if another thread is
already inside the callback, skip the call (it would do the same work).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The 5s threshold turned out to be tight on slower CI machines (fast-test in
particular). Without the polling fix this query takes ~60s on a release build
(longer with sanitizers and slow VMs), so 20s still asserts the fix is in
effect while leaving ample headroom.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

if (( ELAPSED_MS < 20000 ))
then
echo "cancelled under 20s"

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The new script output and the reference file disagree: this script prints cancelled under 20s, but 04238_array_resize_cancellation_kill.reference still expects cancelled under 5s.

As written, the test will fail even when cancellation works correctly. Please either restore the 5s text in the script or update the reference to 20s.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

04238_array_resize_cancellation_kill.sh now prints cancelled under 10s, but 04238_array_resize_cancellation_kill.reference still expects cancelled under 20s. As-is, the test output cannot match the reference even when cancellation works. Please update one side so the expected line matches the script.

Algunenano and others added 3 commits May 14, 2026 17:56
…lback mutex

The previous commit serialized the interactive cancel callback body but
`LocalQueryState::is_cancelled` was still touched outside that lock by
`LocalConnection::sendCancel`, the destructor, and the pull loop in
`pullBlock`. The race is still live as long as the field is a plain `bool`.

Switch the field to `std::atomic<bool>` so every read/write is race-free
regardless of which thread does the access. The callback-body mutex stays
because it serializes the user-supplied `process_progress_callback`, which
isn't ours to assume thread-safe.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Companion fix to the previous threshold-bump commit: the script prints
"cancelled under 20s" but the reference still said "cancelled under 5s",
which made the test fail in CI even when cancellation worked correctly.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…he threshold

The bugfix-validation job runs the new tests against the PR base (without
the polling fix) to confirm they would have caught the bug. With a single
`arrayResize(-1e9)` call the natural completion is ~10s on the fastest CI
runners, well under the 20s threshold, so the tests passed on the base too
and bugfix-validation reported "Failed to reproduce the bug."

Use a 3-element `ARRAY JOIN` size vector. The resize then runs three times
per query, pushing natural completion well over 20s on every CI machine,
while the patched binary still aborts the first call shortly after the
deadline so the test finishes quickly.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Comment thread src/Common/CurrentThread.cpp
@clickhouse-gh

clickhouse-gh Bot commented May 14, 2026

Copy link
Copy Markdown
Contributor

LLVM Coverage Report

Metric Baseline Current Δ
Lines 84.10% 84.10% +0.00%
Functions 92.00% 92.00% +0.00%
Branches 76.50% 76.60% +0.10%

Changed lines: 82.61% (38/46) · Uncovered code

Full report · Diff report

Algunenano added a commit to Algunenano/ClickHouse that referenced this pull request May 15, 2026
The review prompt told the model to pass --repo to every helper that
accepts it, but resolve-pr-review-thread and unresolve-pr-review-thread
take only --thread-id (a globally unique GraphQL node id that already
identifies the repo). Recent runs added --repo ClickHouse/ClickHouse and
got "error: unrecognized arguments" from gh.py.

Tighten the instruction to use --repo only on the commands that document
it and call out the two helpers that reject it.

Failing run: https://s3.amazonaws.com/clickhouse-test-reports/PRs/104875/59f9cd17cc1226775a59470916300b9b7bc73a4d/code_review/job.log
PR: ClickHouse#104875
Worker threads run multiple queries back-to-back. The previous code kept
`last_callback_at` as a plain `thread_local` so query A's recent poll
could suppress query B's first cancel check for up to `interactive_delay`.

Track the query-context pointer in TLS and treat any change as a reset
of the throttle, so each query gets an immediate first cancel poll.

Addresses ClickHouse#104875 (comment)

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Comment thread src/Functions/GatherUtils/Algorithms.h Outdated
…absolute index

Previous code used `i % CANCELLATION_CHECK_INTERVAL == 0` where `i` is the
absolute array index, starting at `array_size`. When `length - array_size`
is smaller than the interval and `array_size % interval != 0` (for example
resizing `[x]` to `1001`), the check never fires inside that row's append
phase. With many such rows in a single block, the function spends seconds
without polling and cancellation only fires at the next between-block
check.

Switch to `(i - array_size) % CANCELLATION_CHECK_INTERVAL == 0` at all four
fill loops, so we poll every 1024 appended elements regardless of starting
offset. Same cost, no new state.
Comment thread tests/queries/0_stateless/04237_array_resize_cancellation.sh Outdated
Comment thread tests/queries/0_stateless/04238_array_resize_cancellation_kill.sh Outdated
@@ -0,0 +1,2 @@
got expected timeout exception
elapsed under 20s

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

04237_array_resize_cancellation.sh now prints elapsed under 10s, but this reference still expects elapsed under 20s. That makes the test fail even when timeout cancellation behaves correctly. Please align the expected line with the script output.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

CurrentThread::throwIfQueryCancelled can now invoke this callback from worker threads, but the lambda still dereferences this->state and this->query_context. LocalConnection::~LocalConnection calls state.reset() immediately after sendCancel(), and sendQuery also replaces state/query_context for the next query. That creates a race where an in-flight callback touches destroyed or replaced state (including state->progress) or cancels the wrong query.

Please capture stable per-query objects in the callback (for example a shared state object + the specific ContextPtr) or clear/synchronize the callback before resetting/replacing state.

@Algunenano Algunenano closed this May 20, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

pr-bugfix Pull request with bugfix, not backported by default

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant