Allow KILL QUERY/MUTATION SELECT to return multiple blocks by groeneai · Pull Request #104927 · ClickHouse/ClickHouse · GitHub
Skip to content

Allow KILL QUERY/MUTATION SELECT to return multiple blocks#104927

Open
groeneai wants to merge 8 commits into
ClickHouse:masterfrom
groeneai:groeneai/fix-kill-query-multi-block-104857
Open

Allow KILL QUERY/MUTATION SELECT to return multiple blocks#104927
groeneai wants to merge 8 commits into
ClickHouse:masterfrom
groeneai:groeneai/fix-kill-query-multi-block-104857

Conversation

@groeneai

@groeneai groeneai commented May 14, 2026

Copy link
Copy Markdown
Contributor

Fixes the Expected one block from input stream LOGICAL_ERROR (STID 4569-3618) in InterpreterKillQueryQuery::getSelectResult.

The interpreter executes an internal SELECT over system.processes / system.mutations / system.part_moves_between_shards / system.transactions to find which queries / mutations to act on. The old code asserted that the resulting pipeline emits at most one block:

while (res.empty() && executor.pull(res));
while (executor.pull(tmp_block));
if (!tmp_block.empty())
    throw Exception(ErrorCodes::LOGICAL_ERROR, "Expected one block from input stream");

That is wrong — the pipeline can legitimately produce several blocks. The AST fuzzer found a minimal reproducer (originally in @PedroTadim's fiddle https://fiddle.clickhouse.com/468380d6-18b1-4bea-be38-4186d9a9428f):

CREATE TABLE t0 (c0 UInt64, c1 UInt64) ENGINE = MergeTree ORDER BY c0;
INSERT INTO t0 SELECT number, number FROM numbers(5);
SYSTEM STOP MERGES t0;
ALTER TABLE t0 UPDATE c1 = c1 + 1 WHERE 1 SETTINGS mutations_sync = 0;
ALTER TABLE t0 UPDATE c1 = c1 + 2 WHERE 1 SETTINGS mutations_sync = 0;
KILL MUTATION WHERE (SELECT 1 WHERE database = command) OR command LIKE '%UPDATE%' TEST;

The per-row scalar subquery in WHERE causes the planner to emit one block per row of system.mutations. A small max_block_size from CI randomization produces the same shape. Either way the second executor.pull returns a non-empty block and the old code threw.

Fix: collect every produced block via concatenateBlocks from Core/Block.h. The result set is bounded by the size of the corresponding system.* table (small, in-process state), so this stays cheap. materializeBlockInplace is preserved so existing typeid_cast<const ColumnString &> callers in execute remain valid.

A regression test tests/queries/0_stateless/04238_kill_mutation_subquery_multi_block.sql pins max_block_size = 1 and uses the per-row subquery shape, so the multi-block path is exercised deterministically.

Issue: #104857
CI report (vehicle): https://s3.amazonaws.com/clickhouse-test-reports/json.html?PR=104419&sha=c8267c5c7e69922ec96a5837cc51e5bc710bf9cb&name_0=PR&name_1=AST%20fuzzer%20%28amd_tsan%29

Closes #104857

Changelog category (leave one):

  • Bug Fix (user-visible misbehavior in an official stable release)

Changelog entry (a user-readable short description of the changes that goes into CHANGELOG.md):

Fix a LOGICAL_ERROR "Expected one block from input stream" thrown by KILL QUERY / KILL MUTATION / KILL PART_MOVE_TO_SHARD / KILL TRANSACTION when their WHERE clause contains a per-row subquery, or when max_block_size is small enough that the internal SELECT over the relevant system.* table emits more than one block.

Documentation entry for user-facing changes

  • Documentation is unchanged

`InterpreterKillQueryQuery::getSelectResult` executes an internal
`SELECT` over `system.processes`, `system.mutations`,
`system.part_moves_between_shards`, or `system.transactions` to find
the queries / mutations to act on. The old implementation asserted
that the pipeline emits at most one block:

    while (res.empty() && executor.pull(res));
    while (executor.pull(tmp_block));
    if (!tmp_block.empty())
        throw Exception(ErrorCodes::LOGICAL_ERROR, "Expected one block from input stream");

This is wrong: the pipeline can legitimately produce several blocks.
The AST fuzzer found a minimal reproducer (ClickHouse#104857, STID 4569-3618):

    KILL MUTATION
        WHERE (SELECT 1 WHERE database = command) OR command LIKE '%UPDATE%'
        TEST;

The per-row scalar subquery in `WHERE` causes the planner to emit one
block per row of `system.mutations`. A small `max_block_size` from CI
randomization can produce the same shape.

Fix: collect every produced block via `concatenateBlocks`. The result
set is bounded by the size of the corresponding `system.*` table, so
this is cheap. `materializeBlockInplace` is kept so existing
`typeid_cast<const ColumnString &>` callers remain valid.

Closes ClickHouse#104857.
@groeneai

Copy link
Copy Markdown
Contributor Author

@groeneai

Copy link
Copy Markdown
Contributor Author

cc @KochetovNicolai @alexey-milovidov @PedroTadim — could you review this? Fixes #104857 ( in InterpreterKillQueryQuery::getSelectResult). The internal SELECT over system.* was wrongly asserting at most one block; this PR collects every pulled block via concatenateBlocks and adds a deterministic regression test using @PedroTadim's fiddle scenario.

@nikitamikhaylov nikitamikhaylov added the can be tested Allows running workflows for external contributors label May 14, 2026
@clickhouse-gh

clickhouse-gh Bot commented May 14, 2026

Copy link
Copy Markdown
Contributor

Workflow [PR], commit [1ee52df]

Summary:

job_name test_name status info comment
Stateless tests (amd_asan_ubsan, distributed plan, parallel, 1/2) FAIL
04256_packed_skip_indices_layout_transitions FAIL cidb
Upgrade check (amd_release) FAIL
Error message in clickhouse-server.log (see upgrade_error_messages.txt) FAIL cidb

AI Review

Summary

The PR fixes InterpreterKillQueryQuery::getSelectResult by collecting every block emitted by the internal system.* query, concatenating them, and then preserving the existing materializeBlockInplace behavior for downstream typeid_cast callers. I did not find a new correctness issue in the shared helper change or in the regression-test shape.

Missing context / blind spots
  • ⚠️ I could not rerun tests/queries/0_stateless/04238_kill_mutation_subquery_multi_block.sh locally because this environment does not have a clickhouse binary available, and the latest Praktika rollup for commit 1ee52dfe467a323c97bbf48f3e2be1da35db58f0 has not published per-job results yet.
Final Verdict

✅ No new findings on the current diff.

@clickhouse-gh clickhouse-gh Bot added the pr-bugfix Pull request with bugfix, not backported by default label May 14, 2026
…idation

The regression test `04238_kill_mutation_subquery_multi_block` reproduces the
`LOGICAL_ERROR` in `InterpreterKillQueryQuery::getSelectResult` via
`${CLICKHOUSE_CLIENT}`, which means the `chassert` from the `Expected one block
from input stream` exception lands inside the long-running test server. The
functional-tests runner's hung-check then detects "server is not responding"
and terminates the runner before any test result is written to
`test_result.txt`. `FTResultsProcessor` only treats the exact strings
`Server died, terminating all processes` or `Server does not respond to
health check` as `server_died`; the runner's actual `Stopping tests,
terminating all processes...` message is not matched, so the run falls into
the `not success_finish` branch with status `ERROR` and an empty
`test_results` list. The `Bugfix validation` inversion loop only flips
`FAIL` -> `OK` (it iterates `test_result.results`), so with an empty list
`has_failure` stays `false` and the framework reports
`Failed to reproduce the bug` even though the bug did fire and left a core
dump.

Switch the reproducer to `clickhouse local` (multi-query) so the abort is
contained to a single short-lived subprocess. The test runner then observes
a normal non-zero exit code (134) and an empty stdout; the diff against
`.reference` produces a regular `FAIL` for this test, which `Bugfix
validation` then correctly inverts to `OK`. The same pattern was used for
`04214_iceberg_alter_drop_column_stale_metadata_cache_104295` in commit
c2f8daf for the identical reason.

Verified locally:
- Binary without the fix (old `InterpreterKillQueryQuery::getSelectResult`
  that throws `Expected one block from input stream`):
  test FAILs with exit 134 and the runner records a normal `FAIL`.
- Binary with the fix (`concatenateBlocks`-based collection): 30/30 passes
  with full CI randomization.

CI report for the failing bugfix-validation run on `629e5c9`:
https://s3.amazonaws.com/clickhouse-test-reports/json.html?PR=104927&sha=629e5c96e7c1112c15c3ab21855a6d1895bf6b0f&name_0=PR&name_1=Bugfix+validation+%28functional+tests%29
@groeneai

Copy link
Copy Markdown
Contributor Author

Bugfix-validation fix pushed (commit fcdf526c348).

The previous CI run on 629e5c9 failed Bugfix validation (functional tests) with Failed to reproduce the bug. The reproducer in 04238_kill_mutation_subquery_multi_block.sql exercises a chassert (Expected one block from input stream) which aborts the long-running test server in debug/sanitizer builds. The runner's hung-check then terminates everything before a per-test result is written, so test_result.results is empty, the Bugfix validation inversion loop sees no FAIL to flip to OK, and reports "failed to reproduce" — even though the bug fired and left a core dump.

Same shape as 04214_iceberg_alter_drop_column_stale_metadata_cache_104295 from c2f8daf0f9c last week. The fix is the same: move the reproducer to clickhouse local --multiquery so the SIGABRT is contained to a short-lived subprocess. The runner observes a non-zero exit + empty stdout, the diff against .reference ("ok\n") fails normally, and Bugfix validation correctly inverts the FAIL to OK.

Verified locally:

  • Without the fix (reverted InterpreterKillQueryQuery::getSelectResult to the old Expected one block throw): runner records [ FAIL ], exit code 134 from clickhouse local.
  • With the fix (concatenateBlocks-based collection): 30/30 passes with full CI randomization via tests/clickhouse-test --test-runs 30.

The KILL MUTATION + per-row scalar subquery + max_block_size = 1 shape is unchanged — it's the same reproducer @PedroTadim provided in https://fiddle.clickhouse.com/468380d6-18b1-4bea-be38-4186d9a9428f, just wrapped in ${CLICKHOUSE_LOCAL} --multiquery.

cc @KochetovNicolai @alexey-milovidov @PedroTadim

@PedroTadim

Copy link
Copy Markdown
Member

@groeneai a CI test crashed, look

@groeneai

Copy link
Copy Markdown
Contributor Author

@PedroTadim — the failing Fast test is chronic, unrelated to this PR.

Failure: 02932_refreshable_materialized_views_2 timed out at the system wait view e step:

Reason: Timeout! Killing process group ...
... create materialized view e refresh every 1 second (x Int64) ... select x + sleepEachRow(1) as x from src ...
system wait view e;

Stack shows the client blocked in poll() at ClientBase::receiveResult when SIGTSTP fired — i.e. the test ran out of wall-clock waiting for the refresh to complete on a busy Fast-test runner. This test refreshes every 1s and pages through sleepEachRow(1), so it's timing-sensitive.

Cross-PR check (last 14d, CIDB): 15 distinct PRs hit this exact test name, plus 3 merge-queue staging hits on master — completely unrelated authors and code areas (#76867, #100276, #99384, #100539, #102485, #104172, #104225, #104418, #104440, #104523, #104778, #104902, #98939, #100752, this PR). The pattern matches the closed family-tracker #66961 (02932_refreshable_materialized_views flaky).

No overlap with this PR's changes: PR #104927 touches only src/Interpreters/InterpreterKillQueryQuery.cpp plus the new stateless test 04238_kill_mutation_subquery_multi_block.{sh,reference} — nothing in the refreshable-MV / materialized-view code path.

Verdict: chronic flake on an unrelated test. A Fast-test re-run should clear it. The PR itself (KILL QUERY/MUTATION multi-block fix + bugfix-validation regression) is unchanged at fcdf526c.

groeneai added a commit to groeneai/ClickHouse that referenced this pull request May 15, 2026
…ailure

The reproducer for issue ClickHouse#102705 runs `SELECT ... FROM system.detached_tables
WHERE uuid = ...` against a table that has already been detached. Without
the fix, the planner creates `uuid_column` (because the predicate filters
on `uuid`) but the detached branch of `getFilteredTables` leaves it empty
while `table_column` has one row. The mismatched column lengths trigger
a `LOGICAL_ERROR` which fires `chassert` in debug / sanitizer builds and
aborts the process.

When run via `clickhouse-client` against the long-running test server,
that abort lands inside the server. The runner's hung-check then
terminates the test before any result is recorded, so the `Bugfix
validation` framework cannot invert the result to `OK` and the check
fails on the PR (as observed on PR ClickHouse#104979). The original `.sql` form
of `04240_detached_tables_uuid_filter_102705` had this issue.

Switching the test to `.sh` and running the reproducer in `clickhouse
local` contains the `SIGABRT` to a short-lived subprocess: stdout is
truncated before reaching `SELECT 'ok'`, the runner sees the diff
against the `.reference`, and reports a normal `FAIL`. Bugfix validation
can then invert it to `OK` on the PR branch.

The `.sh` uses the same pattern as `04238_kill_mutation_subquery_multi_block.sh`
(PR ClickHouse#104927) and `01600_detach_permanently.sh` (Atomic database under
`--path` + `${CLICKHOUSE_TEST_UNIQUE_NAME}` working folder).

Verified locally on the rebuilt debug binary:
- with the fix → `1\nok` → matches reference (30 / 30 deterministic).
- without the fix (line 118-119 reverted, rebuilt) → `Aborted (core dumped)`
  with empty stdout → does not match reference → `FAIL`.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
@PedroTadim

Copy link
Copy Markdown
Member

@groeneai merge with master

@groeneai

Copy link
Copy Markdown
Contributor Author

@PedroTadim master merged in e44c095. CI will re-run automatically.

@yariks5s yariks5s self-assigned this Jun 4, 2026

@yariks5s yariks5s left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The change is valid, but I have one concern to @groeneai: helper is also used by KILL QUERY, KILL PART_MOVE_TO_SHARD, and KILL TRANSACTION, so those paths still rely on shared-code confidence rather than direct coverage. Let's add a separate PR with test coverage after this PR is merged.

@groeneai

groeneai commented Jun 4, 2026

Copy link
Copy Markdown
Contributor Author

@yariks5s thanks for the approval. Will open a follow-up PR after this merges, adding direct multi-block coverage for KILL QUERY (system.processes), KILL PART_MOVE_TO_SHARD (system.part_moves_between_shards), and KILL TRANSACTION (system.transactions). Same pattern as 04238: per-row scalar subquery + max_block_size=1.

Session: cron:clickhouse-ci-task-worker:20260604-223800

@yariks5s

yariks5s commented Jun 22, 2026

Copy link
Copy Markdown
Member

@groeneai could you please re-run the failing job? if not please update with master once again

@groeneai

Copy link
Copy Markdown
Contributor Author

@yariks5s merged master in 22159b0a86e, full CI re-triggered. Fork contributors can't re-run individual jobs, so a master merge was the way.

For the record, neither stale failure was PR-caused: the Stateless tests (amd_tsan) hits were 00096_obfuscator_save_load / 00175_obfuscator_schema_inference (the UntrackedMemory obfuscator SIGABRT from #104690, reverted by #106365 after our old base ran), and Unit tests (asan_ubsan) had gtest verdict Approve but the job exited on the post-run dirty-repo leftover-files check. The refreshed base clears both.

@groeneai

Copy link
Copy Markdown
Contributor Author

CI fully finished on HEAD 22159b0. The remaining red checks are pre-existing chronic trunk failures, not caused by this PR (which only touches InterpreterKillQueryQuery.cpp + the 04238_kill_mutation_subquery_multi_block test):

Both are unrelated to the KILL QUERY/MUTATION multi-block fix. Finish Workflow is green; the Mergeable Check/PR reds are just the rollup of the two chronic failures above.

@yariks5s yariks5s added this pull request to the merge queue Jun 25, 2026
@github-merge-queue github-merge-queue Bot removed this pull request from the merge queue due to failed status checks Jun 25, 2026
groeneai and others added 2 commits June 25, 2026 17:53
After master changed Blocks to VectorWithMemoryTracking<Block> and
concatenateBlocks to take const Blocks &, the local accumulator in
getSelectResult must use the Blocks typedef instead of a plain
std::vector<Block> so it converts to the new signature. No behavior change.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
@groeneai

Copy link
Copy Markdown
Contributor Author

@yariks5s pushed 444a8ee. The merge-queue build broke because master changed the Blocks typedef to VectorWithMemoryTracking<Block> and concatenateBlocks to take const Blocks &; the local accumulator in getSelectResult was a plain std::vector<Block> which no longer converts. Merged master and changed the accumulator to the Blocks typedef (one-line type change, no behavior change, approved logic intact). Built against merged master and the 04238 regression test passes 10/10.

@groeneai

Copy link
Copy Markdown
Contributor Author

CI finish ledger — 444a8ee

Every failure below has an owner (a fixing PR). None is PR-caused: this PR's diff touches only src/Interpreters/InterpreterKillQueryQuery.cpp and the 04238 test, neither of which is in any failing code path.

Check / test Reason Owner / fixing PR
Unit tests (asan_ubsan, function_prop_fuzzer) / geohashesInBox Unsupported argument types flaky fuzzer finding (7 unrelated PRs/30d, 0 master) #107063 (ours, open)
Stress test (arm_tsan) / future_error "promise destructed" (STID 2508-38c6) flaky TaskTracker teardown race (8 unrelated PRs/30d) #107383 (ours, open)
Mergeable Check / PR aggregator rollup of the two above
CH Inc sync CH Inc sync (private, not actionable)

Session id: cron:our-pr-ci-monitor:20260625-233000

@alexey-milovidov

Copy link
Copy Markdown
Member

@groeneai, we have fixed these failed unrelated checks. Please update the branch.

@groeneai

groeneai commented Jul 1, 2026

Copy link
Copy Markdown
Contributor Author

@alexey-milovidov done, master merged in 1ee52dfe467a323c97bbf48f3e2be1da35db58f0. CI will re-run. The approved fix (concatenateBlocks over all emitted blocks in getSelectResult) is unchanged; 04238_kill_mutation_subquery_multi_block passes 5/5 on the merged binary.

@alexey-milovidov

Copy link
Copy Markdown
Member

@groeneai, fix the test 04256_packed_skip_indices_layout_transitions and the upgrade check.

@clickhouse-gh

clickhouse-gh Bot commented Jul 1, 2026

Copy link
Copy Markdown
Contributor

LLVM Coverage Report

Metric Baseline Current Δ
Lines 85.40% 85.40% +0.00%
Functions 92.60% 92.60% +0.00%
Branches 77.60% 77.70% +0.10%

Changed lines: Changed C/C++ lines covered: 11/11 (100.00%) · Uncovered code

Full report · Diff report

@groeneai

groeneai commented Jul 1, 2026

Copy link
Copy Markdown
Contributor Author

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

can be tested Allows running workflows for external contributors pr-bugfix Pull request with bugfix, not backported by default

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Logical error: Expected one block from input stream (STID: 4569-3618)

5 participants