Fix exception on IN tuple() against Distributed sharded table#104966
Conversation
The new-analyzer rewriter `OptimizeShardingKeyRewriteIn::enterImpl` in
`OptimizeShardingKeyRewriteInVisitor.cpp` calls `tuple.back()` on an
empty `std::vector` when the constant on the right side of `IN` is an
empty tuple (`IN tuple()` / `IN ()`). The AST-side rewriter directly
above guards this with `tuple.size() > 1` and is safe; the analyzer
side did not.
The AST fuzzer reduced its way to:
SELECT (2 IN tuple(-2)) FROM dist_01756
WHERE dummy IN (0, 2)
GROUP BY ..., (dummy IN tuple()) AND ...
with `dist_01756` sharded on `intHash64(dummy)`, so the inner
`(dummy IN tuple())` reaches the analyzer-side rewriter and trips UB
on the empty `Tuple::back()` — observed as `Received signal 6` /
abort in libc++'s vector hardening.
`IN ()` is a degenerate query (always false) and there's nothing
useful to rewrite, so just bail out early when the tuple is empty,
matching the AST-side behavior.
Added regression test `04238_shardkey_rewrite_in_empty_tuple.sql`
that runs the previously-crashing pattern against
`test_cluster_two_shards` with `optimize_skip_unused_shards_rewrite_in`
enabled.
Found by the AST fuzzer running against PR #104478:
https://s3.amazonaws.com/clickhouse-test-reports/PRs/104478/d46a8dce38bd41d0bc0b55715ff277fdce51fba3/ast_fuzzer_amd_debug_targeted/
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The previous version of the test put the empty `tuple()` in `WHERE` (and in
a `SELECT`-only position with no shard-pruning WHERE), so:
* `WHERE dummy IN tuple()` is always false; `optimize_skip_unused_shards`
fully prunes the cluster before the rewriter runs (`shards > 1` precondition
fails in `ClusterProxy::executeQuery`), and the empty-tuple branch in
`OptimizeShardingKeyRewriteIn::enterImpl` is never reached.
* `SELECT (dummy IN tuple()) FROM dist` with no WHERE filter does not set
`query_info.optimized_cluster`, so again the rewriter is skipped.
As a result the test reported `[ OK ]` on the master-HEAD release binary used
by the bugfix-validation job, so the validation framework reported
"Failed to reproduce the bug" and marked the check red.
Switch the trigger to the actual fuzz-reduced pattern (PR #104478,
`fuzzer.log.zst`, query around `Fuzzing step 716`):
SELECT (2 IN tuple(-2)) FROM dist
WHERE dummy IN (0, 2)
GROUP BY (dummy IN tuple()) AND materialize(-2147483649) AND (NULL GLOBAL IN (*))
FORMAT Null;
The non-empty `WHERE dummy IN (0, 2)` keeps `optimized_cluster` populated with
two shards, so the rewriter runs and visits the empty `(dummy IN tuple())`
expression in `GROUP BY`. Verified locally:
* master HEAD without the fix: aborts in
`contrib/llvm-project/libcxx/include/__vector/vector.h:444:
libc++ Hardening assertion !empty() failed: back() called on an empty vector`
(exit code 134).
* With the fix: query completes cleanly.
Bugfix-validation report:
https://s3.amazonaws.com/clickhouse-test-reports/json.html?PR=104966&sha=95b2f912df1456d5d0b5d09d431882e3b9855fc3&name_0=PR&name_1=Bugfix%20validation%20%28functional%20tests%29
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…yzer + DatabaseReplicated path also passes
The previous query carried the entire fuzz-reduced expression including
`materialize(-2147483649) AND (NULL GLOBAL IN (*))`. The `NULL GLOBAL IN (*)`
fragment is not required to reproduce the empty-tuple crash in
`OptimizeShardingKeyRewriteIn` (the analyzer-side rewriter visits the
inner `(dummy IN tuple())` regardless of what it is AND'd with). It does
trip the old analyzer's `GlobalSubqueriesMatcher` under the combined
`old analyzer + DatabaseReplicated + s3 storage` stateless run with
"Table <db>.dummy does not exist", because `*` expansion in `GLOBAL IN (*)`
collides with the temporary GLOBAL IN data table in that path.
Drop the noise and keep the minimal trigger:
SELECT count() FROM dist_04238
WHERE dummy IN (0, 2)
GROUP BY (dummy IN tuple())
FORMAT Null;
`WHERE dummy IN (0, 2)` keeps `optimized_cluster` populated with two
shards, so the rewriter runs and visits the empty `(dummy IN tuple())`
in `GROUP BY`. Verified locally:
* binary without the fix: aborts in `Tuple::back()` on empty vector
(libc++ hardening assertion in `__vector/vector.h:444`).
* binary with the fix: optimizer succeeds (fails downstream on
`ALL_CONNECTION_TRIES_FAILED` only because the local sandbox lacks
a listener on the cluster shards, unrelated to the rewriter).
CI report:
https://s3.amazonaws.com/clickhouse-test-reports/json.html?PR=104966&sha=864d6fac4617209b457a697076f41727f165bbd9&name_0=PR&name_1=Stateless%20tests%20%28amd_llvm_coverage%2C%20old%20analyzer%2C%20s3%20storage%2C%20DatabaseReplicated%2C%20WasmEdge%2C%20parallel%29
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…collision Master now contains `04238_incomplete_utf8_column_name_json.sql` (merged after the original test was added). Renumber to `04240` (next available after `04239_detach_table_on_temporary_table_103475`). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
… number Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
@groeneai the "Stress test (arm_release)" failure in the previous run is the unrelated The "Finish Workflow" failure was a transient connection refused to The "Bugfix validation (functional tests)" report says "Failed to reproduce the bug" but the test does reproduce it: under master the server crashes with the libc++ hardening abort (the |
|
This was fixed by #105146. Let's update the branch. |
…replicas `OptimizeTrivialGroupByLimitPass` runs in the analyzer on the initiator and lowers `max_rows_to_group_by` on the query's local context. That mutation is not propagated to remote replicas, so once CI's `automatic_parallel_replicas_mode=2` randomization is sampled the aggregator on the remotes never applies the cap and `OverflowAny` stays at zero. The test then sees `04229_on 0` instead of `04229_on 1` and fails 3/3 reruns (every PR run hits the same diff). Pin `enable_parallel_replicas = 0` at the query level in both `SETTINGS` clauses so the optimization is exercised on the local-coordinator path it is designed for. Query-level `SETTINGS` takes priority over the runner's session-level injection, so the pin is robust against randomization. No `no-*` tag is added (per `.claude/CLAUDE.md` guidance) and no other randomization is disabled — `max_threads = 1` was already pinned, the rest stays randomized. Locally reproduced with a 3-replica `default_parallel_replicas` cluster on loopback: 5/5 fails without the pin (`04229_on 0`), 20/20 passes with the pin under the same parallel-replicas session settings. Cross-PR scope (CIDB, 30d): seen on PRs #104473, #100146, #101033, #104694, #104966 — all unrelated to the test or its source PR. Reported by @alexey-milovidov on #100146.
The previous form (`04240_shardkey_rewrite_in_empty_tuple.sql`) caused the test server itself to abort under libc++ hardening — the planner aborted inside `OptimizeShardingKeyRewriteIn::enterImpl` on `Tuple::back()` over an empty constant tuple, taking the whole `clickhouse-server` process with it. The functional test runner then classified that as `SERVER_DIED` and the bugfix-validation framework refused to invert it, reporting "Failed to reproduce the bug" (see job log: https://s3.amazonaws.com/clickhouse-test-reports/PRs/104966/74d9e8953e17e40eb3639c76c088299ed724f6d8/bugfix_validation_functional_tests/job.log). Reshape the regression into a `.sh` test that runs the bug-triggering scenario inside a `clickhouse-local` subprocess so the abort stays contained. The subprocess gets a tiny inline config defining a two-shard `remote_servers` entry (`clickhouse-local` has no clusters by default); `EXPLAIN` is enough to trip the rewriter without actually connecting to the fake shards. On master HEAD without the fix the subprocess aborts (exit 134); the `if` branch is taken and we print `BUG`, diverging from the `OK` reference -> `FAIL`, which the bugfix-validation framework inverts to `OK` ("bug reproduced"). On the PR branch the subprocess plans cleanly -> exit 0 -> `OK` -> matches the reference. The test server stays alive either way. This follows the same pattern recently used for `04237_104840_partition_pruner_floor_null_nullable_key.sh` (commit 00df6f1). Verified locally (debug build): * master HEAD `OptimizeShardingKeyRewriteInVisitor.cpp` (no fix): `clickhouse-local` aborts with "back() called on an empty vector" (exit 134) -> stdout `BUG` -> diff against reference -> `FAIL`. * PR HEAD with fix: `EXPLAIN` succeeds -> stdout `OK` -> matches reference. Also renumber from `04240` to `04243` to avoid colliding with three unrelated tests already landed at `04240_*` on master. Bugfix-validation CI report: https://s3.amazonaws.com/clickhouse-test-reports/json.html?PR=104966&sha=74d9e8953e17e40eb3639c76c088299ed724f6d8&name_0=PR&name_1=Bugfix%20validation%20%28functional%20tests%29 Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
@groeneai, the The |
|
@alexey-milovidov — answers to both:
Confirmed unrelated to this PR. The same MSan signature hits chronic across at least 7 unrelated PRs on master and PRs in the last 30 days (CIDB):
Both are the same root cause. Tracked under umbrella issue #104877. PR #105146 (Stop the bleeding) did not silence this specific signature, and PR #100225 (the prior Root cause: in Fix: filed as a separate PR — #105223 — adds the missing
Confirmed same as the segfault PR #104905 is already fixing — Will report back here once CI on #105223 confirms the |
LLVM Coverage Report
Changed lines: Changed C/C++ lines covered by tests: 6/6 (100.00%) | Lost baseline coverage: none · Uncovered code |

Changelog category (leave one):
Changelog entry (a user-readable short description of the changes that goes into CHANGELOG.md):
Fix an exception (
Received signal 6 (abort)in vector hardening, observable asTuple::back()on an empty vector) when executing a query of the formSELECT ... FROM <Distributed table> WHERE/HAVING/GROUP BY ... (sharding_key_column IN tuple())withoptimize_skip_unused_shards_rewrite_inenabled under the new analyzer.Documentation entry for user-facing changes
What was wrong
OptimizeShardingKeyRewriteIn::enterImplinOptimizeShardingKeyRewriteInVisitor.cppcallstuple.back()on an emptystd::vectorwhen the right side ofINis the empty literal tupletuple(). The AST-side rewriter directly above guards this withtuple.size() > 1and is safe; the analyzer-side path did not.The AST fuzzer reduced its way to:
with
dist_01756sharded onintHash64(dummy), so the inner(dummy IN tuple())reached the analyzer rewriter and tripped UB on the emptyTuple::back()— observed in CI asReceived signal Aborted (6)/verbose_abort.cpp/vector::back[abi:sqe220101]() const.Fix
Bail out early when the constant tuple is empty —
IN ()is a degenerate query (always false) and there is nothing useful to rewrite. This matches the existing AST-side behavior.Test
Added
tests/queries/0_stateless/04243_shardkey_rewrite_in_empty_tuple.sh. It runs the previously-crashing pattern (IN tuple()inGROUP BYagainst a two-shardDistributedtable sharded onintHash64(dummy), with a non-emptyWHEREINto keep shards alive so the rewriter actually runs, andoptimize_skip_unused_shards_rewrite_inenabled) inside a containedclickhouse-localsubprocess. AnEXPLAINis enough to trip the rewriter: with the fix it completes cleanly (printsOK); on master HEAD the planner aborts on the emptyTuple::back()(exit 134under libc++ hardening) and the test printsBUG. The bug-triggering scenario is isolated inclickhouse-localso the abort stays contained.Origin
Found by the AST fuzzer on PR #104478:
Version info
26.6.1.316