Fix exception on `IN tuple()` against Distributed sharded table by alexey-milovidov · Pull Request #104966 · ClickHouse/ClickHouse · GitHub
Skip to content

Fix exception on IN tuple() against Distributed sharded table#104966

Merged
alexey-milovidov merged 16 commits into
masterfrom
fix-shardkey-rewrite-empty-tuple
Jun 2, 2026
Merged

Fix exception on IN tuple() against Distributed sharded table#104966
alexey-milovidov merged 16 commits into
masterfrom
fix-shardkey-rewrite-empty-tuple

Conversation

@alexey-milovidov

@alexey-milovidov alexey-milovidov commented May 14, 2026

Copy link
Copy Markdown
Member

Changelog category (leave one):

  • Bug Fix (user-visible misbehavior in an official stable release)

Changelog entry (a user-readable short description of the changes that goes into CHANGELOG.md):

Fix an exception (Received signal 6 (abort) in vector hardening, observable as Tuple::back() on an empty vector) when executing a query of the form SELECT ... FROM <Distributed table> WHERE/HAVING/GROUP BY ... (sharding_key_column IN tuple()) with optimize_skip_unused_shards_rewrite_in enabled under the new analyzer.

Documentation entry for user-facing changes

  • Documentation is written (mandatory for new features)

What was wrong

OptimizeShardingKeyRewriteIn::enterImpl in OptimizeShardingKeyRewriteInVisitor.cpp calls tuple.back() on an empty std::vector when the right side of IN is the empty literal tuple tuple(). The AST-side rewriter directly above guards this with tuple.size() > 1 and is safe; the analyzer-side path did not.

The AST fuzzer reduced its way to:

SELECT (2 IN tuple(-2)) FROM dist_01756
WHERE dummy IN (0, 2)
GROUP BY ..., (dummy IN tuple()) AND ...

with dist_01756 sharded on intHash64(dummy), so the inner (dummy IN tuple()) reached the analyzer rewriter and tripped UB on the empty Tuple::back() — observed in CI as Received signal Aborted (6) / verbose_abort.cpp / vector::back[abi:sqe220101]() const.

Fix

Bail out early when the constant tuple is empty — IN () is a degenerate query (always false) and there is nothing useful to rewrite. This matches the existing AST-side behavior.

Test

Added tests/queries/0_stateless/04243_shardkey_rewrite_in_empty_tuple.sh. It runs the previously-crashing pattern (IN tuple() in GROUP BY against a two-shard Distributed table sharded on intHash64(dummy), with a non-empty WHERE IN to keep shards alive so the rewriter actually runs, and optimize_skip_unused_shards_rewrite_in enabled) inside a contained clickhouse-local subprocess. An EXPLAIN is enough to trip the rewriter: with the fix it completes cleanly (prints OK); on master HEAD the planner aborts on the empty Tuple::back() (exit 134 under libc++ hardening) and the test prints BUG. The bug-triggering scenario is isolated in clickhouse-local so the abort stays contained.

Origin

Found by the AST fuzzer on PR #104478:

Version info

  • Merged into: 26.6.1.316

The new-analyzer rewriter `OptimizeShardingKeyRewriteIn::enterImpl` in
`OptimizeShardingKeyRewriteInVisitor.cpp` calls `tuple.back()` on an
empty `std::vector` when the constant on the right side of `IN` is an
empty tuple (`IN tuple()` / `IN ()`). The AST-side rewriter directly
above guards this with `tuple.size() > 1` and is safe; the analyzer
side did not.

The AST fuzzer reduced its way to:

    SELECT (2 IN tuple(-2)) FROM dist_01756
    WHERE dummy IN (0, 2)
    GROUP BY ..., (dummy IN tuple()) AND ...

with `dist_01756` sharded on `intHash64(dummy)`, so the inner
`(dummy IN tuple())` reaches the analyzer-side rewriter and trips UB
on the empty `Tuple::back()` — observed as `Received signal 6` /
abort in libc++'s vector hardening.

`IN ()` is a degenerate query (always false) and there's nothing
useful to rewrite, so just bail out early when the tuple is empty,
matching the AST-side behavior.

Added regression test `04238_shardkey_rewrite_in_empty_tuple.sql`
that runs the previously-crashing pattern against
`test_cluster_two_shards` with `optimize_skip_unused_shards_rewrite_in`
enabled.

Found by the AST fuzzer running against PR #104478:
https://s3.amazonaws.com/clickhouse-test-reports/PRs/104478/d46a8dce38bd41d0bc0b55715ff277fdce51fba3/ast_fuzzer_amd_debug_targeted/

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@clickhouse-gh

clickhouse-gh Bot commented May 14, 2026

Copy link
Copy Markdown
Contributor

alexey-milovidov and others added 7 commits May 15, 2026 06:07
The previous version of the test put the empty `tuple()` in `WHERE` (and in
a `SELECT`-only position with no shard-pruning WHERE), so:

  * `WHERE dummy IN tuple()` is always false; `optimize_skip_unused_shards`
    fully prunes the cluster before the rewriter runs (`shards > 1` precondition
    fails in `ClusterProxy::executeQuery`), and the empty-tuple branch in
    `OptimizeShardingKeyRewriteIn::enterImpl` is never reached.
  * `SELECT (dummy IN tuple()) FROM dist` with no WHERE filter does not set
    `query_info.optimized_cluster`, so again the rewriter is skipped.

As a result the test reported `[ OK ]` on the master-HEAD release binary used
by the bugfix-validation job, so the validation framework reported
"Failed to reproduce the bug" and marked the check red.

Switch the trigger to the actual fuzz-reduced pattern (PR #104478,
`fuzzer.log.zst`, query around `Fuzzing step 716`):

    SELECT (2 IN tuple(-2)) FROM dist
    WHERE dummy IN (0, 2)
    GROUP BY (dummy IN tuple()) AND materialize(-2147483649) AND (NULL GLOBAL IN (*))
    FORMAT Null;

The non-empty `WHERE dummy IN (0, 2)` keeps `optimized_cluster` populated with
two shards, so the rewriter runs and visits the empty `(dummy IN tuple())`
expression in `GROUP BY`. Verified locally:

  * master HEAD without the fix: aborts in
    `contrib/llvm-project/libcxx/include/__vector/vector.h:444:
     libc++ Hardening assertion !empty() failed: back() called on an empty vector`
    (exit code 134).
  * With the fix: query completes cleanly.

Bugfix-validation report:
https://s3.amazonaws.com/clickhouse-test-reports/json.html?PR=104966&sha=95b2f912df1456d5d0b5d09d431882e3b9855fc3&name_0=PR&name_1=Bugfix%20validation%20%28functional%20tests%29

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…yzer + DatabaseReplicated path also passes

The previous query carried the entire fuzz-reduced expression including
`materialize(-2147483649) AND (NULL GLOBAL IN (*))`. The `NULL GLOBAL IN (*)`
fragment is not required to reproduce the empty-tuple crash in
`OptimizeShardingKeyRewriteIn` (the analyzer-side rewriter visits the
inner `(dummy IN tuple())` regardless of what it is AND'd with). It does
trip the old analyzer's `GlobalSubqueriesMatcher` under the combined
`old analyzer + DatabaseReplicated + s3 storage` stateless run with
"Table <db>.dummy does not exist", because `*` expansion in `GLOBAL IN (*)`
collides with the temporary GLOBAL IN data table in that path.

Drop the noise and keep the minimal trigger:

  SELECT count() FROM dist_04238
  WHERE dummy IN (0, 2)
  GROUP BY (dummy IN tuple())
  FORMAT Null;

`WHERE dummy IN (0, 2)` keeps `optimized_cluster` populated with two
shards, so the rewriter runs and visits the empty `(dummy IN tuple())`
in `GROUP BY`. Verified locally:
  * binary without the fix: aborts in `Tuple::back()` on empty vector
    (libc++ hardening assertion in `__vector/vector.h:444`).
  * binary with the fix: optimizer succeeds (fails downstream on
    `ALL_CONNECTION_TRIES_FAILED` only because the local sandbox lacks
    a listener on the cluster shards, unrelated to the rewriter).

CI report:
https://s3.amazonaws.com/clickhouse-test-reports/json.html?PR=104966&sha=864d6fac4617209b457a697076f41727f165bbd9&name_0=PR&name_1=Stateless%20tests%20%28amd_llvm_coverage%2C%20old%20analyzer%2C%20s3%20storage%2C%20DatabaseReplicated%2C%20WasmEdge%2C%20parallel%29

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…collision

Master now contains `04238_incomplete_utf8_column_name_json.sql` (merged
after the original test was added). Renumber to `04240` (next available
after `04239_detach_table_on_temporary_table_103475`).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
… number

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@alexey-milovidov

Copy link
Copy Markdown
Member Author

@groeneai the "Stress test (arm_release)" failure in the previous run is the unrelated registerStorageTimeSeries null-pointer crash on server startup (segfault inside StorageTimeSeries::StorageTimeSeriestryGetTable returning nullptr then dereferenced). Fix is already in flight in #104905. Failure report: https://s3.amazonaws.com/clickhouse-test-reports/json.html?PR=104966&sha=e32382595abdea7c6d1661b7251c11367ec01062&name_0=PR&name_1=Stress%20test%20%28arm_release%29

The "Finish Workflow" failure was a transient connection refused to play.clickhouse.com during CIDB upload ([Errno 111] Connection refused) — infrastructure, not a code issue.

The "Bugfix validation (functional tests)" report says "Failed to reproduce the bug" but the test does reproduce it: under master the server crashes with the libc++ hardening abort (the Tuple::back() on empty vector). The validator just can't see the crash — when the server dies the test runner dies with it before a per-test FAIL is written, and the bugfix path explicitly skips CHECK_ERRORS (ci/jobs/functional_tests.py:339-342) so the <Fatal> log line is never promoted to a test result. The core dump (core.TCPHandler.1588-993766.zst.enc) and <Fatal> lines in clickhouse-server.err.log confirm the crash. I've merged master into the branch to refresh CI.

@alexey-milovidov

Copy link
Copy Markdown
Member Author

This was fixed by #105146. Let's update the branch.

alexey-milovidov pushed a commit that referenced this pull request May 17, 2026
…replicas

`OptimizeTrivialGroupByLimitPass` runs in the analyzer on the initiator
and lowers `max_rows_to_group_by` on the query's local context. That
mutation is not propagated to remote replicas, so once CI's
`automatic_parallel_replicas_mode=2` randomization is sampled the
aggregator on the remotes never applies the cap and `OverflowAny` stays
at zero. The test then sees `04229_on   0` instead of
`04229_on   1` and fails 3/3 reruns (every PR run hits the same diff).

Pin `enable_parallel_replicas = 0` at the query level in both `SETTINGS`
clauses so the optimization is exercised on the local-coordinator path
it is designed for. Query-level `SETTINGS` takes priority over the
runner's session-level injection, so the pin is robust against
randomization. No `no-*` tag is added (per
`.claude/CLAUDE.md` guidance) and no other randomization is disabled —
`max_threads = 1` was already pinned, the rest stays randomized.

Locally reproduced with a 3-replica `default_parallel_replicas` cluster
on loopback: 5/5 fails without the pin (`04229_on   0`), 20/20 passes
with the pin under the same parallel-replicas session settings.

Cross-PR scope (CIDB, 30d): seen on PRs #104473, #100146, #101033,
#104694, #104966 — all unrelated to the test or its source PR.

Reported by @alexey-milovidov on
#100146.
The previous form (`04240_shardkey_rewrite_in_empty_tuple.sql`) caused the
test server itself to abort under libc++ hardening — the planner aborted
inside `OptimizeShardingKeyRewriteIn::enterImpl` on `Tuple::back()` over an
empty constant tuple, taking the whole `clickhouse-server` process with it.
The functional test runner then classified that as `SERVER_DIED` and the
bugfix-validation framework refused to invert it, reporting
"Failed to reproduce the bug" (see job log:
https://s3.amazonaws.com/clickhouse-test-reports/PRs/104966/74d9e8953e17e40eb3639c76c088299ed724f6d8/bugfix_validation_functional_tests/job.log).

Reshape the regression into a `.sh` test that runs the bug-triggering scenario
inside a `clickhouse-local` subprocess so the abort stays contained. The
subprocess gets a tiny inline config defining a two-shard `remote_servers`
entry (`clickhouse-local` has no clusters by default); `EXPLAIN` is enough to
trip the rewriter without actually connecting to the fake shards.

On master HEAD without the fix the subprocess aborts (exit 134); the `if`
branch is taken and we print `BUG`, diverging from the `OK` reference -> `FAIL`,
which the bugfix-validation framework inverts to `OK` ("bug reproduced"). On
the PR branch the subprocess plans cleanly -> exit 0 -> `OK` -> matches the
reference. The test server stays alive either way.

This follows the same pattern recently used for
`04237_104840_partition_pruner_floor_null_nullable_key.sh` (commit 00df6f1).

Verified locally (debug build):
  * master HEAD `OptimizeShardingKeyRewriteInVisitor.cpp` (no fix):
    `clickhouse-local` aborts with
    "back() called on an empty vector" (exit 134) -> stdout `BUG` -> diff
    against reference -> `FAIL`.
  * PR HEAD with fix: `EXPLAIN` succeeds -> stdout `OK` -> matches reference.

Also renumber from `04240` to `04243` to avoid colliding with three
unrelated tests already landed at `04240_*` on master.

Bugfix-validation CI report: https://s3.amazonaws.com/clickhouse-test-reports/json.html?PR=104966&sha=74d9e8953e17e40eb3639c76c088299ed724f6d8&name_0=PR&name_1=Bugfix%20validation%20%28functional%20tests%29

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@alexey-milovidov

Copy link
Copy Markdown
Member Author

@groeneai, the Unit tests (msan, function_prop_fuzzer) failure (FunctionsStress: use-of-uninitialized-value src/Common/StringSearcher.h:563:61 via Volnitsky / MatchImpl / iLike) is unrelated to this PR — same MSan signature has hit several other PRs on master in the past two weeks (104545, 105126, 100377, 103383, 99740 per the CIDB query in the CI report). It is the same class of function_prop_fuzzer noise that #105146 ("Stop the bleeding in function_prop_fuzzer") tried to silence, but StringSearcher uninitialized-value findings still leak through. Could you investigate this specific signature and provide a fix in a separate PR? CI: https://s3.amazonaws.com/clickhouse-test-reports/json.html?PR=104966&sha=74d9e8953e17e40eb3639c76c088299ed724f6d8&name_0=PR&name_1=Unit%20tests%20%28msan%2C%20function_prop_fuzzer%29

The Stress test (arm_msan) failure (Cannot start clickhouse-server) is the same StorageTimeSeries::StorageTimeSeries null-pointer segfault during startup that #104905 is already fixing — please coordinate with that PR. CI: https://s3.amazonaws.com/clickhouse-test-reports/json.html?PR=104966&sha=74d9e8953e17e40eb3639c76c088299ed724f6d8&name_0=PR&name_1=Stress%20test%20%28arm_msan%29

@groeneai

Copy link
Copy Markdown
Contributor

@alexey-milovidov — answers to both:

Unit tests (msan, function_prop_fuzzer)StringSearcher.h:563:61 via Volnitsky -> MatchImpl -> ilike

Confirmed unrelated to this PR. The same MSan signature hits chronic across at least 7 unrelated PRs on master and PRs in the last 30 days (CIDB):

Both are the same root cause. Tracked under umbrella issue #104877. PR #105146 (Stop the bleeding) did not silence this specific signature, and PR #100225 (the prior StringSearcher.h MSan fix) only closed the empty-needle case in compare.

Root cause: in StringSearcher<false, false>'s constructor cache loop, when convertUTF8ToCodePoint reports an invalid UTF-8 sequence in the middle of the needle (so c_u32 is empty and the if (c_u32) { ... } body is skipped), the inner loop still reads l_seq[j] / u_seq[j] and inserts them into cachel / cacheu via _mm_insert_epi8. The first-character path already handles this correctly (memcpy(l_seq, needle, src_len) in the outer else), but the loop body was missing the analogous else. The uninitialized bytes flow into the SIMD cache and surface later as MSan poison in _mm_cmpeq_epi8(v_haystack, cachel).

Fix: filed as a separate PR — #105223 — adds the missing else branch mirroring the first-character handling. Regression test 04255_stringsearcher_utf8_invalid_needle.sql exercises the path through ilike, positionCaseInsensitiveUTF8, multiSearchAnyCaseInsensitiveUTF8, startsWithCaseInsensitiveUTF8, endsWithCaseInsensitiveUTF8. Functional behaviour is unchanged (compareTrivial already rejects invalid UTF-8 needles); only the MSan poison is removed.

Stress test (arm_msan)Cannot start clickhouse-server StorageTimeSeries::StorageTimeSeries

Confirmed same as the segfault PR #104905 is already fixing — StorageTimeSeries.cpp:171 null-deref during startup. No coordination work needed here; once #104905 lands this signature should disappear for #104966's next CI run too.

Will report back here once CI on #105223 confirms the function_prop_fuzzer family is clean.

@clickhouse-gh

clickhouse-gh Bot commented May 29, 2026

Copy link
Copy Markdown
Contributor

LLVM Coverage Report

Metric Baseline Current Δ
Lines 84.40% 84.50% +0.10%
Functions 91.30% 91.40% +0.10%
Branches 76.90% 77.10% +0.20%

Changed lines: Changed C/C++ lines covered by tests: 6/6 (100.00%) | Lost baseline coverage: none · Uncovered code

Full report · Diff report

@alexey-milovidov alexey-milovidov left a comment

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok.

@alexey-milovidov alexey-milovidov self-assigned this Jun 2, 2026
@alexey-milovidov alexey-milovidov added this pull request to the merge queue Jun 2, 2026
Merged via the queue into master with commit 55150d7 Jun 2, 2026
166 checks passed
@alexey-milovidov alexey-milovidov deleted the fix-shardkey-rewrite-empty-tuple branch June 2, 2026 05:54
@clickgapai

clickgapai commented Jun 2, 2026

Copy link
Copy Markdown
Contributor

@robot-ch-test-poll robot-ch-test-poll added the pr-synced-to-cloud The PR is synced to the cloud repo label Jun 2, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

pr-bugfix Pull request with bugfix, not backported by default pr-synced-to-cloud The PR is synced to the cloud repo

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants