iframe-proxy

alexey-milovidov · 2026-05-14T16:58:08Z

Changelog category (leave one):

Bug Fix (user-visible misbehavior in an official stable release)

Changelog entry (a user-readable short description of the changes that goes into CHANGELOG.md):

Fix an exception (Received signal 6 (abort) in vector hardening, observable as Tuple::back() on an empty vector) when executing a query of the form SELECT ... FROM <Distributed table> WHERE/HAVING/GROUP BY ... (sharding_key_column IN tuple()) with optimize_skip_unused_shards_rewrite_in enabled under the new analyzer.

Documentation entry for user-facing changes

Documentation is written (mandatory for new features)

What was wrong

OptimizeShardingKeyRewriteIn::enterImpl in OptimizeShardingKeyRewriteInVisitor.cpp calls tuple.back() on an empty std::vector when the right side of IN is the empty literal tuple tuple(). The AST-side rewriter directly above guards this with tuple.size() > 1 and is safe; the analyzer-side path did not.

The AST fuzzer reduced its way to:

SELECT (2 IN tuple(-2)) FROM dist_01756
WHERE dummy IN (0, 2)
GROUP BY ..., (dummy IN tuple()) AND ...

with dist_01756 sharded on intHash64(dummy), so the inner (dummy IN tuple()) reached the analyzer rewriter and tripped UB on the empty Tuple::back() — observed in CI as Received signal Aborted (6) / verbose_abort.cpp / vector::back[abi:sqe220101]() const.

Fix

Bail out early when the constant tuple is empty — IN () is a degenerate query (always false) and there is nothing useful to rewrite. This matches the existing AST-side behavior.

Test

Added tests/queries/0_stateless/04243_shardkey_rewrite_in_empty_tuple.sh. It runs the previously-crashing pattern (IN tuple() in GROUP BY against a two-shard Distributed table sharded on intHash64(dummy), with a non-empty WHERE IN to keep shards alive so the rewriter actually runs, and optimize_skip_unused_shards_rewrite_in enabled) inside a contained clickhouse-local subprocess. An EXPLAIN is enough to trip the rewriter: with the fix it completes cleanly (prints OK); on master HEAD the planner aborts on the empty Tuple::back() (exit 134 under libc++ hardening) and the test prints BUG. The bug-triggering scenario is isolated in clickhouse-local so the abort stays contained.

Origin

Found by the AST fuzzer on PR #104478:

Version info

Merged into: 26.6.1.316

The new-analyzer rewriter `OptimizeShardingKeyRewriteIn::enterImpl` in `OptimizeShardingKeyRewriteInVisitor.cpp` calls `tuple.back()` on an empty `std::vector` when the constant on the right side of `IN` is an empty tuple (`IN tuple()` / `IN ()`). The AST-side rewriter directly above guards this with `tuple.size() > 1` and is safe; the analyzer side did not. The AST fuzzer reduced its way to: SELECT (2 IN tuple(-2)) FROM dist_01756 WHERE dummy IN (0, 2) GROUP BY ..., (dummy IN tuple()) AND ... with `dist_01756` sharded on `intHash64(dummy)`, so the inner `(dummy IN tuple())` reaches the analyzer-side rewriter and trips UB on the empty `Tuple::back()` — observed as `Received signal 6` / abort in libc++'s vector hardening. `IN ()` is a degenerate query (always false) and there's nothing useful to rewrite, so just bail out early when the tuple is empty, matching the AST-side behavior. Added regression test `04238_shardkey_rewrite_in_empty_tuple.sql` that runs the previously-crashing pattern against `test_cluster_two_shards` with `optimize_skip_unused_shards_rewrite_in` enabled. Found by the AST fuzzer running against PR #104478: https://s3.amazonaws.com/clickhouse-test-reports/PRs/104478/d46a8dce38bd41d0bc0b55715ff277fdce51fba3/ast_fuzzer_amd_debug_targeted/ Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

clickhouse-gh · 2026-05-14T16:58:51Z

…e-empty-tuple

The previous version of the test put the empty `tuple()` in `WHERE` (and in a `SELECT`-only position with no shard-pruning WHERE), so: * `WHERE dummy IN tuple()` is always false; `optimize_skip_unused_shards` fully prunes the cluster before the rewriter runs (`shards > 1` precondition fails in `ClusterProxy::executeQuery`), and the empty-tuple branch in `OptimizeShardingKeyRewriteIn::enterImpl` is never reached. * `SELECT (dummy IN tuple()) FROM dist` with no WHERE filter does not set `query_info.optimized_cluster`, so again the rewriter is skipped. As a result the test reported `[ OK ]` on the master-HEAD release binary used by the bugfix-validation job, so the validation framework reported "Failed to reproduce the bug" and marked the check red. Switch the trigger to the actual fuzz-reduced pattern (PR #104478, `fuzzer.log.zst`, query around `Fuzzing step 716`): SELECT (2 IN tuple(-2)) FROM dist WHERE dummy IN (0, 2) GROUP BY (dummy IN tuple()) AND materialize(-2147483649) AND (NULL GLOBAL IN (*)) FORMAT Null; The non-empty `WHERE dummy IN (0, 2)` keeps `optimized_cluster` populated with two shards, so the rewriter runs and visits the empty `(dummy IN tuple())` expression in `GROUP BY`. Verified locally: * master HEAD without the fix: aborts in `contrib/llvm-project/libcxx/include/__vector/vector.h:444: libc++ Hardening assertion !empty() failed: back() called on an empty vector` (exit code 134). * With the fix: query completes cleanly. Bugfix-validation report: https://s3.amazonaws.com/clickhouse-test-reports/json.html?PR=104966&sha=95b2f912df1456d5d0b5d09d431882e3b9855fc3&name_0=PR&name_1=Bugfix%20validation%20%28functional%20tests%29 Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…yzer + DatabaseReplicated path also passes The previous query carried the entire fuzz-reduced expression including `materialize(-2147483649) AND (NULL GLOBAL IN (*))`. The `NULL GLOBAL IN (*)` fragment is not required to reproduce the empty-tuple crash in `OptimizeShardingKeyRewriteIn` (the analyzer-side rewriter visits the inner `(dummy IN tuple())` regardless of what it is AND'd with). It does trip the old analyzer's `GlobalSubqueriesMatcher` under the combined `old analyzer + DatabaseReplicated + s3 storage` stateless run with "Table <db>.dummy does not exist", because `*` expansion in `GLOBAL IN (*)` collides with the temporary GLOBAL IN data table in that path. Drop the noise and keep the minimal trigger: SELECT count() FROM dist_04238 WHERE dummy IN (0, 2) GROUP BY (dummy IN tuple()) FORMAT Null; `WHERE dummy IN (0, 2)` keeps `optimized_cluster` populated with two shards, so the rewriter runs and visits the empty `(dummy IN tuple())` in `GROUP BY`. Verified locally: * binary without the fix: aborts in `Tuple::back()` on empty vector (libc++ hardening assertion in `__vector/vector.h:444`). * binary with the fix: optimizer succeeds (fails downstream on `ALL_CONNECTION_TRIES_FAILED` only because the local sandbox lacks a listener on the cluster shards, unrelated to the rewriter). CI report: https://s3.amazonaws.com/clickhouse-test-reports/json.html?PR=104966&sha=864d6fac4617209b457a697076f41727f165bbd9&name_0=PR&name_1=Stateless%20tests%20%28amd_llvm_coverage%2C%20old%20analyzer%2C%20s3%20storage%2C%20DatabaseReplicated%2C%20WasmEdge%2C%20parallel%29 Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…e-empty-tuple

…collision Master now contains `04238_incomplete_utf8_column_name_json.sql` (merged after the original test was added). Renumber to `04240` (next available after `04239_detach_table_on_temporary_table_103475`). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

… number Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…e-empty-tuple

alexey-milovidov · 2026-05-17T02:20:45Z

@groeneai the "Stress test (arm_release)" failure in the previous run is the unrelated registerStorageTimeSeries null-pointer crash on server startup (segfault inside StorageTimeSeries::StorageTimeSeries → tryGetTable returning nullptr then dereferenced). Fix is already in flight in #104905. Failure report: https://s3.amazonaws.com/clickhouse-test-reports/json.html?PR=104966&sha=e32382595abdea7c6d1661b7251c11367ec01062&name_0=PR&name_1=Stress%20test%20%28arm_release%29

The "Finish Workflow" failure was a transient connection refused to play.clickhouse.com during CIDB upload ([Errno 111] Connection refused) — infrastructure, not a code issue.

The "Bugfix validation (functional tests)" report says "Failed to reproduce the bug" but the test does reproduce it: under master the server crashes with the libc++ hardening abort (the Tuple::back() on empty vector). The validator just can't see the crash — when the server dies the test runner dies with it before a per-test FAIL is written, and the bugfix path explicitly skips CHECK_ERRORS (ci/jobs/functional_tests.py:339-342) so the <Fatal> log line is never promoted to a test result. The core dump (core.TCPHandler.1588-993766.zst.enc) and <Fatal> lines in clickhouse-server.err.log confirm the crash. I've merged master into the branch to refresh CI.

alexey-milovidov · 2026-05-17T18:40:18Z

This was fixed by #105146. Let's update the branch.

@alexey-milovidov

…replicas `OptimizeTrivialGroupByLimitPass` runs in the analyzer on the initiator and lowers `max_rows_to_group_by` on the query's local context. That mutation is not propagated to remote replicas, so once CI's `automatic_parallel_replicas_mode=2` randomization is sampled the aggregator on the remotes never applies the cap and `OverflowAny` stays at zero. The test then sees `04229_on 0` instead of `04229_on 1` and fails 3/3 reruns (every PR run hits the same diff). Pin `enable_parallel_replicas = 0` at the query level in both `SETTINGS` clauses so the optimization is exercised on the local-coordinator path it is designed for. Query-level `SETTINGS` takes priority over the runner's session-level injection, so the pin is robust against randomization. No `no-*` tag is added (per `.claude/CLAUDE.md` guidance) and no other randomization is disabled — `max_threads = 1` was already pinned, the rest stays randomized. Locally reproduced with a 3-replica `default_parallel_replicas` cluster on loopback: 5/5 fails without the pin (`04229_on 0`), 20/20 passes with the pin under the same parallel-replicas session settings. Cross-PR scope (CIDB, 30d): seen on PRs #104473, #100146, #101033, #104694, #104966 — all unrelated to the test or its source PR. Reported by @alexey-milovidov on #100146.

The previous form (`04240_shardkey_rewrite_in_empty_tuple.sql`) caused the test server itself to abort under libc++ hardening — the planner aborted inside `OptimizeShardingKeyRewriteIn::enterImpl` on `Tuple::back()` over an empty constant tuple, taking the whole `clickhouse-server` process with it. The functional test runner then classified that as `SERVER_DIED` and the bugfix-validation framework refused to invert it, reporting "Failed to reproduce the bug" (see job log: https://s3.amazonaws.com/clickhouse-test-reports/PRs/104966/74d9e8953e17e40eb3639c76c088299ed724f6d8/bugfix_validation_functional_tests/job.log). Reshape the regression into a `.sh` test that runs the bug-triggering scenario inside a `clickhouse-local` subprocess so the abort stays contained. The subprocess gets a tiny inline config defining a two-shard `remote_servers` entry (`clickhouse-local` has no clusters by default); `EXPLAIN` is enough to trip the rewriter without actually connecting to the fake shards. On master HEAD without the fix the subprocess aborts (exit 134); the `if` branch is taken and we print `BUG`, diverging from the `OK` reference -> `FAIL`, which the bugfix-validation framework inverts to `OK` ("bug reproduced"). On the PR branch the subprocess plans cleanly -> exit 0 -> `OK` -> matches the reference. The test server stays alive either way. This follows the same pattern recently used for `04237_104840_partition_pruner_floor_null_nullable_key.sh` (commit 00df6f1). Verified locally (debug build): * master HEAD `OptimizeShardingKeyRewriteInVisitor.cpp` (no fix): `clickhouse-local` aborts with "back() called on an empty vector" (exit 134) -> stdout `BUG` -> diff against reference -> `FAIL`. * PR HEAD with fix: `EXPLAIN` succeeds -> stdout `OK` -> matches reference. Also renumber from `04240` to `04243` to avoid colliding with three unrelated tests already landed at `04240_*` on master. Bugfix-validation CI report: https://s3.amazonaws.com/clickhouse-test-reports/json.html?PR=104966&sha=74d9e8953e17e40eb3639c76c088299ed724f6d8&name_0=PR&name_1=Bugfix%20validation%20%28functional%20tests%29 Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

alexey-milovidov · 2026-05-18T12:00:19Z

@groeneai, the Unit tests (msan, function_prop_fuzzer) failure (FunctionsStress: use-of-uninitialized-value src/Common/StringSearcher.h:563:61 via Volnitsky / MatchImpl / iLike) is unrelated to this PR — same MSan signature has hit several other PRs on master in the past two weeks (104545, 105126, 100377, 103383, 99740 per the CIDB query in the CI report). It is the same class of function_prop_fuzzer noise that #105146 ("Stop the bleeding in function_prop_fuzzer") tried to silence, but StringSearcher uninitialized-value findings still leak through. Could you investigate this specific signature and provide a fix in a separate PR? CI: https://s3.amazonaws.com/clickhouse-test-reports/json.html?PR=104966&sha=74d9e8953e17e40eb3639c76c088299ed724f6d8&name_0=PR&name_1=Unit%20tests%20%28msan%2C%20function_prop_fuzzer%29

The Stress test (arm_msan) failure (Cannot start clickhouse-server) is the same StorageTimeSeries::StorageTimeSeries null-pointer segfault during startup that #104905 is already fixing — please coordinate with that PR. CI: https://s3.amazonaws.com/clickhouse-test-reports/json.html?PR=104966&sha=74d9e8953e17e40eb3639c76c088299ed724f6d8&name_0=PR&name_1=Stress%20test%20%28arm_msan%29

groeneai · 2026-05-18T12:48:53Z

@alexey-milovidov — answers to both:

Unit tests (msan, function_prop_fuzzer) — StringSearcher.h:563:61 via Volnitsky -> MatchImpl -> ilike

Confirmed unrelated to this PR. The same MSan signature hits chronic across at least 7 unrelated PRs on master and PRs in the last 30 days (CIDB):

The exact StringSearcher.h:563:61 use-of-uninitialized-value in DB::impl::StringSearcher<false, false>::search: Revert "Revert "CI: run libFuzzers in PRs, w/o corpus upload"" #99740, Add web terminal interface to clickhouse-server #100277, Disable read-in-order when primary key selectivity is poor #100377, Fix flattenTuple and tupleToNameValuePairs not propagating NULLs for Nullable(Tuple) #103383, Add --chime option for ASCII BEL on slow query finish (issue #92718) #104545, Fix exception on IN tuple() against Distributed sharded table #104966, Speed up LIMIT BY by running it independently per MergeTree partition #105126.
The sibling shape StringSearcher.h:496:41 in compare (same predicate, just the compare path instead of search) hits other PRs in the same window.

Both are the same root cause. Tracked under umbrella issue #104877. PR #105146 (Stop the bleeding) did not silence this specific signature, and PR #100225 (the prior StringSearcher.h MSan fix) only closed the empty-needle case in compare.

Root cause: in StringSearcher<false, false>'s constructor cache loop, when convertUTF8ToCodePoint reports an invalid UTF-8 sequence in the middle of the needle (so c_u32 is empty and the if (c_u32) { ... } body is skipped), the inner loop still reads l_seq[j] / u_seq[j] and inserts them into cachel / cacheu via _mm_insert_epi8. The first-character path already handles this correctly (memcpy(l_seq, needle, src_len) in the outer else), but the loop body was missing the analogous else. The uninitialized bytes flow into the SIMD cache and surface later as MSan poison in _mm_cmpeq_epi8(v_haystack, cachel).

Fix: filed as a separate PR — #105223 — adds the missing else branch mirroring the first-character handling. Regression test 04255_stringsearcher_utf8_invalid_needle.sql exercises the path through ilike, positionCaseInsensitiveUTF8, multiSearchAnyCaseInsensitiveUTF8, startsWithCaseInsensitiveUTF8, endsWithCaseInsensitiveUTF8. Functional behaviour is unchanged (compareTrivial already rejects invalid UTF-8 needles); only the MSan poison is removed.

Stress test (arm_msan) — Cannot start clickhouse-server StorageTimeSeries::StorageTimeSeries

Confirmed same as the segfault PR #104905 is already fixing — StorageTimeSeries.cpp:171 null-deref during startup. No coordination work needed here; once #104905 lands this signature should disappear for #104966's next CI run too.

Will report back here once CI on #105223 confirms the function_prop_fuzzer family is clean.

…e-empty-tuple

clickhouse-gh · 2026-05-29T06:10:18Z

LLVM Coverage Report

Metric	Baseline	Current	Δ
Lines	84.40%	84.50%	+0.10%
Functions	91.30%	91.40%	+0.10%
Branches	76.90%	77.10%	+0.20%

Changed lines: Changed C/C++ lines covered by tests: 6/6 (100.00%) | Lost baseline coverage: none · Uncovered code

Full report · Diff report

alexey-milovidov

Ok.

clickgapai · 2026-06-02T06:33:35Z

clickhouse-gh Bot added the pr-bugfix Pull request with bugfix, not backported by default label May 14, 2026

groeneai mentioned this pull request May 14, 2026

Enable query condition cache for ORDER BY ... LIMIT n queries #104478

Merged

1 task

alexey-milovidov mentioned this pull request May 15, 2026

Fix distributed task iterator not initialized exception with parallel replicas #100146

Merged

1 task

groeneai mentioned this pull request May 15, 2026

Stabilize 04229_trivial_group_by_limit_profile_events under parallel replicas #104996

Merged

alexey-milovidov and others added 7 commits May 15, 2026 06:07

Merge remote-tracking branch 'origin/master' into fix-shardkey-rewrit…

95b2f91

…e-empty-tuple

Merge remote-tracking branch 'origin/master' into fix-shardkey-rewrit…

3a4625b

…e-empty-tuple

Rename internal table from dist_04238 to dist_04240 to match test…

e323825

… number Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Merge remote-tracking branch 'origin/master' into fix-shardkey-rewrit…

36384d8

…e-empty-tuple

alexey-milovidov mentioned this pull request May 17, 2026

Stop the bleeding in function_prop_fuzzer #105146

Merged

1 task

Merge branch 'master' into fix-shardkey-rewrite-empty-tuple

74d9e89

groeneai mentioned this pull request May 18, 2026

Fix MSan use-of-uninitialized-value in UTF-8 case-insensitive StringSearcher #105223

Merged

1 task

alexey-milovidov added 4 commits May 19, 2026 18:28

Merge remote-tracking branch 'origin/master' into fix-shardkey-rewrit…

6251006

…e-empty-tuple

Merge remote-tracking branch 'origin/master' into fix-shardkey-rewrit…

5c54dd4

…e-empty-tuple

Merge remote-tracking branch 'origin/master' into fix-shardkey-rewrit…

b6b5d7f

…e-empty-tuple

Merge remote-tracking branch 'origin/master' into fix-shardkey-rewrit…

b934ac0

…e-empty-tuple

groeneai mentioned this pull request May 27, 2026

Fix Memory table abort on ALTER TABLE ... APPLY PATCHES (STID 2380-3043) #105286

Merged

1 task

alexey-milovidov added 2 commits May 28, 2026 08:36

Merge remote-tracking branch 'origin/master' into fix-shardkey-rewrit…

d61a1c3

…e-empty-tuple

Merge remote-tracking branch 'origin/master' into fix-shardkey-rewrit…

24da549

…e-empty-tuple

alexey-milovidov commented Jun 2, 2026

View reviewed changes

alexey-milovidov self-assigned this Jun 2, 2026

alexey-milovidov added this pull request to the merge queue Jun 2, 2026

Merged via the queue into master with commit 55150d7 Jun 2, 2026
166 checks passed

alexey-milovidov deleted the fix-shardkey-rewrite-empty-tuple branch June 2, 2026 05:54

robot-ch-test-poll added the pr-synced-to-cloud The PR is synced to the cloud repo label Jun 2, 2026

Sunbelt Computer Software

PL/B Language Development and Support

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Fix exception on `IN tuple()` against Distributed sharded table#104966

Fix exception on `IN tuple()` against Distributed sharded table#104966
alexey-milovidov merged 16 commits into
masterfrom
fix-shardkey-rewrite-empty-tuple

alexey-milovidov commented May 14, 2026 •

edited by robot-clickhouse

Loading

Uh oh!

clickhouse-gh Bot commented May 14, 2026 •

edited

Loading

Uh oh!

alexey-milovidov commented May 17, 2026

Uh oh!

alexey-milovidov commented May 17, 2026

Uh oh!

alexey-milovidov commented May 18, 2026

Uh oh!

groeneai commented May 18, 2026

Uh oh!

clickhouse-gh Bot commented May 29, 2026

Uh oh!

alexey-milovidov left a comment

Uh oh!

Uh oh!

clickgapai commented Jun 2, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Sunbelt Computer Software

PL/B Language Development and Support

Uh oh!

Conversation

alexey-milovidov commented May 14, 2026 • edited by robot-clickhouse Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Changelog category (leave one):

Changelog entry (a user-readable short description of the changes that goes into CHANGELOG.md):

Documentation entry for user-facing changes

What was wrong

Fix

Test

Origin

Version info

Uh oh!

clickhouse-gh Bot commented May 14, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

AI Review

Summary

Final Verdict

Uh oh!

alexey-milovidov commented May 17, 2026

Uh oh!

alexey-milovidov commented May 17, 2026

Uh oh!

alexey-milovidov commented May 18, 2026

Uh oh!

groeneai commented May 18, 2026

Uh oh!

clickhouse-gh Bot commented May 29, 2026

LLVM Coverage Report

Uh oh!

alexey-milovidov left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

clickgapai commented Jun 2, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

TL;DR

Reproduction

Bisect

Affects

Suggested next step

Triage metadata

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

alexey-milovidov commented May 14, 2026 •

edited by robot-clickhouse

Loading

clickhouse-gh Bot commented May 14, 2026 •

edited

Loading

clickgapai commented Jun 2, 2026 •

edited

Loading