iframe-proxy

alexey-milovidov · 2026-05-10T23:47:53Z

Add a regression test for NOT_FOUND_COLUMN_IN_BLOCK on distributed table queries under the analyzer, reported in #70356. The bug no longer reproduces on master.

Closes #70356.

Changelog category (leave one):

CI Fix or Improvement (changelog entry is not required)

Changelog entry (a user-readable short description of the changes that goes into CHANGELOG.md):

...

Documentation entry for user-facing changes

Documentation is written (mandatory for new features)

Version info

Merged into: 26.5.1.646

Closes #70356.

clickhouse-gh · 2026-05-10T23:48:27Z

Add queries against a Distributed table backed by `test_cluster_two_shards_localhost` to broaden coverage of the `NOT_FOUND_COLUMN_IN_BLOCK` regression test.

…e-70356

…0356` If a previous run aborts before the trailing `DROP TABLE`, a rerun would fail with `TABLE_ALREADY_EXISTS`. Cleanup the table next to the other drops at the start of the test.

…two-shard queries The two-shard distributed queries against `test_cluster_two_shards_localhost` were non-deterministic under randomized session settings on the flaky check. Both shards of that cluster resolve to the same backing `shard_table_70356`, so each shard returns the same row. When the randomizer enables both `optimize_skip_unused_shards` and `optimize_distributed_group_by_sharding_key`, the coordinator-side GROUP BY merge is skipped (the GROUP BY column matches the sharding key `sipHash64(adid)`), and the result becomes two duplicate rows per query instead of one. Pin `optimize_distributed_group_by_sharding_key = 0` in the two-shard queries to keep the coordinator merge stable regardless of randomized session settings. The regression scenario (`NOT_FOUND_COLUMN_IN_BLOCK` under the analyzer) is unaffected because it is a query-analysis bug, not an execution-time deduplication concern. Report: https://s3.amazonaws.com/clickhouse-test-reports/json.html?PR=104551&sha=f26cca514c8fff2c932dad6a35f9f031122c8419&name_0=PR PR: #104551

…e-70356

The original repro from #70356 used `GROUP BY 1` under the analyzer without `ORDER BY`, with `FORMAT Null`. The existing variants all add `ORDER BY id`, which changes the tail of the plan and can miss bugs specific to the original no-sort path. Add `FORMAT Null` variants matching the original failure shape for both the single-shard and two-shard distributed tables. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…e-70356

alexey-milovidov · 2026-05-13T23:57:09Z

@groeneai, please investigate the following CI failures on this PR (the PR only adds the stateless test 04219_analyzer_distributed_groupby_70356.sql and cannot cause these; reruns of the timeout failures passed):

Tests/04039_merge_tree_snapshot_teardown_race (timeout) — already tracked in Flaky test: 04039_merge_tree_snapshot_teardown_race #104866.
Tests/03753_join_runtime_filter_dynamically_disable (timeout, no linked issue) — https://s3.amazonaws.com/clickhouse-test-reports/json.html?PR=104551&sha=82671b4c5d2fda8605189db8b9c5eb939cd5b858&name_0=PR&name_1=Stateless%20tests%20%28amd_msan%2C%20WasmEdge%2C%20parallel%2C%201%2F2%29
Tests/04057_transaction_version_metadata_lifecycle (result differs with reference) — already tracked in Flaky test: 04057_transaction_version_metadata_lifecycle #103152.
libFuzzer tests / execute_query_fuzzer (LOGICAL_ERROR from Context::clearCaches in src/Interpreters/Context.cpp:4579 invoked via MergeTreeData::dropAllData) — https://s3.amazonaws.com/clickhouse-test-reports/PRs/104551/82671b4c5d2fda8605189db8b9c5eb939cd5b858/libfuzzer_tests/execute_query_fuzzer/

Please provide fixes (or link existing fix PRs if work is already in progress) in separate PRs.

groeneai · 2026-05-16T01:21:27Z

@alexey-milovidov — triage results on the four CI failures (commit 82671b4c):

1. 04039_merge_tree_snapshot_teardown_race (timeout) — confirmed, tracked under Issue #104866 (WasmEdge MSan timeout family). Root cause identified by @Algunenano on 2026-05-14: x86-64-v3 ISA baseline flip combined with MSan's __msan_chain_origin instrumentation on hot paths, plus LLVM inserting vzeroupper at SSE/AVX boundaries, pushes wall time over the 180 s SYSTEM FLUSH LOGS limit. Mitigated by the x86v3 revert in PR #104971 (merged 2026-05-15); re-enable will go through PR #105019.

2. 03753_join_runtime_filter_dynamically_disable (timeout) — same WasmEdge MSan SYSTEM FLUSH LOGS query_log timeout family as item 1. Tracked under the same Issue #104866 umbrella; sightings on PR #100377, #100399, #100332, etc. Expected to clear with the x86v3 revert; if it persists post-#104971 we will file a dedicated issue.

3. 04057_transaction_version_metadata_lifecycle — confirmed already tracked in Issue #103152. No action from our side.

4. libFuzzer tests / execute_query_fuzzer — LOGICAL_ERROR from Context::clearCaches — NEW, investigated below. Will land in a separate PR.

Root cause for item 4

Crash trace (from crash-3f2b45ca5d992f663d07e4d0f8bb303ba84aaf90.trace):

#13  DB::Context::clearCaches() const            src/Interpreters/Context.cpp:4579
#14  DB::MergeTreeData::dropAllData()             src/Storages/MergeTree/MergeTreeData.cpp:3978
#15  DB::(anonymous)::validateStorage(...)        src/Interpreters/InterpreterCreateQuery.cpp:1862
#16  DB::InterpreterCreateQuery::doCreateTable(...) src/Interpreters/InterpreterCreateQuery.cpp:2093
...
#21  LLVMFuzzerTestOneInput                       src/Interpreters/fuzzers/execute_query_fuzzer.cpp:182

Path:

Fuzzer feeds a mutated CREATE TABLE ... ENGINE = MergeTree query.
InterpreterCreateQuery::doCreateTable calls validateStorage, which calls validateVirtualColumns / checkForUnsupportedColumns.
One of those throws (the fuzzer hits an illegal-column or virtual-column collision).
validateStorage's catch block calls storage.drop() → MergeTreeData::dropAllData().
dropAllData calls getContext()->clearCaches() (Atomic-database check at line 3996, guarded by !getStorageID().hasUUID()).
Context::clearCaches() walks 12 caches and throws LOGICAL_ERROR("Uncompressed cache was not created yet.") because the fuzzer Context has none of them initialized.

The fuzzer harness (execute_query_fuzzer.cpp) is intentionally minimal — it never calls setUncompressedCache, setMarkCache, setPrimaryIndexCache, setIndexUncompressedCache, setIndexMarkCache, setVectorSimilarityIndexCache, setTextIndexTokensCache, setTextIndexHeaderCache, setTextIndexPostingsCache, setMmapCache, or setQueryConditionCache. Production servers always do; the fuzzer skips it because it does not need caching to fuzz parser/planner/interpreter paths.

Context::clearCaches() (Context.cpp:4578-4625) was written to assert that the server is fully initialized, but the assertion is overzealous: clearing a null cache is semantically a no-op (the cache is trivially empty). The throws block the cleanup path for any caller that legitimately operates on a partially-initialized Context — including this fuzzer.

Cross-PR scope (CIDB, 60 days)

execute_query_fuzzer "deadly signal" hits in CIDB:

The 2026-05 cluster (4 hits across 4 unrelated PRs in 3 days) matches the new fuzzer-corpus mutation that reaches the validateStorage cleanup path. Not a regression; the corpus rolled into this area recently. The clearCaches throws are unchanged since 2023 (Robert Schulze, Anton Popov authored).

Fix plan (separate PR)

Make Context::clearCaches() defensive — replace each if (!cache) throw with if (cache) cache->clear(). The "was not created yet" exceptions were defensive assertions and provided no real protection; they only fire in fuzzer/test contexts that legitimately operate without initialized caches.

I will open a small follow-up PR with this change plus a regression test (tests/queries/0_stateless/04xxx_create_table_with_invalid_virtual_column_no_cache_init.sql) once a server-side test harness for the no-caches path is wired up — or, if a simpler approach lands, by adding setUncompressedCache(...) etc. calls in execute_query_fuzzer.cpp and noting that production callers always initialize first. I will defer to the comp-query-execution reviewers on which fix shape they prefer.

Tracking on our side: task 2026-05-13-github-clickhouseclickhouse104551--.

Context::clearCaches threw a LOGICAL_ERROR ("X cache was not created yet.") on the first null cache pointer it encountered. Production servers always initialize all caches at startup, so the throws never fired in production - but the execute_query_fuzzer libFuzzer harness (and any unit-test harness built around a minimal Context) creates a Context without calling any set*Cache initializer. When a fuzzed CREATE TABLE failed validateStorage, the cleanup path (MergeTreeData::dropAllData -> Context::clearCaches) tripped the assertion and libFuzzer reported a crash with the trace documented on PR ClickHouse#104551. The fix replaces each "null cache -> throw" guard with a defensive "if (cache) cache->clear()" check, matching the pattern already used by every single-cache clear<X>Cache method on Context (clearUncompressedCache, clearMarkCache, clearPrimaryIndexCache, and so on - 14 sibling methods, all defensive). clearCaches was the lone outlier. Adds a regression test in src/Interpreters/tests/gtest_context_clear_caches.cpp that copies the global test Context (whose caches are all null because gtest_global_context.cpp never sets them) and calls clearCaches twice - both calls must complete without throwing. Verified that the test fails against the unmodified source with "Logical error: 'Uncompressed cache was not created yet.'" and passes once the defensive null checks are in place. See ClickHouse#104551 (comment) Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

Add test for #70356

d8bda1c

Closes #70356.

clickhouse-gh Bot added the pr-ci label May 10, 2026

alexey-milovidov commented May 11, 2026

View reviewed changes

Comment thread tests/queries/0_stateless/04219_analyzer_distributed_groupby_70356.sql

Address review: also cover two-shard distributed configuration

f26cca5

Add queries against a Distributed table backed by `test_cluster_two_shards_localhost` to broaden coverage of the `NOT_FOUND_COLUMN_IN_BLOCK` regression test.

clickhouse-gh Bot reviewed May 12, 2026

View reviewed changes

Comment thread tests/queries/0_stateless/04219_analyzer_distributed_groupby_70356.sql

alexey-milovidov added 4 commits May 12, 2026 08:48

Merge remote-tracking branch 'origin/master' into milovidov/test-issu…

7cc5204

…e-70356

Address review: add DROP TABLE IF EXISTS for `dist_two_shards_table_7…

4d6fb8c

…0356` If a previous run aborts before the trailing `DROP TABLE`, a rerun would fail with `TABLE_ALREADY_EXISTS`. Cleanup the table next to the other drops at the start of the test.

Merge remote-tracking branch 'origin/master' into milovidov/test-issu…

b59711d

…e-70356

clickhouse-gh Bot reviewed May 12, 2026

View reviewed changes

Comment thread tests/queries/0_stateless/04219_analyzer_distributed_groupby_70356.sql

alexey-milovidov and others added 2 commits May 13, 2026 07:14

Merge remote-tracking branch 'origin/master' into milovidov/test-issu…

82671b4

…e-70356

alexey-milovidov mentioned this pull request May 13, 2026

Enable clang-tidy check for uninitialized variables #100399

Merged

Merge remote-tracking branch 'origin/master' into milovidov/test-issu…

44a33b6

…e-70356

alexey-milovidov added this pull request to the merge queue May 14, 2026

alexey-milovidov self-assigned this May 14, 2026

Merged via the queue into master with commit e9e9652 May 14, 2026
167 checks passed

alexey-milovidov deleted the milovidov/test-issue-70356 branch May 14, 2026 11:23

robot-ch-test-poll2 added the pr-synced-to-cloud The PR is synced to the cloud repo label May 14, 2026

groeneai mentioned this pull request May 15, 2026

Replace ExpressionAnalyzer with Analyzer for standalone expression compilation #96886

Open

9 tasks

groeneai mentioned this pull request May 17, 2026

Make Context::clearCaches defensive against uninitialized caches #105148

Merged

1 task

alexey-milovidov mentioned this pull request May 18, 2026

Support WITH TIES for negative LIMIT #100930

Merged

1 task

robot-clickhouse-ci-1 mentioned this pull request Jun 27, 2026

Analyzer causes NOT_FOUND_COLUMN_IN_BLOCK on distributed table queries #70356

Closed

Sunbelt Computer Software

PL/B Language Development and Support

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add test for #70356#104551

Add test for #70356#104551
alexey-milovidov merged 9 commits into
masterfrom
milovidov/test-issue-70356

alexey-milovidov commented May 10, 2026 •

edited by robot-clickhouse

Loading

Uh oh!

clickhouse-gh Bot commented May 10, 2026 •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

Uh oh!

alexey-milovidov commented May 13, 2026

Uh oh!

Uh oh!

groeneai commented May 16, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Sunbelt Computer Software

PL/B Language Development and Support

Uh oh!

Conversation

alexey-milovidov commented May 10, 2026 • edited by robot-clickhouse Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Changelog category (leave one):

Changelog entry (a user-readable short description of the changes that goes into CHANGELOG.md):

Documentation entry for user-facing changes

Version info

Uh oh!

clickhouse-gh Bot commented May 10, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

AI Review

Summary

Final Verdict

Uh oh!

Uh oh!

Uh oh!

Uh oh!

alexey-milovidov commented May 13, 2026

Uh oh!

Uh oh!

groeneai commented May 16, 2026

Root cause for item 4

Cross-PR scope (CIDB, 60 days)

Fix plan (separate PR)

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

alexey-milovidov commented May 10, 2026 •

edited by robot-clickhouse

Loading

clickhouse-gh Bot commented May 10, 2026 •

edited

Loading