Add test for #70356 by alexey-milovidov · Pull Request #104551 · ClickHouse/ClickHouse · GitHub
Skip to content

Add test for #70356#104551

Merged
alexey-milovidov merged 9 commits into
masterfrom
milovidov/test-issue-70356
May 14, 2026
Merged

Add test for #70356#104551
alexey-milovidov merged 9 commits into
masterfrom
milovidov/test-issue-70356

Conversation

@alexey-milovidov

@alexey-milovidov alexey-milovidov commented May 10, 2026

Copy link
Copy Markdown
Member

Add a regression test for NOT_FOUND_COLUMN_IN_BLOCK on distributed table queries under the analyzer, reported in #70356. The bug no longer reproduces on master.

Closes #70356.

Changelog category (leave one):

  • CI Fix or Improvement (changelog entry is not required)

Changelog entry (a user-readable short description of the changes that goes into CHANGELOG.md):

...

Documentation entry for user-facing changes

  • Documentation is written (mandatory for new features)

Version info

  • Merged into: 26.5.1.646

@clickhouse-gh

clickhouse-gh Bot commented May 10, 2026

Copy link
Copy Markdown
Contributor

@clickhouse-gh clickhouse-gh Bot added the pr-ci label May 10, 2026
Add queries against a Distributed table backed by
`test_cluster_two_shards_localhost` to broaden coverage of the
`NOT_FOUND_COLUMN_IN_BLOCK` regression test.
…0356`

If a previous run aborts before the trailing `DROP TABLE`, a rerun would
fail with `TABLE_ALREADY_EXISTS`. Cleanup the table next to the other
drops at the start of the test.
…two-shard queries

The two-shard distributed queries against `test_cluster_two_shards_localhost`
were non-deterministic under randomized session settings on the flaky check.
Both shards of that cluster resolve to the same backing
`shard_table_70356`, so each shard returns the same row. When the
randomizer enables both `optimize_skip_unused_shards` and
`optimize_distributed_group_by_sharding_key`, the coordinator-side
GROUP BY merge is skipped (the GROUP BY column matches the sharding key
`sipHash64(adid)`), and the result becomes two duplicate rows per query
instead of one.

Pin `optimize_distributed_group_by_sharding_key = 0` in the two-shard
queries to keep the coordinator merge stable regardless of randomized
session settings. The regression scenario (`NOT_FOUND_COLUMN_IN_BLOCK`
under the analyzer) is unaffected because it is a query-analysis bug,
not an execution-time deduplication concern.

Report: https://s3.amazonaws.com/clickhouse-test-reports/json.html?PR=104551&sha=f26cca514c8fff2c932dad6a35f9f031122c8419&name_0=PR
PR: #104551
alexey-milovidov and others added 2 commits May 13, 2026 07:14
The original repro from #70356 used `GROUP BY 1` under the analyzer
without `ORDER BY`, with `FORMAT Null`. The existing variants all add
`ORDER BY id`, which changes the tail of the plan and can miss bugs
specific to the original no-sort path. Add `FORMAT Null` variants
matching the original failure shape for both the single-shard and
two-shard distributed tables.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@alexey-milovidov

Copy link
Copy Markdown
Member Author

@groeneai, please investigate the following CI failures on this PR (the PR only adds the stateless test 04219_analyzer_distributed_groupby_70356.sql and cannot cause these; reruns of the timeout failures passed):

Please provide fixes (or link existing fix PRs if work is already in progress) in separate PRs.

@alexey-milovidov alexey-milovidov added this pull request to the merge queue May 14, 2026
@alexey-milovidov alexey-milovidov self-assigned this May 14, 2026
Merged via the queue into master with commit e9e9652 May 14, 2026
167 checks passed
@alexey-milovidov alexey-milovidov deleted the milovidov/test-issue-70356 branch May 14, 2026 11:23
@robot-ch-test-poll2 robot-ch-test-poll2 added the pr-synced-to-cloud The PR is synced to the cloud repo label May 14, 2026
@groeneai

Copy link
Copy Markdown
Contributor

@alexey-milovidov — triage results on the four CI failures (commit 82671b4c):

1. 04039_merge_tree_snapshot_teardown_race (timeout) — confirmed, tracked under Issue #104866 (WasmEdge MSan timeout family). Root cause identified by @Algunenano on 2026-05-14: x86-64-v3 ISA baseline flip combined with MSan's __msan_chain_origin instrumentation on hot paths, plus LLVM inserting vzeroupper at SSE/AVX boundaries, pushes wall time over the 180 s SYSTEM FLUSH LOGS limit. Mitigated by the x86v3 revert in PR #104971 (merged 2026-05-15); re-enable will go through PR #105019.

2. 03753_join_runtime_filter_dynamically_disable (timeout) — same WasmEdge MSan SYSTEM FLUSH LOGS query_log timeout family as item 1. Tracked under the same Issue #104866 umbrella; sightings on PR #100377, #100399, #100332, etc. Expected to clear with the x86v3 revert; if it persists post-#104971 we will file a dedicated issue.

3. 04057_transaction_version_metadata_lifecycle — confirmed already tracked in Issue #103152. No action from our side.

4. libFuzzer tests / execute_query_fuzzerLOGICAL_ERROR from Context::clearCachesNEW, investigated below. Will land in a separate PR.


Root cause for item 4

Crash trace (from crash-3f2b45ca5d992f663d07e4d0f8bb303ba84aaf90.trace):

#13  DB::Context::clearCaches() const            src/Interpreters/Context.cpp:4579
#14  DB::MergeTreeData::dropAllData()             src/Storages/MergeTree/MergeTreeData.cpp:3978
#15  DB::(anonymous)::validateStorage(...)        src/Interpreters/InterpreterCreateQuery.cpp:1862
#16  DB::InterpreterCreateQuery::doCreateTable(...) src/Interpreters/InterpreterCreateQuery.cpp:2093
...
#21  LLVMFuzzerTestOneInput                       src/Interpreters/fuzzers/execute_query_fuzzer.cpp:182

Path:

  1. Fuzzer feeds a mutated CREATE TABLE ... ENGINE = MergeTree query.
  2. InterpreterCreateQuery::doCreateTable calls validateStorage, which calls validateVirtualColumns / checkForUnsupportedColumns.
  3. One of those throws (the fuzzer hits an illegal-column or virtual-column collision).
  4. validateStorage's catch block calls storage.drop()MergeTreeData::dropAllData().
  5. dropAllData calls getContext()->clearCaches() (Atomic-database check at line 3996, guarded by !getStorageID().hasUUID()).
  6. Context::clearCaches() walks 12 caches and throws LOGICAL_ERROR("Uncompressed cache was not created yet.") because the fuzzer Context has none of them initialized.

The fuzzer harness (execute_query_fuzzer.cpp) is intentionally minimal — it never calls setUncompressedCache, setMarkCache, setPrimaryIndexCache, setIndexUncompressedCache, setIndexMarkCache, setVectorSimilarityIndexCache, setTextIndexTokensCache, setTextIndexHeaderCache, setTextIndexPostingsCache, setMmapCache, or setQueryConditionCache. Production servers always do; the fuzzer skips it because it does not need caching to fuzz parser/planner/interpreter paths.

Context::clearCaches() (Context.cpp:4578-4625) was written to assert that the server is fully initialized, but the assertion is overzealous: clearing a null cache is semantically a no-op (the cache is trivially empty). The throws block the cleanup path for any caller that legitimately operates on a partially-initialized Context — including this fuzzer.

Cross-PR scope (CIDB, 60 days)

execute_query_fuzzer "deadly signal" hits in CIDB:

PR count last seen
#104551 1 2026-05-13
#104859 1 2026-05-13
#96886 1 2026-05-14
#100407 1 2026-05-15
#99740 2 2026-03-20

The 2026-05 cluster (4 hits across 4 unrelated PRs in 3 days) matches the new fuzzer-corpus mutation that reaches the validateStorage cleanup path. Not a regression; the corpus rolled into this area recently. The clearCaches throws are unchanged since 2023 (Robert Schulze, Anton Popov authored).

Fix plan (separate PR)

Make Context::clearCaches() defensive — replace each if (!cache) throw with if (cache) cache->clear(). The "was not created yet" exceptions were defensive assertions and provided no real protection; they only fire in fuzzer/test contexts that legitimately operate without initialized caches.

I will open a small follow-up PR with this change plus a regression test (tests/queries/0_stateless/04xxx_create_table_with_invalid_virtual_column_no_cache_init.sql) once a server-side test harness for the no-caches path is wired up — or, if a simpler approach lands, by adding setUncompressedCache(...) etc. calls in execute_query_fuzzer.cpp and noting that production callers always initialize first. I will defer to the comp-query-execution reviewers on which fix shape they prefer.

Tracking on our side: task 2026-05-13-github-clickhouseclickhouse104551--.

pull Bot pushed a commit to sonvt1710/ClickHouse that referenced this pull request Jun 9, 2026
Context::clearCaches threw a LOGICAL_ERROR ("X cache was not created yet.")
on the first null cache pointer it encountered. Production servers always
initialize all caches at startup, so the throws never fired in production -
but the execute_query_fuzzer libFuzzer harness (and any unit-test harness
built around a minimal Context) creates a Context without calling any
set*Cache initializer. When a fuzzed CREATE TABLE failed validateStorage,
the cleanup path (MergeTreeData::dropAllData -> Context::clearCaches)
tripped the assertion and libFuzzer reported a crash with the trace
documented on PR ClickHouse#104551.

The fix replaces each "null cache -> throw" guard with a defensive
"if (cache) cache->clear()" check, matching the pattern already used by
every single-cache clear<X>Cache method on Context (clearUncompressedCache,
clearMarkCache, clearPrimaryIndexCache, and so on - 14 sibling methods,
all defensive). clearCaches was the lone outlier.

Adds a regression test in src/Interpreters/tests/gtest_context_clear_caches.cpp
that copies the global test Context (whose caches are all null because
gtest_global_context.cpp never sets them) and calls clearCaches twice -
both calls must complete without throwing. Verified that the test fails
against the unmodified source with "Logical error: 'Uncompressed cache
was not created yet.'" and passes once the defensive null checks are in
place.

See ClickHouse#104551 (comment)

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

pr-ci pr-synced-to-cloud The PR is synced to the cloud repo

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Analyzer causes NOT_FOUND_COLUMN_IN_BLOCK on distributed table queries

3 participants