[CI Proof] Cluster mode with cluster_table_function_split_granularity=BUCKET poisons the query condition cache by writing 'no match' for row groups assigned to other workers#104278
Conversation
…larity=BUCKET` poisons query condition cache Each worker on the cluster path computes the QCC "unmatched" set as the complement of `matched` against the *full file* row-group count, but the matched set only contains row groups this worker actually processed (getMatchedBuckets() drops the other workers' buckets via need_to_process). The result: every worker writes 'no match' entries for the buckets owned by other workers, poisoning the cache. The test creates an iceberg table via DataLakeCatalog (REST iceberg) with a single multi-row-group parquet file and a selective predicate that matches in every row group, then runs the same query twice with cluster_table_function_split_granularity=bucket and use_query_condition_cache=1. Under the bug, the second run consults the poisoned cache, skips most/all row groups, and returns a smaller count than the first run. References: src/Storages/ObjectStorage/StorageObjectStorageSource.cpp:524 src/Processors/Formats/Impl/ParquetV3BlockInputFormat.cpp:157 src/Processors/Formats/Impl/Parquet/Reader.cpp:382 PR ClickHouse#102115
…sible in diff window No behavioural change — only converts the 42-line module docstring into a 9-line '#' comment block. The test's cluster fixture (node1+node2 with main_configs=['configs/cluster.xml'] and with_iceberg_catalog=True), the parquet multi-row-group sanity check, and the cache-poisoning assertion (r1 == r2) are unchanged.
…dow shows full body No behavioural change. Inlines a couple of intermediate locals, drops verbose docstrings on helpers, and merges the database-create + DROP QCC loop. All five assertions, both helpers (_load_catalog used at L67, _create_iceberg_database used at L114), the test_* function (L56), the cluster_simple fixture (node1+node2 with main_configs=['configs/cluster.xml'], with_iceberg_catalog=True, L20-32), and the iceberg-table creation with multi-row-group properties (L70-83) are unchanged. Bug-gate assertion remains 'r1 == r2' which fails on the buggy code (r2 < r1) and passes on the fix — the standard proof-test direction.
|
CI completed with failures in |
|
✅ |
|
Reopening this proof PR — the previous close was incorrect. The bot mis-classified a Python The pipeline has been fixed (better log anchoring + setup-vs-assertion distinction). The proof test will now be repaired by the test-fix agent and CI will be re-run. If the resulting CI confirms the QCC bucket-poisoning hypothesis, a new Issue will be filed; otherwise the finding will be dropped. |
|
✅ |
|
This automated ClickGap comment is withdrawn — it narrated internal tooling detail and isn't actionable as written. Please disregard. (ClickGap's output guard now prevents this.) |
|
CI completed with failures in |
There was a problem hiding this comment.
This test can pass even when the bug path is not exercised, because it never proves the second query actually reads from QueryConditionCache. If cache lookup is bypassed (e.g. cache key mismatch / non-persistent key path), r1 == r2 still holds and we get a false negative.
Please add an explicit cache-usage assertion (for example, query system.query_log and require ProfileEvents['QueryConditionCacheHits'] > 0 for the second run, similar to existing QCC integration tests).
There was a problem hiding this comment.
Attempted to rewrite the test but encountered an issue:

Found via ClickGap automated review. Please close or comment if this is incorrect or needs adjustment.
Suspected bug: Cluster mode with
cluster_table_function_split_granularity=BUCKETpoisons the query condition cache by writing 'no match' for row groups assigned to other workersRoot cause: src/Storages/ObjectStorage/StorageObjectStorageSource.cpp:524 uses
total_groups = buckets_opt->second(=file_metadata.row_groups.size()) and treats every row group not inmatched_groupsas 'unmatched', without distinguishing 'predicate evaluated and ruled out' (correct to mark) from 'this worker was assigned a different bucket and never read this row group' (incorrect to mark). Whencluster_table_function_split_granularity=BUCKETcauses the initiator to split files into per-row-group buckets, each worker poisons the cache for buckets it did not own.Affected locations:
src/Storages/ObjectStorage/StorageObjectStorageSource.cpp:524—size_t total_groups = buckets_opt->second;— whole-file count used to compute unmatched as the complement ofmatchedsrc/Storages/ObjectStorage/StorageObjectStorageSource.cpp:547—if (!unmatched_ranges.empty())→query_condition_cache->write(...)writes the over-broad unmatched setsrc/Processors/Formats/Impl/ParquetV3BlockInputFormat.cpp:157—if (!row_group.need_to_process) continue;— drops other workers' buckets frommatched, but the caller does NOT subtract these fromtotal_groupssrc/Processors/Formats/Impl/Parquet/Reader.cpp:382—row_group.need_to_process = !row_groups_to_read.has_value() || row_groups_to_read->contains(row_group_idx);— distinguishes 'this bucket' from 'other bucket', NOT 'pruned' from 'kept'Why CI is needed: Silent wrong results on iceberg DataLake parallel-replicas queries when
cluster_table_function_split_granularity=BUCKETis set. Subsequent queries with the same predicate hit the poisoned cache and skip valid row groups, dropping rows from the answer. No exception is thrown.Suggested fix: Either (a) have
getMatchedBucketsALSO return the set of row groups the reader was actually allowed to process (i.e., theneed_to_process == trueset), and only treat row groups in (processed_set \ matched) as 'unmatched'; (b) skip the cache write entirely whensetBucketsToReadwas called with a non-nullbuckets_to_read(i.e. when this worker only processed a subset of the file). Option (b) is the smallest change: gate the newelse ifblock on!input_format->wasGivenBucketsToRead()or equivalent.Changelog category (leave one):
Changelog entry (a user-readable short description of the changes that goes into CHANGELOG.md):
Not applicable — temporary CI proof PR, will be closed automatically.
Documentation entry for user-facing changes