Fix flaky tests 03394_distributed_shuffle_join_with_aggregation and _with_in#109386
Conversation
The test asserts an exact distributed query plan via EXPLAIN. When CI randomization enables statistics (use_statistics + materialize_statistics_on_insert + auto_statistics_types), the cost-based distributed planner reads column NDV from the statistics estimator, which changes the estimated group count and flips the aggregation strategy from Shuffle to partial+merge (see makeDistributed.cpp tryMakeDistributedAggregation and estimateReadRowsCount, gated on use_statistics). The asserted plan then diverges. Pin use_statistics = 0 so the plan is deterministic regardless of the statistics randomization. Mirrors 03357_join_pk_sharding and 03279_join_choose_build_table, which pin the same setting for the same reason. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
|
cc @davenger — could you review this? It pins |
|
Workflow [PR], commit [da40604] Summary: ✅ AI ReviewSummaryThis PR makes Final VerdictStatus: ✅ Approve |
Same fix as the sibling _with_aggregation in this PR, applied to the _with_in variant. The test asserts an exact distributed query plan via EXPLAIN. When CI randomization enables statistics (materialize_statistics_on_insert + non-empty auto_statistics_types), the cost-based distributed planner reads column NDV from the statistics estimator (estimateReadRowsCount, gated on use_statistics in optimizeJoin.cpp), which changes the estimated group count and can flip the aggregation strategy from Shuffle to partial+merge. The asserted plan then diverges. The sibling _with_aggregation has a CI --diagnose-random-settings verdict confirming this exact culprit (materialize_statistics_on_insert True, auto_statistics_types tdigest,minmax,basic). _with_in is structurally identical (same table, same distributed settings, same EXPLAIN plan-shape assertion; only the subquery predicate differs) and shares the same use_statistics-gated code path, so it carries the same latent flake. Pin use_statistics = 0 so the plan is deterministic regardless of the statistics randomization. Verified the EXPLAIN output is byte-identical with use_statistics 0 vs 1 and matches the committed .reference, so the pin does not alter what the test asserts. Mirrors 03357_join_pk_sharding and 03279_join_choose_build_table. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
|
Extended this PR to also cover the sibling
The distributed shuffle-join plan flip only manifests on CI's real multi-replica cluster (not reproducible single-node), so |
LLVM Coverage Report
Changed lines: No C/C++ source files changed — skipping uncovered code analysis. Newly covered by added/modified tests: 99 line(s), 3 function(s) across 44 file(s) · Details Top files
|

Changelog category (leave one):
Changelog entry (a user-readable short description of the changes that goes into CHANGELOG.md):
...
Description
Fixes flakiness of
03394_distributed_shuffle_join_with_aggregationand its sibling03394_distributed_shuffle_join_with_in, seen inStateless testsruns under randomized settings (e.g.amd_tsan).No related open issue found (checked GitHub issues for the test name and the flaky signature).
Root cause: the tests assert an exact distributed query plan via
EXPLAIN. When CI randomization enables statistics (use_statistics=1+materialize_statistics_on_insert=1+ a non-emptyauto_statistics_types), the cost-based distributed planner reads column NDV / row estimates from the statistics estimator. IntryMakeDistributedAggregation(src/Processors/QueryPlan/Optimizations/makeDistributed.cpp) this changes the estimated group count and flips the aggregation strategy fromShuffle(Aggregating+ShuffleExchange) to partial+merge (Aggregating (partial)+MergingAggregated+ScatterExchange), so the asserted plan diverges from the reference. Query results are unaffected; only theEXPLAINplan shape changes.The whole statistics estimator path is gated on
use_statistics(estimateReadRowsCountinoptimizeJoin.cpp), so pinningSET use_statistics = 0makes the plan deterministic regardless of the statistics randomization. This mirrors03357_join_pk_shardingand03279_join_choose_build_table, which pin the same setting for the same reason (they also assert on the join/plan shape).CI's own
--diagnose-random-settingsminimizer isolated the culprits on the failing_with_aggregationrun (public PR #86353,Stateless tests (amd_tsan, parallel, 2/2)):materialize_statistics_on_insert True+auto_statistics_types tdigest,minmax,basic(36/36 fail with the settings, 51/51 pass without). Public report: https://s3.amazonaws.com/clickhouse-test-reports/json.html?PR=86353&sha=d03b55a32fff040b8dac9918793a53b3d868cd29&name_0=PR&name_1=Stateless%20tests%20%28amd_tsan%2C%20parallel%2C%202%2F2%29_with_inis structurally identical to_with_aggregation(same table, same distributed settings, sameEXPLAINplan-shape assertion; only the subquery predicate differs) and shares the sameuse_statistics-gated code path, so it carries the same latent flake. Verified the_with_inEXPLAINoutput is byte-identical withuse_statistics0 vs 1 (statistics materialized) and matches the committed.reference, so the pin does not change what the test asserts. The distributed shuffle-join plan flip is a property of CI's real multi-replica cluster and is not reproducible on a single node, so both variants are pinned preventively on the same proven code path.Verified locally: the fixed
_with_aggregationpasses 100/100 under full settings randomization and 10/10 with randomization disabled.Version info
26.7.1.512