Fix flaky tests 03394_distributed_shuffle_join_with_aggregation and _with_in by groeneai · Pull Request #109386 · ClickHouse/ClickHouse · GitHub
Skip to content

Fix flaky tests 03394_distributed_shuffle_join_with_aggregation and _with_in#109386

Merged
davenger merged 2 commits into
ClickHouse:masterfrom
groeneai:groeneai/fix-03394-shuffle-join-aggregation-flaky
Jul 4, 2026
Merged

Fix flaky tests 03394_distributed_shuffle_join_with_aggregation and _with_in#109386
davenger merged 2 commits into
ClickHouse:masterfrom
groeneai:groeneai/fix-03394-shuffle-join-aggregation-flaky

Conversation

@groeneai

@groeneai groeneai commented Jul 4, 2026

Copy link
Copy Markdown
Contributor

Changelog category (leave one):

  • CI Fix or Improvement (changelog entry is not required)

Changelog entry (a user-readable short description of the changes that goes into CHANGELOG.md):

...

Description

Fixes flakiness of 03394_distributed_shuffle_join_with_aggregation and its sibling 03394_distributed_shuffle_join_with_in, seen in Stateless tests runs under randomized settings (e.g. amd_tsan).

No related open issue found (checked GitHub issues for the test name and the flaky signature).

Root cause: the tests assert an exact distributed query plan via EXPLAIN. When CI randomization enables statistics (use_statistics=1 + materialize_statistics_on_insert=1 + a non-empty auto_statistics_types), the cost-based distributed planner reads column NDV / row estimates from the statistics estimator. In tryMakeDistributedAggregation (src/Processors/QueryPlan/Optimizations/makeDistributed.cpp) this changes the estimated group count and flips the aggregation strategy from Shuffle (Aggregating + ShuffleExchange) to partial+merge (Aggregating (partial) + MergingAggregated + ScatterExchange), so the asserted plan diverges from the reference. Query results are unaffected; only the EXPLAIN plan shape changes.

The whole statistics estimator path is gated on use_statistics (estimateReadRowsCount in optimizeJoin.cpp), so pinning SET use_statistics = 0 makes the plan deterministic regardless of the statistics randomization. This mirrors 03357_join_pk_sharding and 03279_join_choose_build_table, which pin the same setting for the same reason (they also assert on the join/plan shape).

CI's own --diagnose-random-settings minimizer isolated the culprits on the failing _with_aggregation run (public PR #86353, Stateless tests (amd_tsan, parallel, 2/2)): materialize_statistics_on_insert True + auto_statistics_types tdigest,minmax,basic (36/36 fail with the settings, 51/51 pass without). Public report: https://s3.amazonaws.com/clickhouse-test-reports/json.html?PR=86353&sha=d03b55a32fff040b8dac9918793a53b3d868cd29&name_0=PR&name_1=Stateless%20tests%20%28amd_tsan%2C%20parallel%2C%202%2F2%29

_with_in is structurally identical to _with_aggregation (same table, same distributed settings, same EXPLAIN plan-shape assertion; only the subquery predicate differs) and shares the same use_statistics-gated code path, so it carries the same latent flake. Verified the _with_in EXPLAIN output is byte-identical with use_statistics 0 vs 1 (statistics materialized) and matches the committed .reference, so the pin does not change what the test asserts. The distributed shuffle-join plan flip is a property of CI's real multi-replica cluster and is not reproducible on a single node, so both variants are pinned preventively on the same proven code path.

Verified locally: the fixed _with_aggregation passes 100/100 under full settings randomization and 10/10 with randomization disabled.

Version info

  • Merged into: 26.7.1.512

The test asserts an exact distributed query plan via EXPLAIN. When CI randomization
enables statistics (use_statistics + materialize_statistics_on_insert +
auto_statistics_types), the cost-based distributed planner reads column NDV from the
statistics estimator, which changes the estimated group count and flips the aggregation
strategy from Shuffle to partial+merge (see makeDistributed.cpp tryMakeDistributedAggregation
and estimateReadRowsCount, gated on use_statistics). The asserted plan then diverges.

Pin use_statistics = 0 so the plan is deterministic regardless of the statistics
randomization. Mirrors 03357_join_pk_sharding and 03279_join_choose_build_table, which
pin the same setting for the same reason.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
@groeneai

groeneai commented Jul 4, 2026

Copy link
Copy Markdown
Contributor Author

@groeneai

groeneai commented Jul 4, 2026

Copy link
Copy Markdown
Contributor Author

cc @davenger — could you review this? It pins use_statistics = 0 in the test so the asserted distributed EXPLAIN plan stays deterministic: with statistics randomized on, tryMakeDistributedAggregation reads NDV from the estimator and flips the aggregation strategy (Shuffle -> partial+merge), which diverges from the reference.

@clickhouse-gh

clickhouse-gh Bot commented Jul 4, 2026

Copy link
Copy Markdown
Contributor

Workflow [PR], commit [da40604]

Summary:


AI Review

Summary

This PR makes 03394_distributed_shuffle_join_with_aggregation and 03394_distributed_shuffle_join_with_in deterministic by pinning use_statistics = 0, so randomized statistics cannot flip the distributed aggregation strategy and invalidate the EXPLAIN references. I did not find a remaining correctness or coverage problem in the change: the pin matches other plan-sensitive tests, and statistics-enabled distributed planning is still exercised elsewhere in the 03394_* suite.

Final Verdict

Status: ✅ Approve

@clickhouse-gh clickhouse-gh Bot added the pr-ci label Jul 4, 2026
@davenger davenger added the can be tested Allows running workflows for external contributors label Jul 4, 2026
@davenger davenger self-assigned this Jul 4, 2026
Same fix as the sibling _with_aggregation in this PR, applied to the _with_in
variant. The test asserts an exact distributed query plan via EXPLAIN. When CI
randomization enables statistics (materialize_statistics_on_insert +
non-empty auto_statistics_types), the cost-based distributed planner reads
column NDV from the statistics estimator (estimateReadRowsCount, gated on
use_statistics in optimizeJoin.cpp), which changes the estimated group count
and can flip the aggregation strategy from Shuffle to partial+merge. The
asserted plan then diverges.

The sibling _with_aggregation has a CI --diagnose-random-settings verdict
confirming this exact culprit (materialize_statistics_on_insert True,
auto_statistics_types tdigest,minmax,basic). _with_in is structurally identical
(same table, same distributed settings, same EXPLAIN plan-shape assertion; only
the subquery predicate differs) and shares the same use_statistics-gated code
path, so it carries the same latent flake.

Pin use_statistics = 0 so the plan is deterministic regardless of the statistics
randomization. Verified the EXPLAIN output is byte-identical with use_statistics
0 vs 1 and matches the committed .reference, so the pin does not alter what the
test asserts. Mirrors 03357_join_pk_sharding and 03279_join_choose_build_table.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
@groeneai groeneai changed the title Fix flaky test 03394_distributed_shuffle_join_with_aggregation Fix flaky tests 03394_distributed_shuffle_join_with_aggregation and _with_in Jul 4, 2026
@groeneai

groeneai commented Jul 4, 2026

Copy link
Copy Markdown
Contributor Author

Extended this PR to also cover the sibling 03394_distributed_shuffle_join_with_in (commit da40604) — same root cause, same one-line fix (SET use_statistics = 0).

_with_in is structurally identical to _with_aggregation: same table, same distributed settings, same EXPLAIN plan-shape assertion; only the subquery predicate differs (lang IN ('en','de') vs lang = 'en'). It hits the same use_statistics-gated estimator path in optimizeJoin.cpp, so it carries the same latent statistics-randomization flake.

The distributed shuffle-join plan flip only manifests on CI's real multi-replica cluster (not reproducible single-node), so _with_in is pinned preventively. I did confirm the pin is orthogonal to the assertion: the _with_in EXPLAIN output is byte-identical with use_statistics 0 vs 1 (statistics materialized via auto_statistics_types = 'tdigest,minmax,basic') and matches the committed .reference.

@clickhouse-gh

clickhouse-gh Bot commented Jul 4, 2026

Copy link
Copy Markdown
Contributor

LLVM Coverage Report

Metric Baseline Current Δ
Lines 85.92% 85.93% +0.01%
Functions 91.48% 91.48% +0.00%
Branches 78.47% 78.51% +0.04%

Changed lines: No C/C++ source files changed — skipping uncovered code analysis.

Newly covered by added/modified tests: 99 line(s), 3 function(s) across 44 file(s) · Details

Top files
  • src/Storages/MergeTree/MergeFromLogEntryTask.cpp: 7 line(s)
  • src/AggregateFunctions/AggregateFunctionSparkbar.cpp: 5 line(s)
  • src/Functions/FunctionsNumericIndexedVector.h: 5 line(s)
  • src/Functions/FunctionsExternalDictionaries.h: 4 line(s)
  • src/Functions/h3EdgeAngle.cpp: 4 line(s)

Full report

@groeneai

groeneai commented Jul 4, 2026

Copy link
Copy Markdown
Contributor Author

@davenger davenger added this pull request to the merge queue Jul 4, 2026
Merged via the queue into ClickHouse:master with commit 5c08fe5 Jul 4, 2026
174 checks passed
@robot-ch-test-poll robot-ch-test-poll added the pr-synced-to-cloud The PR is synced to the cloud repo label Jul 4, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

can be tested Allows running workflows for external contributors pr-ci pr-synced-to-cloud The PR is synced to the cloud repo

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants