iframe-proxy

davenger · 2025-08-28T11:27:09Z

A Cascades-style cost-based optimizer that chooses distribution strategies for multi-stage distributed query plans (#106020). It explores alternatives in a memo with top-down, goal-directed search and picks the cheapest plan satisfying the required distribution and sorting properties, inserting exchange operators as needed.

Implemented:

Join strategies: shuffle hash join, broadcast hash join (with ReplicatedRead — every worker repeats the same read of a small table instead of a network broadcast, assuming shared storage where all workers see the same data), local join.
Aggregation strategies: two-phase (partial + merge), shuffle by group keys, local.
Top-N: two-stage distributed top-N (per-node bounded sort, sorted-merge gather, coordinator limit).
Read strategies: parallel N-way read, replicated read, local read.
Properties and enforcers: distribution (node count, replication, partitioning columns with equivalence classes and hash cast types) and sorting, bridged by Gather/Shuffle/Broadcast/ScatterExchange and Sort enforcers.
Transformations: join commutativity, two-phase aggregation split, two-stage top-N split.
Cost model: work, network, and sequential components with configurable weights, a fixed per-exchange overhead, broadcast costed per receiving node, and statistics clamped to join kind and strictness semantics.
Fail-closed behavior: an unconvertible plan with exchange steps, an exhausted optimization task budget, and an invalid cost-config override all reject the query with a clear error instead of degrading silently.

Design, a worked example on TPC-H data (a simplified 3-table query traced through the memo), and current limitations are documented in src/Processors/QueryPlan/Optimizations/Cascades/ARCHITECTURE.md. Plan-shape tests cover the actual TPC-H queries (03836_tpch_join_order_plans).

Disabled by default. Requires the analyzer and the multi-stage distributed execution configuration (stateless workers):

SET enable_cascades_optimizer = 1, make_distributed_plan = 1;

For tests, param__internal_cascades_cluster_node_count overrides the cluster size, param__internal_cascades_cost_config overrides the cost weights, param__internal_join_table_stat_hints injects table statistics, and distributed_plan_execute_locally = 1 runs the distributed stages in-process.

Related: #106020

Changelog category (leave one):

Experimental Feature

Changelog entry (a user-readable short description of the changes that goes into CHANGELOG.md):

Added an experimental Cascades cost-based optimizer for distributed query plans, enabled by enable_cascades_optimizer = 1 together with make_distributed_plan = 1. It chooses between shuffle, broadcast, and local join strategies, two-phase, shuffle, and local aggregation, two-stage distributed top-N, and parallel and replicated reads by estimated cost, inserting exchange operators as needed.

Documentation entry for user-facing changes

Documentation written in /docs

clickhouse-gh · 2025-08-28T11:27:36Z

Workflow [PR], commit [10cdc0b]

Summary: ❌

Performance Comparison: Performance dashboard

job_name	test_name	status	info
Stateless tests (arm_asan_ubsan, targeted)		FAIL
	03394_distributed_shuffle_join_with_aggregation	FAIL	cidb
	03394_distributed_shuffle_join_with_aggregation	FAIL	cidb
	03394_distributed_shuffle_join_with_aggregation	FAIL	cidb
	03394_distributed_shuffle_join_with_aggregation	FAIL	cidb
Stateless tests (arm_binary, parallel)		FAIL
	03394_distributed_shuffle_join_with_in	FAIL	cidb
	03394_distributed_shuffle_join_with_aggregation	FAIL	cidb

AI Review

Summary

This PR adds an experimental Cascades cost-based optimizer for distributed query plans. The overall design is substantial, but the current head still has several live contract violations in plan reconstruction, distributed read consistency, and task routing that can either change query results or reject otherwise valid queries late in execution, so this is not ready to merge.

Findings

❌ Blockers

[src/Processors/QueryPlan/Optimizations/Cascades/Rules/DefaultImplementation.cpp:30-46] DefaultImplementation still admits steps that the Cascades plan builder cannot reconstruct. buildBestPlan unconditionally calls clone on the chosen step, so plans containing ReadFromPreparedSource / ReadFromStorageStep can still die with NOT_IMPLEMENTED during optimization instead of either staying local or being rejected up front. Suggested fix: reject non-cloneable steps before memo construction, or restrict DefaultImplementation to step types with a working clone.
[src/Processors/QueryPlan/Optimizations/Cascades/Rules/JoinCommutativity.cpp:24-47] JoinCommutativity still ignores JoinStepLogical::isOptimized. That lets Cascades swap USING joins whose changed_types contract has already been baked into actions_after_join, but swapInputs does not rebuild that projection/cast contract, so the swapped plan can return different semantics from the original join. Suggested fix: skip isOptimized joins here and in any other transformation that reconstructs a logical join.
[src/Processors/QueryPlan/Optimizations/Cascades/Rules/ParallelReadImplementation.cpp:101-136] ReplicatedReadImplementation still advertises a replicated full copy for any ReadFromMergeTree, but it neither proves that every worker sees the same complete table nor pins the coordinator snapshot. In a broadcast join, different workers can therefore read different or partial right-hand sides, changing query results. Suggested fix: gate ReplicatedRead to storage layouts with a guaranteed shared full snapshot, and serialize the coordinator-selected part list for the replicated path too; otherwise use a single read plus BroadcastExchangeStep.
[src/Processors/QueryPlan/Optimizations/makeDistributed.cpp:1010-1039] reconciling 1 shard with N shards still clones the first single-shard task into every destination shard and then merges parameters with insert. If the first child is single-shard and the second child is distributed, the copied bucket_id / total_buckets stay at 0 / 1, so every combined task reads bucket 0 from the distributed side and the other buckets are dropped. Suggested fix: build the expanded task from the N-shard side, or overwrite routing parameters after merging child task parameters.
[src/Processors/QueryPlan/ReadFromMergeTree.cpp:3797-3815, 5102-5113] the serialized pinned-part contract is still wrong in two ways: an empty coordinator snapshot is indistinguishable from “no pinned list”, and non-empty part lists are re-sorted lexicographically before worker bucketing. That lets workers read parts the coordinator did not see and can also break the execution order required by distributed requestReadingInOrder. Suggested fix: serialize an explicit “pinned list present” flag and preserve the coordinator’s original parts_with_ranges order exactly.
[src/QueryPipeline/DistributedPlanExecutor.cpp:970-975] chooseTaskSerializationVersion still decides whether version 1 is safe by comparing producer ports with the initiator’s distributed_query.streaming_exchange_port. With per-node exchange ports, a consumer whose local fallback port differs can still receive a version-1 task and dial producer:<consumer_port> instead of the producer’s real port. Suggested fix: keep version 2 whenever producer endpoints may differ from the destination worker’s fallback port, or base the decision on the destination worker port rather than the initiator’s port.

⚠️ Majors

[src/Processors/QueryPlan/Optimizations/QueryPlanOptimizationSettings.cpp:213-215] projection rewrites still stay enabled under make_distributed_plan, while ReadFromMergeTree::isSerializable at [src/Processors/QueryPlan/ReadFromMergeTree.h:432] still reports every bucketed read as serializable. Unsupported distributed reads from projections, deferred FINAL filters, or direct text-index tasks therefore slip past planning-time validation and fail only later when serializeQueryPlan walks the fragment. Suggested fix: either disable normal / forced projection rewrites for distributed plans, or make isSerializable reflect the bucketed-read constraints and rerun the distributed-read support check after projection selection.
[src/Processors/QueryPlan/Optimizations/Cascades/Rules/AggregationImplementations.cpp:266-276] the two-phase merge path still hardcodes memory_efficient_aggregation to false, so enabling distributed_aggregation_memory_efficient = 1 no longer affects Cascades-built distributed aggregation even though that setting defaults to enabled in the legacy path. Suggested fix: thread the optimization setting into Cascades and pass the same distributed_aggregation_memory_efficient && !has_grouping_sets contract used by the rule-based planner.
[src/Processors/QueryPlan/AggregatingStep.h:87-90] the new in-order aggregation serialization payload is still unreachable because AggregatingStep::isSerializable rejects every step with a non-empty sort_description_for_merging. A remote fragment that legitimately contains such an aggregation still fails the pre-check with make_distributed_plan cannot distribute this query instead of using the serialized state. Suggested fix: make isSerializable match the serializer, or keep rejecting / planning away from in-order aggregation consistently before those steps enter a distributed fragment.
[src/Processors/QueryPlan/Optimizations/Cascades/Rules/TopNImplementation.cpp:136-137] the internal TopN merge cap is still a real LimitStep built with always_read_till_end = false. Under exact_rows_before_limit = 1, rows_before_limit_at_least then stops at that internal cap and under-reports the true pre-limit row count for distributed top-N queries. Suggested fix: either skip TwoStageTopN when the user-visible limit must read till end, or replace the internal cap with an implementation that does not block the upstream counter.
[src/Processors/QueryPlan/Optimizations/Cascades/StatisticsDerivation.cpp:366-370] deriveFilterStatistics still models every FilterStep as selectivity 1: it remaps NDVs but leaves row counts unchanged. Real post-join or computed filters can therefore be treated as if they removed no rows at all, which can flip broadcast-vs-shuffle and other downstream cost choices. Suggested fix: apply at least a heuristic row-count reduction here, or reuse the existing selectivity estimator when the filter remains close enough to a table-read path.

Tests

⚠️ Add a worker-backed regression for ReplicatedRead where the build-side table is either local/sharded or changes between task starts; that is the smallest proof that every worker sees the same right-hand-side snapshot.
⚠️ Add a distributed ORDER BY ... LIMIT regression with exact_rows_before_limit = 1 and Cascades enabled to prove that rows_before_limit_at_least survives the internal top-N merge cap.
⚠️ Add a focused distributed-read regression that selects a projection or deferred-FINAL / direct text-index path under make_distributed_plan and proves the query is rejected during support checks rather than later during fragment serialization.

Final Verdict

Status: ❌ Block

Minimum required actions:

Fix the live wrong-results / execution-contract bugs in ReplicatedReadImplementation, task-shard reconciliation, pinned distributed reads, JoinCommutativity, and exchange-port serialization.
Make unsupported distributed-read shapes fail closed during planning instead of during fragment serialization.
Add focused regressions for the distributed snapshot, exact_rows_before_limit, and fail-close validation paths above.

…graph

…ut under randomized settings Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

…zero weights Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

…un on a worker Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

…s, derive join width from output header Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

…ther on L*node_count Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

…p to fix clang-tidy Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

…agment keeps its stream cap Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

clickhouse-gh · 2026-07-01T22:29:51Z

+
+ExpressionStatistics StatisticsDerivation::deriveFilterStatistics(const FilterStep & filter_step, const ExpressionStatistics & input_statistics)
+{
+    ExpressionStatistics result_statistics = input_statistics;


deriveFilterStatistics still models every FilterStep as selectivity = 1: it remaps NDVs but leaves estimated_row_count and max_row_count unchanged. After the earlier pushdown pass there are still realistic filters on computed columns or post-join expressions, so broadcast-vs-shuffle and downstream exchange costs can ignore a large cardinality reduction entirely and pick the opposite plan. Please derive at least a heuristic row-count reduction here (or reuse the existing selectivity estimator when the filter still sits on a ReadFromMergeTree path) instead of copying the input cardinality unchanged.

…pper bound in prepopulation Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

… mismatched-type USING join swap in 04332 Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

…cades check for in-order aggregation Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

clickhouse-gh · 2026-07-02T17:20:05Z

            writeStringBinary(part_name, ctx.out);
    }
+
+    bool has_input_order = query_info.input_order_info != nullptr;


Now that read-in-order metadata is serialized for distributed reads, distributed_read_part_names also becomes the worker-side execution-order contract for requestReadingInOrder: deserialization rebuilds result.parts_with_ranges in this exact sequence, and spreadMarkRangesAmongStreamsWithOrder then consumes that vector from the back to decide which ranges each ordered reader sees first. The std::sort(part_names...) above changes the coordinator's selectRangesToRead() order into lexicographic part-name order, which is not the same thing when the ORDER BY key is unrelated to part naming or partitioning. A distributed SortedRead can therefore claim sorted output while actually reading parts in the wrong order. Please preserve the coordinator's original parts_with_ranges order instead of sorting the names.

… clone, note rule-based-planner scope in distributed_plan setting descriptions Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

… steps to one node, keep query sort settings and serialize the sort limit, price bounded sorts on input rows Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

…mixed work units Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

… cost formulas onto strategies Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

… and price exchanges on the selected child Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

…e cost config Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

…d format-reparse check and crashed CI on the LOCAL JOIN test Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

clickhouse-gh · 2026-07-03T16:28:41Z

@@ -4876,9 +4883,6 @@ void ReadFromMergeTree::serialize(Serialization & ctx) const
    /// non-bucket read is rebuilt and re-optimized on the worker, which re-derives them.
    if (distributed_read_bucket_count > 0)


assertFragmentSerializable still trusts ReadFromMergeTree::isSerializable(), but that method always returns true. These new bucketed-read rejections therefore slip past the pre-check and only fail later when serializeQueryPlan walks the fragment. That regresses the intended fail-closed path not only for projections, but also for deferred FINAL filters and direct text-index tasks. Please either make isSerializable() reflect these bucketed-read constraints, or move the checks into the earlier distributed-read support pass so make_distributed_plan rejects them before building the distributed plan.

…he swapped join tag Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

… and keep 04310 stat hints authoritative Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

clickhouse-gh · 2026-07-04T13:49:13Z

LLVM Coverage Report

Changed lines: Changed C/C++ lines covered: 2410/2681 (89.89%) · Uncovered code

Full report · Diff report

davenger added 10 commits August 27, 2025 23:54

Use setSwapInputs() for join commutativity rule

89f9df6

Build plan from best expressions

d7bf9ed

Implement clone() for Sorting step

53104e6

Disable enable_cascades_optimizer by default

622ad21

Sum inputs costs by default

9db5791

Build optimized plan from best expessions

409c951

Rebuild JoinStepLogical with swapped sides in join commutativity rule

a11fb6f

Some comments and cleanups

f5bce0b

Column NDV statistics stub

0160895

Estimate join equality predicate selectivity

ae74086

davenger marked this pull request as draft August 28, 2025 11:27

clickhouse-gh Bot added the pr-not-for-changelog This PR should not be mentioned in the changelog label Aug 28, 2025

novikd self-assigned this Aug 28, 2025

davenger added 14 commits August 28, 2025 13:56

Style fixes

4112cd1

Add enable_cascades_optimizer setting

18a98ba

Fix for adding top nodes twice

a59edb8

Properly get join step

3efd985

Update to refactored JoinStepLogical

ee49ce6

Added swapInputs() method

5a0da9f

Use swapInputs() in join cmmutativity rule

fee7883

Fix typo

84d9bb6

Fix step descriptions

acb9bc4

More work on propagating statistics

93d33ef

Fix build

abeac4b

Typo

31f45ab

Properly swap sources

3b9c21f

Support getting stats from JSON hint

120b3c5

davenger force-pushed the wip_cascades branch from 852ec15 to 8da0548 Compare September 23, 2025 12:19

Use EquivalenceClasses to add transitive equality predicates to join …

f7b5caf

…graph

Force legacy EXPLAIN format in distributed-plan tests for stable outp…

27109bf

…ut under randomized settings Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

clickhouse-gh Bot reviewed Jul 1, 2026

View reviewed changes

Comment thread src/Processors/QueryPlan/Optimizations/Cascades/StatisticsDerivation.cpp

Comment thread src/Processors/QueryPlan/Optimizations/Cascades/Cost.h

davenger and others added 6 commits July 1, 2026 08:38

Cascades: validate cost-config weights and make infinite cost absorb …

dc80bf3

…zero weights Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

Cascades: fail closed when a plan with distributed exchanges cannot r…

97e4b31

…un on a worker Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

Cascades: clamp join row/max-row estimates by join kind and strictnes…

4ac3105

…s, derive join width from output header Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

Cascades: cost the TwoStageTopN partial sort on input rows and its ga…

84d39a0

…ther on L*node_count Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

Cascades: use initializer-list std::min/std::max in the join row clam…

d345bbb

…p to fix clang-tidy Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

Serialize the allow_narrowing flag of UnionStep so a shipped UNION fr…

3c9c5ce

…agment keeps its stream cap Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

clickhouse-gh Bot reviewed Jul 1, 2026

View reviewed changes

alexey-milovidov mentioned this pull request Jul 2, 2026

Compress hash join right-side blocks in memory before OOM / spilling #107667

Open

1 task

davenger and others added 6 commits July 2, 2026 09:46

Cascades: fix read stats byte width in the estimator branch and row u…

deca73d

…pper bound in prepopulation Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

Cascades: cost broadcast exchange network transfer by receiver count

4333f63

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

Merge branch 'master' into wip_cascades

30716bd

Cascades: sync ARCHITECTURE.md with the current implementation

f762a66

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

Cascades: pin stat hints in 04040 for environment-stable plans, cover…

20416b0

… mismatched-type USING join swap in 04332 Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

Reject PASTE JOIN under make_distributed_plan and add a defensive Cas…

84bbcdb

…cades check for in-order aggregation Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

clickhouse-gh Bot reviewed Jul 2, 2026

View reviewed changes

davenger and others added 7 commits July 2, 2026 21:05

Keep deferred FINAL filters and text-index tasks in ReadFromMergeTree…

e5bcaf7

… clone, note rule-based-planner scope in distributed_plan setting descriptions Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

Cascades: reject LOCAL JOIN and non-Full sorts, pin non-deterministic…

faddf0d

… steps to one node, keep query sort settings and serialize the sort limit, price bounded sorts on input rows Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

Cascades: name the cost model calibration constants and document the …

a273e04

…mixed work units Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

Cascades: split pure operator costing from subtree accumulation, move…

5f61e47

… cost formulas onto strategies Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

Cascades: record partial top-N physical output rows on the expression…

215edb0

… and price exchanges on the selected child Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

Cascades: make the per-operator cost constants overridable through th…

aeccacf

…e cost config Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

Format the LOCAL keyword of a join: dropping it failed the debug-buil…

24b2b02

…d format-reparse check and crashed CI on the LOCAL JOIN test Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

clickhouse-gh Bot reviewed Jul 3, 2026

View reviewed changes

Pin query_plan_optimize_join_order_randomize in the tests asserting t…

d03b55a

…he swapped join tag Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

groeneai mentioned this pull request Jul 4, 2026

Fix flaky tests 03394_distributed_shuffle_join_with_aggregation and _with_in #109386

Open

davenger and others added 2 commits July 4, 2026 10:25

Convert logical joins for the local fallback of make_distributed_plan…

afbd13e

… and keep 04310 stat hints authoritative Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

Reuse traverseQueryPlan in the local fallback join conversion

10cdc0b

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

Sunbelt Computer Software

PL/B Language Development and Support

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Cascades cost-based optimizer for distributed query plans#86353

Cascades cost-based optimizer for distributed query plans#86353
davenger wants to merge 262 commits into
masterfrom
wip_cascades

davenger commented Aug 28, 2025 •

edited

Loading

Uh oh!

clickhouse-gh Bot commented Aug 28, 2025 •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

clickhouse-gh Bot Jul 1, 2026

Uh oh!

clickhouse-gh Bot Jul 2, 2026

Uh oh!

clickhouse-gh Bot Jul 3, 2026

Uh oh!

clickhouse-gh Bot commented Jul 4, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

		@@ -4876,9 +4883,6 @@ void ReadFromMergeTree::serialize(Serialization & ctx) const
		/// non-bucket read is rebuilt and re-optimized on the worker, which re-derives them.
		if (distributed_read_bucket_count > 0)

Sunbelt Computer Software

PL/B Language Development and Support

Uh oh!

Conversation

davenger commented Aug 28, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Changelog category (leave one):

Changelog entry (a user-readable short description of the changes that goes into CHANGELOG.md):

Documentation entry for user-facing changes

Uh oh!

clickhouse-gh Bot commented Aug 28, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

AI Review

Summary

Findings

Tests

Final Verdict

Uh oh!

Uh oh!

Uh oh!

clickhouse-gh Bot Jul 1, 2026

Choose a reason for hiding this comment

Uh oh!

clickhouse-gh Bot Jul 2, 2026

Choose a reason for hiding this comment

Uh oh!

clickhouse-gh Bot Jul 3, 2026

Choose a reason for hiding this comment

Uh oh!

clickhouse-gh Bot commented Jul 4, 2026

LLVM Coverage Report

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

davenger commented Aug 28, 2025 •

edited

Loading

clickhouse-gh Bot commented Aug 28, 2025 •

edited

Loading