Enable allow_aggregate_partitions_independently by default#105128
Enable allow_aggregate_partitions_independently by default#105128alexey-milovidov wants to merge 35 commits into
allow_aggregate_partitions_independently by default#105128Conversation
When the partition key suits the `GROUP BY` key, ClickHouse can route each partition through a separate aggregation pipeline and skip the merging step entirely. The setting that gates this optimization, `allow_aggregate_partitions_independently`, has been off by default since it was introduced in 23.2 (#45364) because the optimization is beneficial only when there are enough partitions of roughly equal size. Runtime heuristics in `ReadFromMergeTree::requestOutputEachPartitionThroughSeparatePort` already implement those checks: they disable the optimization when the partition count is below `max_threads / 2`, above `max_number_of_partitions_for_independent_aggregation` (128 by default), or when the largest partition has more than twice the average number of rows. In every other case the optimization either pays off or is silently skipped, so the gate-setting can safely default to true. `force_aggregate_partitions_independently` (which bypasses the heuristics) is left at false. Backward-compatible behavior can be restored with `compatibility = '26.4'`.
…d_in_order_spread The test asserts the spread/merging pipeline shape from `ReadInOrderOptimizer` (with `MergingAggregatedBucketTransform` / `FinishAggregatingInOrderTransform`). With `allow_aggregate_partitions_independently` now defaulting to `true`, the two partitions are routed through independent aggregation ports (`FinalizeAggregatedTransform` × 2), so the `EXPLAIN PIPELINE` output diverges from the reference. Force the setting off in the test to preserve its original intent. CI report: https://s3.amazonaws.com/clickhouse-test-reports/json.html?PR=105128&sha=b4bd5f71520b0cb1439283ed7e928fe47c905b07&name_0=PR&name_1=Fast%20test PR: #105128
|
No changes on performance tests. |
…rtitions-independently-by-default
The 26.5 block has been closed since the PR was opened, so register the default flip under `26.6` in `SettingsChangesHistory.cpp` and update the docs note accordingly. This also keeps `02995_new_settings_history` happy.
…rtitions-independently-by-default
…rtitions-independently-by-default
…ughSeparatePortForAggregation` in settings change reason The reason string for `allow_aggregate_partitions_independently` in `system.settings_changes` referenced `ReadFromMergeTree::requestOutputEachPartitionThroughSeparatePort`, which is not a real method name. The actual method is `ReadFromMergeTree::requestOutputEachPartitionThroughSeparatePortForAggregation`, so users searching the source from this text would find nothing. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
|
Fixed in c3d0390 — the reason string now uses the exact method name |
…rtitions-independently-by-default
…g settings The merge of master re-introduced history entries for `query_plan_min_columns_for_join_lazy_indexing` and `query_plan_max_limit_for_join_lazy_indexing`, but these settings were removed from `Settings.cpp` when "Lazily apply selector and replication indexes in join" was reverted on master. Iterating the history for any `compatibility` value below the latest version then tried to set a non-existent setting and threw `UNKNOWN_SETTING`, breaking all compatibility tests in `Fast test`. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…rtitions-independently-by-default
|
Merged Root cause: a previous merge-conflict resolution re-introduced Fix ( |
|
📊 Cloud Performance Report ✅ AI verdict: This PR flips allow_aggregate_partitions_independently on by default (plus a max_rows_to_group_by guard). That optimization only engages when a table's partition key is a function of the GROUP BY key, so partitions can be aggregated independently and the merge skipped. ClickBench's hits table isn't partitioned that way, so the runtime heuristics skip it and Q28's aggregation path is unchanged — the flagged -5.14% improvement on clickbench Q28 is off-path and reads as run-to-run variance, so we downgraded it to not-sure. No other queries were flagged on clickbench or tpch_adapted_1_official. clickbenchFlagged queries (1 of 43)
q-value = BH-FDR adjusted p; smaller is stronger evidence. MIRAI flags a query when q < fdr_q (default 0.10) — the value the verdict is based on. tpch_adapted_1_official🟢 No significant changes Debug info
|
…rtitions-independently-by-default
…rtitions-independently-by-default
…rtitions-independently-by-default
|
The sole CI failure here is an in This PR only flips the default of Evidence it is a pre-existing, master-wide fuzzer-discoverable @groeneai, could you investigate this failure and provide a fix in a separate PR? If a fix is already in progress, please link it here. I have re-merged |
|
@alexey-milovidov You are right that this is unrelated to the aggregation setting. It is the chronic That failure ( SELECT DISTINCT NULL, toColumnTypeName(r), toString(toUInt32(...), v) AS r FROM t_variant_null__fuzz_48 ...
INTERSECT SELECT DISTINCT toString(v) AS r, ... FROM t_variant_null__fuzz_0where Fixed in #107374 (merged 2026-06-13 18:34 UTC, Verified your exact query against a debug build:
This PR's current HEAD Note: the join/runtime-filter manifestation of the same family is a separate fix, #106931 ( |
…rtitions-independently-by-default
…rtitions-independently-by-default
…forces The docs note pointed at the "key factors for a good performance" list as the criteria the runtime guard checks, but `requestOutputEachPartitionThroughSeparatePortForAggregation` only skips the optimisation for too few partitions (fewer than `max_threads / 2`), too many partitions (more than `max_number_of_partitions_for_independent_aggregation`), or heavily skewed partition sizes. It does not skip merely because partitions are small, so listing that factor as a runtime criterion overstated the guard. Enumerate the three checks the heuristic actually performs and note that the list below is general layout guidance, only part of which is enforced. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
|
We can only merge this if it improves the performance of some queries with no degradation on others. |
…s set `max_rows_to_group_by` is a global `GROUP BY` limit that, in normal aggregation, is enforced during the final merge phase. The aggregate-partitions-independently optimization skips that merge and aggregates each partition on a separate pipeline, so the limit was enforced against each partition's own hash table rather than globally. With `allow_aggregate_partitions_independently` off by default this was only reachable on an opt-in path, but enabling it by default exposes the gap: a query such as `PARTITION BY a % 16 GROUP BY a SETTINGS max_rows_to_group_by = 100` could return far more than 100 groups (or apply `group_by_overflow_mode` per partition) instead of honoring the global limit. Disable the optimization in `optimizeAggregationPerPartition` when `max_rows_to_group_by != 0`, falling back to normal aggregation. This matches the two sibling skip-merge paths, which already reject the same case: `AggregatingStep::canUseShardedAggregation` returns `false` and `make_distributed_plan` throws `SUPPORT_IS_DISABLED`. Add a regression test that forces the optimization on a partitioned table whose partition key suits the `GROUP BY` key and asserts the global limit is preserved (`TOO_MANY_ROWS` with `group_by_overflow_mode = 'throw'`). This re-applies the guard that was held back in d262e0c pending the author's decision on how to resolve the gap; the disable-guard approach was chosen. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
`aggregation_by_partitions.xml` forces the optimization on both server builds, so it cannot show the effect of flipping the `allow_aggregate_partitions_independently` default - the forced setting behaves identically on the master and PR builds the comparison runs. Add a test that deliberately does not set (or force) the setting, so each build uses its own compiled-in default: the old default (0) aggregates with a global merge, while the new default (1) lets the runtime heuristics aggregate each partition independently and skip the merge. The layout is favorable for the heuristics (partition key is a function of the `GROUP BY` key, many roughly equal-sized partitions within the default 128-partition cap), so the optimization fires on the PR build and the query is faster than on the master build. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Make the test deterministic and the win measurable across machines: - Pin `max_threads = 32`. The runtime heuristic disables the optimization when the partition count is below `max_threads / 2`, which is tied to the host's core count. On a 96-core machine the natural `max_threads` is 96, so the 32-partition table (32 < 48) would not take the optimization at all and the 64-partition table only barely. Pinning `max_threads = 32` keeps both the 32- and 64-partition tables (>= 16) on the optimized path regardless of the host. - Add a 50M-row size. At 10M rows the query runs in ~0.1s and the signal is dominated by run-to-run variance; at 50M rows the optimized path is stable and clearly faster (locally ~0.33s -> ~0.19s for 64 partitions and ~0.33s -> ~0.21s for 32 partitions, both with normal aggregation as the baseline). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
|
Re: the performance bar — Local measurement (50M rows,
No degradation on unfavorable layouts — the heuristics already skip the optimization there, consistent with the Also re-applied the |
…rtitions-independently-by-default
The new stateless test was numbered `05020`, but the highest existing test number is `04488`, which makes `Style check/test_numbers_check` fail with `Gap (4415, 5020) > 100` (the check rejects gaps of 100 or more between consecutive stateless test numbers). Rename the `.sql` and `.reference` files to `04489`, the next available number. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…ion flake The test verifies that partition-independent aggregation falls back to normal aggregation when `max_rows_to_group_by` is set, by asserting the query throws `TOO_MANY_ROWS`. Under the randomized setting `optimize_aggregation_in_order = 1`, aggregation emits each group as soon as it is complete, so the aggregation hash table never grows past the limit and `max_rows_to_group_by` is never tripped - regardless of partition-independent aggregation, the query succeeds and the assertion fails. Pin `optimize_aggregation_in_order = 0` so the test deterministically exercises the global limit enforced via the merge phase that the optimization skips. CI report: #105128 Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Enabling `allow_aggregate_partitions_independently` by default also makes the optimization fall back to normal aggregation whenever `max_rows_to_group_by` is set (the per-partition pipeline skips the merge phase where the global limit is enforced). The stateless test profile (`tests/config/users.d/limits.yaml`) sets `max_rows_to_group_by` to a high value (`10G`) as a safety net "so it will not limit anything", which is non-zero and therefore disables the optimization. `02521_aggregation_by_partitions` exists to verify the partition-independent pipeline, so in CI every `EXPLAIN PIPELINE` collapsed to ordinary aggregation and the test failed deterministically on every shard (`@@ -1,225 +1,94 @@`), while it passed on master where the fallback does not exist. The sibling test `01551_mergetree_read_in_order_spread` was already adjusted in this pull request for the same reason. Clear the limit with `set max_rows_to_group_by = 0` at the top so the test keeps exercising the partition-independent pipeline it is meant to verify. CI report: #105128 Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…rtitions-independently-by-default
The test verified the `max_rows_to_group_by` guard only through the runtime `TOO_MANY_ROWS` error, which is fragile: `max_rows_to_group_by` is enforced against each aggregation hash table, so whether it trips depends on how the partition layout maps onto threads and on whether the query is distributed. Two CI randomizations defeated it: * `enable_parallel_replicas = 1` routes the aggregation across replicas, where the limit is enforced per replica rather than globally - an orthogonal path this optimization does not touch (the guard already disables independent per-partition aggregation on the local plan when the limit is set). Pin it off so the randomizer cannot select it. * When the optimization is active and each partition gets its own aggregating transform (e.g. `max_threads` >= number of partitions), every per-partition hash table stays under the limit and nothing throws - exactly the case the optimization-disabling guard must prevent, but the runtime check observes it only probabilistically. Replace the runtime-only assertion with a deterministic `EXPLAIN PLAN` check: with `force_aggregate_partitions_independently` set, the plan reads each partition through a separate port when there is no limit, and must stop doing so once `max_rows_to_group_by` is set (the guard falls back to normal aggregation whose merge phase enforces the global limit). This is independent of `max_threads` and the other randomized settings. The runtime `TOO_MANY_ROWS` check is kept as an end-to-end confirmation. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
|
Pushed Root cause. The test asserted the
Most of the other red runs were stale pre-pin results (executed before Fix (test-only). Pin Verification. Against a recent |
The stateless test profile (`tests/config/users.d/limits.yaml`) sets
`max_rows_to_group_by = 10G` as a high "won't limit anything" safety net.
That value is non-zero, so the new guard in `optimizeAggregationPerPartition`
already disables independent per-partition aggregation. The first positive
`EXPLAIN` check in `04489_aggregate_partitions_independently_max_rows_to_group_by`
inherited that limit and therefore returned `0` instead of `1`, failing in the
`Fast test` job:
@@ -1,2 +1,2 @@
-1
+0
0
Reset `max_rows_to_group_by` to 0 before the positive case so the no-limit path
is actually exercised, then set the explicit limit only for the negative and
runtime checks. This mirrors the fix already applied to
`02521_aggregation_by_partitions`.
CI report: https://s3.amazonaws.com/clickhouse-test-reports/json.html?PR=105128&sha=9d9d2b007fc87b1fc697da9ae891954ec9bd6b76&name_0=PR&name_1=Fast%20test
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
The note in `custom-partitioning-key.md` described the size-skew check as "the largest partition holds more than twice the average number of rows", implying a comparison against `sum_rows / partitions_cnt`. The actual heuristic in `ReadFromMergeTree::requestOutputEachPartitionThroughSeparatePortForAggregation` compares the largest partition against `2 * (sum_rows / max_threads)`, which differs from the average partition size whenever the partition count is not equal to `max_threads`. Reword the note to describe the real threshold. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
|
Addressed the AI review's two required actions (pushed
Also updated the PR description: corrected the same skew wording, fixed the function name to Both review threads are resolved. I confirmed no other stateless test is affected by the guard under the A local full rebuild to re-run the test against a guard-enabled binary was infeasible in this environment (the worktree checkout invalidated the build cache, triggering a near-full rebuild), but the fix is logically airtight given the CI output above and the guard code ( |
…rtitions-independently-by-default
`master` added `04489_max_threads_auto_parsing_compat` (and tests at the `04490` and `04491` prefixes), so move the regression test for the `max_rows_to_group_by` guard to the next free prefix `04492`. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
|
Also merged the latest |
`04492_aggregate_partitions_independently_max_rows_to_group_by` still failed
in CI (e.g. `Stateless tests (amd_tsan, parallel, 2/2)` and the flaky checks),
including with `optimize_aggregation_in_order = 0`:
The query succeeded but the server error '158' was expected
(SELECT a FROM t_apart_max_rows GROUP BY a FORMAT Null; -- { serverError TOO_MANY_ROWS })
Root cause: `max_rows_to_group_by` is checked against each aggregation lane's
own hash table, not against a single global one. Even with the per-partition
optimization disabled (the `EXPLAIN` checks correctly return `0`), the
read-in-order spread gives each of the 16 partitions its own aggregating lane.
At `max_threads = 3` the pipeline has 19 `AggregatingTransform` lanes, each
holding ~63 distinct keys (< the limit of 100), so the limit never trips and
the query returns all 1000 groups without throwing.
This is independent of the optimization (it happens in normal aggregation too),
so the runtime check can only be made deterministic by forcing a single lane.
Pin `max_threads = 1` so one lane sees all the keys; the limit is then exercised
deterministically regardless of the partition layout and the randomized reading
settings. Verified on a freshly built binary across 60 randomized-setting runs
(0 failures). The `EXPLAIN PLAN` checks are unaffected (independent of
`max_threads`). Also corrected the comments: the limit is per-lane, not enforced
globally during the merge phase.
CI report: https://s3.amazonaws.com/clickhouse-test-reports/json.html?PR=105128&sha=4529ec352bd98263f954917d57d61b535189d034&name_0=PR
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
|
Pushed The earlier fix assumed the per-table-limit escape only happens when "the optimization is active and 19 Fix (test-only): pin |
|
Re-ran the two red jobs on
No re-merge: the branch is 138 commits behind but the merge base is <1 day old and |
|
Heads-up on a new test-number collision that appeared on
Neither tree collides on its own (merge base Otherwise the PR looks ready: the two earlier reds reran to |
|
This PR only flips Otherwise unchanged and ready: public CI is fully green on |
|
Status update since the
So no re-merge is needed to resolve a conflict; the flag should clear on GitHub's next mergeability recompute. If it lingers, a trivial |
…rtitions-independently-by-default
|
Re-merged |
| /// more groups (or apply `group_by_overflow_mode`) per partition than the global limit allows. | ||
| /// Fall back to normal aggregation, matching `AggregatingStep::canUseShardedAggregation` and | ||
| /// `make_distributed_plan`, both of which reject this case for the same reason. | ||
| if (aggregating_step->getParams().max_rows_to_group_by != 0) |
There was a problem hiding this comment.
This guard changes more than the new default: it also changes the pre-26.7 explicit opt-in path for allow_aggregate_partitions_independently / force_aggregate_partitions_independently. Before this PR, a query or profile that set those flags still took the per-partition path even if max_rows_to_group_by was a high non-zero safety net; after this change it silently falls back to normal aggregation for any non-zero value. The updated test 02521_aggregation_by_partitions.sql is already evidence of that behavior change: it now has to clear the inherited 10G limit just to keep exercising the old opt-in path.
The problem is that compatibility = '26.6' does not restore that old behavior. The only compatibility entry added in this PR is the default flip in SettingsChangesHistory.cpp, so existing deployments that explicitly enabled the optimization will still lose it under compatibility = '26.6' whenever they also have a non-zero max_rows_to_group_by in a profile. If that stricter fallback is intentional, it needs its own compatibility story (for example, a gated guard or an additional history-controlled setting); otherwise the PR/docs should stop promising that compatibility = '26.6' restores the pre-26.7 behavior.
LLVM Coverage ReportChanged lines: Changed C/C++ lines covered: 22/22 (100.00%) · Uncovered code |

When the partition key suits the
GROUP BYkey, ClickHouse can route each partition through a separate aggregation pipeline and skip the merging step entirely. The setting that gates this optimization,allow_aggregate_partitions_independently, has been off by default since it was introduced in 23.2 (#45364) because the optimization is beneficial only when there are enough partitions of roughly equal size.Runtime heuristics in
ReadFromMergeTree::requestOutputEachPartitionThroughSeparatePortForAggregation(src/Processors/QueryPlan/ReadFromMergeTree.cpp) already enforce those criteria: they disable the optimization when the partition count is belowmax_threads / 2, abovemax_number_of_partitions_for_independent_aggregation(128 by default), or when the largest partition holds more rows than twice the total number of rows divided bymax_threads. In every other case the optimization either pays off or is silently skipped, so the gate-setting can safely default totrue.force_aggregate_partitions_independently(which bypasses the heuristics) is left atfalse. The optimization skips the cross-partition merge phase, which is where the globalmax_rows_to_group_bylimit is enforced, sooptimizeAggregationPerPartitionnow falls back to normal aggregation whenevermax_rows_to_group_byis set, matching the sharded-aggregation and distributed-plan paths. Backward-compatible behavior can be restored withcompatibility = '26.6'.Changelog category (leave one):
Changelog entry (a user-readable short description of the changes that goes into CHANGELOG.md):
Enable
allow_aggregate_partitions_independentlyby default. When aGROUP BYkey suits the partition key, ClickHouse can aggregate each partition independently and skip the global merging step. Runtime heuristics automatically skip the optimization when the partition layout would make it unfavorable (too few partitions, too many partitions, or significantly skewed partition sizes).Documentation entry for user-facing changes