Enable `allow_aggregate_partitions_independently` by default by alexey-milovidov · Pull Request #105128 · ClickHouse/ClickHouse · GitHub
Skip to content

Enable allow_aggregate_partitions_independently by default#105128

Open
alexey-milovidov wants to merge 35 commits into
masterfrom
enable-aggregate-partitions-independently-by-default
Open

Enable allow_aggregate_partitions_independently by default#105128
alexey-milovidov wants to merge 35 commits into
masterfrom
enable-aggregate-partitions-independently-by-default

Conversation

@alexey-milovidov

@alexey-milovidov alexey-milovidov commented May 16, 2026

Copy link
Copy Markdown
Member

When the partition key suits the GROUP BY key, ClickHouse can route each partition through a separate aggregation pipeline and skip the merging step entirely. The setting that gates this optimization, allow_aggregate_partitions_independently, has been off by default since it was introduced in 23.2 (#45364) because the optimization is beneficial only when there are enough partitions of roughly equal size.

Runtime heuristics in ReadFromMergeTree::requestOutputEachPartitionThroughSeparatePortForAggregation (src/Processors/QueryPlan/ReadFromMergeTree.cpp) already enforce those criteria: they disable the optimization when the partition count is below max_threads / 2, above max_number_of_partitions_for_independent_aggregation (128 by default), or when the largest partition holds more rows than twice the total number of rows divided by max_threads. In every other case the optimization either pays off or is silently skipped, so the gate-setting can safely default to true.

force_aggregate_partitions_independently (which bypasses the heuristics) is left at false. The optimization skips the cross-partition merge phase, which is where the global max_rows_to_group_by limit is enforced, so optimizeAggregationPerPartition now falls back to normal aggregation whenever max_rows_to_group_by is set, matching the sharded-aggregation and distributed-plan paths. Backward-compatible behavior can be restored with compatibility = '26.6'.

Changelog category (leave one):

  • Performance Improvement

Changelog entry (a user-readable short description of the changes that goes into CHANGELOG.md):

Enable allow_aggregate_partitions_independently by default. When a GROUP BY key suits the partition key, ClickHouse can aggregate each partition independently and skip the global merging step. Runtime heuristics automatically skip the optimization when the partition layout would make it unfavorable (too few partitions, too many partitions, or significantly skewed partition sizes).

Documentation entry for user-facing changes

  • Documentation is written (mandatory for new features)

When the partition key suits the `GROUP BY` key, ClickHouse can route
each partition through a separate aggregation pipeline and skip the
merging step entirely. The setting that gates this optimization,
`allow_aggregate_partitions_independently`, has been off by default
since it was introduced in 23.2 (#45364) because the optimization is
beneficial only when there are enough partitions of roughly equal
size.

Runtime heuristics in `ReadFromMergeTree::requestOutputEachPartitionThroughSeparatePort`
already implement those checks: they disable the optimization when the
partition count is below `max_threads / 2`, above
`max_number_of_partitions_for_independent_aggregation` (128 by default),
or when the largest partition has more than twice the average number of
rows. In every other case the optimization either pays off or is
silently skipped, so the gate-setting can safely default to true.

`force_aggregate_partitions_independently` (which bypasses the
heuristics) is left at false. Backward-compatible behavior can be
restored with `compatibility = '26.4'`.
@clickhouse-gh

clickhouse-gh Bot commented May 16, 2026

Copy link
Copy Markdown
Contributor

@clickhouse-gh clickhouse-gh Bot added the pr-performance Pull request with some performance improvements label May 16, 2026
Comment thread src/Core/Settings.cpp
…d_in_order_spread

The test asserts the spread/merging pipeline shape from `ReadInOrderOptimizer`
(with `MergingAggregatedBucketTransform` / `FinishAggregatingInOrderTransform`).
With `allow_aggregate_partitions_independently` now defaulting to `true`, the
two partitions are routed through independent aggregation ports
(`FinalizeAggregatedTransform` × 2), so the `EXPLAIN PIPELINE` output diverges
from the reference. Force the setting off in the test to preserve its original
intent.

CI report: https://s3.amazonaws.com/clickhouse-test-reports/json.html?PR=105128&sha=b4bd5f71520b0cb1439283ed7e928fe47c905b07&name_0=PR&name_1=Fast%20test
PR: #105128
@alexey-milovidov

Copy link
Copy Markdown
Member Author

No changes on performance tests.

The 26.5 block has been closed since the PR was opened, so register the
default flip under `26.6` in `SettingsChangesHistory.cpp` and update the
docs note accordingly. This also keeps `02995_new_settings_history` happy.
Comment thread src/Core/SettingsChangesHistory.cpp Outdated
…ughSeparatePortForAggregation` in settings change reason

The reason string for `allow_aggregate_partitions_independently` in
`system.settings_changes` referenced `ReadFromMergeTree::requestOutputEachPartitionThroughSeparatePort`,
which is not a real method name. The actual method is
`ReadFromMergeTree::requestOutputEachPartitionThroughSeparatePortForAggregation`,
so users searching the source from this text would find nothing.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@alexey-milovidov

Copy link
Copy Markdown
Member Author

Fixed in c3d0390 — the reason string now uses the exact method name ReadFromMergeTree::requestOutputEachPartitionThroughSeparatePortForAggregation.

Comment thread src/Core/SettingsChangesHistory.cpp
alexey-milovidov and others added 2 commits June 5, 2026 04:07
…g settings

The merge of master re-introduced history entries for
`query_plan_min_columns_for_join_lazy_indexing` and
`query_plan_max_limit_for_join_lazy_indexing`, but these settings were
removed from `Settings.cpp` when "Lazily apply selector and replication
indexes in join" was reverted on master. Iterating the history for any
`compatibility` value below the latest version then tried to set a
non-existent setting and threw `UNKNOWN_SETTING`, breaking all
compatibility tests in `Fast test`.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@alexey-milovidov

Copy link
Copy Markdown
Member Author

Merged master (was 557 commits behind and red) and fixed the Fast test failures.

Root cause: a previous merge-conflict resolution re-introduced SettingsChangesHistory.cpp entries for query_plan_min_columns_for_join_lazy_indexing and query_plan_max_limit_for_join_lazy_indexing, but these settings were removed from Settings.cpp when "Lazily apply selector and replication indexes in join" was reverted on master (133a0fc8edd). SettingsImpl::applyCompatibilitySetting iterates the history and calls get(final_name) before applying the previous value, so any compatibility below the latest version threw UNKNOWN_SETTING — which is why all the compatibility/constraint tests (02324, 02325, 03243, 04234, 04078, 03916, 03011, 02970, 03274, 04209, 04034, 03006, 02933, 03773) failed.

Fix (a68425ec146): dropped the two stale history entries, leaving only the intended allow_aggregate_partitions_independently change. Verified locally — the previously failing compatibility tests now pass (the 04034 parallel-replicas test only needed the parallel_replicas cluster, which CI provides). This also resolves the AI-review blocker.

@clickhouse-gh

clickhouse-gh Bot commented Jun 5, 2026

Copy link
Copy Markdown
Contributor

📊 Cloud Performance Report

✅ AI verdict: no_change — no significant changes across 38 queries analysed

This PR flips allow_aggregate_partitions_independently on by default (plus a max_rows_to_group_by guard). That optimization only engages when a table's partition key is a function of the GROUP BY key, so partitions can be aggregated independently and the merge skipped. ClickBench's hits table isn't partitioned that way, so the runtime heuristics skip it and Q28's aggregation path is unchanged — the flagged -5.14% improvement on clickbench Q28 is off-path and reads as run-to-run variance, so we downgraded it to not-sure. No other queries were flagged on clickbench or tpch_adapted_1_official.

clickbench

⚠️ 1 inconclusive

Flagged queries (1 of 43)
Query Verdict Baseline median (ms) PR median (ms) Change q-value Hint
⚠️ 28 not_sure 6678 6334 -5.1% <0.0001 aggregation: Per-partition GROUP BY opt needs partition key = GROUP BY key; hits isn't partitioned so Q28 is off-path noise

q-value = BH-FDR adjusted p; smaller is stronger evidence. MIRAI flags a query when q < fdr_q (default 0.10) — the value the verdict is based on.

tpch_adapted_1_official

🟢 No significant changes

Debug info
  • StressHouse run: 636e29ce-da54-4f39-8e13-1867e601e745
  • MIRAI run: 17e1e2aa-9793-4478-86d7-68ff9720c4f4
  • PR check IDs:
    • clickbench_569455_1782870400
    • clickbench_569461_1782870400
    • clickbench_569473_1782870400
    • tpch_adapted_1_official_569480_1782870400
    • tpch_adapted_1_official_569492_1782870400
    • tpch_adapted_1_official_569502_1782870400

@alexey-milovidov

Copy link
Copy Markdown
Member Author

The sole CI failure here is an AST fuzzer (amd_debug) logical error and is unrelated to this PR:

Logical error: 'Variant 1 (String) has size 3, but expected 2'

in DB::ColumnVariant state validation, reached via DB::DistinctTransform::transform.

Report: https://s3.amazonaws.com/clickhouse-test-reports/json.html?PR=105128&sha=cfa19d7355537aad6a3633d78e0abf6d2eeaa745&name_0=PR&name_1=AST%20fuzzer%20%28amd_debug%29

This PR only flips the default of allow_aggregate_partitions_independently (plus a docs line, a SettingsChangesHistory entry, and one test). There is no path from an aggregation-setting default to a Variant column size mismatch in DistinctTransform.

Evidence it is a pre-existing, master-wide fuzzer-discoverable ColumnVariant bug (seeded from 04268_cast_keep_nullable_variant.sql): the same Variant <N> (<type>) has size <X>, but expected <Y> logical error has appeared on at least 7 unrelated PRs in the last 45 days — 99639, 102724, 105128, 107053, 107116, 100272, 104435.

@groeneai, could you investigate this failure and provide a fix in a separate PR? If a fix is already in progress, please link it here. I have re-merged master to get a fresh CI run.

@groeneai

Copy link
Copy Markdown
Contributor

@alexey-milovidov You are right that this is unrelated to the aggregation setting. It is the chronic ColumnVariant "Variant N has size X, but expected Y" family (STID 3068), and the specific shape your run hit is already fixed.

That failure (AST fuzzer (amd_debug), STID 3068-3371, commit cfa19d73):

SELECT DISTINCT NULL, toColumnTypeName(r), toString(toUInt32(...), v) AS r FROM t_variant_null__fuzz_48 ...
INTERSECT SELECT DISTINCT toString(v) AS r, ... FROM t_variant_null__fuzz_0

where v Variant(Array(Array(UInt64)), String) holds one non-empty variant plus NULLs. toString(v) goes through ConvertImplFromVariantToColumn::execute, which filtered the variant to non-null rows, ran the nested function, then expanded the result in place via assumeMutable(). For a single-variant-plus-NULL column the nested result aliases the source subcolumn (use_count > 1), so the in-place expand grew the shared subcolumn and left the source ColumnVariant with a subcolumn larger than its discriminator count. DISTINCT (DistinctTransform) was the validating op that then tripped validateState.

Fixed in #107374 (merged 2026-06-13 18:34 UTC, Related: #101415): IColumn::mutate()-before-expand at all 6 filter-then-expand sites (1 ConvertImplFromVariantToColumn + 3 FunctionVariantAdaptor + 2 FunctionDynamicAdaptor), which clones when shared, plus regression test 04337_variant_to_column_inplace_mutation.

Verified your exact query against a debug build:

  • pre-fix binary: SIGABRT, Variant 1 (String) has size 3, but expected 2
  • fixed HEAD: exit 0, no crash

This PR's current HEAD dd52b610 already contains #107374 (re-merged master at 01:51 UTC), and the fresh run shows AST fuzzer (amd_debug) and (amd_debug, targeted) green at 02:19 UTC. CIDB: 0 master hits for this family in 45 days; the only post-merge hit (#99639, 06-13 19:01) is on a stale base 100 commits behind the fix.

Note: the join/runtime-filter manifestation of the same family is a separate fix, #106931 (Closes #101415), still in review. #107374 covers the toString/conversion path your run hit.

Comment thread src/Core/Settings.cpp
Comment thread docs/en/engines/table-engines/mergetree-family/custom-partitioning-key.md Outdated
…forces

The docs note pointed at the "key factors for a good performance" list as
the criteria the runtime guard checks, but
`requestOutputEachPartitionThroughSeparatePortForAggregation` only skips the
optimisation for too few partitions (fewer than `max_threads / 2`), too many
partitions (more than `max_number_of_partitions_for_independent_aggregation`),
or heavily skewed partition sizes. It does not skip merely because partitions
are small, so listing that factor as a runtime criterion overstated the guard.

Enumerate the three checks the heuristic actually performs and note that the
list below is general layout guidance, only part of which is enforced.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Comment thread src/Core/Settings.cpp
@alexey-milovidov

Copy link
Copy Markdown
Member Author

We can only merge this if it improves the performance of some queries with no degradation on others.

alexey-milovidov and others added 3 commits June 28, 2026 18:31
…s set

`max_rows_to_group_by` is a global `GROUP BY` limit that, in normal
aggregation, is enforced during the final merge phase. The
aggregate-partitions-independently optimization skips that merge and
aggregates each partition on a separate pipeline, so the limit was
enforced against each partition's own hash table rather than globally.

With `allow_aggregate_partitions_independently` off by default this was
only reachable on an opt-in path, but enabling it by default exposes the
gap: a query such as `PARTITION BY a % 16 GROUP BY a SETTINGS
max_rows_to_group_by = 100` could return far more than 100 groups (or
apply `group_by_overflow_mode` per partition) instead of honoring the
global limit.

Disable the optimization in `optimizeAggregationPerPartition` when
`max_rows_to_group_by != 0`, falling back to normal aggregation. This
matches the two sibling skip-merge paths, which already reject the same
case: `AggregatingStep::canUseShardedAggregation` returns `false` and
`make_distributed_plan` throws `SUPPORT_IS_DISABLED`.

Add a regression test that forces the optimization on a partitioned
table whose partition key suits the `GROUP BY` key and asserts the
global limit is preserved (`TOO_MANY_ROWS` with
`group_by_overflow_mode = 'throw'`).

This re-applies the guard that was held back in d262e0c pending the
author's decision on how to resolve the gap; the disable-guard approach
was chosen.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
`aggregation_by_partitions.xml` forces the optimization on both server
builds, so it cannot show the effect of flipping the
`allow_aggregate_partitions_independently` default - the forced setting
behaves identically on the master and PR builds the comparison runs.

Add a test that deliberately does not set (or force) the setting, so each
build uses its own compiled-in default: the old default (0) aggregates
with a global merge, while the new default (1) lets the runtime
heuristics aggregate each partition independently and skip the merge. The
layout is favorable for the heuristics (partition key is a function of
the `GROUP BY` key, many roughly equal-sized partitions within the
default 128-partition cap), so the optimization fires on the PR build and
the query is faster than on the master build.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Make the test deterministic and the win measurable across machines:

- Pin `max_threads = 32`. The runtime heuristic disables the optimization
  when the partition count is below `max_threads / 2`, which is tied to
  the host's core count. On a 96-core machine the natural `max_threads`
  is 96, so the 32-partition table (32 < 48) would not take the
  optimization at all and the 64-partition table only barely. Pinning
  `max_threads = 32` keeps both the 32- and 64-partition tables (>= 16)
  on the optimized path regardless of the host.

- Add a 50M-row size. At 10M rows the query runs in ~0.1s and the signal
  is dominated by run-to-run variance; at 50M rows the optimized path is
  stable and clearly faster (locally ~0.33s -> ~0.19s for 64 partitions
  and ~0.33s -> ~0.21s for 32 partitions, both with normal aggregation as
  the baseline).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@alexey-milovidov

Copy link
Copy Markdown
Member Author

Re: the performance bar — aggregation_by_partitions.xml forces the optimization on both the base and PR builds, so it can't show the effect of flipping the default. Added aggregate_partitions_independently_default.xml (037ec3c, tuned in 10828a8): it relies on the default (base 0 = off, this PR 1 = on) on a partition-aligned GROUP BY, with max_threads pinned so the runtime heuristic fires deterministically regardless of the host's core count.

Local measurement (50M rows, max_threads = 32, normal aggregation as the baseline):

partitions normal aggregation independent partitions speedup
64 ~0.33s ~0.19s ~45%
32 ~0.33s ~0.21s ~37%

No degradation on unfavorable layouts — the heuristics already skip the optimization there, consistent with the no_change cloud-benchmark verdict on ClickBench (whose GROUP BY is not partition-aligned). The CI performance comparison should now show the win on this test for the PR build.

Also re-applied the max_rows_to_group_by correctness guard that was held back in d262e0c; see the resolved thread on src/Core/Settings.cpp.

alexey-milovidov and others added 6 commits June 29, 2026 18:57
The new stateless test was numbered `05020`, but the highest existing
test number is `04488`, which makes `Style check/test_numbers_check`
fail with `Gap (4415, 5020) > 100` (the check rejects gaps of 100 or more
between consecutive stateless test numbers). Rename the `.sql` and
`.reference` files to `04489`, the next available number.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…ion flake

The test verifies that partition-independent aggregation falls back to normal
aggregation when `max_rows_to_group_by` is set, by asserting the query throws
`TOO_MANY_ROWS`. Under the randomized setting `optimize_aggregation_in_order = 1`,
aggregation emits each group as soon as it is complete, so the aggregation hash
table never grows past the limit and `max_rows_to_group_by` is never tripped -
regardless of partition-independent aggregation, the query succeeds and the
assertion fails.

Pin `optimize_aggregation_in_order = 0` so the test deterministically exercises
the global limit enforced via the merge phase that the optimization skips.

CI report: #105128

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Enabling `allow_aggregate_partitions_independently` by default also makes the
optimization fall back to normal aggregation whenever `max_rows_to_group_by` is
set (the per-partition pipeline skips the merge phase where the global limit is
enforced). The stateless test profile (`tests/config/users.d/limits.yaml`) sets
`max_rows_to_group_by` to a high value (`10G`) as a safety net "so it will not
limit anything", which is non-zero and therefore disables the optimization.

`02521_aggregation_by_partitions` exists to verify the partition-independent
pipeline, so in CI every `EXPLAIN PIPELINE` collapsed to ordinary aggregation and
the test failed deterministically on every shard (`@@ -1,225 +1,94 @@`), while it
passed on master where the fallback does not exist. The sibling test
`01551_mergetree_read_in_order_spread` was already adjusted in this pull request
for the same reason.

Clear the limit with `set max_rows_to_group_by = 0` at the top so the test keeps
exercising the partition-independent pipeline it is meant to verify.

CI report: #105128

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
The test verified the `max_rows_to_group_by` guard only through the runtime
`TOO_MANY_ROWS` error, which is fragile: `max_rows_to_group_by` is enforced
against each aggregation hash table, so whether it trips depends on how the
partition layout maps onto threads and on whether the query is distributed.

Two CI randomizations defeated it:

* `enable_parallel_replicas = 1` routes the aggregation across replicas, where
  the limit is enforced per replica rather than globally - an orthogonal path
  this optimization does not touch (the guard already disables independent
  per-partition aggregation on the local plan when the limit is set). Pin it
  off so the randomizer cannot select it.

* When the optimization is active and each partition gets its own aggregating
  transform (e.g. `max_threads` >= number of partitions), every per-partition
  hash table stays under the limit and nothing throws - exactly the case the
  optimization-disabling guard must prevent, but the runtime check observes it
  only probabilistically.

Replace the runtime-only assertion with a deterministic `EXPLAIN PLAN` check:
with `force_aggregate_partitions_independently` set, the plan reads each
partition through a separate port when there is no limit, and must stop doing
so once `max_rows_to_group_by` is set (the guard falls back to normal
aggregation whose merge phase enforces the global limit). This is independent
of `max_threads` and the other randomized settings. The runtime
`TOO_MANY_ROWS` check is kept as an end-to-end confirmation.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@alexey-milovidov

Copy link
Copy Markdown
Member Author

Pushed 9d9d2b007fc — stabilized the only failing test, 04489_aggregate_partitions_independently_max_rows_to_group_by (every red check was this test).

Root cause. The test asserted the max_rows_to_group_by guard only through the runtime TOO_MANY_ROWS error. That limit is enforced against each aggregation hash table, so two CI randomizations could keep a 1000-group query under the per-table limit and never throw:

  • enable_parallel_replicas = 1 (selected by the randomizer on several runs, including the regular arm_binary, parallel and amd_tsan, s3 storage, parallel runs at b15e83a4): aggregation is split across replicas and max_rows_to_group_by is enforced per replica, not globally. That path is orthogonal to this optimization — the guard already disables independent per-partition aggregation on the local plan when the limit is set — and parallel replicas not enforcing the global GROUP BY limit is pre-existing behavior.
  • When the optimization is active and max_threads ≥ the partition count, each partition gets its own aggregating transform (~63 keys < 100), so nothing throws — exactly the case the guard must prevent, but the runtime check observes it only probabilistically.

Most of the other red runs were stale pre-pin results (executed before 528e2b22b9a, when the test still lacked the optimize_aggregation_in_order pin — confirmed via the CIDB execution timestamps).

Fix (test-only). Pin enable_parallel_replicas = 0, and replace the runtime-only assertion with a deterministic EXPLAIN PLAN check: with force_aggregate_partitions_independently set, the plan reads each partition through a separate port without a limit (asserts 1) and must stop once max_rows_to_group_by is set (asserts 0 — the guard falls back to normal aggregation whose merge phase enforces the global limit). This is independent of max_threads and the rest of the randomized settings. The runtime TOO_MANY_ROWS check is kept as an end-to-end confirmation.

Verification. Against a recent master binary: with the optimization forced on, EXPLAIN PLAN keeps Read each partition through separate port even when max_rows_to_group_by is set (the bug → the new check returns 1); simulating the guard's effect (normal aggregation) the separate-port read disappears (0) and the query throws TOO_MANY_ROWS under every randomized-setting combination I tried, including the full setting set from the failing runs. The new EXPLAIN assertion is stable (1 then 0) across max_threads/group_by_two_level_threshold/enable_memory_bound_merging_of_aggregation_results/… I could not run a from-scratch build-verify in this environment (the worktree needed a near-full rebuild), so the guard's compiled plan effect itself was not re-executed here.

Comment thread docs/en/engines/table-engines/mergetree-family/custom-partitioning-key.md Outdated
alexey-milovidov and others added 2 commits June 30, 2026 19:42
The stateless test profile (`tests/config/users.d/limits.yaml`) sets
`max_rows_to_group_by = 10G` as a high "won't limit anything" safety net.
That value is non-zero, so the new guard in `optimizeAggregationPerPartition`
already disables independent per-partition aggregation. The first positive
`EXPLAIN` check in `04489_aggregate_partitions_independently_max_rows_to_group_by`
inherited that limit and therefore returned `0` instead of `1`, failing in the
`Fast test` job:

    @@ -1,2 +1,2 @@
    -1
    +0
     0

Reset `max_rows_to_group_by` to 0 before the positive case so the no-limit path
is actually exercised, then set the explicit limit only for the negative and
runtime checks. This mirrors the fix already applied to
`02521_aggregation_by_partitions`.

CI report: https://s3.amazonaws.com/clickhouse-test-reports/json.html?PR=105128&sha=9d9d2b007fc87b1fc697da9ae891954ec9bd6b76&name_0=PR&name_1=Fast%20test

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
The note in `custom-partitioning-key.md` described the size-skew check as "the
largest partition holds more than twice the average number of rows", implying a
comparison against `sum_rows / partitions_cnt`. The actual heuristic in
`ReadFromMergeTree::requestOutputEachPartitionThroughSeparatePortForAggregation`
compares the largest partition against `2 * (sum_rows / max_threads)`, which
differs from the average partition size whenever the partition count is not equal
to `max_threads`. Reword the note to describe the real threshold.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@alexey-milovidov

Copy link
Copy Markdown
Member Author

Addressed the AI review's two required actions (pushed 9d9d2b007fc..529e0ffdc1c):

  1. 04489_aggregate_partitions_independently_max_rows_to_group_by Fast test failure (commit cdaa0112202). The first positive EXPLAIN check was inheriting max_rows_to_group_by = 10G from the stateless profile (tests/config/users.d/limits.yaml, installed for every job including Fast test via tests/config/install.sh). Because the new guard in optimizeAggregationPerPartition returns for any non-zero max_rows_to_group_by, the optimization was disabled and the check returned 0 instead of 1:

    @@ -1,2 +1,2 @@
    -1
    +0
     0
    

    The CI evidence itself confirms the root cause: line 2 (with the explicit 100 limit) already returned 0, proving the guard fires on a non-zero limit in the real binary; line 1 returned 0 only because it inherited the profile's 10G. The fix resets max_rows_to_group_by = 0 before the positive case and sets the explicit limit only for the negative and runtime checks — the same approach already used in 02521_aggregation_by_partitions.

  2. Docs skew-threshold wording (commit 529e0ffdc1c). The note now describes the heuristic as "the largest partition holds more rows than twice the total number of rows divided by max_threads", matching the 2 * (sum_rows / max_threads) comparison in requestOutputEachPartitionThroughSeparatePortForAggregation (it previously implied 2 * sum_rows / partitions_cnt).

Also updated the PR description: corrected the same skew wording, fixed the function name to ...ForAggregation, and the compatibility value to 26.6 (the change lands in the 26.7 block of SettingsChangesHistory).

Both review threads are resolved. I confirmed no other stateless test is affected by the guard under the 10G profile: only 02521, 01551 and 04489 assert on the plan/pipeline (all handled); the other tests that exercise the optimization (02681, 04234, 03093, and the LIMIT BY tests) are result-based and the guard only changes the execution path, not the results.

A local full rebuild to re-run the test against a guard-enabled binary was infeasible in this environment (the worktree checkout invalidated the build cache, triggering a near-full rebuild), but the fix is logically airtight given the CI output above and the guard code (if (params.max_rows_to_group_by != 0) return;).

alexey-milovidov and others added 2 commits June 30, 2026 19:48
`master` added `04489_max_threads_auto_parsing_compat` (and tests at the `04490`
and `04491` prefixes), so move the regression test for the
`max_rows_to_group_by` guard to the next free prefix `04492`.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@alexey-milovidov

Copy link
Copy Markdown
Member Author

Also merged the latest master (the branch was 215 commits behind; the merge is clean — git merge-tree confirmed no conflicts, which also cleared a transient CONFLICTING mergeability flag GitHub reported right after the previous push). In the process master had added 04489_max_threads_auto_parsing_compat plus tests at the 04490/04491 prefixes, so the regression test was renumbered 0448904492 to avoid the prefix collision (commit 4529ec352bd). PR is now MERGEABLE.

`04492_aggregate_partitions_independently_max_rows_to_group_by` still failed
in CI (e.g. `Stateless tests (amd_tsan, parallel, 2/2)` and the flaky checks),
including with `optimize_aggregation_in_order = 0`:

  The query succeeded but the server error '158' was expected
  (SELECT a FROM t_apart_max_rows GROUP BY a FORMAT Null; -- { serverError TOO_MANY_ROWS })

Root cause: `max_rows_to_group_by` is checked against each aggregation lane's
own hash table, not against a single global one. Even with the per-partition
optimization disabled (the `EXPLAIN` checks correctly return `0`), the
read-in-order spread gives each of the 16 partitions its own aggregating lane.
At `max_threads = 3` the pipeline has 19 `AggregatingTransform` lanes, each
holding ~63 distinct keys (< the limit of 100), so the limit never trips and
the query returns all 1000 groups without throwing.

This is independent of the optimization (it happens in normal aggregation too),
so the runtime check can only be made deterministic by forcing a single lane.
Pin `max_threads = 1` so one lane sees all the keys; the limit is then exercised
deterministically regardless of the partition layout and the randomized reading
settings. Verified on a freshly built binary across 60 randomized-setting runs
(0 failures). The `EXPLAIN PLAN` checks are unaffected (independent of
`max_threads`). Also corrected the comments: the limit is per-lane, not enforced
globally during the merge phase.

CI report: https://s3.amazonaws.com/clickhouse-test-reports/json.html?PR=105128&sha=4529ec352bd98263f954917d57d61b535189d034&name_0=PR

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@alexey-milovidov

Copy link
Copy Markdown
Member Author

Pushed d6bd5c8e89a — the runtime TOO_MANY_ROWS check in 04492 was still failing on the latest run (4529ec35), e.g. the regular Stateless tests (amd_tsan, parallel, 2/2) and arm_binary, parallel runs, all with optimize_aggregation_in_order = 0 and enable_parallel_replicas = 0 already pinned:

The query succeeded but the server error '158' was expected
(SELECT a FROM t_apart_max_rows GROUP BY a FORMAT Null; -- { serverError TOO_MANY_ROWS })

The earlier fix assumed the per-table-limit escape only happens when "the optimization is active and max_threads >= the partition count". It actually happens with the optimization disabled too. With the guard correctly returning 0 for the separate-port EXPLAIN check (normal aggregation), the read-in-order spread still gives each of the 16 partitions its own aggregating lane. Reproduced on a freshly built guard-enabled binary with the exact failing settings (max_threads = 3):

Resize 19 -> 3
  AggregatingTransform x 19
    ...
      MergeTreeSelect(pool: ReadPool, algorithm: Thread) x 3 0 -> 1
        ExpressionTransform x 32
          MergeTreeSelect(pool: ReadPoolInOrder, algorithm: InOrder) x 16 0 -> 1

19 AggregatingTransform lanes, each holding ~63 keys (< 100), so max_rows_to_group_by never trips. This is independent of the optimization, so the runtime assertion can only be made deterministic by forcing a single lane.

Fix (test-only): pin max_threads = 1 so one lane sees all 1000 keys. Verified on the built binary across 60 randomized-setting runs (0 failures), including the full setting set from the failing runs, optimize_aggregation_in_order 0/1, max_threads 1..16, the read-in-order spread, and memory-bound merging. The EXPLAIN PLAN checks are unaffected. Also corrected the comments: max_rows_to_group_by is enforced per aggregation lane, not globally during the merge phase.

CI report: https://s3.amazonaws.com/clickhouse-test-reports/json.html?PR=105128&sha=4529ec352bd98263f954917d57d61b535189d034&name_0=PR

@alexey-milovidov

Copy link
Copy Markdown
Member Author

Re-ran the two red jobs on d6bd5c8e89a (run 28480451574). Both are known, unrelated flakes — the regression test 04492 no longer appears among the failures, so the max_threads = 1 pin held.

  1. Stateless tests (arm_binary, parallel)01018_Distributed__shard_num. Not a 04492/aggregation failure: it threw Code: 102 … Unexpected packet from server 127.0.0.2:9000 (expected TablesStatusResponse, got ProfileInfo) … (UNEXPECTED_PACKET_FROM_SERVER) on SELECT _shard_num, * FROM remote('127.0.0.{1,2}', system.one) — a connection-establishment race in ConnectionEstablisher::runConnection::getTablesStatus, on a query with no aggregation path. CI's own flaky-test harness re-ran it 152 times with the identical randomized settings, 152/152 passed ("not reproducible, likely a transient issue"). This test also fails intermittently on unrelated PRs and on master trunk (e.g. PR=0 on 2026-06-09).

  2. Stress test (arm_debug)Logical error: Inconsistent KeyCondition behavior. The debug-only (#ifndef NDEBUG) assertion in MergeTreeDataSelectExecutor, reached via AST-fuzzer mutations — compiled out in release. Master-wide flake: 22 distinct PRs since 2026-06-15, 0 hits on master trunk, only in *_debug/stress/AST-fuzzer builds. Tracked by the open umbrella issue AST-fuzzer-internal-path: executor/planner assertions on type-invalid inputs the analyzer normally rejects (5 sites: IColumn, KeyCondition, SingleValueData, ActionsDAG ×2) #107951 . This PR touches only the aggregation setting default and optimizeAggregationPerPartition, with no MergeTree/KeyCondition code path.

No re-merge: the branch is 138 commits behind but the merge base is <1 day old and master has not touched any file this PR modifies, so a re-merge would only re-roll the same flakes without a fix.

@alexey-milovidov

Copy link
Copy Markdown
Member Author

Heads-up on a new test-number collision that appeared on master after the last sync — it won't surface as a git conflict, so flagging it explicitly.

master merged f135d0e70e7 ("Renumber test 04489 -> 04492 to avoid collision after master merge"), which added 04492_regexp_pattern_with_nul_byte. That prefix-collides with this PR's 04492_aggregate_partitions_independently_max_rows_to_group_by — both were independently renumbered to 04492.

Neither tree collides on its own (merge base fbdcc93bf8a has no 04492, and 04492_regexp is not yet an ancestor of the PR head d6bd5c8e), and a merge is clean (different filenames, no conflict), so it would silently create two 04492_ tests. Worth renumbering the aggregation test to the next free prefix at merge time (on current master, 0449004492 are taken and 04493 is free, though other in-flight PRs may claim it — best chosen at merge).

Otherwise the PR looks ready: the two earlier reds reran to SUCCESS (01018_Distributed__shard_num and the Stress arm_debug Inconsistent KeyCondition behavior, #107951), AI review is ✅ Approve on d6bd5c8e, 0 unresolved threads, and master has not touched any file this PR modifies, so no re-merge is needed.

@alexey-milovidov

Copy link
Copy Markdown
Member Author

CH Inc sync flipped to a failure after my last note (it was not-run then). Diagnosed the private sync run 28482821586 — all four reds are Cloud-only tests that this PR cannot touch, so they are unrelated flakes:

  • 03010_shared_merge_tree_mutations_kill_mutation (amd_debug, distributed cache) — SharedMergeTree kill-mutation, timing-sensitive.
  • 03016_shared_merge_tree_multiple_tables (amd_asan_ubsan, SharedCatalog) — SharedMergeTree.
  • test_refreshable_mv_no_multi_read::test_refreshable_mv_attach_without_multi_read (Integration db disk, both 6/7 and old analyzer 6/7) — refreshable-MV attach race (assert 'Disabled' in 'Scheduled'); a recurring cross-PR Cloud flake (also hit e.g. the Fix logical error exception in getConsistentMetadataSnapshotImpl when log pointer goes backwards #100651 sync run).

This PR only flips allow_aggregate_partitions_independently to true, adds the 26.7 SettingsChangesHistory entry, the max_rows_to_group_by != 0 guard in optimizeAggregationPerPartition, docs, and the aggregation regression/perf tests — none of which exercise SharedMergeTree or refreshable materialized views. Re-triggered the failed sync jobs (gh run rerun --failed).

Otherwise unchanged and ready: public CI is fully green on d6bd5c8e (0 failures), AI review is ✅ Approve, 0 unresolved threads, MERGEABLE, and git merge-tree against current master is clean (the Settings.cpp / SettingsChangesHistory.cpp overlaps merge without conflict). The 04492 test-number collision with master's 04492_regexp_pattern_with_nul_byte still stands — best renumbered at merge as noted above.

@alexey-milovidov

Copy link
Copy Markdown
Member Author

Status update since the CH Inc sync re-trigger:

  • CI is now fully green on d6bd5c8e89a — 161 SUCCESS / 13 SKIP / 0 failures. The CH Inc sync job you re-triggered has flipped to SUCCESS, and the AI review is at ✅ Approve with 0 unresolved threads.
  • GitHub now shows CONFLICTING/DIRTY, but this is a stale mergeability computation, not a real conflict. master advanced 531 commits under the unchanged head, so GitHub recomputed. git merge-tree --write-tree origin/master d6bd5c8e89a against the current master tip (bf1980a9bd1) returns a clean tree (exit 0, no conflict markers). I verified the merged SettingsChangesHistory.cpp 26.7 block contains both this PR's allow_aggregate_partitions_independently entry and master's newly-added entries (explain_query_plan_default, optimize_and_compare_chain_max_hash_work, etc.) — nothing is dropped.

So no re-merge is needed to resolve a conflict; the flag should clear on GitHub's next mergeability recompute. If it lingers, a trivial master merge at merge time (where the 04492 test-number renumber lands anyway) resolves it. Held off on re-merging so as not to discard the green CI or restart the performance comparison.

@alexey-milovidov

Copy link
Copy Markdown
Member Author

Re-merged master (4644410000c) to clear the stale CONFLICTING flag — the branch was 942 commits behind and GitHub reported a conflict, but the 3-way merge is conflict-free. Verified the net diff against master is unchanged (9 files, +138/-3): the allow_aggregate_partitions_independently = true default, the SettingsChangesHistory entry (correctly in the 26.7 block), the max_rows_to_group_by guard in optimizeAggregationPerPartition, the 04492 test, and the docs all survived intact. Fresh CI is running; the branch is now MERGEABLE.

/// more groups (or apply `group_by_overflow_mode`) per partition than the global limit allows.
/// Fall back to normal aggregation, matching `AggregatingStep::canUseShardedAggregation` and
/// `make_distributed_plan`, both of which reject this case for the same reason.
if (aggregating_step->getParams().max_rows_to_group_by != 0)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This guard changes more than the new default: it also changes the pre-26.7 explicit opt-in path for allow_aggregate_partitions_independently / force_aggregate_partitions_independently. Before this PR, a query or profile that set those flags still took the per-partition path even if max_rows_to_group_by was a high non-zero safety net; after this change it silently falls back to normal aggregation for any non-zero value. The updated test 02521_aggregation_by_partitions.sql is already evidence of that behavior change: it now has to clear the inherited 10G limit just to keep exercising the old opt-in path.

The problem is that compatibility = '26.6' does not restore that old behavior. The only compatibility entry added in this PR is the default flip in SettingsChangesHistory.cpp, so existing deployments that explicitly enabled the optimization will still lose it under compatibility = '26.6' whenever they also have a non-zero max_rows_to_group_by in a profile. If that stricter fallback is intentional, it needs its own compatibility story (for example, a gated guard or an additional history-controlled setting); otherwise the PR/docs should stop promising that compatibility = '26.6' restores the pre-26.7 behavior.

@clickhouse-gh

clickhouse-gh Bot commented Jul 2, 2026

Copy link
Copy Markdown
Contributor

LLVM Coverage Report

Metric Baseline Current Δ
Lines 85.50% 85.50% +0.00%
Functions 92.60% 92.60% +0.00%
Branches 77.60% 77.70% +0.10%

Changed lines: Changed C/C++ lines covered: 22/22 (100.00%) · Uncovered code

Full report · Diff report

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

pr-performance Pull request with some performance improvements

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants