iframe-proxy

alexey-milovidov · 2026-05-16T15:53:11Z

When the partition key suits the GROUP BY key, ClickHouse can route each partition through a separate aggregation pipeline and skip the merging step entirely. The setting that gates this optimization, allow_aggregate_partitions_independently, has been off by default since it was introduced in 23.2 (#45364) because the optimization is beneficial only when there are enough partitions of roughly equal size.

Runtime heuristics in ReadFromMergeTree::requestOutputEachPartitionThroughSeparatePortForAggregation (src/Processors/QueryPlan/ReadFromMergeTree.cpp) already enforce those criteria: they disable the optimization when the partition count is below max_threads / 2, above max_number_of_partitions_for_independent_aggregation (128 by default), or when the largest partition holds more rows than twice the total number of rows divided by max_threads. In every other case the optimization either pays off or is silently skipped, so the gate-setting can safely default to true.

force_aggregate_partitions_independently (which bypasses the heuristics) is left at false. The optimization skips the cross-partition merge phase, which is where the global max_rows_to_group_by limit is enforced, so optimizeAggregationPerPartition now falls back to normal aggregation whenever max_rows_to_group_by is set, matching the sharded-aggregation and distributed-plan paths. Backward-compatible behavior can be restored with compatibility = '26.6'.

Changelog category (leave one):

Performance Improvement

Changelog entry (a user-readable short description of the changes that goes into CHANGELOG.md):

Enable allow_aggregate_partitions_independently by default. When a GROUP BY key suits the partition key, ClickHouse can aggregate each partition independently and skip the global merging step. Runtime heuristics automatically skip the optimization when the partition layout would make it unfavorable (too few partitions, too many partitions, or significantly skewed partition sizes).

Documentation entry for user-facing changes

Documentation is written (mandatory for new features)

When the partition key suits the `GROUP BY` key, ClickHouse can route each partition through a separate aggregation pipeline and skip the merging step entirely. The setting that gates this optimization, `allow_aggregate_partitions_independently`, has been off by default since it was introduced in 23.2 (#45364) because the optimization is beneficial only when there are enough partitions of roughly equal size. Runtime heuristics in `ReadFromMergeTree::requestOutputEachPartitionThroughSeparatePort` already implement those checks: they disable the optimization when the partition count is below `max_threads / 2`, above `max_number_of_partitions_for_independent_aggregation` (128 by default), or when the largest partition has more than twice the average number of rows. In every other case the optimization either pays off or is silently skipped, so the gate-setting can safely default to true. `force_aggregate_partitions_independently` (which bypasses the heuristics) is left at false. Backward-compatible behavior can be restored with `compatibility = '26.4'`.

clickhouse-gh · 2026-05-16T15:53:43Z

Workflow [PR], commit [4644410]

Summary: ❌

Performance Comparison: Performance dashboard

job_name	test_name	status	info
Fast test (arm_darwin)		FAIL
	03135_keeper_client_find_commands	FAIL	cidb
Stress test (arm_ubsan)		FAIL
	Logical error: Invalid number of rows in Chunk A column B at position C: expected D, got E (STID: 3366-2d1d)	FAIL	cidb

AI Review

Summary

This PR flips allow_aggregate_partitions_independently to default-on, records the change in SettingsChangesHistory, adds a max_rows_to_group_by fallback in optimizeAggregationPerPartition, and extends tests/docs around the optimization. The default flip itself is well-covered now, but the new fallback also changes the long-standing explicit opt-in path and that behavior change is not actually restorable via compatibility = '26.6', so the rollback contract in the PR is incomplete.

Findings

⚠️ Majors
- [src/Processors/QueryPlan/Optimizations/useDataParallelAggregation.cpp:121] The new max_rows_to_group_by != 0 early return does not just protect the new default-on path; it also changes the pre-26.7 explicit allow_aggregate_partitions_independently / force_aggregate_partitions_independently behavior for any installation that keeps a non-zero safety-net max_rows_to_group_by in a profile. The test adjustment in 02521_aggregation_by_partitions.sql is evidence: it now has to clear the inherited 10G limit to keep exercising the old opt-in path at all. Because the only compatibility entry added in this PR is the default flip, compatibility = '26.6' still cannot restore that pre-26.7 explicit opt-in behavior. Suggested fix: either gate this new fallback behind compatibility as well (or another history-controlled setting), or narrow the PR/docs claim so it no longer promises that compatibility = '26.6' restores the previous behavior.

Final Verdict

Status: ⚠️ Request changes

Minimum required action: add a compatibility path for the new max_rows_to_group_by fallback (or otherwise make the old explicit opt-in semantics restorable), or update the PR/docs so they no longer claim that compatibility = '26.6' fully restores the pre-26.7 behavior.

…d_in_order_spread The test asserts the spread/merging pipeline shape from `ReadInOrderOptimizer` (with `MergingAggregatedBucketTransform` / `FinishAggregatingInOrderTransform`). With `allow_aggregate_partitions_independently` now defaulting to `true`, the two partitions are routed through independent aggregation ports (`FinalizeAggregatedTransform` × 2), so the `EXPLAIN PIPELINE` output diverges from the reference. Force the setting off in the test to preserve its original intent. CI report: https://s3.amazonaws.com/clickhouse-test-reports/json.html?PR=105128&sha=b4bd5f71520b0cb1439283ed7e928fe47c905b07&name_0=PR&name_1=Fast%20test PR: #105128

alexey-milovidov · 2026-05-17T23:05:19Z

No changes on performance tests.

…rtitions-independently-by-default

The 26.5 block has been closed since the PR was opened, so register the default flip under `26.6` in `SettingsChangesHistory.cpp` and update the docs note accordingly. This also keeps `02995_new_settings_history` happy.

…rtitions-independently-by-default

…ughSeparatePortForAggregation` in settings change reason The reason string for `allow_aggregate_partitions_independently` in `system.settings_changes` referenced `ReadFromMergeTree::requestOutputEachPartitionThroughSeparatePort`, which is not a real method name. The actual method is `ReadFromMergeTree::requestOutputEachPartitionThroughSeparatePortForAggregation`, so users searching the source from this text would find nothing. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

alexey-milovidov · 2026-06-01T03:15:39Z

Fixed in c3d0390 — the reason string now uses the exact method name ReadFromMergeTree::requestOutputEachPartitionThroughSeparatePortForAggregation.

…rtitions-independently-by-default

…g settings The merge of master re-introduced history entries for `query_plan_min_columns_for_join_lazy_indexing` and `query_plan_max_limit_for_join_lazy_indexing`, but these settings were removed from `Settings.cpp` when "Lazily apply selector and replication indexes in join" was reverted on master. Iterating the history for any `compatibility` value below the latest version then tried to set a non-existent setting and threw `UNKNOWN_SETTING`, breaking all compatibility tests in `Fast test`. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

…rtitions-independently-by-default

alexey-milovidov · 2026-06-05T04:27:23Z

Merged master (was 557 commits behind and red) and fixed the Fast test failures.

Root cause: a previous merge-conflict resolution re-introduced SettingsChangesHistory.cpp entries for query_plan_min_columns_for_join_lazy_indexing and query_plan_max_limit_for_join_lazy_indexing, but these settings were removed from Settings.cpp when "Lazily apply selector and replication indexes in join" was reverted on master (133a0fc8edd). SettingsImpl::applyCompatibilitySetting iterates the history and calls get(final_name) before applying the previous value, so any compatibility below the latest version threw UNKNOWN_SETTING — which is why all the compatibility/constraint tests (02324, 02325, 03243, 04234, 04078, 03916, 03011, 02970, 03274, 04209, 04034, 03006, 02933, 03773) failed.

Fix (a68425ec146): dropped the two stale history entries, leaving only the intended allow_aggregate_partitions_independently change. Verified locally — the previously failing compatibility tests now pass (the 04034 parallel-replicas test only needed the parallel_replicas cluster, which CI provides). This also resolves the AI-review blocker.

clickhouse-gh · 2026-06-05T07:45:28Z

📊 Cloud Performance Report

✅ AI verdict: no_change — no significant changes across 38 queries analysed

This PR flips allow_aggregate_partitions_independently on by default (plus a max_rows_to_group_by guard). That optimization only engages when a table's partition key is a function of the GROUP BY key, so partitions can be aggregated independently and the merge skipped. ClickBench's hits table isn't partitioned that way, so the runtime heuristics skip it and Q28's aggregation path is unchanged — the flagged -5.14% improvement on clickbench Q28 is off-path and reads as run-to-run variance, so we downgraded it to not-sure. No other queries were flagged on clickbench or tpch_adapted_1_official.

clickbench

⚠️ 1 inconclusive

Flagged queries (1 of 43)

	Query	Verdict	Baseline median (ms)	PR median (ms)	Change	q-value	Hint
⚠️	28	not_sure	6678	6334	-5.1%	<0.0001	aggregation: Per-partition GROUP BY opt needs partition key = GROUP BY key; hits isn't partitioned so Q28 is off-path noise

_{q-value = BH-FDR adjusted p; smaller is stronger evidence. MIRAI flags a query when q < fdr_q (default 0.10) — the value the verdict is based on.}

tpch_adapted_1_official

🟢 No significant changes

Debug info

StressHouse run: 636e29ce-da54-4f39-8e13-1867e601e745
MIRAI run: 17e1e2aa-9793-4478-86d7-68ff9720c4f4
PR check IDs:
- clickbench_569455_1782870400
- clickbench_569461_1782870400
- clickbench_569473_1782870400
- tpch_adapted_1_official_569480_1782870400
- tpch_adapted_1_official_569492_1782870400
- tpch_adapted_1_official_569502_1782870400

…rtitions-independently-by-default

alexey-milovidov · 2026-06-14T01:52:05Z

The sole CI failure here is an AST fuzzer (amd_debug) logical error and is unrelated to this PR:

Logical error: 'Variant 1 (String) has size 3, but expected 2'

in DB::ColumnVariant state validation, reached via DB::DistinctTransform::transform.

Report: https://s3.amazonaws.com/clickhouse-test-reports/json.html?PR=105128&sha=cfa19d7355537aad6a3633d78e0abf6d2eeaa745&name_0=PR&name_1=AST%20fuzzer%20%28amd_debug%29

This PR only flips the default of allow_aggregate_partitions_independently (plus a docs line, a SettingsChangesHistory entry, and one test). There is no path from an aggregation-setting default to a Variant column size mismatch in DistinctTransform.

Evidence it is a pre-existing, master-wide fuzzer-discoverable ColumnVariant bug (seeded from 04268_cast_keep_nullable_variant.sql): the same Variant <N> (<type>) has size <X>, but expected <Y> logical error has appeared on at least 7 unrelated PRs in the last 45 days — 99639, 102724, 105128, 107053, 107116, 100272, 104435.

@groeneai, could you investigate this failure and provide a fix in a separate PR? If a fix is already in progress, please link it here. I have re-merged master to get a fresh CI run.

groeneai · 2026-06-14T02:54:17Z

@alexey-milovidov You are right that this is unrelated to the aggregation setting. It is the chronic ColumnVariant "Variant N has size X, but expected Y" family (STID 3068), and the specific shape your run hit is already fixed.

That failure (AST fuzzer (amd_debug), STID 3068-3371, commit cfa19d73):

SELECT DISTINCT NULL, toColumnTypeName(r), toString(toUInt32(...), v) AS r FROM t_variant_null__fuzz_48 ...
INTERSECT SELECT DISTINCT toString(v) AS r, ... FROM t_variant_null__fuzz_0

where v Variant(Array(Array(UInt64)), String) holds one non-empty variant plus NULLs. toString(v) goes through ConvertImplFromVariantToColumn::execute, which filtered the variant to non-null rows, ran the nested function, then expanded the result in place via assumeMutable(). For a single-variant-plus-NULL column the nested result aliases the source subcolumn (use_count > 1), so the in-place expand grew the shared subcolumn and left the source ColumnVariant with a subcolumn larger than its discriminator count. DISTINCT (DistinctTransform) was the validating op that then tripped validateState.

Fixed in #107374 (merged 2026-06-13 18:34 UTC, Related: #101415): IColumn::mutate()-before-expand at all 6 filter-then-expand sites (1 ConvertImplFromVariantToColumn + 3 FunctionVariantAdaptor + 2 FunctionDynamicAdaptor), which clones when shared, plus regression test 04337_variant_to_column_inplace_mutation.

Verified your exact query against a debug build:

pre-fix binary: SIGABRT, Variant 1 (String) has size 3, but expected 2
fixed HEAD: exit 0, no crash

This PR's current HEAD dd52b610 already contains #107374 (re-merged master at 01:51 UTC), and the fresh run shows AST fuzzer (amd_debug) and (amd_debug, targeted) green at 02:19 UTC. CIDB: 0 master hits for this family in 45 days; the only post-merge hit (#99639, 06-13 19:01) is on a stale base 100 commits behind the fix.

Note: the join/runtime-filter manifestation of the same family is a separate fix, #106931 (Closes #101415), still in review. #107374 covers the toString/conversion path your run hit.

…rtitions-independently-by-default

…forces The docs note pointed at the "key factors for a good performance" list as the criteria the runtime guard checks, but `requestOutputEachPartitionThroughSeparatePortForAggregation` only skips the optimisation for too few partitions (fewer than `max_threads / 2`), too many partitions (more than `max_number_of_partitions_for_independent_aggregation`), or heavily skewed partition sizes. It does not skip merely because partitions are small, so listing that factor as a runtime criterion overstated the guard. Enumerate the three checks the heuristic actually performs and note that the list below is general layout guidance, only part of which is enforced. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

alexey-milovidov · 2026-06-28T18:20:53Z

We can only merge this if it improves the performance of some queries with no degradation on others.

…s set `max_rows_to_group_by` is a global `GROUP BY` limit that, in normal aggregation, is enforced during the final merge phase. The aggregate-partitions-independently optimization skips that merge and aggregates each partition on a separate pipeline, so the limit was enforced against each partition's own hash table rather than globally. With `allow_aggregate_partitions_independently` off by default this was only reachable on an opt-in path, but enabling it by default exposes the gap: a query such as `PARTITION BY a % 16 GROUP BY a SETTINGS max_rows_to_group_by = 100` could return far more than 100 groups (or apply `group_by_overflow_mode` per partition) instead of honoring the global limit. Disable the optimization in `optimizeAggregationPerPartition` when `max_rows_to_group_by != 0`, falling back to normal aggregation. This matches the two sibling skip-merge paths, which already reject the same case: `AggregatingStep::canUseShardedAggregation` returns `false` and `make_distributed_plan` throws `SUPPORT_IS_DISABLED`. Add a regression test that forces the optimization on a partitioned table whose partition key suits the `GROUP BY` key and asserts the global limit is preserved (`TOO_MANY_ROWS` with `group_by_overflow_mode = 'throw'`). This re-applies the guard that was held back in d262e0c pending the author's decision on how to resolve the gap; the disable-guard approach was chosen. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

`aggregation_by_partitions.xml` forces the optimization on both server builds, so it cannot show the effect of flipping the `allow_aggregate_partitions_independently` default - the forced setting behaves identically on the master and PR builds the comparison runs. Add a test that deliberately does not set (or force) the setting, so each build uses its own compiled-in default: the old default (0) aggregates with a global merge, while the new default (1) lets the runtime heuristics aggregate each partition independently and skip the merge. The layout is favorable for the heuristics (partition key is a function of the `GROUP BY` key, many roughly equal-sized partitions within the default 128-partition cap), so the optimization fires on the PR build and the query is faster than on the master build. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

Make the test deterministic and the win measurable across machines: - Pin `max_threads = 32`. The runtime heuristic disables the optimization when the partition count is below `max_threads / 2`, which is tied to the host's core count. On a 96-core machine the natural `max_threads` is 96, so the 32-partition table (32 < 48) would not take the optimization at all and the 64-partition table only barely. Pinning `max_threads = 32` keeps both the 32- and 64-partition tables (>= 16) on the optimized path regardless of the host. - Add a 50M-row size. At 10M rows the query runs in ~0.1s and the signal is dominated by run-to-run variance; at 50M rows the optimized path is stable and clearly faster (locally ~0.33s -> ~0.19s for 64 partitions and ~0.33s -> ~0.21s for 32 partitions, both with normal aggregation as the baseline). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

alexey-milovidov · 2026-06-28T18:47:44Z

Re: the performance bar — aggregation_by_partitions.xml forces the optimization on both the base and PR builds, so it can't show the effect of flipping the default. Added aggregate_partitions_independently_default.xml (037ec3c, tuned in 10828a8): it relies on the default (base 0 = off, this PR 1 = on) on a partition-aligned GROUP BY, with max_threads pinned so the runtime heuristic fires deterministically regardless of the host's core count.

Local measurement (50M rows, max_threads = 32, normal aggregation as the baseline):

partitions	normal aggregation	independent partitions	speedup
64	~0.33s	~0.19s	~45%
32	~0.33s	~0.21s	~37%

No degradation on unfavorable layouts — the heuristics already skip the optimization there, consistent with the no_change cloud-benchmark verdict on ClickBench (whose GROUP BY is not partition-aligned). The CI performance comparison should now show the win on this test for the PR build.

Also re-applied the max_rows_to_group_by correctness guard that was held back in d262e0c; see the resolved thread on src/Core/Settings.cpp.

…rtitions-independently-by-default

The new stateless test was numbered `05020`, but the highest existing test number is `04488`, which makes `Style check/test_numbers_check` fail with `Gap (4415, 5020) > 100` (the check rejects gaps of 100 or more between consecutive stateless test numbers). Rename the `.sql` and `.reference` files to `04489`, the next available number. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

…ion flake The test verifies that partition-independent aggregation falls back to normal aggregation when `max_rows_to_group_by` is set, by asserting the query throws `TOO_MANY_ROWS`. Under the randomized setting `optimize_aggregation_in_order = 1`, aggregation emits each group as soon as it is complete, so the aggregation hash table never grows past the limit and `max_rows_to_group_by` is never tripped - regardless of partition-independent aggregation, the query succeeds and the assertion fails. Pin `optimize_aggregation_in_order = 0` so the test deterministically exercises the global limit enforced via the merge phase that the optimization skips. CI report: #105128 Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

Enabling `allow_aggregate_partitions_independently` by default also makes the optimization fall back to normal aggregation whenever `max_rows_to_group_by` is set (the per-partition pipeline skips the merge phase where the global limit is enforced). The stateless test profile (`tests/config/users.d/limits.yaml`) sets `max_rows_to_group_by` to a high value (`10G`) as a safety net "so it will not limit anything", which is non-zero and therefore disables the optimization. `02521_aggregation_by_partitions` exists to verify the partition-independent pipeline, so in CI every `EXPLAIN PIPELINE` collapsed to ordinary aggregation and the test failed deterministically on every shard (`@@ -1,225 +1,94 @@`), while it passed on master where the fallback does not exist. The sibling test `01551_mergetree_read_in_order_spread` was already adjusted in this pull request for the same reason. Clear the limit with `set max_rows_to_group_by = 0` at the top so the test keeps exercising the partition-independent pipeline it is meant to verify. CI report: #105128 Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

…rtitions-independently-by-default

The test verified the `max_rows_to_group_by` guard only through the runtime `TOO_MANY_ROWS` error, which is fragile: `max_rows_to_group_by` is enforced against each aggregation hash table, so whether it trips depends on how the partition layout maps onto threads and on whether the query is distributed. Two CI randomizations defeated it: * `enable_parallel_replicas = 1` routes the aggregation across replicas, where the limit is enforced per replica rather than globally - an orthogonal path this optimization does not touch (the guard already disables independent per-partition aggregation on the local plan when the limit is set). Pin it off so the randomizer cannot select it. * When the optimization is active and each partition gets its own aggregating transform (e.g. `max_threads` >= number of partitions), every per-partition hash table stays under the limit and nothing throws - exactly the case the optimization-disabling guard must prevent, but the runtime check observes it only probabilistically. Replace the runtime-only assertion with a deterministic `EXPLAIN PLAN` check: with `force_aggregate_partitions_independently` set, the plan reads each partition through a separate port when there is no limit, and must stop doing so once `max_rows_to_group_by` is set (the guard falls back to normal aggregation whose merge phase enforces the global limit). This is independent of `max_threads` and the other randomized settings. The runtime `TOO_MANY_ROWS` check is kept as an end-to-end confirmation. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

alexey-milovidov · 2026-06-30T15:47:34Z

Pushed 9d9d2b007fc — stabilized the only failing test, 04489_aggregate_partitions_independently_max_rows_to_group_by (every red check was this test).

Root cause. The test asserted the max_rows_to_group_by guard only through the runtime TOO_MANY_ROWS error. That limit is enforced against each aggregation hash table, so two CI randomizations could keep a 1000-group query under the per-table limit and never throw:

enable_parallel_replicas = 1 (selected by the randomizer on several runs, including the regular arm_binary, parallel and amd_tsan, s3 storage, parallel runs at b15e83a4): aggregation is split across replicas and max_rows_to_group_by is enforced per replica, not globally. That path is orthogonal to this optimization — the guard already disables independent per-partition aggregation on the local plan when the limit is set — and parallel replicas not enforcing the global GROUP BY limit is pre-existing behavior.
When the optimization is active and max_threads ≥ the partition count, each partition gets its own aggregating transform (~63 keys < 100), so nothing throws — exactly the case the guard must prevent, but the runtime check observes it only probabilistically.

Most of the other red runs were stale pre-pin results (executed before 528e2b22b9a, when the test still lacked the optimize_aggregation_in_order pin — confirmed via the CIDB execution timestamps).

Fix (test-only). Pin enable_parallel_replicas = 0, and replace the runtime-only assertion with a deterministic EXPLAIN PLAN check: with force_aggregate_partitions_independently set, the plan reads each partition through a separate port without a limit (asserts 1) and must stop once max_rows_to_group_by is set (asserts 0 — the guard falls back to normal aggregation whose merge phase enforces the global limit). This is independent of max_threads and the rest of the randomized settings. The runtime TOO_MANY_ROWS check is kept as an end-to-end confirmation.

Verification. Against a recent master binary: with the optimization forced on, EXPLAIN PLAN keeps Read each partition through separate port even when max_rows_to_group_by is set (the bug → the new check returns 1); simulating the guard's effect (normal aggregation) the separate-port read disappears (0) and the query throws TOO_MANY_ROWS under every randomized-setting combination I tried, including the full setting set from the failing runs. The new EXPLAIN assertion is stable (1 then 0) across max_threads/group_by_two_level_threshold/enable_memory_bound_merging_of_aggregation_results/… I could not run a from-scratch build-verify in this environment (the worktree needed a near-full rebuild), so the guard's compiled plan effect itself was not re-executed here.

The stateless test profile (`tests/config/users.d/limits.yaml`) sets `max_rows_to_group_by = 10G` as a high "won't limit anything" safety net. That value is non-zero, so the new guard in `optimizeAggregationPerPartition` already disables independent per-partition aggregation. The first positive `EXPLAIN` check in `04489_aggregate_partitions_independently_max_rows_to_group_by` inherited that limit and therefore returned `0` instead of `1`, failing in the `Fast test` job: @@ -1,2 +1,2 @@ -1 +0 0 Reset `max_rows_to_group_by` to 0 before the positive case so the no-limit path is actually exercised, then set the explicit limit only for the negative and runtime checks. This mirrors the fix already applied to `02521_aggregation_by_partitions`. CI report: https://s3.amazonaws.com/clickhouse-test-reports/json.html?PR=105128&sha=9d9d2b007fc87b1fc697da9ae891954ec9bd6b76&name_0=PR&name_1=Fast%20test Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

The note in `custom-partitioning-key.md` described the size-skew check as "the largest partition holds more than twice the average number of rows", implying a comparison against `sum_rows / partitions_cnt`. The actual heuristic in `ReadFromMergeTree::requestOutputEachPartitionThroughSeparatePortForAggregation` compares the largest partition against `2 * (sum_rows / max_threads)`, which differs from the average partition size whenever the partition count is not equal to `max_threads`. Reword the note to describe the real threshold. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

alexey-milovidov · 2026-06-30T19:44:43Z

Addressed the AI review's two required actions (pushed 9d9d2b007fc..529e0ffdc1c):

04489_aggregate_partitions_independently_max_rows_to_group_by Fast test failure (commit cdaa0112202). The first positive EXPLAIN check was inheriting max_rows_to_group_by = 10G from the stateless profile (tests/config/users.d/limits.yaml, installed for every job including Fast test via tests/config/install.sh). Because the new guard in optimizeAggregationPerPartition returns for any non-zero max_rows_to_group_by, the optimization was disabled and the check returned 0 instead of 1:
```
@@ -1,2 +1,2 @@
-1
+0
 0
```
The CI evidence itself confirms the root cause: line 2 (with the explicit 100 limit) already returned 0, proving the guard fires on a non-zero limit in the real binary; line 1 returned 0 only because it inherited the profile's 10G. The fix resets max_rows_to_group_by = 0 before the positive case and sets the explicit limit only for the negative and runtime checks — the same approach already used in 02521_aggregation_by_partitions.
Docs skew-threshold wording (commit 529e0ffdc1c). The note now describes the heuristic as "the largest partition holds more rows than twice the total number of rows divided by max_threads", matching the 2 * (sum_rows / max_threads) comparison in requestOutputEachPartitionThroughSeparatePortForAggregation (it previously implied 2 * sum_rows / partitions_cnt).

Also updated the PR description: corrected the same skew wording, fixed the function name to ...ForAggregation, and the compatibility value to 26.6 (the change lands in the 26.7 block of SettingsChangesHistory).

Both review threads are resolved. I confirmed no other stateless test is affected by the guard under the 10G profile: only 02521, 01551 and 04489 assert on the plan/pipeline (all handled); the other tests that exercise the optimization (02681, 04234, 03093, and the LIMIT BY tests) are result-based and the guard only changes the execution path, not the results.

A local full rebuild to re-run the test against a guard-enabled binary was infeasible in this environment (the worktree checkout invalidated the build cache, triggering a near-full rebuild), but the fix is logically airtight given the CI output above and the guard code (if (params.max_rows_to_group_by != 0) return;).

…rtitions-independently-by-default

`master` added `04489_max_threads_auto_parsing_compat` (and tests at the `04490` and `04491` prefixes), so move the regression test for the `max_rows_to_group_by` guard to the next free prefix `04492`. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

alexey-milovidov · 2026-06-30T19:55:12Z

Also merged the latest master (the branch was 215 commits behind; the merge is clean — git merge-tree confirmed no conflicts, which also cleared a transient CONFLICTING mergeability flag GitHub reported right after the previous push). In the process master had added 04489_max_threads_auto_parsing_compat plus tests at the 04490/04491 prefixes, so the regression test was renumbered 04489 → 04492 to avoid the prefix collision (commit 4529ec352bd). PR is now MERGEABLE.

`04492_aggregate_partitions_independently_max_rows_to_group_by` still failed in CI (e.g. `Stateless tests (amd_tsan, parallel, 2/2)` and the flaky checks), including with `optimize_aggregation_in_order = 0`: The query succeeded but the server error '158' was expected (SELECT a FROM t_apart_max_rows GROUP BY a FORMAT Null; -- { serverError TOO_MANY_ROWS }) Root cause: `max_rows_to_group_by` is checked against each aggregation lane's own hash table, not against a single global one. Even with the per-partition optimization disabled (the `EXPLAIN` checks correctly return `0`), the read-in-order spread gives each of the 16 partitions its own aggregating lane. At `max_threads = 3` the pipeline has 19 `AggregatingTransform` lanes, each holding ~63 distinct keys (< the limit of 100), so the limit never trips and the query returns all 1000 groups without throwing. This is independent of the optimization (it happens in normal aggregation too), so the runtime check can only be made deterministic by forcing a single lane. Pin `max_threads = 1` so one lane sees all the keys; the limit is then exercised deterministically regardless of the partition layout and the randomized reading settings. Verified on a freshly built binary across 60 randomized-setting runs (0 failures). The `EXPLAIN PLAN` checks are unaffected (independent of `max_threads`). Also corrected the comments: the limit is per-lane, not enforced globally during the merge phase. CI report: https://s3.amazonaws.com/clickhouse-test-reports/json.html?PR=105128&sha=4529ec352bd98263f954917d57d61b535189d034&name_0=PR Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

alexey-milovidov · 2026-06-30T22:39:20Z

Pushed d6bd5c8e89a — the runtime TOO_MANY_ROWS check in 04492 was still failing on the latest run (4529ec35), e.g. the regular Stateless tests (amd_tsan, parallel, 2/2) and arm_binary, parallel runs, all with optimize_aggregation_in_order = 0 and enable_parallel_replicas = 0 already pinned:

The query succeeded but the server error '158' was expected
(SELECT a FROM t_apart_max_rows GROUP BY a FORMAT Null; -- { serverError TOO_MANY_ROWS })

The earlier fix assumed the per-table-limit escape only happens when "the optimization is active and max_threads >= the partition count". It actually happens with the optimization disabled too. With the guard correctly returning 0 for the separate-port EXPLAIN check (normal aggregation), the read-in-order spread still gives each of the 16 partitions its own aggregating lane. Reproduced on a freshly built guard-enabled binary with the exact failing settings (max_threads = 3):

Resize 19 -> 3
  AggregatingTransform x 19
    ...
      MergeTreeSelect(pool: ReadPool, algorithm: Thread) x 3 0 -> 1
        ExpressionTransform x 32
          MergeTreeSelect(pool: ReadPoolInOrder, algorithm: InOrder) x 16 0 -> 1

19 AggregatingTransform lanes, each holding ~63 keys (< 100), so max_rows_to_group_by never trips. This is independent of the optimization, so the runtime assertion can only be made deterministic by forcing a single lane.

Fix (test-only): pin max_threads = 1 so one lane sees all 1000 keys. Verified on the built binary across 60 randomized-setting runs (0 failures), including the full setting set from the failing runs, optimize_aggregation_in_order 0/1, max_threads 1..16, the read-in-order spread, and memory-bound merging. The EXPLAIN PLAN checks are unaffected. Also corrected the comments: max_rows_to_group_by is enforced per aggregation lane, not globally during the merge phase.

CI report: https://s3.amazonaws.com/clickhouse-test-reports/json.html?PR=105128&sha=4529ec352bd98263f954917d57d61b535189d034&name_0=PR

alexey-milovidov · 2026-07-01T03:54:39Z

Re-ran the two red jobs on d6bd5c8e89a (run 28480451574). Both are known, unrelated flakes — the regression test 04492 no longer appears among the failures, so the max_threads = 1 pin held.

Stateless tests (arm_binary, parallel) → 01018_Distributed__shard_num. Not a 04492/aggregation failure: it threw Code: 102 … Unexpected packet from server 127.0.0.2:9000 (expected TablesStatusResponse, got ProfileInfo) … (UNEXPECTED_PACKET_FROM_SERVER) on SELECT _shard_num, * FROM remote('127.0.0.{1,2}', system.one) — a connection-establishment race in ConnectionEstablisher::run → Connection::getTablesStatus, on a query with no aggregation path. CI's own flaky-test harness re-ran it 152 times with the identical randomized settings, 152/152 passed ("not reproducible, likely a transient issue"). This test also fails intermittently on unrelated PRs and on master trunk (e.g. PR=0 on 2026-06-09).
Stress test (arm_debug) → Logical error: Inconsistent KeyCondition behavior. The debug-only (#ifndef NDEBUG) assertion in MergeTreeDataSelectExecutor, reached via AST-fuzzer mutations — compiled out in release. Master-wide flake: 22 distinct PRs since 2026-06-15, 0 hits on master trunk, only in *_debug/stress/AST-fuzzer builds. Tracked by the open umbrella issue AST-fuzzer-internal-path: executor/planner assertions on type-invalid inputs the analyzer normally rejects (5 sites: IColumn, KeyCondition, SingleValueData, ActionsDAG ×2) #107951 . This PR touches only the aggregation setting default and optimizeAggregationPerPartition, with no MergeTree/KeyCondition code path.

No re-merge: the branch is 138 commits behind but the merge base is <1 day old and master has not touched any file this PR modifies, so a re-merge would only re-roll the same flakes without a fix.

alexey-milovidov · 2026-07-01T06:15:37Z

Heads-up on a new test-number collision that appeared on master after the last sync — it won't surface as a git conflict, so flagging it explicitly.

master merged f135d0e70e7 ("Renumber test 04489 -> 04492 to avoid collision after master merge"), which added 04492_regexp_pattern_with_nul_byte. That prefix-collides with this PR's 04492_aggregate_partitions_independently_max_rows_to_group_by — both were independently renumbered to 04492.

Neither tree collides on its own (merge base fbdcc93bf8a has no 04492, and 04492_regexp is not yet an ancestor of the PR head d6bd5c8e), and a merge is clean (different filenames, no conflict), so it would silently create two 04492_ tests. Worth renumbering the aggregation test to the next free prefix at merge time (on current master, 04490–04492 are taken and 04493 is free, though other in-flight PRs may claim it — best chosen at merge).

Otherwise the PR looks ready: the two earlier reds reran to SUCCESS (01018_Distributed__shard_num and the Stress arm_debug Inconsistent KeyCondition behavior, #107951), AI review is ✅ Approve on d6bd5c8e, 0 unresolved threads, and master has not touched any file this PR modifies, so no re-merge is needed.

alexey-milovidov · 2026-07-01T08:31:00Z

CH Inc sync flipped to a failure after my last note (it was not-run then). Diagnosed the private sync run 28482821586 — all four reds are Cloud-only tests that this PR cannot touch, so they are unrelated flakes:

03010_shared_merge_tree_mutations_kill_mutation (amd_debug, distributed cache) — SharedMergeTree kill-mutation, timing-sensitive.
03016_shared_merge_tree_multiple_tables (amd_asan_ubsan, SharedCatalog) — SharedMergeTree.
test_refreshable_mv_no_multi_read::test_refreshable_mv_attach_without_multi_read (Integration db disk, both 6/7 and old analyzer 6/7) — refreshable-MV attach race (assert 'Disabled' in 'Scheduled'); a recurring cross-PR Cloud flake (also hit e.g. the Fix logical error exception in getConsistentMetadataSnapshotImpl when log pointer goes backwards #100651 sync run).

This PR only flips allow_aggregate_partitions_independently to true, adds the 26.7 SettingsChangesHistory entry, the max_rows_to_group_by != 0 guard in optimizeAggregationPerPartition, docs, and the aggregation regression/perf tests — none of which exercise SharedMergeTree or refreshable materialized views. Re-triggered the failed sync jobs (gh run rerun --failed).

Otherwise unchanged and ready: public CI is fully green on d6bd5c8e (0 failures), AI review is ✅ Approve, 0 unresolved threads, MERGEABLE, and git merge-tree against current master is clean (the Settings.cpp / SettingsChangesHistory.cpp overlaps merge without conflict). The 04492 test-number collision with master's 04492_regexp_pattern_with_nul_byte still stands — best renumbered at merge as noted above.

alexey-milovidov · 2026-07-01T13:14:14Z

Status update since the CH Inc sync re-trigger:

CI is now fully green on d6bd5c8e89a — 161 SUCCESS / 13 SKIP / 0 failures. The CH Inc sync job you re-triggered has flipped to SUCCESS, and the AI review is at ✅ Approve with 0 unresolved threads.
GitHub now shows CONFLICTING/DIRTY, but this is a stale mergeability computation, not a real conflict. master advanced 531 commits under the unchanged head, so GitHub recomputed. git merge-tree --write-tree origin/master d6bd5c8e89a against the current master tip (bf1980a9bd1) returns a clean tree (exit 0, no conflict markers). I verified the merged SettingsChangesHistory.cpp 26.7 block contains both this PR's allow_aggregate_partitions_independently entry and master's newly-added entries (explain_query_plan_default, optimize_and_compare_chain_max_hash_work, etc.) — nothing is dropped.

So no re-merge is needed to resolve a conflict; the flag should clear on GitHub's next mergeability recompute. If it lingers, a trivial master merge at merge time (where the 04492 test-number renumber lands anyway) resolves it. Held off on re-merging so as not to discard the green CI or restart the performance comparison.

…rtitions-independently-by-default

alexey-milovidov · 2026-07-02T05:49:52Z

Re-merged master (4644410000c) to clear the stale CONFLICTING flag — the branch was 942 commits behind and GitHub reported a conflict, but the 3-way merge is conflict-free. Verified the net diff against master is unchanged (9 files, +138/-3): the allow_aggregate_partitions_independently = true default, the SettingsChangesHistory entry (correctly in the 26.7 block), the max_rows_to_group_by guard in optimizeAggregationPerPartition, the 04492 test, and the docs all survived intact. Fresh CI is running; the branch is now MERGEABLE.

clickhouse-gh · 2026-07-02T06:05:56Z

+    /// more groups (or apply `group_by_overflow_mode`) per partition than the global limit allows.
+    /// Fall back to normal aggregation, matching `AggregatingStep::canUseShardedAggregation` and
+    /// `make_distributed_plan`, both of which reject this case for the same reason.
+    if (aggregating_step->getParams().max_rows_to_group_by != 0)


This guard changes more than the new default: it also changes the pre-26.7 explicit opt-in path for allow_aggregate_partitions_independently / force_aggregate_partitions_independently. Before this PR, a query or profile that set those flags still took the per-partition path even if max_rows_to_group_by was a high non-zero safety net; after this change it silently falls back to normal aggregation for any non-zero value. The updated test 02521_aggregation_by_partitions.sql is already evidence of that behavior change: it now has to clear the inherited 10G limit just to keep exercising the old opt-in path.

The problem is that compatibility = '26.6' does not restore that old behavior. The only compatibility entry added in this PR is the default flip in SettingsChangesHistory.cpp, so existing deployments that explicitly enabled the optimization will still lose it under compatibility = '26.6' whenever they also have a non-zero max_rows_to_group_by in a profile. If that stricter fallback is intentional, it needs its own compatibility story (for example, a gated guard or an additional history-controlled setting); otherwise the PR/docs should stop promising that compatibility = '26.6' restores the pre-26.7 behavior.

clickhouse-gh · 2026-07-02T09:14:27Z

LLVM Coverage Report

Changed lines: Changed C/C++ lines covered: 22/22 (100.00%) · Uncovered code

Full report · Diff report

clickhouse-gh Bot added the pr-performance Pull request with some performance improvements label May 16, 2026

clickhouse-gh Bot reviewed May 16, 2026

View reviewed changes

Comment thread src/Core/Settings.cpp

alexey-milovidov added 4 commits May 25, 2026 18:28

Merge remote-tracking branch 'origin/master' into enable-aggregate-pa…

c24ae9b

…rtitions-independently-by-default

Move allow_aggregate_partitions_independently change to 26.6 block

cb9be35

The 26.5 block has been closed since the PR was opened, so register the default flip under `26.6` in `SettingsChangesHistory.cpp` and update the docs note accordingly. This also keeps `02995_new_settings_history` happy.

Merge remote-tracking branch 'origin/master' into enable-aggregate-pa…

4b48454

…rtitions-independently-by-default

Merge remote-tracking branch 'origin/master' into enable-aggregate-pa…

500f255

…rtitions-independently-by-default

clickhouse-gh Bot reviewed May 30, 2026

View reviewed changes

Comment thread src/Core/SettingsChangesHistory.cpp Outdated

Merge remote-tracking branch 'origin/master' into enable-aggregate-pa…

fc88a71

…rtitions-independently-by-default

clickhouse-gh Bot reviewed Jun 2, 2026

View reviewed changes

Comment thread src/Core/SettingsChangesHistory.cpp

alexey-milovidov and others added 2 commits June 5, 2026 04:07

Merge remote-tracking branch 'origin/master' into enable-aggregate-pa…

535dbc2

…rtitions-independently-by-default

alexey-milovidov added 3 commits June 9, 2026 22:36

Merge remote-tracking branch 'origin/master' into enable-aggregate-pa…

1c410fb

…rtitions-independently-by-default

Merge remote-tracking branch 'origin/master' into enable-aggregate-pa…

cfa19d7

…rtitions-independently-by-default

Merge remote-tracking branch 'origin/master' into enable-aggregate-pa…

dd52b61

…rtitions-independently-by-default

Merge remote-tracking branch 'origin/master' into enable-aggregate-pa…

661f1b7

…rtitions-independently-by-default

clickhouse-gh Bot reviewed Jun 17, 2026

View reviewed changes

Comment thread src/Core/Settings.cpp

Merge remote-tracking branch 'origin/master' into enable-aggregate-pa…

d5a3073

…rtitions-independently-by-default

clickhouse-gh Bot reviewed Jun 21, 2026

View reviewed changes

Comment thread docs/en/engines/table-engines/mergetree-family/custom-partitioning-key.md Outdated

clickhouse-gh Bot reviewed Jun 23, 2026

View reviewed changes

Comment thread src/Core/Settings.cpp

alexey-milovidov and others added 3 commits June 28, 2026 18:31

alexey-milovidov and others added 6 commits June 29, 2026 18:57

Merge remote-tracking branch 'origin/master' into enable-aggregate-pa…

84d4979

…rtitions-independently-by-default

Merge remote-tracking branch 'origin/master' into enable-aggregate-pa…

b15e83a

…rtitions-independently-by-default

clickhouse-gh Bot reviewed Jun 30, 2026

View reviewed changes

Comment thread tests/queries/0_stateless/04492_aggregate_partitions_independently_max_rows_to_group_by.sql

Comment thread docs/en/engines/table-engines/mergetree-family/custom-partitioning-key.md Outdated

alexey-milovidov and others added 2 commits June 30, 2026 19:42

alexey-milovidov and others added 2 commits June 30, 2026 19:48

Merge remote-tracking branch 'origin/master' into enable-aggregate-pa…

302bc52

…rtitions-independently-by-default

alexey-milovidov mentioned this pull request Jul 1, 2026

Compress hash join right-side blocks in memory before OOM / spilling #107667

Open

1 task

Merge remote-tracking branch 'origin/master' into enable-aggregate-pa…

4644410

…rtitions-independently-by-default

clickhouse-gh Bot reviewed Jul 2, 2026

View reviewed changes

Sunbelt Computer Software

PL/B Language Development and Support

Uh oh!

Conversation

alexey-milovidov commented May 16, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Changelog category (leave one):

Changelog entry (a user-readable short description of the changes that goes into CHANGELOG.md):

Documentation entry for user-facing changes

Uh oh!

clickhouse-gh Bot commented May 16, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

AI Review

Summary

Findings

Final Verdict

Uh oh!

Uh oh!

alexey-milovidov commented May 17, 2026

Uh oh!

Uh oh!

alexey-milovidov commented Jun 1, 2026

Uh oh!

Uh oh!

alexey-milovidov commented Jun 5, 2026

Uh oh!

clickhouse-gh Bot commented Jun 5, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

clickbench

tpch_adapted_1_official

Uh oh!

alexey-milovidov commented Jun 14, 2026

Uh oh!

groeneai commented Jun 14, 2026

Uh oh!

Uh oh!

Uh oh!

Uh oh!

alexey-milovidov commented Jun 28, 2026

Uh oh!

alexey-milovidov commented Jun 28, 2026

Uh oh!

alexey-milovidov commented Jun 30, 2026

Uh oh!

Uh oh!

Uh oh!

alexey-milovidov commented Jun 30, 2026

Uh oh!

alexey-milovidov commented Jun 30, 2026

Uh oh!

alexey-milovidov commented Jun 30, 2026

Uh oh!

alexey-milovidov commented Jul 1, 2026

Uh oh!

alexey-milovidov commented Jul 1, 2026

Uh oh!

alexey-milovidov commented Jul 1, 2026

Uh oh!

alexey-milovidov commented Jul 1, 2026

Uh oh!

alexey-milovidov commented Jul 2, 2026

Uh oh!

clickhouse-gh Bot Jul 2, 2026

Choose a reason for hiding this comment

Uh oh!

clickhouse-gh Bot commented Jul 2, 2026

LLVM Coverage Report

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

alexey-milovidov commented May 16, 2026 •

edited

Loading

clickhouse-gh Bot commented May 16, 2026 •

edited

Loading

clickhouse-gh Bot commented Jun 5, 2026 •

edited

Loading