Naive Sharded Aggregation for high cardinality data by nihalzp · Pull Request #104233 · ClickHouse/ClickHouse · GitHub
Skip to content

Naive Sharded Aggregation for high cardinality data#104233

Merged
nihalzp merged 32 commits into
ClickHouse:masterfrom
nihalzp:naive-sharded-aggregation
May 25, 2026
Merged

Naive Sharded Aggregation for high cardinality data#104233
nihalzp merged 32 commits into
ClickHouse:masterfrom
nihalzp:naive-sharded-aggregation

Conversation

@nihalzp

@nihalzp nihalzp commented May 6, 2026

Copy link
Copy Markdown
Member

Simplified version of #99581 that does not have many of the optimizations like no materialization during scatter, single hashing and reuse of hash, single serialization and reuse of serialized bytes.

There are mostly regressions since data is not usually high cardinality and evenly distributed but If enabled for perf tests, some tests have sped up:
image
image
image
image
image
image
image
image
image

Changelog category (leave one):

  • Performance Improvement

Changelog entry (a user-readable short description of the changes that goes into CHANGELOG.md):

New GROUP BY optimization for high cardinality evenly distributed keys that scatters rows across threads by hashing the grouping key, so each thread aggregates a disjoint subset of keys without a merge phase. Set enable_sharding_aggregator = 1 to enable it.

Documentation entry for user-facing changes

  • Documentation is written (mandatory for new features)

Version info

  • Merged into: 26.6.1.128

@clickhouse-gh

clickhouse-gh Bot commented May 6, 2026

Copy link
Copy Markdown
Contributor

@clickhouse-gh clickhouse-gh Bot added the pr-performance Pull request with some performance improvements label May 6, 2026

@clickhouse-gh clickhouse-gh Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Posting inline findings only.

@nihalzp

nihalzp commented May 7, 2026

Copy link
Copy Markdown
Member Author

Similar performance improvement as the optimized sharded aggregation for high cardinality data and aggregate functions with expensive merge.
image

Some queries are slowed down bit more cases where data copy for scatter is expensive and we rehash. The following query was slowed down by +1.535x only in the optimized version.
image

@nihalzp nihalzp requested a review from nickitat May 7, 2026 13:38
params.group_by_two_level_threshold_bytes = 0;

/// Sharded aggregation does not implement temporary-file spill/merge yet.
params.max_bytes_before_external_group_by = 0;

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Disabling params.max_bytes_before_external_group_by here means the sharded path ignores external aggregation and always keeps hash tables in memory. With optimize_aggregation_by_sharding enabled by default, this is a behavior change for queries that previously spilled to disk: they can now hit memory-limit exceptions instead.

Please either (1) keep sharded aggregation disabled by default until spill support exists, or (2) do not take the sharded path when external aggregation settings are configured, so existing memory-safety behavior is preserved.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

params.max_bytes_before_external_group_by is still unconditionally reset to 0 in the sharded path (AggregatingStep::transformPipeline, line 344 in current head), so the earlier concern is still live.

The regression is now visible in this PR itself: multiple integration tests had to explicitly set enable_sharding_aggregator = 0 to keep external GROUP BY behavior (test_max_bytes_ratio_before_external_order_group_by_for_server, test_temporary_data_in_cache, test_tmp_policy, etc.). That means enabling sharded aggregation by default currently removes an existing spill-to-disk safety mechanism for eligible queries.

Please keep sharded aggregation off unless external aggregation is unsupported for that query shape, or gate this path when external-aggregation settings are active.

Comment thread src/Processors/Transforms/BufferedShardByHashTransform.cpp
@egor-click

egor-click commented May 7, 2026

Copy link
Copy Markdown
Contributor

Similar performance improvement as the optimized sharded aggregation for high cardinality data and aggregate functions with expensive merge. image

Some queries are slowed down bit more cases where data copy for scatter is expensive and we rehash. The following query was slowed down by +1.535x only in the optimized version. image

tbh i noticed this pr while working on perf comparison dashboard, and it looks like we have a lot of degradations, especially in quantile and some group by , some by 5-10x

@nihalzp

nihalzp commented May 7, 2026

Copy link
Copy Markdown
Member Author

tbh i noticed this pr while working on perf comparison dashboard, and it looks like we have a lot of degradations, especially in quantile and some group by , some by 5-10x

Yes, this is expected especially for low cardinality and skewed data. My comment was mainly comparing difference between this PR and #99581 (which is this PR + some optimizations). Both have many degradations.

@nickitat nickitat self-assigned this May 15, 2026
Comment thread src/Core/Settings.cpp Outdated
Comment thread src/Processors/QueryPlan/AggregatingStep.cpp Outdated
Comment thread src/Processors/QueryPlan/AggregatingStep.cpp Outdated
Comment thread src/Processors/QueryPlan/AggregatingStep.cpp Outdated
Comment thread src/Processors/QueryPlan/AggregatingStep.cpp Outdated
Comment thread src/Processors/QueryPlan/AggregatingStep.cpp
Comment thread src/Processors/Transforms/BufferedShardByHashTransform.h
Comment thread src/Processors/Transforms/BufferedShardByHashTransform.cpp

/// Try to pull a new input chunk.
input.setNeeded();
if (input.hasData())

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need to limit the max queue length and don't accept new chunks while already at this limit.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, good catch. Actually, the original optimized sharded aggregation had it but later I removed it for this naive version because I thought since we split the chunks into smaller parts the memory will be equivalent whether the content is in hash table or in IColumn chunks. But then realized, for low cardinality case, the memory difference would be quite high between queued chunks in BufferedShardByHashTransform.cpp and their memory in hash tables after aggregation.

I have added a limit of 10 chunks at most. Maybe we can also put a limit in terms of rows or bytes.

/// causing them to cluster into a small subset of hash table buckets.
/// The golden ratio constant ensures thorough bit mixing with a single multiply.
/// Combine the mix with Lemire fastrange to map into [0, num_shards) without a divide.
static constexpr size_t fibonacci_hash_multiplier = 0x9e3779b97f4a7c15ULL;

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We already have similar logic implemented in scatterBlockByHashGeneric. Maybe we could reuse it.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I could not use scatterBlockByHashGeneric directly because the signature requires Block and but we only have Chunk and there would be conversion overhead if used. I decided to reuse only JoinCommon::hashToSelector to avoid doing hashing manually.

@nickitat nickitat left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All good, modulo a few comments.

Comment thread src/Processors/QueryPlan/AggregatingStep.cpp Outdated
Comment thread src/Processors/Transforms/BufferedShardByHashTransform.cpp Outdated
Comment thread src/Processors/QueryPlan/AggregatingStep.cpp Outdated
@clickhouse-gh

clickhouse-gh Bot commented May 25, 2026

Copy link
Copy Markdown
Contributor

LLVM Coverage Report

Metric Baseline Current Δ
Lines 84.20% 84.10% -0.10%
Functions 91.40% 91.40% +0.00%
Branches 76.60% 76.60% +0.00%

Changed lines: 89.34% (218/244) · Uncovered code

Full report · Diff report

@nihalzp nihalzp added this pull request to the merge queue May 25, 2026
Merged via the queue into ClickHouse:master with commit 4775f2d May 25, 2026
326 of 329 checks passed
@nihalzp nihalzp deleted the naive-sharded-aggregation branch May 25, 2026 17:14
@clickgapai

Copy link
Copy Markdown
Contributor

@robot-ch-test-poll4 robot-ch-test-poll4 added the pr-synced-to-cloud The PR is synced to the cloud repo label May 25, 2026
DavidHe-2008 pushed a commit to DavidHe-2008/ClickHouse that referenced this pull request Jun 1, 2026
…gation

Naive Sharded Aggregation for high cardinality data
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

pr-performance Pull request with some performance improvements pr-synced-to-cloud The PR is synced to the cloud repo

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants