Speed up `LIMIT BY` by running it independently per `MergeTree` partition by nihalzp · Pull Request #105126 · ClickHouse/ClickHouse · GitHub
Skip to content

Speed up LIMIT BY by running it independently per MergeTree partition#105126

Merged
nihalzp merged 23 commits into
ClickHouse:masterfrom
nihalzp:partitioned-limit-by
May 25, 2026
Merged

Speed up LIMIT BY by running it independently per MergeTree partition#105126
nihalzp merged 23 commits into
ClickHouse:masterfrom
nihalzp:partitioned-limit-by

Conversation

@nihalzp

@nihalzp nihalzp commented May 16, 2026

Copy link
Copy Markdown
Member

LIMIT BY queries on MergeTree tables currently call pipeline.resize(1) to merge every upstream stream into one, then attach a single LimitByTransform; the hash table sees the entire input on one thread regardless of how many partitions the data is spread across. This PR runs LIMIT BY per partition in parallel when the partition expression is a deterministic function of the LIMIT BY columns; as a result, no LIMIT BY group can ever span two partitions.

CREATE TABLE t (a UInt64, b UInt64) ENGINE = MergeTree ORDER BY tuple() PARTITION BY a % 8;
INSERT INTO t SELECT number, number FROM numbers(10000000);

EXPLAIN PIPELINE
SELECT a FROM t WHERE b > 1 LIMIT 10 BY a
SETTINGS allow_limit_by_partitions_independently = 1;
    ┌─explain──────────────────────────────────────┐
 1. │ (Expression)                                 │
 2. │ ExpressionTransform × 8                      │
 3. │   (LimitBy)                                  │
 4. │   LimitByTransform × 8                       │ <- now × 8 (was × 1)
 5. │     (Expression)                             │
 6. │     ExpressionTransform × 8                  │
 7. │       (Expression)                           │
 8. │       ExpressionTransform × 8                │
 9. │         (ReadFromMergeTree)                  │
10. │         Resize 7 → 1                         │
11. │           MergeTreeSelect(...                │
    └──────────────────────────────────────────────┘
image

Changelog category (leave one):

  • Performance Improvement

Changelog entry (a user-readable short description of the changes that goes into CHANGELOG.md):

Speed up LIMIT BY queries on partitioned MergeTree tables by running LIMIT BY inside each partition's stream in parallel, instead of merging all streams into one before applying the limit. This applies when the partition expression is a deterministic function of the LIMIT BY columns, so no LIMIT BY group can span two partitions. Controlled by the new setting allow_limit_by_partitions_independently (enabled by default).

Version info

  • Merged into: 26.6.1.125

@clickhouse-gh

clickhouse-gh Bot commented May 16, 2026

Copy link
Copy Markdown
Contributor

@clickhouse-gh clickhouse-gh Bot added the pr-performance Pull request with some performance improvements label May 16, 2026

@alexey-milovidov alexey-milovidov left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok. Related question - do you know why allow_aggregate_partitions_independently is not true by default? If not, I will send a PR enabling it. And the second question - do we have the same for DISTINCT? If not, let's add, it's even more important than for LIMIT BY.

@alexey-milovidov alexey-milovidov self-assigned this May 16, 2026
@nihalzp

nihalzp commented May 16, 2026

Copy link
Copy Markdown
Member Author

Ok. Related question - do you know why allow_aggregate_partitions_independently is not true by default?

Yes. Actutally, few days ago, I asked the same question to @nickitat. The issue is that for skewed data this can slow down queries significantly.

@nihalzp

nihalzp commented May 16, 2026

Copy link
Copy Markdown
Member Author

And the second question - do we have the same for DISTINCT? If not, let's add, it's even more important than for LIMIT BY.

We do not have it for DISTINCT. But it is already planned and we can reuse most of the code from this PR.

Comment on lines +368 to +372
if (optimization_settings.aggregate_partitions_independently)
optimizeAggregationPerPartition(frame_node, nodes, optimization_settings);

if (optimization_settings.limit_by_partitions_independently)
optimizeLimitByPerPartition(frame_node, nodes, optimization_settings);

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These optimizations were previously run in first pass. But that's risky as some projection optimization can replace the source after we've set the flags, causing streams to no longer carry disjoint partitions but the transform still expecting them to. This current place is correct as at this point query plan has been stabilized.

@alexey-milovidov

Copy link
Copy Markdown
Member

This was fixed by #105146. Let's update the branch.

@clickhouse-gh

clickhouse-gh Bot commented May 25, 2026

Copy link
Copy Markdown
Contributor

LLVM Coverage Report

Metric Baseline Current Δ
Lines 84.10% 84.10% +0.00%
Functions 91.40% 91.40% +0.00%
Branches 76.60% 76.60% +0.00%

Changed lines: 82.47% (127/154) | lost baseline coverage: 1 line(s) · Uncovered code

Full report · Diff report

@nihalzp nihalzp added this pull request to the merge queue May 25, 2026
Merged via the queue into ClickHouse:master with commit ab6b5ff May 25, 2026
166 of 167 checks passed
@nihalzp nihalzp deleted the partitioned-limit-by branch May 25, 2026 06:23
@robot-clickhouse-ci-1 robot-clickhouse-ci-1 added the pr-synced-to-cloud The PR is synced to the cloud repo label May 25, 2026
pull Bot pushed a commit to CrazyForks/ClickHouse that referenced this pull request May 27, 2026
The test (added in ClickHouse#105126) has ~30 scenarios that each do
INSERT + EXPLAIN, and the MSan slowdown pushes it past the per-test
timeout. It is consistently flaky on `Stateless tests (amd_msan, ...)`
while passing on every other sanitizer/build flavor.

Adding `no-msan` since the test exercises pipeline planning, not
memory correctness, so MSan adds little signal.

CI report (first failing master commit, sha 69102a31ed94):
https://d1k2gkhrlfqv31.cloudfront.net/clickhouse-test-reports-private/json.html?REF=master&sha=69102a31ed94&name_0=MasterCI&name_1=Stateless%20tests%20%28amd_msan%2C%20meta%20in%20keeper%2C%20s3%20storage%2C%20parallel%2C%202%2F2%29
DavidHe-2008 pushed a commit to DavidHe-2008/ClickHouse that referenced this pull request Jun 1, 2026
Speed up `LIMIT BY` by running it independently per `MergeTree` partition
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

pr-performance Pull request with some performance improvements pr-synced-to-cloud The PR is synced to the cloud repo

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants