Speed up LIMIT BY by running it independently per MergeTree partition#105126
Conversation
alexey-milovidov
left a comment
There was a problem hiding this comment.
Ok. Related question - do you know why allow_aggregate_partitions_independently is not true by default? If not, I will send a PR enabling it. And the second question - do we have the same for DISTINCT? If not, let's add, it's even more important than for LIMIT BY.
Yes. Actutally, few days ago, I asked the same question to @nickitat. The issue is that for skewed data this can slow down queries significantly. |
We do not have it for |
| if (optimization_settings.aggregate_partitions_independently) | ||
| optimizeAggregationPerPartition(frame_node, nodes, optimization_settings); | ||
|
|
||
| if (optimization_settings.limit_by_partitions_independently) | ||
| optimizeLimitByPerPartition(frame_node, nodes, optimization_settings); |
There was a problem hiding this comment.
These optimizations were previously run in first pass. But that's risky as some projection optimization can replace the source after we've set the flags, causing streams to no longer carry disjoint partitions but the transform still expecting them to. This current place is correct as at this point query plan has been stabilized.
|
This was fixed by #105146. Let's update the branch. |
LLVM Coverage ReportChanged lines: 82.47% (127/154) | lost baseline coverage: 1 line(s) · Uncovered code |
ab6b5ff
The test (added in ClickHouse#105126) has ~30 scenarios that each do INSERT + EXPLAIN, and the MSan slowdown pushes it past the per-test timeout. It is consistently flaky on `Stateless tests (amd_msan, ...)` while passing on every other sanitizer/build flavor. Adding `no-msan` since the test exercises pipeline planning, not memory correctness, so MSan adds little signal. CI report (first failing master commit, sha 69102a31ed94): https://d1k2gkhrlfqv31.cloudfront.net/clickhouse-test-reports-private/json.html?REF=master&sha=69102a31ed94&name_0=MasterCI&name_1=Stateless%20tests%20%28amd_msan%2C%20meta%20in%20keeper%2C%20s3%20storage%2C%20parallel%2C%202%2F2%29
Speed up `LIMIT BY` by running it independently per `MergeTree` partition

LIMIT BYqueries onMergeTreetables currently callpipeline.resize(1)to merge every upstream stream into one, then attach a singleLimitByTransform; the hash table sees the entire input on one thread regardless of how many partitions the data is spread across. This PR runsLIMIT BYper partition in parallel when the partition expression is a deterministic function of theLIMIT BYcolumns; as a result, noLIMIT BYgroup can ever span two partitions.Changelog category (leave one):
Changelog entry (a user-readable short description of the changes that goes into CHANGELOG.md):
Speed up
LIMIT BYqueries on partitionedMergeTreetables by runningLIMIT BYinside each partition's stream in parallel, instead of merging all streams into one before applying the limit. This applies when the partition expression is a deterministic function of theLIMIT BYcolumns, so noLIMIT BYgroup can span two partitions. Controlled by the new settingallow_limit_by_partitions_independently(enabled by default).Version info
26.6.1.125