Restore O(1) reserve in MergeSorter::mergeBatchImpl (fix ORDER BY perf regression)#108179
Conversation
|
cc @KochetovNicolai @vdimir @4ertus2 — could you review this? It restores the O(1) result-block reservation in |
|
Workflow [PR], commit [fe4258e] Summary: ❌
AI ReviewSummaryThis PR restores the Final VerdictStatus: ✅ Approve |
|
CI fully finished on HEAD e7c1bde. The remaining red checks are pre-existing chronic trunk failures, not caused by this PR:
Neither is in this PR's code path (the change guards the predicate-statistics per-granule counters in |
|
📊 Cloud Performance Report 🟢 AI verdict: This PR trims the merge-sort output reservation in SortingTransform, making it O(1) per output block and avoiding over-reserving columns for high-cardinality sorts. The improvements on the GROUP BY ... ORDER BY ... LIMIT queries (Q15, Q18, Q28, Q32, Q33, Q34) line up cleanly with that change, and Q15's 21% gain on the high-cardinality UserID sort is the strongest signal. Q4 and Q5 are pure COUNT(DISTINCT) queries with no sorting, so their flagged improvements can't come from this diff and were downgraded to not-sure as run-to-run variance. clickbench🟢 6 improved · Flagged queries (8 of 43)
q-value = BH-FDR adjusted p; smaller is stronger evidence. MIRAI flags a query when q < fdr_q (default 0.10) — the value the verdict is based on. tpch_adapted_1_official🟢 No significant changes Debug info
|
| /// The size of output block will not be larger than the `max_merged_block_size`. | ||
| /// If redundant memory space is reserved, `MemoryTracker` will count more memory usage than actual usage. | ||
| size_to_reserve = std::min(size_to_reserve, max_merged_block_size); | ||
| size_t size_to_reserve = std::min(static_cast<size_t>(chunks[0].getNumRows()), max_merged_block_size); |
There was a problem hiding this comment.
I think it's possible to enlarge default reserve size here to some expected value with O(1)
size_t size_to_reserve = std::min(static_cast<size_t>(chunks[0].getNumRows()) * chunks.size(), max_merged_block_size);
or
size_t size_to_reserve = std::min(static_cast<size_t>(chunks[0].getNumRows()) * chunks.size() / 2, max_merged_block_size);
e7c1bde to
f385d86
Compare
|
@4ertus2 Adopted the first form, pushed as f385d86: |
f385d86 to
43ae659
Compare
Pre-PR validation gate (updated for 43ae659 - O(1) via precomputed
|
| # | Question | Answer |
|---|---|---|
| a | Deterministic repro? | Regressing change isolated deterministically by bisection: the issue's controlled window bd123fed28c5..d8dd2c3a1a82 has exactly one functional source change, #100963 in MergeSorter::mergeBatchImpl. ARM timing repro is the jit_sort query 12 loop; reporter measures it 9/9 on Neoverse-V2. Host here is x86, so absolute ARM numbers are validated by CI / the reporter's pipeline. |
| b | Root cause explained? | #100963 changed the per-output-block reserve from min(chunks[0].rows, cap) to min(sum_of_all_chunk_rows, cap). The output block is already capped at max_merged_block_size, and for a large sort the input chunks are each ~max_merged_block_size rows, so both reserve the same amount. The new code adds a full pass over chunks on every output block in front of the hot k-way merge loop, for no benefit. Profile shows MergeSortingTransform +10.88%. |
| c | Fix matches root cause? | Yes. chunks is not cleared until the merge finishes, so the row sum is constant across output blocks. It is summed once in the constructor (which already iterates the chunks) into total_input_rows, and the per-block reserve is min(total_input_rows, max_merged_block_size) - O(1), no per-block loop. Same reservation value as the #100963 sum form, computed once. |
| d | Test intent preserved / new tests added? | No test modified. Covering perf test is the existing tests/performance/jit_sort.xml query 12. No new functional test needed: results are unchanged because reserve() is a capacity hint. |
| e | Both directions demonstrated? | Reporter validated both directions on ARM: affected build ~+18% vs baseline; reverting the #100963 hunk recovers -13.96%. This PR restores the fast path while keeping the exact total-row bound. |
| f | Fix is general, not a narrow patch? | Corrects at the source (the reservation), not a symptom guard. The estimate is now bounded by both the true total input rows and max_merged_block_size, so it neither under-reserves (bare chunks[0] for many small chunks) nor over-reserves (a chunks[0] * chunks.size() product for uneven chunks, the case clickhouse-gh flagged). #100963 touched only this file; grepped src/Processors for the same "sum of chunk rows for reserve" pattern in sibling sort/merge paths - no other site. |
Session id: cron:clickhouse-worker-slot-6:20260623-121000
|
CI fully finished on HEAD 43ae659 (Finish Workflow SUCCESS). One non-success test check, not PR-caused:
Mergeable Check / PR reds are aggregator artifacts of the above. No PR-caused failures on this head. This PR only changes a capacity-hint |
…f regression) PR ClickHouse#100963 replaced the O(1) reservation min(chunks[0].getNumRows(), max_merged_block_size) with a sum over all chunks (capped at max_merged_block_size) computed on every output block, in front of the hot k-way merge loop. The output block is already capped at max_merged_block_size, and in the regressing jit_sort/12 case the input chunks are each about max_merged_block_size rows, so the loop produced the same reservation while adding per-block work. egor-click measured +18.73% on Neoverse-V2 (issue ClickHouse#106599). This restores the original O(1) form. chunks[0] is a single input chunk, so the value never exceeds the rows the merge can emit (no over-reservation), and there is no walk over chunks. reserve() is a capacity hint only, so the merge-sort output is byte-identical regardless of the reserved size; for uneven chunks the columns grow as needed. egor-click validated this form recovers the regression on Neoverse-V2 (about -13.96% vs affected), whereas reserving the saturated total-row sum did not. Closes: ClickHouse#106599
43ae659 to
fe4258e
Compare
Pre-PR validation gate (updated for fe4258e - bare O(1)
|
| # | Question | Answer |
|---|---|---|
| a | Deterministic repro? | Yes, on the relevant hardware. egor-click reproduced the +18.73% jit_sort/12 regression on Neoverse-V2, 9/9 runs, and git-bisected the controlled window (bd123fe -> d8dd2c3) to the single functional change #100963. x86 host cannot measure ARM perf; the measurement of record is the CI Performance Comparison + the reporter's Neoverse-V2 pipeline. |
| b | Root cause explained? | #100963 replaced min(chunks[0].getNumRows(), max_merged_block_size) (O(1)) with a sum over all input chunks computed on every output block, in front of the hot k-way merge loop. The output is already capped, so the loop adds per-block work for no benefit. Matches the issue's profile (MergeSortingTransform +10.88%). |
| c | Fix matches root cause? | Restores the exact pre-#100963 O(1) reservation. Removes both the per-block walk and the cap-saturated reserve size. egor-click's data shows both matter: the win comes from the smaller chunks[0] reserve, not only from dropping the loop. |
| d | Test intent preserved / new tests added? | No functional test change: reserve() is a capacity hint, so output is byte-identical and existing correctness tests are unaffected. The perf regression test (tests/performance/jit_sort.xml query 12) is the regression check and runs in CI Performance Comparison. |
| e | Both directions demonstrated? | egor-click validated direction on Neoverse-V2: affected binary regressed; this bare restore recovers it (affected-vs-patched about -13.96%). A total_input_rows saturate-to-cap variant did NOT recover (+0.59%), confirming the smaller reserve is the cause. Locally: compile clean; merge-sort output byte-identical to an independent reference sort. |
| f | Fix is general across code paths? | #100963 touched only MergeSorter::mergeBatchImpl in SortingTransform.cpp. Grepped Processors/Interpreters for the same sum-reserve pattern; no sibling path has it. No competing open PR. |
| g | Fix generalizes across inputs (params/datatypes/wrappers)? | N/A (perf change to a capacity-hint reserve, not a correctness code bug). The reserved size never affects results for any chunk shape (uniform, uneven, empty); confirmed it cannot over-reserve since chunks[0] <= rows the merge emits. |
| h | Backward compatible? (maintainer-approved exception only) | Yes / N/A. No setting, on-disk/wire format, or behavior change; pure restoration of a prior reservation heuristic, results identical. |
| i | Invariants and contracts preserved? | Yes. reserve() is a capacity hint with no contract beyond the hint; the merge loop governs every emitted row. chunks[0] is valid here because the reserve runs only under queue.isValid() (at least one non-empty chunk). No data-structure or concurrency invariant touched. |
Session id: cron:clickhouse-worker-slot-2:20260623-180600
|
@4ertus2 Heads up on a pivot: I reverted to the bare egor-click validated on ARM Neoverse-V2 that the larger reserve is what costs the time, independent of the per-block loop: the bare Since the bare form reserves the least, it cannot over-allocate (the clickhouse-gh concern was the opposite direction), and |
|
Are you sure that there's real degradation caused by big reserve() on arm64 and it's not a perf test fluctuation? Could you check it with some confidence level? The bare version reserve some. How much we could reserve without perf degradation? We probably could set this value as a threshold. |
|
On confidence (real vs fluctuation): egor-click ran a controlled ARM Neoverse-V2 bisect and the signal sits well above the noise floor.
So the effect (~14-19%) is over an order of magnitude larger than run-to-run variance (<1%), consistent in direction, and a clean A/B (positive on introduce, matching-magnitude negative on remove). Our CI Performance Comparison (arm_release) is a second, independent measurement of the same query; the run distribution is also visible at performance.ci.clickhouse.com/history/ for On a reserve threshold: the data shows the win is reserving the smaller value, not just dropping the per-block sum loop. Session id: cron:clickhouse-worker-slot-1:20260623-214000 |
LLVM Coverage Report
Changed lines: Changed C/C++ lines covered by tests: 4/4 (100.00%) | Lost baseline coverage: none · Uncovered code |
I think it's an interesting issue itself to find the constant for ARM that does not affect perf in such reserve() cases. Could you try several variants with powers of 2 values as min_reserve_size, converting it into min_reserve_values by division by sizeof(type)? I.e. 8KB, 128KB, 1MB values. |
|
@4ertus2 The lower-bound framing is sound: Two things on running it:
egor-click's two validated points already bracket this: the bare I can prepare a variant with that lower bound for egor-click or the CI Performance Comparison to sweep the three values. I would keep the current validated bare form as the shipping fix until a candidate is arm-validated, and leave it to the assignee whether to fold a tuned constant into this PR or track it as a follow-up. |
CI finish ledger - fe4258eEvery failure below has an owner. None are PR-caused: this PR is a one-hunk Session id: cron:our-pr-ci-monitor:20260624-163000 |

Changelog category (leave one):
Changelog entry (a user-readable short description of the changes that goes into CHANGELOG.md):
Fixed a performance regression in
ORDER BY(merge sort) introduced in #100963. Queries with a sort followed byLIMITwere slightly slower because of redundant per-output-block work in the result-block reservation.Description
Reported by egor-click in #106599:
jit_sortperf test query12regressed on ARM/aarch64 (Neoverse-V2) by about +18%client_time(SELECT WatchID FROM hits_100m_single ORDER BY WatchID ASC, CounterID DESC, ClientIP ASC LIMIT 2000000).Public history: https://performance.ci.clickhouse.com/history/jit_sort/12?metric=client_time&arch=arm
Root cause. Bisecting the issue's controlled window (baseline
bd123fed28c5-> affectedd8dd2c3a1a82) leaves a single functional source change: #100963, inMergeSorter::mergeBatchImpl. It replaced the reservation sizewith a sum over all input chunks, capped at
max_merged_block_size:This both walks the whole
chunksvector on every output block (in front of the hot k-way merge loop) and reserves a larger, cap-saturated size. Matches the issue's profile (MergeSortingTransform+10.88%).Fix. Restore the pre-#100963 O(1) reservation
min(chunks[0].getNumRows(), max_merged_block_size). This is the prototype egor-click validated on ARM Neoverse-V2 (affected-vs-patched about -13.96%, 9/9 runs). egor-click's validation also showed the win needs the smaller reserve, not only dropping the per-block loop: a variant that stays O(1) but reserves the full input-row sum (saturating the cap) did not recover the regression (+0.59%).reserve()is a capacity hint only, so it never changes results or the final allocation size; the barechunks[0]estimate reserves the least and cannot over-allocate. Locally verified the merge-sort output for a mixed-direction multi-columnORDER BYis byte-identical to an independent reference sort. ARM perf is validated by the reporter's Neoverse-V2 pipeline and by the CI Performance Comparison.