Restore O(1) reserve in MergeSorter::mergeBatchImpl (fix ORDER BY perf regression) by groeneai · Pull Request #108179 · ClickHouse/ClickHouse · GitHub
Skip to content

Restore O(1) reserve in MergeSorter::mergeBatchImpl (fix ORDER BY perf regression)#108179

Open
groeneai wants to merge 1 commit into
ClickHouse:masterfrom
groeneai:groeneai/fix-106599-sortingtransform-reserve-perf
Open

Restore O(1) reserve in MergeSorter::mergeBatchImpl (fix ORDER BY perf regression)#108179
groeneai wants to merge 1 commit into
ClickHouse:masterfrom
groeneai:groeneai/fix-106599-sortingtransform-reserve-perf

Conversation

@groeneai

@groeneai groeneai commented Jun 22, 2026

Copy link
Copy Markdown
Contributor

Changelog category (leave one):

  • Performance Improvement

Changelog entry (a user-readable short description of the changes that goes into CHANGELOG.md):

Fixed a performance regression in ORDER BY (merge sort) introduced in #100963. Queries with a sort followed by LIMIT were slightly slower because of redundant per-output-block work in the result-block reservation.

Description

Reported by egor-click in #106599: jit_sort perf test query 12 regressed on ARM/aarch64 (Neoverse-V2) by about +18% client_time (SELECT WatchID FROM hits_100m_single ORDER BY WatchID ASC, CounterID DESC, ClientIP ASC LIMIT 2000000).

Public history: https://performance.ci.clickhouse.com/history/jit_sort/12?metric=client_time&arch=arm

Root cause. Bisecting the issue's controlled window (baseline bd123fed28c5 -> affected d8dd2c3a1a82) leaves a single functional source change: #100963, in MergeSorter::mergeBatchImpl. It replaced the reservation size

min(chunks[0].getNumRows(), max_merged_block_size)

with a sum over all input chunks, capped at max_merged_block_size:

size_to_reserve = 0;
for (auto & chunk : chunks) size_to_reserve += chunk.getNumRows();
size_to_reserve = min(size_to_reserve, max_merged_block_size);

This both walks the whole chunks vector on every output block (in front of the hot k-way merge loop) and reserves a larger, cap-saturated size. Matches the issue's profile (MergeSortingTransform +10.88%).

Fix. Restore the pre-#100963 O(1) reservation min(chunks[0].getNumRows(), max_merged_block_size). This is the prototype egor-click validated on ARM Neoverse-V2 (affected-vs-patched about -13.96%, 9/9 runs). egor-click's validation also showed the win needs the smaller reserve, not only dropping the per-block loop: a variant that stays O(1) but reserves the full input-row sum (saturating the cap) did not recover the regression (+0.59%).

reserve() is a capacity hint only, so it never changes results or the final allocation size; the bare chunks[0] estimate reserves the least and cannot over-allocate. Locally verified the merge-sort output for a mixed-direction multi-column ORDER BY is byte-identical to an independent reference sort. ARM perf is validated by the reporter's Neoverse-V2 pipeline and by the CI Performance Comparison.

@groeneai

Copy link
Copy Markdown
Contributor Author

@groeneai

Copy link
Copy Markdown
Contributor Author

cc @KochetovNicolai @vdimir @4ertus2 — could you review this? It restores the O(1) result-block reservation in MergeSorter::mergeBatchImpl that #100963 turned into a per-output-block sum over all input chunks; since the block is already capped at max_merged_block_size the sum reserves the same amount but adds overhead in front of the merge loop, costing about +18% on the jit_sort ARM perf test (#106599). @4ertus2 in case the sum was intended to address an under-reservation case I missed.

@clickhouse-gh

clickhouse-gh Bot commented Jun 22, 2026

Copy link
Copy Markdown
Contributor

Workflow [PR], commit [fe4258e]

Summary:

job_name test_name status info comment
Stress test (amd_msan) FAIL
Hung check failed, possible deadlock found FAIL cidb, issue
Stress test (arm_msan) FAIL
Cannot start clickhouse-server FAIL cidb
Logical error: 'Unexpected exception in refresh scheduling' (STID: 2508-3e7b) FAIL cidb
Check failed FAIL cidb
Stress test (arm_tsan) ERROR

AI Review

Summary

This PR restores the MergeSorter::mergeBatchImpl result-column reservation to the O(1) chunks[0]-based estimate, avoiding the per-output-block scan over all chunks added by #100963. I did not find correctness, safety, or PR metadata issues in the current hunk, and the performance claim is backed by the linked ARM validation and existing jit_sort performance coverage.

Final Verdict

Status: ✅ Approve

@clickhouse-gh clickhouse-gh Bot added the pr-performance Pull request with some performance improvements label Jun 22, 2026
@vdimir vdimir added the can be tested Allows running workflows for external contributors label Jun 22, 2026
@vdimir vdimir self-assigned this Jun 22, 2026
@groeneai

Copy link
Copy Markdown
Contributor Author

CI fully finished on HEAD e7c1bde. The remaining red checks are pre-existing chronic trunk failures, not caused by this PR:

  • Stress test (amd_asan_ubsan) "Not-ready Set is passed as the second argument ... (STID: 0250-41a5)": chronic trunk, 30 distinct PRs + 15 master hits in 30d.
  • Integration tests (amd_asan_ubsan, db disk, old analyzer, 3/6) Timeout: chronic flaky shard, 15 distinct PRs in 14d.

Neither is in this PR's code path (the change guards the predicate-statistics per-granule counters in MergeTreeRangeReader). All Performance Comparison shards passed; Finish Workflow is green. The Mergeable Check/PR reds are just the rollup of the two chronic failures above.

@clickhouse-gh

clickhouse-gh Bot commented Jun 23, 2026

Copy link
Copy Markdown
Contributor

📊 Cloud Performance Report

🟢 AI verdict: improvement6 query(s) improved out of 38 analysed

This PR trims the merge-sort output reservation in SortingTransform, making it O(1) per output block and avoiding over-reserving columns for high-cardinality sorts. The improvements on the GROUP BY ... ORDER BY ... LIMIT queries (Q15, Q18, Q28, Q32, Q33, Q34) line up cleanly with that change, and Q15's 21% gain on the high-cardinality UserID sort is the strongest signal. Q4 and Q5 are pure COUNT(DISTINCT) queries with no sorting, so their flagged improvements can't come from this diff and were downgraded to not-sure as run-to-run variance.

clickbench

🟢 6 improved · ⚠️ 2 inconclusive

Flagged queries (8 of 43)
Query Verdict Baseline median (ms) PR median (ms) Change q-value Hint
🟢 15 improvement 251 199 -20.7% <0.0001 memory: GROUP BY UserID then sort top-10: high-cardinality merge sort, exactly the path the reserve tweak speeds up.
🟢 18 improvement 1342 1268 -5.5% <0.0001 memory: Large GROUP BY + ORDER BY LIMIT; merge-sort reservation fix plausibly explains the gain.
🟢 28 improvement 7046 6532 -7.3% <0.0001 memory: Heavy GROUP BY/sort (6.5s); modest delta consistent with the merge-sort reservation fix.
🟢 32 improvement 1351 1250 -7.5% <0.0001 memory: High-card WatchID,ClientIP sort; measurements noisy but both tests agree on a real gain.
🟢 33 improvement 1478 1395 -5.6% <0.0001 memory: GROUP BY URL then sort top-10; on the merge-sort path this PR optimizes. Low-noise, consistent.
🟢 34 improvement 1482 1391 -6.1% <0.0001 memory: GROUP BY URL variant; same merge-sort path, consistent small gain.
⚠️ 4 not_sure 264 215 -18.6% <0.0001 memory: COUNT(DISTINCT UserID) has no ORDER BY; the merge-sort reserve change can't touch it. Off-path = variance.
⚠️ 5 not_sure 382 339 -11.3% <0.0001 memory: COUNT(DISTINCT SearchPhrase): no sort step, so the merge-sort reservation fix can't drive an 11% gain. Off-path.

q-value = BH-FDR adjusted p; smaller is stronger evidence. MIRAI flags a query when q < fdr_q (default 0.10) — the value the verdict is based on.

tpch_adapted_1_official

🟢 No significant changes

Debug info
  • StressHouse run: ea2630c4-1afb-4859-a63a-43aacb27ae32
  • MIRAI run: 00d48b21-5a58-4b7e-8c0c-c01b3d9c5f23
  • PR check IDs:
    • clickbench_22071_1782249114
    • clickbench_22076_1782249114
    • clickbench_22079_1782249114
    • tpch_adapted_1_official_22089_1782249114
    • tpch_adapted_1_official_22099_1782249114
    • tpch_adapted_1_official_22103_1782249114

/// The size of output block will not be larger than the `max_merged_block_size`.
/// If redundant memory space is reserved, `MemoryTracker` will count more memory usage than actual usage.
size_to_reserve = std::min(size_to_reserve, max_merged_block_size);
size_t size_to_reserve = std::min(static_cast<size_t>(chunks[0].getNumRows()), max_merged_block_size);

@4ertus2 4ertus2 Jun 23, 2026

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's possible to enlarge default reserve size here to some expected value with O(1)

size_t size_to_reserve = std::min(static_cast<size_t>(chunks[0].getNumRows()) * chunks.size(), max_merged_block_size);

or

size_t size_to_reserve = std::min(static_cast<size_t>(chunks[0].getNumRows()) * chunks.size() / 2, max_merged_block_size);

@groeneai groeneai force-pushed the groeneai/fix-106599-sortingtransform-reserve-perf branch from e7c1bde to f385d86 Compare June 23, 2026 11:35
@groeneai

Copy link
Copy Markdown
Contributor Author

@4ertus2 Adopted the first form, pushed as f385d86: std::min(static_cast<size_t>(chunks[0].getNumRows()) * chunks.size(), max_merged_block_size). It is O(1) (one multiply, no per-block loop), the min cap bounds the product so there is no over-reservation, and for the many-chunks case it sizes nearer the merged output than the bare chunks[0] estimate. Skipped the /2 variant: in the common uniform-chunk case chunks[0].getNumRows() is already around max_merged_block_size, so both the full and halved products saturate the cap identically, while /2 would under-reserve by 2x whenever the product lands below the cap.

Comment thread src/Processors/Transforms/SortingTransform.cpp Outdated
@groeneai groeneai force-pushed the groeneai/fix-106599-sortingtransform-reserve-perf branch from f385d86 to 43ae659 Compare June 23, 2026 12:23
@groeneai

Copy link
Copy Markdown
Contributor Author

Pre-PR validation gate (updated for 43ae659 - O(1) via precomputed total_input_rows)

# Question Answer
a Deterministic repro? Regressing change isolated deterministically by bisection: the issue's controlled window bd123fed28c5..d8dd2c3a1a82 has exactly one functional source change, #100963 in MergeSorter::mergeBatchImpl. ARM timing repro is the jit_sort query 12 loop; reporter measures it 9/9 on Neoverse-V2. Host here is x86, so absolute ARM numbers are validated by CI / the reporter's pipeline.
b Root cause explained? #100963 changed the per-output-block reserve from min(chunks[0].rows, cap) to min(sum_of_all_chunk_rows, cap). The output block is already capped at max_merged_block_size, and for a large sort the input chunks are each ~max_merged_block_size rows, so both reserve the same amount. The new code adds a full pass over chunks on every output block in front of the hot k-way merge loop, for no benefit. Profile shows MergeSortingTransform +10.88%.
c Fix matches root cause? Yes. chunks is not cleared until the merge finishes, so the row sum is constant across output blocks. It is summed once in the constructor (which already iterates the chunks) into total_input_rows, and the per-block reserve is min(total_input_rows, max_merged_block_size) - O(1), no per-block loop. Same reservation value as the #100963 sum form, computed once.
d Test intent preserved / new tests added? No test modified. Covering perf test is the existing tests/performance/jit_sort.xml query 12. No new functional test needed: results are unchanged because reserve() is a capacity hint.
e Both directions demonstrated? Reporter validated both directions on ARM: affected build ~+18% vs baseline; reverting the #100963 hunk recovers -13.96%. This PR restores the fast path while keeping the exact total-row bound.
f Fix is general, not a narrow patch? Corrects at the source (the reservation), not a symptom guard. The estimate is now bounded by both the true total input rows and max_merged_block_size, so it neither under-reserves (bare chunks[0] for many small chunks) nor over-reserves (a chunks[0] * chunks.size() product for uneven chunks, the case clickhouse-gh flagged). #100963 touched only this file; grepped src/Processors for the same "sum of chunk rows for reserve" pattern in sibling sort/merge paths - no other site.

Session id: cron:clickhouse-worker-slot-6:20260623-121000

@groeneai

Copy link
Copy Markdown
Contributor Author

CI fully finished on HEAD 43ae659 (Finish Workflow SUCCESS). One non-success test check, not PR-caused:

  • Stress test (arm_tsan) - future_error "promise has been destructed" (STID 2508-38c6) - chronic trunk TaskTracker teardown race on shutdown, unrelated to this MergeSorter reserve change (7 unrelated PRs / 8 hits / 0 master in 30d). Tracked separately.

Mergeable Check / PR reds are aggregator artifacts of the above. No PR-caused failures on this head. This PR only changes a capacity-hint reserve() (results byte-identical); the arm/2 Performance Comparison is the measurement of record. Ready for review.

…f regression)

PR ClickHouse#100963 replaced the O(1) reservation min(chunks[0].getNumRows(),
max_merged_block_size) with a sum over all chunks (capped at
max_merged_block_size) computed on every output block, in front of the hot
k-way merge loop. The output block is already capped at max_merged_block_size,
and in the regressing jit_sort/12 case the input chunks are each about
max_merged_block_size rows, so the loop produced the same reservation while
adding per-block work. egor-click measured +18.73% on Neoverse-V2 (issue
ClickHouse#106599).

This restores the original O(1) form. chunks[0] is a single input chunk, so
the value never exceeds the rows the merge can emit (no over-reservation), and
there is no walk over chunks. reserve() is a capacity hint only, so the
merge-sort output is byte-identical regardless of the reserved size; for uneven
chunks the columns grow as needed. egor-click validated this form recovers the
regression on Neoverse-V2 (about -13.96% vs affected), whereas reserving the
saturated total-row sum did not.

Closes: ClickHouse#106599
@groeneai groeneai force-pushed the groeneai/fix-106599-sortingtransform-reserve-perf branch from 43ae659 to fe4258e Compare June 23, 2026 18:15
@groeneai

Copy link
Copy Markdown
Contributor Author

Pre-PR validation gate (updated for fe4258e - bare O(1) chunks[0] restore)

# Question Answer
a Deterministic repro? Yes, on the relevant hardware. egor-click reproduced the +18.73% jit_sort/12 regression on Neoverse-V2, 9/9 runs, and git-bisected the controlled window (bd123fe -> d8dd2c3) to the single functional change #100963. x86 host cannot measure ARM perf; the measurement of record is the CI Performance Comparison + the reporter's Neoverse-V2 pipeline.
b Root cause explained? #100963 replaced min(chunks[0].getNumRows(), max_merged_block_size) (O(1)) with a sum over all input chunks computed on every output block, in front of the hot k-way merge loop. The output is already capped, so the loop adds per-block work for no benefit. Matches the issue's profile (MergeSortingTransform +10.88%).
c Fix matches root cause? Restores the exact pre-#100963 O(1) reservation. Removes both the per-block walk and the cap-saturated reserve size. egor-click's data shows both matter: the win comes from the smaller chunks[0] reserve, not only from dropping the loop.
d Test intent preserved / new tests added? No functional test change: reserve() is a capacity hint, so output is byte-identical and existing correctness tests are unaffected. The perf regression test (tests/performance/jit_sort.xml query 12) is the regression check and runs in CI Performance Comparison.
e Both directions demonstrated? egor-click validated direction on Neoverse-V2: affected binary regressed; this bare restore recovers it (affected-vs-patched about -13.96%). A total_input_rows saturate-to-cap variant did NOT recover (+0.59%), confirming the smaller reserve is the cause. Locally: compile clean; merge-sort output byte-identical to an independent reference sort.
f Fix is general across code paths? #100963 touched only MergeSorter::mergeBatchImpl in SortingTransform.cpp. Grepped Processors/Interpreters for the same sum-reserve pattern; no sibling path has it. No competing open PR.
g Fix generalizes across inputs (params/datatypes/wrappers)? N/A (perf change to a capacity-hint reserve, not a correctness code bug). The reserved size never affects results for any chunk shape (uniform, uneven, empty); confirmed it cannot over-reserve since chunks[0] <= rows the merge emits.
h Backward compatible? (maintainer-approved exception only) Yes / N/A. No setting, on-disk/wire format, or behavior change; pure restoration of a prior reservation heuristic, results identical.
i Invariants and contracts preserved? Yes. reserve() is a capacity hint with no contract beyond the hint; the merge loop governs every emitted row. chunks[0] is valid here because the reserve runs only under queue.isValid() (at least one non-empty chunk). No data-structure or concurrency invariant touched.

Session id: cron:clickhouse-worker-slot-2:20260623-180600

@groeneai

Copy link
Copy Markdown
Contributor Author

@4ertus2 Heads up on a pivot: I reverted to the bare min(chunks[0].getNumRows(), max_merged_block_size) rather than enlarging the reserve.

egor-click validated on ARM Neoverse-V2 that the larger reserve is what costs the time, independent of the per-block loop: the bare chunks[0] form recovers -13.96%, while an O(1) variant that keeps the full input-row bound (min(total_input_rows, max_merged_block_size), saturating the cap in the uniform-chunk case) did not recover the regression (+0.59%). So both the chunks[0] * chunks.size() product and the total-row sum would reproduce the slowdown on the benchmark.

Since the bare form reserves the least, it cannot over-allocate (the clickhouse-gh concern was the opposite direction), and reserve() is only a hint so the merge still grows as needed for uneven chunks. Does dropping the larger reserve look acceptable to you, or did #100963 target a workload where the bigger up-front reservation matters?

@4ertus2

4ertus2 commented Jun 23, 2026

Copy link
Copy Markdown
Contributor

Are you sure that there's real degradation caused by big reserve() on arm64 and it's not a perf test fluctuation? Could you check it with some confidence level?

The bare version reserve some. How much we could reserve without perf degradation? We probably could set this value as a threshold.

@groeneai

Copy link
Copy Markdown
Contributor Author

@4ertus2

On confidence (real vs fluctuation): egor-click ran a controlled ARM Neoverse-V2 bisect and the signal sits well above the noise floor.

  • Introduction: baseline bd123fed28c5 -> affected d8dd2c3a1a82 (the perf: fix reserve in SortingTransform MergeSorter #100963 window), jit_sort/12 median 0.639s -> 0.759s, +18.73%.
  • Removal (bare chunks[0] restore): affected -> patched, -13.96%.
  • Same-binary self-checks stayed under 1% (-0.89%, -0.88%, -0.03%).

So the effect (~14-19%) is over an order of magnitude larger than run-to-run variance (<1%), consistent in direction, and a clean A/B (positive on introduce, matching-magnitude negative on remove). Our CI Performance Comparison (arm_release) is a second, independent measurement of the same query; the run distribution is also visible at performance.ci.clickhouse.com/history/ for jit_sort.

On a reserve threshold: the data shows the win is reserving the smaller value, not just dropping the per-block sum loop. min(chunks[0].getNumRows(), cap) and min(total_input_rows, cap) measured differently (-13.96% vs +0.59%), which means chunks[0].getNumRows() is below max_merged_block_size for this query - otherwise both would saturate to the cap and be identical. Any reserve that saturates to the cap for this chunk shape therefore reintroduces the cost; the over-reservation itself is what is expensive on ARM. A fixed middle threshold T is possible in principle, but we only have two validated points (chunks[0] good, cap bad), so a safe T is unknown without its own ARM validation pass. chunks[0].getNumRows() is already an O(1) self-sizing estimate that tracks the data shape and is the validated form, so I would keep it rather than add a tuning constant. If a fixed threshold is worth exploring, egor-click's pipeline could sweep a few intermediate sizes.

Session id: cron:clickhouse-worker-slot-1:20260623-214000

@clickhouse-gh

clickhouse-gh Bot commented Jun 23, 2026

Copy link
Copy Markdown
Contributor

LLVM Coverage Report

Metric Baseline Current Δ
Lines 85.40% 85.30% -0.10%
Functions 92.60% 92.60% +0.00%
Branches 77.60% 77.50% -0.10%

Changed lines: Changed C/C++ lines covered by tests: 4/4 (100.00%) | Lost baseline coverage: none · Uncovered code

Full report · Diff report

@4ertus2

4ertus2 commented Jun 24, 2026

Copy link
Copy Markdown
Contributor

A fixed middle threshold T is possible in principle, but we only have two validated points (chunks[0] good, cap bad), so a safe T is unknown without its own ARM validation pass. chunks[0].getNumRows() is already an O(1) self-sizing estimate that tracks the data shape and is the validated form, so I would keep it rather than add a tuning constant. If a fixed threshold is worth exploring, egor-click's pipeline could sweep a few intermediate sizes.

I think it's an interesting issue itself to find the constant for ARM that does not affect perf in such reserve() cases. Could you try several variants with powers of 2 values as min_reserve_size, converting it into min_reserve_values by division by sizeof(type)? I.e. 8KB, 128KB, 1MB values.

@groeneai

Copy link
Copy Markdown
Contributor Author

@4ertus2 The lower-bound framing is sound: reserve = min(max(chunks[0].getNumRows(), min_reserve_values), max_merged_block_size) keeps the small-chunk enlarge you wanted, with the sweep bounding it below the arm degradation.

Two things on running it:

  1. I cannot measure arm64 perf on this x86 host. egor-click's Neoverse-V2 pipeline (the measurement of record for ARM performance regression: jit_sort query 12 is slower #106599) and our CI Performance Comparison are the only arm sources, so the sweep has to go through one of those.

  2. Concrete conversion for jit_sort (columns are UInt64, sizeof 8; default max_merged_block_size = 65409 rows, about 511 KB per column). With min_reserve_values = min_reserve_size / sizeof(type) per column:

    • 8 KB -> 1024 rows
    • 128 KB -> 16384 rows
    • 1 MB -> 131072 rows, clamped to 65409

egor-click's two validated points already bracket this: the bare chunks[0] form recovers the regression and the cap-saturating form does not, so for this query chunks[0].getNumRows() < the cap and reserving up toward the cap is the cost. So the 1 MB candidate clamps to the cap and should reproduce the slowdown; 8 KB and 128 KB are the candidates that may stay safe.

I can prepare a variant with that lower bound for egor-click or the CI Performance Comparison to sweep the three values. I would keep the current validated bare form as the shipping fix until a candidate is arm-validated, and leave it to the assignee whether to fold a tuned constant into this PR or track it as a follow-up.

@groeneai

Copy link
Copy Markdown
Contributor Author

CI finish ledger - fe4258e

Every failure below has an owner. None are PR-caused: this PR is a one-hunk MergeSorter::mergeBatchImpl reserve() change (a capacity-hint no-op on results), which cannot cause a stress-test deadlock or a RefreshTask shutdown race.

Check / test Reason Owner
Stress test (arm_msan) / Logical error 'Unexpected exception in refresh scheduling' (STID 2508-3e7b) RefreshTask::doScheduling shutdown race #105588 (ours, open)
Stress test (arm_msan) / Cannot start clickhouse-server fallout of the refresh-scheduling abort above #105588 (ours, open)
Stress test (arm_msan) / Check failed fallout of the above #105588 (ours, open)
Stress test (amd_msan) / Hung check failed, possible deadlock found chronic trunk GlobalThreadPool-shutdown deadlock (hundreds of master+PR hits/30d) active full-effort investigation; diagnostic PR #105905, fixing-PR link to follow here

Session id: cron:our-pr-ci-monitor:20260624-163000

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

can be tested Allows running workflows for external contributors pr-performance Pull request with some performance improvements

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants