perf: fix reserve in SortingTransform MergeSorter#100963
Conversation
| if (queue.isValid()) | ||
| { | ||
| size_t size_to_reserve = 0; | ||
| for (auto & chunk : chunks) |
There was a problem hiding this comment.
This adds an O(chunks.size()) scan on every mergeBatchImpl call. MergeSorter::read may call this many times (one output chunk per call), so the new scan can become O(num_output_chunks * num_input_chunks) overhead in a hot path.
Could we avoid re-scanning all chunks here? A simple approach is to saturate while summing (break once size_to_reserve >= max_merged_block_size) or cache the total row count once in the sorter state.
There was a problem hiding this comment.
This is still live on current 10deef16: mergeBatchImpl recomputes size_to_reserve by scanning all chunks on every output block. Because chunks is not reduced by queue.next, this repeats identical work and adds O(num_chunks * num_output_blocks) overhead in a hot path.
Please either precompute the capped reserve size once in MergeSorter state, or make the sum saturating and stop as soon as size_to_reserve >= max_merged_block_size.
| if (queue.isValid()) | ||
| { | ||
| size_t size_to_reserve = 0; | ||
| for (auto & chunk : chunks) |
There was a problem hiding this comment.
Changelog category is performance improvement, which requires a non-empty Changelog entry in the PR template. Please add a user-facing one-liner describing the concrete performance impact.
There was a problem hiding this comment.
This is still unresolved in the current PR body: Changelog category is performance improvement, but Changelog entry is missing.
For this category, please add a concrete user-facing one-liner describing what workload gets faster and in what scenario.
|
The Stress test (arm_msan) failure is fixed by #101239, which should be merged first. After it is merged, please update the branch to include the fix. |
|
The flaky check failure is fixed in #102148, let's update the branch. |
LLVM Coverage ReportChanged lines: 100.00% (7/7) · Uncovered code |
…rf regression) PR ClickHouse#100963 changed the per-output-block reserve in MergeSorter::mergeBatchImpl from a constant to a sum over all input chunks (capped at max_merged_block_size). The output block is already bounded by max_merged_block_size, so for the common case (many input chunks, each around max_merged_block_size rows) the sum saturates the cap and reserves the same amount, but the summation walks the entire chunks vector on every output block in front of the hot k-way merge loop. This shows up as a measurable slowdown on the jit_sort perf test (ARM Neoverse-V2, query 12: ORDER BY with LIMIT, around +18% client_time). Compute the estimate in O(1) instead. The k-way merge draws from all chunks, so size the reserve as chunks[0].getNumRows() * chunks.size(), capped at max_merged_block_size. This is one multiply with no per-block loop, reserves closer to the actual merged output for the many-chunks case, and is bounded by max_merged_block_size exactly as the sum form was. For the regressing case both saturate the cap, recovering the time. reserve() is a capacity hint only, so this does not affect results. Reported and validated on ARM by egor-click in ClickHouse#106599. O(1) estimate form suggested by 4ertus2 in review. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…rf regression) PR ClickHouse#100963 changed the per-output-block reserve in MergeSorter::mergeBatchImpl from a constant to a sum over all input chunks (capped at max_merged_block_size). The output block is already bounded by max_merged_block_size, so for the common case (many input chunks, each around max_merged_block_size rows) the sum saturates the cap and reserves the same amount, but the summation walks the entire chunks vector on every output block in front of the hot k-way merge loop. This shows up as a measurable slowdown on the jit_sort perf test (ARM Neoverse-V2, query 12: ORDER BY with LIMIT, around +18% client_time). Compute the same estimate in O(1). chunks is not cleared until the whole merge finishes, so the sum of input rows is constant across every output block; sum it once in the constructor (which already iterates the chunks) into total_input_rows, and reserve min(total_input_rows, max_merged_block_size) per block. This is the exact value the sum form reserved, computed once instead of per block, so it neither under-reserves (as a bare chunks[0] estimate would for many small chunks) nor over-reserves for uneven chunks (as a chunks[0] * chunks.size() product would). For the regressing case it saturates the cap, recovering the time. reserve() is a capacity hint only, so this does not affect results. Reported and validated on ARM by egor-click in ClickHouse#106599. O(1) estimate suggested by 4ertus2 in review; bounding it by total input rows (instead of the product) suggested by clickhouse-gh in review. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…f regression) PR ClickHouse#100963 replaced the O(1) reservation min(chunks[0].getNumRows(), max_merged_block_size) with a sum over all chunks (capped at max_merged_block_size) computed on every output block, in front of the hot k-way merge loop. The output block is already capped at max_merged_block_size, and in the regressing jit_sort/12 case the input chunks are each about max_merged_block_size rows, so the loop produced the same reservation while adding per-block work. egor-click measured +18.73% on Neoverse-V2 (issue ClickHouse#106599). This restores the original O(1) form. chunks[0] is a single input chunk, so the value never exceeds the rows the merge can emit (no over-reservation), and there is no walk over chunks. reserve() is a capacity hint only, so the merge-sort output is byte-identical regardless of the reserved size; for uneven chunks the columns grow as needed. egor-click validated this form recovers the regression on Neoverse-V2 (about -13.96% vs affected), whereas reserving the saturated total-row sum did not. Closes: ClickHouse#106599

Changelog category (leave one):
Changelog entry (a user-readable short description of the changes that goes into CHANGELOG.md):
partial_mergeJOIN.Version info
26.5.1.636