perf: fix reserve in SortingTransform MergeSorter by 4ertus2 · Pull Request #100963 · ClickHouse/ClickHouse · GitHub
Skip to content

perf: fix reserve in SortingTransform MergeSorter#100963

Merged
vdimir merged 7 commits into
ClickHouse:masterfrom
4ertus2:fix-reserve
May 13, 2026
Merged

perf: fix reserve in SortingTransform MergeSorter#100963
vdimir merged 7 commits into
ClickHouse:masterfrom
4ertus2:fix-reserve

Conversation

@4ertus2

@4ertus2 4ertus2 commented Mar 27, 2026

Copy link
Copy Markdown
Contributor

Changelog category (leave one):

  • Performance Improvement

Changelog entry (a user-readable short description of the changes that goes into CHANGELOG.md):

  • Slightly reduced memory over-allocation in partial_merge JOIN.

Version info

  • Merged into: 26.5.1.636

@nikitamikhaylov nikitamikhaylov added the can be tested Allows running workflows for external contributors label Mar 31, 2026
@clickhouse-gh

clickhouse-gh Bot commented Mar 31, 2026

Copy link
Copy Markdown
Contributor

@clickhouse-gh clickhouse-gh Bot added the pr-performance Pull request with some performance improvements label Mar 31, 2026
@nikitamikhaylov nikitamikhaylov self-assigned this Mar 31, 2026
if (queue.isValid())
{
size_t size_to_reserve = 0;
for (auto & chunk : chunks)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This adds an O(chunks.size()) scan on every mergeBatchImpl call. MergeSorter::read may call this many times (one output chunk per call), so the new scan can become O(num_output_chunks * num_input_chunks) overhead in a hot path.

Could we avoid re-scanning all chunks here? A simple approach is to saturate while summing (break once size_to_reserve >= max_merged_block_size) or cache the total row count once in the sorter state.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is still live on current 10deef16: mergeBatchImpl recomputes size_to_reserve by scanning all chunks on every output block. Because chunks is not reduced by queue.next, this repeats identical work and adds O(num_chunks * num_output_blocks) overhead in a hot path.

Please either precompute the capped reserve size once in MergeSorter state, or make the sum saturating and stop as soon as size_to_reserve >= max_merged_block_size.

if (queue.isValid())
{
size_t size_to_reserve = 0;
for (auto & chunk : chunks)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changelog category is performance improvement, which requires a non-empty Changelog entry in the PR template. Please add a user-facing one-liner describing the concrete performance impact.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is still unresolved in the current PR body: Changelog category is performance improvement, but Changelog entry is missing.

For this category, please add a concrete user-facing one-liner describing what workload gets faster and in what scenario.

@alexey-milovidov

Copy link
Copy Markdown
Member

The Stress test (arm_msan) failure is fixed by #101239, which should be merged first. After it is merged, please update the branch to include the fix.

@alexey-milovidov

Copy link
Copy Markdown
Member

The flaky check failure is fixed in #102148, let's update the branch.

@vdimir vdimir enabled auto-merge May 13, 2026 17:18
@clickhouse-gh

clickhouse-gh Bot commented May 13, 2026

Copy link
Copy Markdown
Contributor

LLVM Coverage Report

Metric Baseline Current Δ
Lines 84.10% 84.10% +0.00%
Functions 92.00% 92.00% +0.00%
Branches 76.60% 76.50% -0.10%

Changed lines: 100.00% (7/7) · Uncovered code

Full report · Diff report

@vdimir vdimir added this pull request to the merge queue May 13, 2026
Merged via the queue into ClickHouse:master with commit d8dd2c3 May 13, 2026
167 checks passed
@robot-ch-test-poll3 robot-ch-test-poll3 added the pr-synced-to-cloud The PR is synced to the cloud repo label May 13, 2026
groeneai added a commit to groeneai/ClickHouse that referenced this pull request Jun 23, 2026
…rf regression)

PR ClickHouse#100963 changed the per-output-block reserve in
MergeSorter::mergeBatchImpl from a constant to a sum over all input
chunks (capped at max_merged_block_size). The output block is already
bounded by max_merged_block_size, so for the common case (many input
chunks, each around max_merged_block_size rows) the sum saturates the
cap and reserves the same amount, but the summation walks the entire
chunks vector on every output block in front of the hot k-way merge
loop. This shows up as a measurable slowdown on the jit_sort perf test
(ARM Neoverse-V2, query 12: ORDER BY with LIMIT, around +18%
client_time).

Compute the estimate in O(1) instead. The k-way merge draws from all
chunks, so size the reserve as chunks[0].getNumRows() * chunks.size(),
capped at max_merged_block_size. This is one multiply with no per-block
loop, reserves closer to the actual merged output for the many-chunks
case, and is bounded by max_merged_block_size exactly as the sum form
was. For the regressing case both saturate the cap, recovering the time.

reserve() is a capacity hint only, so this does not affect results.

Reported and validated on ARM by egor-click in ClickHouse#106599. O(1) estimate
form suggested by 4ertus2 in review.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
groeneai added a commit to groeneai/ClickHouse that referenced this pull request Jun 23, 2026
…rf regression)

PR ClickHouse#100963 changed the per-output-block reserve in
MergeSorter::mergeBatchImpl from a constant to a sum over all input
chunks (capped at max_merged_block_size). The output block is already
bounded by max_merged_block_size, so for the common case (many input
chunks, each around max_merged_block_size rows) the sum saturates the
cap and reserves the same amount, but the summation walks the entire
chunks vector on every output block in front of the hot k-way merge
loop. This shows up as a measurable slowdown on the jit_sort perf test
(ARM Neoverse-V2, query 12: ORDER BY with LIMIT, around +18%
client_time).

Compute the same estimate in O(1). chunks is not cleared until the
whole merge finishes, so the sum of input rows is constant across every
output block; sum it once in the constructor (which already iterates
the chunks) into total_input_rows, and reserve
min(total_input_rows, max_merged_block_size) per block. This is the
exact value the sum form reserved, computed once instead of per block,
so it neither under-reserves (as a bare chunks[0] estimate would for
many small chunks) nor over-reserves for uneven chunks (as a
chunks[0] * chunks.size() product would). For the regressing case it
saturates the cap, recovering the time.

reserve() is a capacity hint only, so this does not affect results.

Reported and validated on ARM by egor-click in ClickHouse#106599. O(1) estimate
suggested by 4ertus2 in review; bounding it by total input rows
(instead of the product) suggested by clickhouse-gh in review.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
groeneai added a commit to groeneai/ClickHouse that referenced this pull request Jun 23, 2026
…f regression)

PR ClickHouse#100963 replaced the O(1) reservation min(chunks[0].getNumRows(),
max_merged_block_size) with a sum over all chunks (capped at
max_merged_block_size) computed on every output block, in front of the hot
k-way merge loop. The output block is already capped at max_merged_block_size,
and in the regressing jit_sort/12 case the input chunks are each about
max_merged_block_size rows, so the loop produced the same reservation while
adding per-block work. egor-click measured +18.73% on Neoverse-V2 (issue
ClickHouse#106599).

This restores the original O(1) form. chunks[0] is a single input chunk, so
the value never exceeds the rows the merge can emit (no over-reservation), and
there is no walk over chunks. reserve() is a capacity hint only, so the
merge-sort output is byte-identical regardless of the reserved size; for uneven
chunks the columns grow as needed. egor-click validated this form recovers the
regression on Neoverse-V2 (about -13.96% vs affected), whereas reserving the
saturated total-row sum did not.

Closes: ClickHouse#106599
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

can be tested Allows running workflows for external contributors pr-performance Pull request with some performance improvements pr-synced-to-cloud The PR is synced to the cloud repo

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants