Use ipnsort (unstable) and driftsort (stable) for general-purpose sorting by alexey-milovidov · Pull Request #106650 · ClickHouse/ClickHouse · GitHub
Skip to content

Use ipnsort (unstable) and driftsort (stable) for general-purpose sorting#106650

Merged
alexey-milovidov merged 12 commits into
masterfrom
ipnsort-driftsort
Jun 18, 2026
Merged

Use ipnsort (unstable) and driftsort (stable) for general-purpose sorting#106650
alexey-milovidov merged 12 commits into
masterfrom
ipnsort-driftsort

Conversation

@alexey-milovidov

Copy link
Copy Markdown
Member

Port the Rust standard library's current sort implementations to C++ and use them for ClickHouse's general-purpose sorts:

  • ipnsort (unstable, by Lukas Bergdoll & Orson Peters) → ::sort (base/base/sort.h), replacing pdqsort.
  • driftsort (stable, by Orson Peters & Lukas Bergdoll) → a new ::stableSort, replacing std::stable_sort call sites.

Both are vendored as header-only contribs (contrib/ipnsort, contrib/driftsort), dual-licensed MIT / Apache-2.0, the same way as contrib/pdqsort. They operate on raw pointers, so the always-contiguous iterators are converted with std::to_address; the rare non-contiguous random-access iterators keep using pdqsort / std::stable_sort. trySort, nth_element and partial_sort are unchanged.

ipnsort

ipnsort is pattern-defeating quicksort with a glidesort pivot, a sorting-network small-sort, and a heapsort fallback for the O(n log n) worst case. It detects already-sorted and reverse-sorted runs and handles them in O(n), which pdqsort (as used directly by ::sort, without the trySort pre-pass) does not always get for free.

The C++ port runs the full algorithm for trivially-copyable types (relocating bitwise via memcpy, matching the Rust original) and a correct move-based path (insertion sort + Hoare partition + heapsort) for other types.

One change from the upstream partition was made to keep throughput high for ClickHouse's most common pattern — sorting a permutation through an indirect, cache-missing comparator. The upstream in-place branchless Lomuto partition has a loop-carried dependency through the running less-than count, which serializes the comparator's load latency. This port evaluates the comparator over a block of elements in an independent (ILP-friendly) pre-pass and then performs the cheap serial rearrangement from the precomputed predicates, matching block-quicksort throughput on expensive comparators while keeping the single-pass speed on cheap ones.

driftsort

driftsort is a powersort merge tree over detected runs with a stable scratch-based quicksort for unsorted runs and a branchless bidirectional-merge small-sort. It allocates up to n/2 extra elements (a 4 KiB stack buffer for small inputs, heap otherwise). The full algorithm runs for trivially-copyable types; other types fall back to std::stable_sort.

Measurements (10M elements, single thread, aarch64, best-of)

ipnsort vs the previous ::sort engine on the workloads ::sort actually sees:

  • random UInt64: ~3x faster than pdqsort.
  • already-sorted / reverse-sorted: handled in O(n) (e.g. reverse UInt64 ~8 ms vs hundreds of ms for a naive quicksort), removing the worst case.
  • few-unique and permutation-with-indirect-comparator: on par with or faster than pdqsort.

driftsort vs std::stable_sort:

  • random: 2.4x faster; sorted: 81x; reverse: 42x; few-unique: 6x; 90%-presorted: 2.8x; permutation with indirect comparator: 2.2x.

Correctness

The ports were validated against std::sort / std::stable_sort over thousands of randomized cases (sizes 0..2M, many duplicate densities, sorted / reverse / organ-pipe / nearly-sorted inputs, trivially-copyable and non-trivially-copyable element types) under AddressSanitizer with zero mismatches; driftsort additionally verified stable (equal-key order matches std::stable_sort exactly).

Relation to #106618

This supersedes the blqsort approach in #106618. The CI performance comparison there showed blqsort improving string / multi-column sorts but regressing ORDER BY toDecimal128(...) DESC by ~4.9x (a reverse-sorted worst case). ipnsort keeps the string-sort improvements and does not have that regression.

Related: #106618

Changelog category (leave one):

  • Performance Improvement

Changelog entry (a user-readable short description of the changes that goes into CHANGELOG.md):

Use ipnsort (unstable) and driftsort (stable) — C++ ports of the Rust standard library's sort implementations — for the general-purpose comparison sorts. Speeds up ORDER BY on non-numeric columns and stable sorts, and removes a worst case on reverse-sorted input.

Documentation entry for user-facing changes

  • Documentation is written (mandatory for new features)

🤖 Generated with Claude Code

…ting

Port the Rust standard library's current sort implementations to C++ and use
them in `base/base/sort.h`:

- ipnsort (by Lukas Bergdoll & Orson Peters) backs `::sort`, replacing
  `pdqsort`. It is pattern-defeating quicksort with a glidesort pivot, a
  sorting-network small-sort and a heapsort fallback, and detects already-sorted
  and reverse-sorted runs (handling them in O(n)). The port runs the full
  algorithm for trivially-copyable types and a move-based path otherwise. Its
  partition evaluates the comparator over a block in an ILP-friendly pre-pass
  before the serial rearrangement, which keeps throughput high for expensive
  (indirect / cache-missing) comparators - the dominant ClickHouse pattern.

- driftsort (by Orson Peters & Lukas Bergdoll) backs a new `::stableSort`,
  replacing `std::stable_sort` call sites. It is a powersort merge tree over
  detected runs with a stable scratch quicksort, several times faster than
  `std::stable_sort`. It allocates up to n/2 extra memory.

Both are vendored as header-only contribs (dual MIT / Apache-2.0), like
`contrib/pdqsort`. They need raw pointers, so contiguous iterators are converted
with `std::to_address`; non-contiguous iterators keep `pdqsort` /
`std::stable_sort`. `trySort`, `nth_element` and `partial_sort` are unchanged.

Validated against `std::sort` / `std::stable_sort` over thousands of randomized
cases under AddressSanitizer (sizes, duplicate densities, patterns, and
trivially-copyable, non-trivially-copyable and non-default-constructible element
types) with zero mismatches; driftsort verified stable.

Supersedes the blqsort approach in #106618: the CI performance comparison there
showed blqsort regressing `ORDER BY toDecimal128(...) DESC` ~4.9x (a
reverse-sorted worst case); ipnsort keeps the string-sort improvements without
that regression.

Related: #106618

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@clickhouse-gh

clickhouse-gh Bot commented Jun 6, 2026

Copy link
Copy Markdown
Contributor

@clickhouse-gh clickhouse-gh Bot added pr-performance Pull request with some performance improvements submodule changed At least one submodule changed in this PR. labels Jun 6, 2026
Comment thread contrib/ipnsort/ipnsort.h
Comment thread contrib/driftsort/driftsort.h Outdated
Comment thread base/base/sort.h
@clickhouse-gh

clickhouse-gh Bot commented Jun 7, 2026

Copy link
Copy Markdown
Contributor

📊 Cloud Performance Report

🔴 AI verdict: degradation1 regressed, 4 improved out of 38 analysed

This PR swaps the core sorting implementations: ipnsort replaces pdqsort for unstable sort and driftsort replaces std::stable_sort, so any query doing an ORDER BY or merge is directly affected. Four of the five flagged queries improve in line with a faster sort: clickbench Q22 (-25%) and Q23 (-33%), and TPC-H Q7 (-35%) and Q20 (-50%, though that one's measurements were noisy). The one regression, clickbench Q13 (+11%), runs against the trend and compounds an existing degradation already visible on master, so it's worth a closer look.

clickbench

🔴 1 regressed · 🟢 2 improved

Flagged queries (3 of 43)
Query Verdict Baseline med (ms) PR med (ms) Change q-value Hint
🔴 13 regression 497 552 +11.1% <0.0001 cpu: ipnsort now backs ORDER BY count() LIMIT; +11% bucks the faster-sort trend and compounds an existing master degradation
🟢 22 improvement 334 250 -25.1% <0.0001 cpu: Faster ipnsort on the ORDER BY count() LIMIT path; -25% improvement, though the base run was noisy
🟢 23 improvement 297 198 -33.3% <0.0001 cpu: ipnsort speeds the ORDER BY count() LIMIT path; clean -33% on a stable query, both tests agree

q-value = BH-FDR adjusted p; smaller is stronger evidence. MIRAI flags a query when q < fdr_q (default 0.10) — the value the verdict is based on.

tpch_adapted_1_official

🟢 2 improved

Flagged queries (2 of 22)
Query Verdict Baseline med (ms) PR med (ms) Change q-value Hint
🟢 7 improvement 113 74 -34.5% <0.0001 cpu: ipnsort/driftsort sort swap speeds the ORDER BY path; clean -35% with matched low variance
🟢 20 improvement 248 124 -50.0% <0.0001 cpu: Sort swap plausibly drives -50%, but underlying measurements were noisy and this query is intrinsically flaky

q-value = BH-FDR adjusted p; smaller is stronger evidence. MIRAI flags a query when q < fdr_q (default 0.10) — the value the verdict is based on.

Debug info
  • StressHouse run: 15fe82f4-0be5-40e5-9e34-bd8e3cd2642a
  • MIRAI run: b67e0eb5-a933-427a-b3b5-fbd3eebee3ca
  • PR check IDs:
    • clickbench_77826_1781412051
    • clickbench_77846_1781412051
    • clickbench_77866_1781412052
    • tpch_adapted_1_official_78084_1781412081
    • tpch_adapted_1_official_78104_1781412081
    • tpch_adapted_1_official_78124_1781412082

Comment thread base/base/sort.h
alexey-milovidov and others added 3 commits June 7, 2026 04:46
…_less

`swap_if_less` performs a branchless conditional swap of two elements. In the
no-swap case `left_swap == va`, so `bit_copy(va, left_swap)` copies an element
onto itself, which is overlapping (in fact identical) storage where
`std::memcpy` is undefined behavior. Since `::sort` now routes every contiguous
call through `ipnsort`, this is hit on ordinary inputs.

Use an overlap-safe `std::memmove` for that one copy, matching the original
Rust `ptr::copy` (the non-overlapping `bit_copy`/`memcpy` is kept everywhere it
is provably non-aliasing). For a fixed `sizeof(T)` the compiler lowers the
`memmove` to the same load/store as `memcpy`, so the small-sort network stays
branchless.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
The driftsort scratch buffer is reinterpreted as `T *` and accessed as `T &`,
so it must satisfy `alignof(T)`. The stack buffer uses `alignas(T)`, but the
heap path allocated with plain `new std::byte[]`, which only guarantees the
default new alignment. For an over-aligned trivially-copyable `T` (e.g.
`alignas(64)`) on the heap path this produced under-aligned `T` references.

Allocate the heap scratch with aligned `operator new[]` plus a matching
aligned `operator delete[]` deleter.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Add committed regression coverage for `::sort` (ipnsort) and `::stableSort`
(driftsort) in `base/base/sort.h`, comparing against `std::sort` /
`std::stable_sort` over sorted, reverse, duplicate-heavy, organ-pipe,
all-equal and random inputs across many sizes (small-sort networks,
stack-scratch and heap-scratch paths). It checks:

- equality with `std::sort` for trivially-copyable types and a custom comparator;
- `::stableSort` matches `std::stable_sort` exactly, including equal-key order (stability);
- a non-default-constructible trivially-copyable element type (the `Uninit` scratch path);
- an `alignas(64)` element type whose comparator asserts it never sees an under-aligned reference (the heap-scratch alignment path);
- a non-trivially-copyable element type (the move-based and `std::stable_sort` fallbacks).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Comment thread contrib/ipnsort/ipnsort.h
@alexey-milovidov

Copy link
Copy Markdown
Member Author

@groeneai, fix integration tests,

test_profile_max_sessions_for_user
test_disks_app_interactive

which could wrongly assert the unstable sort order result, and send a PR with the fix.

@groeneai

groeneai commented Jun 8, 2026

Copy link
Copy Markdown
Contributor

@alexey-milovidov fix PR #106730.

test_disks_app_interactive: hardened the ls helper in the test to sort listings on the Python side, decoupling assertions from the disks app sort order. Verified via simulation that reverse and shuffled server output now both pass (old helper fails on either).

test_profile_max_sessions_for_user: reviewed each assertion and found nothing order-dependent — the test uses substring expect, assert_logs_contain_with_retry, and discards KILL QUERY output. No list comparisons. If you have a specific assertion in mind I missed, please point at it and I will follow up.

alexey-milovidov and others added 2 commits June 8, 2026 17:53
The move-based Hoare partition (`partition_hoare_branchy_cyclic`, used by
`::sort` for non-trivially-copyable types) saves an in-transit element in a
`Uninit<T>` union while shuffling. If the `is_less` comparator throws while
the gap is engaged, unwinding runs the union's trivial destructor, which never
calls `~T`, leaking the saved element and leaving the range one value short.
This is a regression versus the `pdqsort` path it replaces.

Add a `GapRestorer` RAII guard (mirroring `GapGuard` used by the
trivially-copyable block partition) that, on unwind, moves the saved element
back into the hole and destroys it. Add a unit test that throws from the
comparator at every comparison index in turn and asserts nothing is leaked.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
groeneai added a commit to groeneai/ClickHouse that referenced this pull request Jun 8, 2026
Under `Integration tests (amd_asan_ubsan, flaky)` praktika runs pytest-xdist
with `--dist=each` and many workers concurrently. All workers share the runner
host filesystem, so the test must scope every persistent path by
`PYTEST_XDIST_WORKER`. The previous patch only sorted `ls` output on the
client side; that addresses sort-order assertions but does nothing for
concurrent workers stomping `/var/lib/clickhouse`, the shared `test/`
directory, and the temporary `a.txt` next to `tests/integration/`.

Changes:
- `DisksClient.default_disk_root_directory` is now per-worker
  (`/var/lib/clickhouse-<gwN>`).
- `getLocalDisksClient` wipes the per-worker root on every refresh, terminates
  any orphan `clickhouse disks` subprocess from the previous iteration, and
  writes a per-worker `config_<gwN>.xml` whose `<path>` points to that root.
- `test_disks_app_interactive_list_directories_default` and
  `test_disks_app_interactive_test_move_and_write` now use per-worker
  directory and file names instead of the shared `test`/`a.txt`.
- `test_disks_app_interactive_cp_and_read` uses per-worker file and directory
  names so concurrent workers cannot delete each other's `a.txt` mid-test.

Verified locally with `python3 -m ci.praktika run "Integration tests
(amd_asan_ubsan, flaky)" --test test_disks_app_interactive` at workers=4 and
workers=8: 20/20 and 40/40 pass, vs. the previous 5/20 fail. Also verified
single-worker mode (5/5 pass) and `--count 10 --workers 4` (50/50 pass).

CI report on the previous flaky-check failure:
https://s3.amazonaws.com/clickhouse-test-reports/json.html?PR=106730&sha=110670122e495cdcbc23b0fc7aba7a70679fa5c6&name_0=PR&name_1=Integration%20tests%20%28amd_asan_ubsan%2C%20flaky%29

Related: ClickHouse#106650

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
alexey-milovidov pushed a commit to azat/ClickHouse that referenced this pull request Jun 9, 2026
The recursive `disks list --recursive` output is sorted server-side by
`std::sort` in `programs/disks/CommandList.cpp`, and the test asserts dict
values against alphabetically ordered lists. If that sort algorithm ever
changes (e.g. to an unstable sort that does not happen to produce
alphabetical order on equal-key inputs, or to a different ordering), the
exact-list assertions in `test_disks_app_interactive_list_directories_default`
would fail.

Sort listings on the Python side in the `ls` helper. The change is
idempotent for the current alphabetical server-side sort, and decouples
test assertions from the underlying sort algorithm.

Verified locally with `python3 -m ci.praktika run "integration" --test
"test_disks_app_interactive/test.py" --count 5` (25 passed in 11s).

Related: ClickHouse#106650
@alexey-milovidov

Copy link
Copy Markdown
Member Author

Continued work on this PR:

The one open review thread — the tpch_adapted_1 query 4 perf flag on base/base/sort.h — is left for your call (perf/gating judgment on the broad ::sort switch).

Pushed 979096a704f.

@thevar1able thevar1able self-assigned this Jun 11, 2026
@alexey-milovidov

Copy link
Copy Markdown
Member Author

@groeneai, also submit a fix for test_disks_app_interactive in a separate PR, so we can merge that sooner.

@alexey-milovidov

Copy link
Copy Markdown
Member Author

By the way, it is still failing in this PR.

@groeneai

Copy link
Copy Markdown
Contributor

@alexey-milovidov separate fix opened: #107262

The still-failing case here is not sort order (that was handled by #106730, whose robustness commits are already on this branch). It is a spurious TimeoutError: Disks client returned no output in the test harness: execute_query waited only 5s for the clickhouse disks subprocess to produce output, and under amd_msan at high host parallelism the subprocess gets CPU-starved past 5s on trivial cleanup commands. CIDB shows it amd_msan only, 0 master failures in 14 days. The PR raises that deadline to 60s; the 900s pytest-timeout still guards real hangs. Test-only, targets master so it can merge independently.

alexey-milovidov and others added 2 commits June 13, 2026 04:09
The test drives `clickhouse disks --test-mode` over pipes and treats the
`\a\a\a\a` SEPARATOR on stdout as the end-of-command marker. `execute_query`
polled once and, if a stderr log line arrived first, read a single stderr line
and returned without ever consuming the stdout separator. That left the
separator buffered and desynced every subsequent command, which under the
slowest sanitizer build (`amd_msan`, where log timing varies more) intermittently
surfaced as `TimeoutError: Disks client returned no output` during teardown
`rm`.

Keep polling until the stdout separator is actually observed, draining stderr
separately, and use a generous timeout so msan does not spuriously time out.
Surface an unexpected output close as an exception instead of looping.

The disks app itself does not use the sort engine changed in this PR
(`CommandList` uses `std::sort`, `DiskLocal` uses `fs::remove_all` /
`directory_iterator`), so this is a test-protocol robustness fix, not a product
change.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
@alexey-milovidov

Copy link
Copy Markdown
Member Author

Pushed 4d9da60f03d (merged master up to date; was ~2 days behind and red).

test_disks_app_interactive (amd_msan) — fixed

This was the only PR-related failure. History on play.clickhouse.com shows it fails only on amd_msan (the slowest sanitizer) and intermittently (~2 FAIL / 8 OK per run), never on asan/tsan/coverage/arm and never on master (3260 OK on msan). A correctness bug in the new sort would fail deterministically on every build — this is a timing/protocol flake instead.

Root cause: the test drives clickhouse disks --test-mode over pipes and uses the \a\a\a\a separator on stdout as the end-of-command marker. execute_query polled once and, if a stderr log line arrived first, read a single stderr line and returned without consuming the stdout separator, desyncing every subsequent command. Under msan (where log timing varies more) this surfaced as TimeoutError: Disks client returned no output during teardown rm. Fix: keep polling until the stdout separator is actually seen, drain stderr separately, and use a generous timeout. The disks app does not use the sort engine this PR changes (CommandList uses std::sort; DiskLocal uses fs::remove_all / directory_iterator), so this is a test-protocol fix, not a product change.

Unrelated CI failure

The Stateless tests (arm_asan_ubsan, azure, parallel) reds (04033_tpc_ds_q04/q33/q69, 00376_shard_group_uniq_array_of_int_array, 00050_min_max, 03272_json_with_progress, 01710_projections_...) are a runner-wide OOM batch: multiple parallel queries failing at the same timestamp with current RSS: 53.70 GiB, maximum: 53.69 GiB (MEMORY_LIMIT_EXCEEDED / OvercommitTracker), and the 01710 UNKNOWN_TABLE is a cascade. These will get fresh CI on the push.

@groeneai, please take a look at the arm_asan_ubsan runner-OOM batch flake on this PR if it recurs after the rerun: https://s3.amazonaws.com/clickhouse-test-reports/json.html?PR=106650&sha=979096a704f567400b11f081f20bfab1bd2f2b7f&name_0=PR&name_1=Stateless%20tests%20%28arm_asan_ubsan%2C%20azure%2C%20parallel%29

@groeneai

Copy link
Copy Markdown
Contributor

@alexey-milovidov Confirmed: runner-wide OOM, not a #106650 sort bug.

On 979096a704f5 the 9 arm_asan_ubsan, azure, parallel reds are all OvercommitTracker MEMORY_LIMIT_EXCEEDED (rc 241) at current RSS: 53.85 GiB, maximum: 53.69 GiB (the azure runner cap), clustered in one ~50s server window. The rc-210 Connection reset ones (q33, 01710, 03991) and the 01710 UNKNOWN_TABLE are cascades from the same event. Scope rules out the PR: that exact check failed 758 distinct tests across 94 distinct PRs on 2026-06-10. A sort regression would fail deterministically on every build; these are host-memory victims that pass on rerun.

The rerun on 4d9da60f03d has not reached that job yet (only Fast test / Style / Config done so far). I will re-check when arm_asan_ubsan, azure, parallel completes and only flag it if it recurs with a non-OOM signature; a repeat OOM is the same infra and clears on rerun.

@alexey-milovidov

Copy link
Copy Markdown
Member Author

@groeneai, fix AST fuzzer (amd_debug, targeted): Logical error: Not-ready Set is passed as the second argument for function 'A (STID: 0250-424f).

@groeneai

Copy link
Copy Markdown
Contributor

@alexey-milovidov The AST fuzzer "Not-ready Set" (STID 0250-424f) is not a regression of this PR. It is a pre-existing DDL-validation hole: a key expression like ORDER BY (x IN some_table) is accepted at CREATE and then crashes on the first INSERT, because the writer evaluates the key but the IN set is never built. x IN (SELECT ...) is already rejected (subquery check), but the table-identifier RHS slips through.

Fix in a separate PR: #107424 - extends the existing key-expression check to also reject IN with a table on the RHS (ORDER BY / PRIMARY KEY / PARTITION BY / skip INDEX). Literal sets like x IN (1,2,3) stay allowed.

Reproducer (deterministic, plain master):

CREATE TABLE t (x Int32) ENGINE = Memory;
CREATE TABLE u (a Int32) ENGINE = MergeTree ORDER BY (a IN t) AS SELECT 1;
-- Logical error: Not-ready Set ...

@clickhouse-gh

clickhouse-gh Bot commented Jun 14, 2026

Copy link
Copy Markdown
Contributor

LLVM Coverage Report

Metric Baseline Current Δ
Lines 84.70% 84.70% +0.00%
Functions 92.40% 92.40% +0.00%
Branches 77.30% 77.30% +0.00%

Changed lines: Changed C/C++ lines covered by tests: 206/222 (92.79%) | Lost baseline coverage (was covered on master, now uncovered in this PR): 1 line(s) · Uncovered code

Full report · Diff report

@alexey-milovidov

Copy link
Copy Markdown
Member Author

The only red on the current head (8dc97c8b23f) is Performance Comparison (amd_release, master_head, 4/6); everything else is green (CH Inc sync, Style check, all functional/integration/sanitizer jobs). The Mergeable Check and PR failures are just aggregates of that one shard.

The flagged queries are run-to-run noise on amd/4, not regressions from this change:

  • Cross-shard inconsistency proves machine noise. 11 of the 17 "slower" queries are on amd/4. The same queries are not flagged on the corresponding arm/4 shard: direct_dictionary (a dictGet/dictHas workload with no sort) and string_subcolumn_in_tuple (SELECT count() ... WHERE NOT ignore(t.sN), also no sort) are flagged slower on amd/4 but clean on arm/4. Sort-unrelated queries regressing on one runner but not the architectural twin is the signature of a slow runner in that run, not a code change.
  • The "sort-shaped" regressions don't touch the changed code. order_with_limit (#0/#2/#6/#11/#13/#16) and sparse_column #40 (ORDER BY str DESC LIMIT 100) all carry a LIMIT. This PR only swaps the full sort primitive (::sortipnsort, plus the new ::stableSort); partial_sort and nth_element are unchanged. IColumn::getPermutationImpl routes any 0 < limit < data_size to partial_sort and reaches the changed ::sort only when limit == 0. For these LIMIT n queries (n = 1…15000, well under the block size) every block goes through the unchanged partial_sort, so the new path is never exercised.

I re-ran the amd/4 Performance Comparison shard to get a confirming run.

@alexey-milovidov

Copy link
Copy Markdown
Member Author

SQL-level benchmark (synthetic + ClickBench + TPC-H)

I ran a clean A/B of this branch against the master sort path to see how the change lands on real queries.

Setup. Two builds of 26.6.1.1, same RelWithDebInfo config, pointed at the same data dir:

  • baseline — master sort path (pdqsort for ::sort, std::stable_sort);
  • ipnsort — this branch (ipnsort for ::sort, driftsort for ::stableSort).

Datasets: ClickBench hits (100M), TPC-H SF1, and synthetic 50M-row tables. Each query: 1 warm-up + 3 timed runs, min wall-clock, FORMAT Null. Control queries (radix ORDER BY, scan-only) match to ~1.0×, so the deltas are attributable to the sort.

Key finding

ORDER BY on <=64-bit numerics uses radix sort (untouched by this PR). The comparison ::sort is reached by String / FixedString / Decimal128-256 / Int128-256 / arraySort / multi-column tie-breaks.

At production parallelism (96 threads) a full ORDER BY is dominated by the unchanged serial k-way merge of small per-chunk sorts, so the sort swap is invisible there — no regression, no speed-up. The win shows where the comparison sort actually dominates the query, and through ClickHouse's indirect, cache-missing permutation comparator it is a realistic ~1.1–1.55× (not the 3× of the raw-UInt64 microbenchmark, which goes through radix in SQL anyway). driftsort is not SQL-visible: the windowFunnel / sequenceNextNode per-group stable sort is a tiny fraction of those aggregates with a cheap timestamp comparator.

Where it shows

arraySort is the cleanest case — a per-row sort with no merge and no block-materialization:

CREATE TABLE arrs (a Array(String)) ENGINE = MergeTree ORDER BY tuple();
INSERT INTO arrs SELECT arrayMap(i -> lower(hex(sipHash64(number, i))), range(200)) FROM numbers_mt(100000);
SELECT arraySort(a) FROM arrs FORMAT Null;                 -- 0.864s -> 0.557s  = 1.55x

Few-unique strings (sort dominates), and single-thread / large-block sorts that bypass the parallel merge:

SELECT s FROM s_fewunique ORDER BY s FORMAT Null;          -- 50M, ~100 distinct, 96 threads: 0.213 -> 0.167 = 1.28x

-- one big sort, no parallel merge:
SELECT s FROM s_rand    ORDER BY s SETTINGS max_threads = 1, max_block_size = 200000000 FORMAT Null;  -- 50M String:    32.41 -> 27.04 = 1.20x
SELECT s FROM s_fewunique ORDER BY s SETTINGS max_threads = 1, max_block_size = 200000000 FORMAT Null;-- 50M few-unique:  5.05 ->  3.75 = 1.35x
SELECT x FROM dec_rand  ORDER BY x SETTINGS max_threads = 1, max_block_size = 200000000 FORMAT Null;  -- 50M Decimal128: 17.38 -> 15.70 = 1.11x
SELECT x FROM i128_rand ORDER BY x SETTINGS max_threads = 1, max_block_size = 200000000 FORMAT Null;  -- 50M Int128:     13.52 -> 12.28 = 1.10x

TPC-H SF1 — single-threaded text-column sorts:

SELECT l_comment FROM lineitem ORDER BY l_comment SETTINGS max_threads = 1 FORMAT Null;  -- 6M:    2.44 -> 2.10 = 1.16x
SELECT o_comment FROM orders   ORDER BY o_comment SETTINGS max_threads = 1 FORMAT Null;  -- 1.5M:  0.53 -> 0.45 = 1.18x
SELECT p_name    FROM part     ORDER BY p_name    SETTINGS max_threads = 1 FORMAT Null;  -- 200k:  0.047 -> 0.035 = 1.34x

Single-thread / large-block summary, min wall-clock seconds:

1-shot single-thread sort baseline ipnsort speed-up
50M few-unique strings 5.045 3.749 1.35×
50M random strings 32.410 27.042 1.20×
50M Decimal128 17.377 15.700 1.11×
50M Int128 13.524 12.277 1.10×
6M TPC-H l_comment 2.466 2.226 1.11×
UInt64 radix CONTROL 7.360 7.805 0.94× (radix path, unchanged)

Where it's masked — full ORDER BY at 96 threads (merge-bound, "no regression")

SELECT URL FROM hits ORDER BY URL FORMAT Null;                                 -- 100M: 5.53 -> 5.45 = 1.01x
SELECT URL, Title, Referer FROM hits ORDER BY URL, Title, Referer FORMAT Null; -- 100M, 3-col: 12.28 -> 12.09 = 1.02x
SELECT WatchID FROM hits ORDER BY WatchID FORMAT Null;                         -- radix CONTROL: 9.22 -> 9.18 = 1.00x

ORDER BY Title / Referer (100M), synthetic 50M random String/Decimal128/Int128 at 96 threads, and windowFunnel / sequenceNextNode are all ~1.0× as well.

Notes

  • The microbenchmark speed-ups don't translate to ORDER BY wall-clock at default parallelism — the per-chunk sorts are small and the serial merge dominates. Worth stating so reviewers don't expect 3× on ORDER BY.
  • The cleanest SQL demonstration is arraySort on a large array column (per-row, no merge) — a good candidate for a tests/performance test (1.55× here).
  • A simple reverse-sorted Decimal128 input is not a pdqsort worst case here (both ~0.05s) — pdqsort's pattern detection already handles it; the reverse-sorted worst-case removal isn't reproduced by a plain reverse-sorted SQL input.

@alexey-milovidov alexey-milovidov added this pull request to the merge queue Jun 18, 2026
Merged via the queue into master with commit ba7fe30 Jun 18, 2026
327 of 328 checks passed
@alexey-milovidov alexey-milovidov deleted the ipnsort-driftsort branch June 18, 2026 23:31
@robot-clickhouse-ci-2 robot-clickhouse-ci-2 added the pr-synced-to-cloud The PR is synced to the cloud repo label Jun 19, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

pr-performance Pull request with some performance improvements pr-synced-to-cloud The PR is synced to the cloud repo submodule changed At least one submodule changed in this PR.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants