iframe-proxy

alexey-milovidov · 2026-06-06T22:34:47Z

Port the Rust standard library's current sort implementations to C++ and use them for ClickHouse's general-purpose sorts:

ipnsort (unstable, by Lukas Bergdoll & Orson Peters) → ::sort (base/base/sort.h), replacing pdqsort.
driftsort (stable, by Orson Peters & Lukas Bergdoll) → a new ::stableSort, replacing std::stable_sort call sites.

Both are vendored as header-only contribs (contrib/ipnsort, contrib/driftsort), dual-licensed MIT / Apache-2.0, the same way as contrib/pdqsort. They operate on raw pointers, so the always-contiguous iterators are converted with std::to_address; the rare non-contiguous random-access iterators keep using pdqsort / std::stable_sort. trySort, nth_element and partial_sort are unchanged.

ipnsort

ipnsort is pattern-defeating quicksort with a glidesort pivot, a sorting-network small-sort, and a heapsort fallback for the O(n log n) worst case. It detects already-sorted and reverse-sorted runs and handles them in O(n), which pdqsort (as used directly by ::sort, without the trySort pre-pass) does not always get for free.

The C++ port runs the full algorithm for trivially-copyable types (relocating bitwise via memcpy, matching the Rust original) and a correct move-based path (insertion sort + Hoare partition + heapsort) for other types.

One change from the upstream partition was made to keep throughput high for ClickHouse's most common pattern — sorting a permutation through an indirect, cache-missing comparator. The upstream in-place branchless Lomuto partition has a loop-carried dependency through the running less-than count, which serializes the comparator's load latency. This port evaluates the comparator over a block of elements in an independent (ILP-friendly) pre-pass and then performs the cheap serial rearrangement from the precomputed predicates, matching block-quicksort throughput on expensive comparators while keeping the single-pass speed on cheap ones.

driftsort

driftsort is a powersort merge tree over detected runs with a stable scratch-based quicksort for unsorted runs and a branchless bidirectional-merge small-sort. It allocates up to n/2 extra elements (a 4 KiB stack buffer for small inputs, heap otherwise). The full algorithm runs for trivially-copyable types; other types fall back to std::stable_sort.

Measurements (10M elements, single thread, aarch64, best-of)

ipnsort vs the previous ::sort engine on the workloads ::sort actually sees:

random UInt64: ~3x faster than pdqsort.
already-sorted / reverse-sorted: handled in O(n) (e.g. reverse UInt64 ~8 ms vs hundreds of ms for a naive quicksort), removing the worst case.
few-unique and permutation-with-indirect-comparator: on par with or faster than pdqsort.

driftsort vs std::stable_sort:

random: 2.4x faster; sorted: 81x; reverse: 42x; few-unique: 6x; 90%-presorted: 2.8x; permutation with indirect comparator: 2.2x.

Correctness

The ports were validated against std::sort / std::stable_sort over thousands of randomized cases (sizes 0..2M, many duplicate densities, sorted / reverse / organ-pipe / nearly-sorted inputs, trivially-copyable and non-trivially-copyable element types) under AddressSanitizer with zero mismatches; driftsort additionally verified stable (equal-key order matches std::stable_sort exactly).

Relation to #106618

This supersedes the blqsort approach in #106618. The CI performance comparison there showed blqsort improving string / multi-column sorts but regressing ORDER BY toDecimal128(...) DESC by ~4.9x (a reverse-sorted worst case). ipnsort keeps the string-sort improvements and does not have that regression.

Related: #106618

Changelog category (leave one):

Performance Improvement

Changelog entry (a user-readable short description of the changes that goes into CHANGELOG.md):

Use ipnsort (unstable) and driftsort (stable) — C++ ports of the Rust standard library's sort implementations — for the general-purpose comparison sorts. Speeds up ORDER BY on non-numeric columns and stable sorts, and removes a worst case on reverse-sorted input.

Documentation entry for user-facing changes

Documentation is written (mandatory for new features)

🤖 Generated with Claude Code

…ting Port the Rust standard library's current sort implementations to C++ and use them in `base/base/sort.h`: - ipnsort (by Lukas Bergdoll & Orson Peters) backs `::sort`, replacing `pdqsort`. It is pattern-defeating quicksort with a glidesort pivot, a sorting-network small-sort and a heapsort fallback, and detects already-sorted and reverse-sorted runs (handling them in O(n)). The port runs the full algorithm for trivially-copyable types and a move-based path otherwise. Its partition evaluates the comparator over a block in an ILP-friendly pre-pass before the serial rearrangement, which keeps throughput high for expensive (indirect / cache-missing) comparators - the dominant ClickHouse pattern. - driftsort (by Orson Peters & Lukas Bergdoll) backs a new `::stableSort`, replacing `std::stable_sort` call sites. It is a powersort merge tree over detected runs with a stable scratch quicksort, several times faster than `std::stable_sort`. It allocates up to n/2 extra memory. Both are vendored as header-only contribs (dual MIT / Apache-2.0), like `contrib/pdqsort`. They need raw pointers, so contiguous iterators are converted with `std::to_address`; non-contiguous iterators keep `pdqsort` / `std::stable_sort`. `trySort`, `nth_element` and `partial_sort` are unchanged. Validated against `std::sort` / `std::stable_sort` over thousands of randomized cases under AddressSanitizer (sizes, duplicate densities, patterns, and trivially-copyable, non-trivially-copyable and non-default-constructible element types) with zero mismatches; driftsort verified stable. Supersedes the blqsort approach in #106618: the CI performance comparison there showed blqsort regressing `ORDER BY toDecimal128(...) DESC` ~4.9x (a reverse-sorted worst case); ipnsort keeps the string-sort improvements without that regression. Related: #106618 Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

clickhouse-gh · 2026-06-06T22:35:21Z

clickhouse-gh · 2026-06-07T00:57:01Z

📊 Cloud Performance Report

🔴 AI verdict: degradation — 1 regressed, 4 improved out of 38 analysed

This PR swaps the core sorting implementations: ipnsort replaces pdqsort for unstable sort and driftsort replaces std::stable_sort, so any query doing an ORDER BY or merge is directly affected. Four of the five flagged queries improve in line with a faster sort: clickbench Q22 (-25%) and Q23 (-33%), and TPC-H Q7 (-35%) and Q20 (-50%, though that one's measurements were noisy). The one regression, clickbench Q13 (+11%), runs against the trend and compounds an existing degradation already visible on master, so it's worth a closer look.

clickbench

🔴 1 regressed · 🟢 2 improved

Flagged queries (3 of 43)

	Query	Verdict	Baseline med (ms)	PR med (ms)	Change	q-value	Hint
🔴	13	regression	497	552	+11.1%	<0.0001	cpu: ipnsort now backs ORDER BY count() LIMIT; +11% bucks the faster-sort trend and compounds an existing master degradation
🟢	22	improvement	334	250	-25.1%	<0.0001	cpu: Faster ipnsort on the ORDER BY count() LIMIT path; -25% improvement, though the base run was noisy
🟢	23	improvement	297	198	-33.3%	<0.0001	cpu: ipnsort speeds the ORDER BY count() LIMIT path; clean -33% on a stable query, both tests agree

_{q-value = BH-FDR adjusted p; smaller is stronger evidence. MIRAI flags a query when q < fdr_q (default 0.10) — the value the verdict is based on.}

tpch_adapted_1_official

🟢 2 improved

Flagged queries (2 of 22)

	Query	Verdict	Baseline med (ms)	PR med (ms)	Change	q-value	Hint
🟢	7	improvement	113	74	-34.5%	<0.0001	cpu: ipnsort/driftsort sort swap speeds the ORDER BY path; clean -35% with matched low variance
🟢	20	improvement	248	124	-50.0%	<0.0001	cpu: Sort swap plausibly drives -50%, but underlying measurements were noisy and this query is intrinsically flaky

_{q-value = BH-FDR adjusted p; smaller is stronger evidence. MIRAI flags a query when q < fdr_q (default 0.10) — the value the verdict is based on.}

Debug info

StressHouse run: 15fe82f4-0be5-40e5-9e34-bd8e3cd2642a
MIRAI run: b67e0eb5-a933-427a-b3b5-fbd3eebee3ca
PR check IDs:
- clickbench_77826_1781412051
- clickbench_77846_1781412051
- clickbench_77866_1781412052
- tpch_adapted_1_official_78084_1781412081
- tpch_adapted_1_official_78104_1781412081
- tpch_adapted_1_official_78124_1781412082

…_less `swap_if_less` performs a branchless conditional swap of two elements. In the no-swap case `left_swap == va`, so `bit_copy(va, left_swap)` copies an element onto itself, which is overlapping (in fact identical) storage where `std::memcpy` is undefined behavior. Since `::sort` now routes every contiguous call through `ipnsort`, this is hit on ordinary inputs. Use an overlap-safe `std::memmove` for that one copy, matching the original Rust `ptr::copy` (the non-overlapping `bit_copy`/`memcpy` is kept everywhere it is provably non-aliasing). For a fixed `sizeof(T)` the compiler lowers the `memmove` to the same load/store as `memcpy`, so the small-sort network stays branchless. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

The driftsort scratch buffer is reinterpreted as `T *` and accessed as `T &`, so it must satisfy `alignof(T)`. The stack buffer uses `alignas(T)`, but the heap path allocated with plain `new std::byte[]`, which only guarantees the default new alignment. For an over-aligned trivially-copyable `T` (e.g. `alignas(64)`) on the heap path this produced under-aligned `T` references. Allocate the heap scratch with aligned `operator new[]` plus a matching aligned `operator delete[]` deleter. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

Add committed regression coverage for `::sort` (ipnsort) and `::stableSort` (driftsort) in `base/base/sort.h`, comparing against `std::sort` / `std::stable_sort` over sorted, reverse, duplicate-heavy, organ-pipe, all-equal and random inputs across many sizes (small-sort networks, stack-scratch and heap-scratch paths). It checks: - equality with `std::sort` for trivially-copyable types and a custom comparator; - `::stableSort` matches `std::stable_sort` exactly, including equal-key order (stability); - a non-default-constructible trivially-copyable element type (the `Uninit` scratch path); - an `alignas(64)` element type whose comparator asserts it never sees an under-aligned reference (the heap-scratch alignment path); - a non-trivially-copyable element type (the move-based and `std::stable_sort` fallbacks). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

alexey-milovidov · 2026-06-08T14:10:04Z

@groeneai, fix integration tests,

test_profile_max_sessions_for_user
test_disks_app_interactive

which could wrongly assert the unstable sort order result, and send a PR with the fix.

groeneai · 2026-06-08T15:35:47Z

@alexey-milovidov fix PR #106730.

test_disks_app_interactive: hardened the ls helper in the test to sort listings on the Python side, decoupling assertions from the disks app sort order. Verified via simulation that reverse and shuffled server output now both pass (old helper fails on either).

test_profile_max_sessions_for_user: reviewed each assertion and found nothing order-dependent — the test uses substring expect, assert_logs_contain_with_retry, and discards KILL QUERY output. No list comparisons. If you have a specific assertion in mind I missed, please point at it and I will follow up.

The move-based Hoare partition (`partition_hoare_branchy_cyclic`, used by `::sort` for non-trivially-copyable types) saves an in-transit element in a `Uninit<T>` union while shuffling. If the `is_less` comparator throws while the gap is engaged, unwinding runs the union's trivial destructor, which never calls `~T`, leaking the saved element and leaving the range one value short. This is a regression versus the `pdqsort` path it replaces. Add a `GapRestorer` RAII guard (mirroring `GapGuard` used by the trivially-copyable block partition) that, on unwind, moves the saved element back into the hole and destroys it. Add a unit test that throws from the comparator at every comparison index in turn and asserts nothing is leaked. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

Under `Integration tests (amd_asan_ubsan, flaky)` praktika runs pytest-xdist with `--dist=each` and many workers concurrently. All workers share the runner host filesystem, so the test must scope every persistent path by `PYTEST_XDIST_WORKER`. The previous patch only sorted `ls` output on the client side; that addresses sort-order assertions but does nothing for concurrent workers stomping `/var/lib/clickhouse`, the shared `test/` directory, and the temporary `a.txt` next to `tests/integration/`. Changes: - `DisksClient.default_disk_root_directory` is now per-worker (`/var/lib/clickhouse-<gwN>`). - `getLocalDisksClient` wipes the per-worker root on every refresh, terminates any orphan `clickhouse disks` subprocess from the previous iteration, and writes a per-worker `config_<gwN>.xml` whose `<path>` points to that root. - `test_disks_app_interactive_list_directories_default` and `test_disks_app_interactive_test_move_and_write` now use per-worker directory and file names instead of the shared `test`/`a.txt`. - `test_disks_app_interactive_cp_and_read` uses per-worker file and directory names so concurrent workers cannot delete each other's `a.txt` mid-test. Verified locally with `python3 -m ci.praktika run "Integration tests (amd_asan_ubsan, flaky)" --test test_disks_app_interactive` at workers=4 and workers=8: 20/20 and 40/40 pass, vs. the previous 5/20 fail. Also verified single-worker mode (5/5 pass) and `--count 10 --workers 4` (50/50 pass). CI report on the previous flaky-check failure: https://s3.amazonaws.com/clickhouse-test-reports/json.html?PR=106730&sha=110670122e495cdcbc23b0fc7aba7a70679fa5c6&name_0=PR&name_1=Integration%20tests%20%28amd_asan_ubsan%2C%20flaky%29 Related: ClickHouse#106650 Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

The recursive `disks list --recursive` output is sorted server-side by `std::sort` in `programs/disks/CommandList.cpp`, and the test asserts dict values against alphabetically ordered lists. If that sort algorithm ever changes (e.g. to an unstable sort that does not happen to produce alphabetical order on equal-key inputs, or to a different ordering), the exact-list assertions in `test_disks_app_interactive_list_directories_default` would fail. Sort listings on the Python side in the `ls` helper. The change is idempotent for the current alphabetical server-side sort, and decouples test assertions from the underlying sort algorithm. Verified locally with `python3 -m ci.praktika run "integration" --test "test_disks_app_interactive/test.py" --count 5` (25 passed in 11s). Related: ClickHouse#106650

alexey-milovidov · 2026-06-10T01:12:44Z

Continued work on this PR:

Merged master (clean, 0 conflicts). The branch was ~1420 commits behind. This brings in two fixes that clear most of the red CI:
- Fix test_disks_app_interactive ls helper to be robust to disks app sort order #106730 (Fix test_disks_app_interactive ls helper to be robust to disks app sort order) — the test_disks_app_interactive timeouts were caused by this PR's unstable ::sort changing the disks-app ls output order; the test was made order-robust on master.
- Fix LOGICAL_ERROR Virtual row boundary violated under read-in-order over Union #106664 (Fix LOGICAL_ERROR Virtual row boundary violated under read-in-order over Union) — clears the Virtual row boundary violated in MergingSortedAlgorithm (STID 2651-3359) and the co-occurring Server died in amd_debug stateless (the LOGICAL_ERROR aborts the server in debug builds).
Exception-safety fix (contrib/ipnsort/ipnsort.h, commit 2a6b1bff57a) is now built and validated: ran unit_tests_dbms Sort.*, all 6 tests pass including the new Sort.MoveBasedPartitionExceptionSafety. The corresponding review thread is resolved.
The remaining test_profile_max_sessions_for_user timeout (amd_msan integration) is an unrelated low-frequency flake — it has failed on master/other PRs (Fix memory usage growth in fs cache #106387, Make Profile Events ±30x faster by using Per CPU atomics for server-wide and users #105056) and is a session-count test with no relation to sorting.

The one open review thread — the tpch_adapted_1 query 4 perf flag on base/base/sort.h — is left for your call (perf/gating judgment on the broad ::sort switch).

Pushed 979096a704f.

alexey-milovidov · 2026-06-11T17:18:20Z

@groeneai, also submit a fix for test_disks_app_interactive in a separate PR, so we can merge that sooner.

alexey-milovidov · 2026-06-11T17:19:11Z

By the way, it is still failing in this PR.

groeneai · 2026-06-11T18:00:51Z

@alexey-milovidov separate fix opened: #107262

The still-failing case here is not sort order (that was handled by #106730, whose robustness commits are already on this branch). It is a spurious TimeoutError: Disks client returned no output in the test harness: execute_query waited only 5s for the clickhouse disks subprocess to produce output, and under amd_msan at high host parallelism the subprocess gets CPU-starved past 5s on trivial cleanup commands. CIDB shows it amd_msan only, 0 master failures in 14 days. The PR raises that deadline to 60s; the 900s pytest-timeout still guards real hangs. Test-only, targets master so it can merge independently.

The test drives `clickhouse disks --test-mode` over pipes and treats the `\a\a\a\a` SEPARATOR on stdout as the end-of-command marker. `execute_query` polled once and, if a stderr log line arrived first, read a single stderr line and returned without ever consuming the stdout separator. That left the separator buffered and desynced every subsequent command, which under the slowest sanitizer build (`amd_msan`, where log timing varies more) intermittently surfaced as `TimeoutError: Disks client returned no output` during teardown `rm`. Keep polling until the stdout separator is actually observed, draining stderr separately, and use a generous timeout so msan does not spuriously time out. Surface an unexpected output close as an exception instead of looping. The disks app itself does not use the sort engine changed in this PR (`CommandList` uses `std::sort`, `DiskLocal` uses `fs::remove_all` / `directory_iterator`), so this is a test-protocol robustness fix, not a product change. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

alexey-milovidov · 2026-06-13T04:10:07Z

Pushed 4d9da60f03d (merged master up to date; was ~2 days behind and red).

`test_disks_app_interactive` (amd_msan) — fixed

This was the only PR-related failure. History on play.clickhouse.com shows it fails only on amd_msan (the slowest sanitizer) and intermittently (~2 FAIL / 8 OK per run), never on asan/tsan/coverage/arm and never on master (3260 OK on msan). A correctness bug in the new sort would fail deterministically on every build — this is a timing/protocol flake instead.

Root cause: the test drives clickhouse disks --test-mode over pipes and uses the \a\a\a\a separator on stdout as the end-of-command marker. execute_query polled once and, if a stderr log line arrived first, read a single stderr line and returned without consuming the stdout separator, desyncing every subsequent command. Under msan (where log timing varies more) this surfaced as TimeoutError: Disks client returned no output during teardown rm. Fix: keep polling until the stdout separator is actually seen, drain stderr separately, and use a generous timeout. The disks app does not use the sort engine this PR changes (CommandList uses std::sort; DiskLocal uses fs::remove_all / directory_iterator), so this is a test-protocol fix, not a product change.

Unrelated CI failure

The Stateless tests (arm_asan_ubsan, azure, parallel) reds (04033_tpc_ds_q04/q33/q69, 00376_shard_group_uniq_array_of_int_array, 00050_min_max, 03272_json_with_progress, 01710_projections_...) are a runner-wide OOM batch: multiple parallel queries failing at the same timestamp with current RSS: 53.70 GiB, maximum: 53.69 GiB (MEMORY_LIMIT_EXCEEDED / OvercommitTracker), and the 01710 UNKNOWN_TABLE is a cascade. These will get fresh CI on the push.

@groeneai, please take a look at the arm_asan_ubsan runner-OOM batch flake on this PR if it recurs after the rerun: https://s3.amazonaws.com/clickhouse-test-reports/json.html?PR=106650&sha=979096a704f567400b11f081f20bfab1bd2f2b7f&name_0=PR&name_1=Stateless%20tests%20%28arm_asan_ubsan%2C%20azure%2C%20parallel%29

groeneai · 2026-06-13T04:48:57Z

@alexey-milovidov Confirmed: runner-wide OOM, not a #106650 sort bug.

On 979096a704f5 the 9 arm_asan_ubsan, azure, parallel reds are all OvercommitTracker MEMORY_LIMIT_EXCEEDED (rc 241) at current RSS: 53.85 GiB, maximum: 53.69 GiB (the azure runner cap), clustered in one ~50s server window. The rc-210 Connection reset ones (q33, 01710, 03991) and the 01710 UNKNOWN_TABLE are cascades from the same event. Scope rules out the PR: that exact check failed 758 distinct tests across 94 distinct PRs on 2026-06-10. A sort regression would fail deterministically on every build; these are host-memory victims that pass on rerun.

The rerun on 4d9da60f03d has not reached that job yet (only Fast test / Style / Config done so far). I will re-check when arm_asan_ubsan, azure, parallel completes and only flag it if it recurs with a non-OOM signature; a repeat OOM is the same infra and clears on rerun.

alexey-milovidov · 2026-06-13T21:15:40Z

@groeneai, fix AST fuzzer (amd_debug, targeted): Logical error: Not-ready Set is passed as the second argument for function 'A (STID: 0250-424f).

groeneai · 2026-06-13T23:32:36Z

@alexey-milovidov The AST fuzzer "Not-ready Set" (STID 0250-424f) is not a regression of this PR. It is a pre-existing DDL-validation hole: a key expression like ORDER BY (x IN some_table) is accepted at CREATE and then crashes on the first INSERT, because the writer evaluates the key but the IN set is never built. x IN (SELECT ...) is already rejected (subquery check), but the table-identifier RHS slips through.

Fix in a separate PR: #107424 - extends the existing key-expression check to also reject IN with a table on the RHS (ORDER BY / PRIMARY KEY / PARTITION BY / skip INDEX). Literal sets like x IN (1,2,3) stay allowed.

Reproducer (deterministic, plain master):

CREATE TABLE t (x Int32) ENGINE = Memory;
CREATE TABLE u (a Int32) ENGINE = MergeTree ORDER BY (a IN t) AS SELECT 1;
-- Logical error: Not-ready Set ...

clickhouse-gh · 2026-06-14T06:44:53Z

LLVM Coverage Report

Metric	Baseline	Current	Δ
Lines	84.70%	84.70%	+0.00%
Functions	92.40%	92.40%	+0.00%
Branches	77.30%	77.30%	+0.00%

Changed lines: Changed C/C++ lines covered by tests: 206/222 (92.79%) | Lost baseline coverage (was covered on master, now uncovered in this PR): 1 line(s) · Uncovered code

Full report · Diff report

alexey-milovidov · 2026-06-14T14:02:34Z

The only red on the current head (8dc97c8b23f) is Performance Comparison (amd_release, master_head, 4/6); everything else is green (CH Inc sync, Style check, all functional/integration/sanitizer jobs). The Mergeable Check and PR failures are just aggregates of that one shard.

The flagged queries are run-to-run noise on amd/4, not regressions from this change:

Cross-shard inconsistency proves machine noise. 11 of the 17 "slower" queries are on amd/4. The same queries are not flagged on the corresponding arm/4 shard: direct_dictionary (a dictGet/dictHas workload with no sort) and string_subcolumn_in_tuple (SELECT count() ... WHERE NOT ignore(t.sN), also no sort) are flagged slower on amd/4 but clean on arm/4. Sort-unrelated queries regressing on one runner but not the architectural twin is the signature of a slow runner in that run, not a code change.
The "sort-shaped" regressions don't touch the changed code. order_with_limit (#0/#2/#6/#11/#13/#16) and sparse_column #40 (ORDER BY str DESC LIMIT 100) all carry a LIMIT. This PR only swaps the full sort primitive (::sort → ipnsort, plus the new ::stableSort); partial_sort and nth_element are unchanged. IColumn::getPermutationImpl routes any 0 < limit < data_size to partial_sort and reaches the changed ::sort only when limit == 0. For these LIMIT n queries (n = 1…15000, well under the block size) every block goes through the unchanged partial_sort, so the new path is never exercised.

I re-ran the amd/4 Performance Comparison shard to get a confirming run.

alexey-milovidov · 2026-06-18T01:16:48Z

SQL-level benchmark (synthetic + ClickBench + TPC-H)

I ran a clean A/B of this branch against the master sort path to see how the change lands on real queries.

Setup. Two builds of 26.6.1.1, same RelWithDebInfo config, pointed at the same data dir:

baseline — master sort path (pdqsort for ::sort, std::stable_sort);
ipnsort — this branch (ipnsort for ::sort, driftsort for ::stableSort).

Datasets: ClickBench hits (100M), TPC-H SF1, and synthetic 50M-row tables. Each query: 1 warm-up + 3 timed runs, min wall-clock, FORMAT Null. Control queries (radix ORDER BY, scan-only) match to ~1.0×, so the deltas are attributable to the sort.

Key finding

ORDER BY on <=64-bit numerics uses radix sort (untouched by this PR). The comparison ::sort is reached by String / FixedString / Decimal128-256 / Int128-256 / arraySort / multi-column tie-breaks.

At production parallelism (96 threads) a full ORDER BY is dominated by the unchanged serial k-way merge of small per-chunk sorts, so the sort swap is invisible there — no regression, no speed-up. The win shows where the comparison sort actually dominates the query, and through ClickHouse's indirect, cache-missing permutation comparator it is a realistic ~1.1–1.55× (not the 3× of the raw-UInt64 microbenchmark, which goes through radix in SQL anyway). driftsort is not SQL-visible: the windowFunnel / sequenceNextNode per-group stable sort is a tiny fraction of those aggregates with a cheap timestamp comparator.

Where it shows

arraySort is the cleanest case — a per-row sort with no merge and no block-materialization:

CREATE TABLE arrs (a Array(String)) ENGINE = MergeTree ORDER BY tuple();
INSERT INTO arrs SELECT arrayMap(i -> lower(hex(sipHash64(number, i))), range(200)) FROM numbers_mt(100000);
SELECT arraySort(a) FROM arrs FORMAT Null;                 -- 0.864s -> 0.557s  = 1.55x

Few-unique strings (sort dominates), and single-thread / large-block sorts that bypass the parallel merge:

SELECT s FROM s_fewunique ORDER BY s FORMAT Null;          -- 50M, ~100 distinct, 96 threads: 0.213 -> 0.167 = 1.28x

-- one big sort, no parallel merge:
SELECT s FROM s_rand    ORDER BY s SETTINGS max_threads = 1, max_block_size = 200000000 FORMAT Null;  -- 50M String:    32.41 -> 27.04 = 1.20x
SELECT s FROM s_fewunique ORDER BY s SETTINGS max_threads = 1, max_block_size = 200000000 FORMAT Null;-- 50M few-unique:  5.05 ->  3.75 = 1.35x
SELECT x FROM dec_rand  ORDER BY x SETTINGS max_threads = 1, max_block_size = 200000000 FORMAT Null;  -- 50M Decimal128: 17.38 -> 15.70 = 1.11x
SELECT x FROM i128_rand ORDER BY x SETTINGS max_threads = 1, max_block_size = 200000000 FORMAT Null;  -- 50M Int128:     13.52 -> 12.28 = 1.10x

TPC-H SF1 — single-threaded text-column sorts:

SELECT l_comment FROM lineitem ORDER BY l_comment SETTINGS max_threads = 1 FORMAT Null;  -- 6M:    2.44 -> 2.10 = 1.16x
SELECT o_comment FROM orders   ORDER BY o_comment SETTINGS max_threads = 1 FORMAT Null;  -- 1.5M:  0.53 -> 0.45 = 1.18x
SELECT p_name    FROM part     ORDER BY p_name    SETTINGS max_threads = 1 FORMAT Null;  -- 200k:  0.047 -> 0.035 = 1.34x

Single-thread / large-block summary, min wall-clock seconds:

Where it's masked — full `ORDER BY` at 96 threads (merge-bound, "no regression")

SELECT URL FROM hits ORDER BY URL FORMAT Null;                                 -- 100M: 5.53 -> 5.45 = 1.01x
SELECT URL, Title, Referer FROM hits ORDER BY URL, Title, Referer FORMAT Null; -- 100M, 3-col: 12.28 -> 12.09 = 1.02x
SELECT WatchID FROM hits ORDER BY WatchID FORMAT Null;                         -- radix CONTROL: 9.22 -> 9.18 = 1.00x

ORDER BY Title / Referer (100M), synthetic 50M random String/Decimal128/Int128 at 96 threads, and windowFunnel / sequenceNextNode are all ~1.0× as well.

Notes

The microbenchmark speed-ups don't translate to ORDER BY wall-clock at default parallelism — the per-chunk sorts are small and the serial merge dominates. Worth stating so reviewers don't expect 3× on ORDER BY.
The cleanest SQL demonstration is arraySort on a large array column (per-row, no merge) — a good candidate for a tests/performance test (1.55× here).
A simple reverse-sorted Decimal128 input is not a pdqsort worst case here (both ~0.05s) — pdqsort's pattern detection already handles it; the reverse-sorted worst-case removal isn't reproduced by a plain reverse-sorted SQL input.

clickhouse-gh Bot added pr-performance Pull request with some performance improvements submodule changed At least one submodule changed in this PR. labels Jun 6, 2026

clickhouse-gh Bot reviewed Jun 6, 2026

View reviewed changes

Comment thread contrib/ipnsort/ipnsort.h

clickhouse-gh Bot reviewed Jun 6, 2026

View reviewed changes

Comment thread contrib/driftsort/driftsort.h Outdated

clickhouse-gh Bot reviewed Jun 6, 2026

View reviewed changes

Comment thread base/base/sort.h

clickhouse-gh Bot reviewed Jun 7, 2026

View reviewed changes

Comment thread base/base/sort.h

alexey-milovidov and others added 3 commits June 7, 2026 04:46

clickhouse-gh Bot reviewed Jun 7, 2026

View reviewed changes

Comment thread contrib/ipnsort/ipnsort.h

groeneai mentioned this pull request Jun 8, 2026

Fix test_disks_app_interactive ls helper to be robust to disks app sort order #106730

Merged

1 task

alexey-milovidov and others added 2 commits June 8, 2026 17:53

Merge remote-tracking branch 'origin/master' into ipnsort-driftsort

3eaead3

alexey-milovidov mentioned this pull request Jun 10, 2026

Make clickhouse-local temporary storage limit configurable #106689

Merged

Merge remote-tracking branch 'origin/master' into ipnsort-driftsort

979096a

thevar1able self-assigned this Jun 11, 2026

groeneai mentioned this pull request Jun 11, 2026

Increase disks app interactive test command timeout #107262

Closed

Merge remote-tracking branch 'origin/master' into ipnsort-driftsort

2bcf4d7

alexey-milovidov mentioned this pull request Jun 13, 2026

Add embedded documentation and a system table for dictionary layouts #106182

Merged

1 task

alexey-milovidov and others added 2 commits June 13, 2026 04:09

Merge remote-tracking branch 'origin/master' into ipnsort-driftsort

4d9da60

Merge branch 'master' into ipnsort-driftsort

fbc0e4d

alexey-milovidov mentioned this pull request Jun 13, 2026

Skip macOS-incompatible distributed tests 04327 and 04336 #107376

Merged

groeneai mentioned this pull request Jun 13, 2026

Forbid table references in IN operator in key expressions #107424

Merged

Merge branch 'master' into ipnsort-driftsort

8dc97c8

thevar1able approved these changes Jun 18, 2026

View reviewed changes

alexey-milovidov added this pull request to the merge queue Jun 18, 2026

alexey-milovidov mentioned this pull request Jun 18, 2026

Use branchless quicksort (blqsort) for the general-purpose sort #106618

Closed

1 task

Merged via the queue into master with commit ba7fe30 Jun 18, 2026
327 of 328 checks passed

alexey-milovidov deleted the ipnsort-driftsort branch June 18, 2026 23:31

robot-clickhouse-ci-2 added the pr-synced-to-cloud The PR is synced to the cloud repo label Jun 19, 2026

Sunbelt Computer Software

PL/B Language Development and Support

Uh oh!

Conversation

alexey-milovidov commented Jun 6, 2026

ipnsort

driftsort

Measurements (10M elements, single thread, aarch64, best-of)

Correctness

Relation to #106618

Changelog category (leave one):

Changelog entry (a user-readable short description of the changes that goes into CHANGELOG.md):

Documentation entry for user-facing changes

Uh oh!

clickhouse-gh Bot commented Jun 6, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

AI Review

Summary

Final Verdict

Uh oh!

Uh oh!

Uh oh!

Uh oh!

clickhouse-gh Bot commented Jun 7, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

clickbench

tpch_adapted_1_official

Uh oh!

Uh oh!

Uh oh!

alexey-milovidov commented Jun 8, 2026

Uh oh!

groeneai commented Jun 8, 2026

Uh oh!

alexey-milovidov commented Jun 10, 2026

Uh oh!

alexey-milovidov commented Jun 11, 2026

Uh oh!

alexey-milovidov commented Jun 11, 2026

Uh oh!

groeneai commented Jun 11, 2026

Uh oh!

alexey-milovidov commented Jun 13, 2026

test_disks_app_interactive (amd_msan) — fixed

Unrelated CI failure

Uh oh!

groeneai commented Jun 13, 2026

Uh oh!

alexey-milovidov commented Jun 13, 2026

Uh oh!

groeneai commented Jun 13, 2026

Uh oh!

clickhouse-gh Bot commented Jun 14, 2026

LLVM Coverage Report

Uh oh!

alexey-milovidov commented Jun 14, 2026

Uh oh!

alexey-milovidov commented Jun 18, 2026

SQL-level benchmark (synthetic + ClickBench + TPC-H)

Key finding

Where it shows

Where it's masked — full ORDER BY at 96 threads (merge-bound, "no regression")

Notes

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

clickhouse-gh Bot commented Jun 6, 2026 •

edited

Loading

clickhouse-gh Bot commented Jun 7, 2026 •

edited

Loading

`test_disks_app_interactive` (amd_msan) — fixed

Where it's masked — full `ORDER BY` at 96 threads (merge-bound, "no regression")