Use ipnsort (unstable) and driftsort (stable) for general-purpose sorting#106650
Conversation
…ting Port the Rust standard library's current sort implementations to C++ and use them in `base/base/sort.h`: - ipnsort (by Lukas Bergdoll & Orson Peters) backs `::sort`, replacing `pdqsort`. It is pattern-defeating quicksort with a glidesort pivot, a sorting-network small-sort and a heapsort fallback, and detects already-sorted and reverse-sorted runs (handling them in O(n)). The port runs the full algorithm for trivially-copyable types and a move-based path otherwise. Its partition evaluates the comparator over a block in an ILP-friendly pre-pass before the serial rearrangement, which keeps throughput high for expensive (indirect / cache-missing) comparators - the dominant ClickHouse pattern. - driftsort (by Orson Peters & Lukas Bergdoll) backs a new `::stableSort`, replacing `std::stable_sort` call sites. It is a powersort merge tree over detected runs with a stable scratch quicksort, several times faster than `std::stable_sort`. It allocates up to n/2 extra memory. Both are vendored as header-only contribs (dual MIT / Apache-2.0), like `contrib/pdqsort`. They need raw pointers, so contiguous iterators are converted with `std::to_address`; non-contiguous iterators keep `pdqsort` / `std::stable_sort`. `trySort`, `nth_element` and `partial_sort` are unchanged. Validated against `std::sort` / `std::stable_sort` over thousands of randomized cases under AddressSanitizer (sizes, duplicate densities, patterns, and trivially-copyable, non-trivially-copyable and non-default-constructible element types) with zero mismatches; driftsort verified stable. Supersedes the blqsort approach in #106618: the CI performance comparison there showed blqsort regressing `ORDER BY toDecimal128(...) DESC` ~4.9x (a reverse-sorted worst case); ipnsort keeps the string-sort improvements without that regression. Related: #106618 Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
|
📊 Cloud Performance Report 🔴 AI verdict: This PR swaps the core sorting implementations: ipnsort replaces pdqsort for unstable sort and driftsort replaces std::stable_sort, so any query doing an ORDER BY or merge is directly affected. Four of the five flagged queries improve in line with a faster sort: clickbench Q22 (-25%) and Q23 (-33%), and TPC-H Q7 (-35%) and Q20 (-50%, though that one's measurements were noisy). The one regression, clickbench Q13 (+11%), runs against the trend and compounds an existing degradation already visible on master, so it's worth a closer look. clickbench🔴 1 regressed · 🟢 2 improved Flagged queries (3 of 43)
q-value = BH-FDR adjusted p; smaller is stronger evidence. MIRAI flags a query when q < fdr_q (default 0.10) — the value the verdict is based on. tpch_adapted_1_official🟢 2 improved Flagged queries (2 of 22)
q-value = BH-FDR adjusted p; smaller is stronger evidence. MIRAI flags a query when q < fdr_q (default 0.10) — the value the verdict is based on. Debug info
|
…_less `swap_if_less` performs a branchless conditional swap of two elements. In the no-swap case `left_swap == va`, so `bit_copy(va, left_swap)` copies an element onto itself, which is overlapping (in fact identical) storage where `std::memcpy` is undefined behavior. Since `::sort` now routes every contiguous call through `ipnsort`, this is hit on ordinary inputs. Use an overlap-safe `std::memmove` for that one copy, matching the original Rust `ptr::copy` (the non-overlapping `bit_copy`/`memcpy` is kept everywhere it is provably non-aliasing). For a fixed `sizeof(T)` the compiler lowers the `memmove` to the same load/store as `memcpy`, so the small-sort network stays branchless. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
The driftsort scratch buffer is reinterpreted as `T *` and accessed as `T &`, so it must satisfy `alignof(T)`. The stack buffer uses `alignas(T)`, but the heap path allocated with plain `new std::byte[]`, which only guarantees the default new alignment. For an over-aligned trivially-copyable `T` (e.g. `alignas(64)`) on the heap path this produced under-aligned `T` references. Allocate the heap scratch with aligned `operator new[]` plus a matching aligned `operator delete[]` deleter. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Add committed regression coverage for `::sort` (ipnsort) and `::stableSort` (driftsort) in `base/base/sort.h`, comparing against `std::sort` / `std::stable_sort` over sorted, reverse, duplicate-heavy, organ-pipe, all-equal and random inputs across many sizes (small-sort networks, stack-scratch and heap-scratch paths). It checks: - equality with `std::sort` for trivially-copyable types and a custom comparator; - `::stableSort` matches `std::stable_sort` exactly, including equal-key order (stability); - a non-default-constructible trivially-copyable element type (the `Uninit` scratch path); - an `alignas(64)` element type whose comparator asserts it never sees an under-aligned reference (the heap-scratch alignment path); - a non-trivially-copyable element type (the move-based and `std::stable_sort` fallbacks). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
|
@groeneai, fix integration tests, test_profile_max_sessions_for_user which could wrongly assert the unstable sort order result, and send a PR with the fix. |
|
@alexey-milovidov fix PR #106730.
|
The move-based Hoare partition (`partition_hoare_branchy_cyclic`, used by `::sort` for non-trivially-copyable types) saves an in-transit element in a `Uninit<T>` union while shuffling. If the `is_less` comparator throws while the gap is engaged, unwinding runs the union's trivial destructor, which never calls `~T`, leaking the saved element and leaving the range one value short. This is a regression versus the `pdqsort` path it replaces. Add a `GapRestorer` RAII guard (mirroring `GapGuard` used by the trivially-copyable block partition) that, on unwind, moves the saved element back into the hole and destroys it. Add a unit test that throws from the comparator at every comparison index in turn and asserts nothing is leaked. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Under `Integration tests (amd_asan_ubsan, flaky)` praktika runs pytest-xdist with `--dist=each` and many workers concurrently. All workers share the runner host filesystem, so the test must scope every persistent path by `PYTEST_XDIST_WORKER`. The previous patch only sorted `ls` output on the client side; that addresses sort-order assertions but does nothing for concurrent workers stomping `/var/lib/clickhouse`, the shared `test/` directory, and the temporary `a.txt` next to `tests/integration/`. Changes: - `DisksClient.default_disk_root_directory` is now per-worker (`/var/lib/clickhouse-<gwN>`). - `getLocalDisksClient` wipes the per-worker root on every refresh, terminates any orphan `clickhouse disks` subprocess from the previous iteration, and writes a per-worker `config_<gwN>.xml` whose `<path>` points to that root. - `test_disks_app_interactive_list_directories_default` and `test_disks_app_interactive_test_move_and_write` now use per-worker directory and file names instead of the shared `test`/`a.txt`. - `test_disks_app_interactive_cp_and_read` uses per-worker file and directory names so concurrent workers cannot delete each other's `a.txt` mid-test. Verified locally with `python3 -m ci.praktika run "Integration tests (amd_asan_ubsan, flaky)" --test test_disks_app_interactive` at workers=4 and workers=8: 20/20 and 40/40 pass, vs. the previous 5/20 fail. Also verified single-worker mode (5/5 pass) and `--count 10 --workers 4` (50/50 pass). CI report on the previous flaky-check failure: https://s3.amazonaws.com/clickhouse-test-reports/json.html?PR=106730&sha=110670122e495cdcbc23b0fc7aba7a70679fa5c6&name_0=PR&name_1=Integration%20tests%20%28amd_asan_ubsan%2C%20flaky%29 Related: ClickHouse#106650 Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
The recursive `disks list --recursive` output is sorted server-side by `std::sort` in `programs/disks/CommandList.cpp`, and the test asserts dict values against alphabetically ordered lists. If that sort algorithm ever changes (e.g. to an unstable sort that does not happen to produce alphabetical order on equal-key inputs, or to a different ordering), the exact-list assertions in `test_disks_app_interactive_list_directories_default` would fail. Sort listings on the Python side in the `ls` helper. The change is idempotent for the current alphabetical server-side sort, and decouples test assertions from the underlying sort algorithm. Verified locally with `python3 -m ci.praktika run "integration" --test "test_disks_app_interactive/test.py" --count 5` (25 passed in 11s). Related: ClickHouse#106650
|
Continued work on this PR:
The one open review thread — the Pushed |
|
@groeneai, also submit a fix for test_disks_app_interactive in a separate PR, so we can merge that sooner. |
|
By the way, it is still failing in this PR. |
|
@alexey-milovidov separate fix opened: #107262 The still-failing case here is not sort order (that was handled by #106730, whose robustness commits are already on this branch). It is a spurious |
The test drives `clickhouse disks --test-mode` over pipes and treats the `\a\a\a\a` SEPARATOR on stdout as the end-of-command marker. `execute_query` polled once and, if a stderr log line arrived first, read a single stderr line and returned without ever consuming the stdout separator. That left the separator buffered and desynced every subsequent command, which under the slowest sanitizer build (`amd_msan`, where log timing varies more) intermittently surfaced as `TimeoutError: Disks client returned no output` during teardown `rm`. Keep polling until the stdout separator is actually observed, draining stderr separately, and use a generous timeout so msan does not spuriously time out. Surface an unexpected output close as an exception instead of looping. The disks app itself does not use the sort engine changed in this PR (`CommandList` uses `std::sort`, `DiskLocal` uses `fs::remove_all` / `directory_iterator`), so this is a test-protocol robustness fix, not a product change. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
|
Pushed
|
|
@alexey-milovidov Confirmed: runner-wide OOM, not a #106650 sort bug. On The rerun on |
|
@groeneai, fix AST fuzzer (amd_debug, targeted): Logical error: Not-ready Set is passed as the second argument for function 'A (STID: 0250-424f). |
|
@alexey-milovidov The AST fuzzer "Not-ready Set" (STID 0250-424f) is not a regression of this PR. It is a pre-existing DDL-validation hole: a key expression like Fix in a separate PR: #107424 - extends the existing key-expression check to also reject IN with a table on the RHS (ORDER BY / PRIMARY KEY / PARTITION BY / skip INDEX). Literal sets like Reproducer (deterministic, plain master): CREATE TABLE t (x Int32) ENGINE = Memory;
CREATE TABLE u (a Int32) ENGINE = MergeTree ORDER BY (a IN t) AS SELECT 1;
-- Logical error: Not-ready Set ... |
LLVM Coverage Report
Changed lines: Changed C/C++ lines covered by tests: 206/222 (92.79%) | Lost baseline coverage (was covered on master, now uncovered in this PR): 1 line(s) · Uncovered code |
|
The only red on the current head ( The flagged queries are run-to-run noise on
I re-ran the |
SQL-level benchmark (synthetic + ClickBench + TPC-H)I ran a clean A/B of this branch against the master sort path to see how the change lands on real queries. Setup. Two builds of
Datasets: ClickBench Key finding
At production parallelism (96 threads) a full Where it shows
CREATE TABLE arrs (a Array(String)) ENGINE = MergeTree ORDER BY tuple();
INSERT INTO arrs SELECT arrayMap(i -> lower(hex(sipHash64(number, i))), range(200)) FROM numbers_mt(100000);
SELECT arraySort(a) FROM arrs FORMAT Null; -- 0.864s -> 0.557s = 1.55xFew-unique strings (sort dominates), and single-thread / large-block sorts that bypass the parallel merge: SELECT s FROM s_fewunique ORDER BY s FORMAT Null; -- 50M, ~100 distinct, 96 threads: 0.213 -> 0.167 = 1.28x
-- one big sort, no parallel merge:
SELECT s FROM s_rand ORDER BY s SETTINGS max_threads = 1, max_block_size = 200000000 FORMAT Null; -- 50M String: 32.41 -> 27.04 = 1.20x
SELECT s FROM s_fewunique ORDER BY s SETTINGS max_threads = 1, max_block_size = 200000000 FORMAT Null;-- 50M few-unique: 5.05 -> 3.75 = 1.35x
SELECT x FROM dec_rand ORDER BY x SETTINGS max_threads = 1, max_block_size = 200000000 FORMAT Null; -- 50M Decimal128: 17.38 -> 15.70 = 1.11x
SELECT x FROM i128_rand ORDER BY x SETTINGS max_threads = 1, max_block_size = 200000000 FORMAT Null; -- 50M Int128: 13.52 -> 12.28 = 1.10xTPC-H SF1 — single-threaded text-column sorts: SELECT l_comment FROM lineitem ORDER BY l_comment SETTINGS max_threads = 1 FORMAT Null; -- 6M: 2.44 -> 2.10 = 1.16x
SELECT o_comment FROM orders ORDER BY o_comment SETTINGS max_threads = 1 FORMAT Null; -- 1.5M: 0.53 -> 0.45 = 1.18x
SELECT p_name FROM part ORDER BY p_name SETTINGS max_threads = 1 FORMAT Null; -- 200k: 0.047 -> 0.035 = 1.34xSingle-thread / large-block summary, min wall-clock seconds: Where it's masked — full
|

Port the Rust standard library's current sort implementations to C++ and use them for ClickHouse's general-purpose sorts:
::sort(base/base/sort.h), replacingpdqsort.::stableSort, replacingstd::stable_sortcall sites.Both are vendored as header-only contribs (
contrib/ipnsort,contrib/driftsort), dual-licensed MIT / Apache-2.0, the same way ascontrib/pdqsort. They operate on raw pointers, so the always-contiguous iterators are converted withstd::to_address; the rare non-contiguous random-access iterators keep usingpdqsort/std::stable_sort.trySort,nth_elementandpartial_sortare unchanged.ipnsort
ipnsortis pattern-defeating quicksort with a glidesort pivot, a sorting-network small-sort, and a heapsort fallback for theO(n log n)worst case. It detects already-sorted and reverse-sorted runs and handles them inO(n), whichpdqsort(as used directly by::sort, without thetrySortpre-pass) does not always get for free.The C++ port runs the full algorithm for trivially-copyable types (relocating bitwise via
memcpy, matching the Rust original) and a correct move-based path (insertion sort + Hoare partition + heapsort) for other types.One change from the upstream partition was made to keep throughput high for ClickHouse's most common pattern — sorting a permutation through an indirect, cache-missing comparator. The upstream in-place branchless Lomuto partition has a loop-carried dependency through the running less-than count, which serializes the comparator's load latency. This port evaluates the comparator over a block of elements in an independent (ILP-friendly) pre-pass and then performs the cheap serial rearrangement from the precomputed predicates, matching block-quicksort throughput on expensive comparators while keeping the single-pass speed on cheap ones.
driftsort
driftsortis a powersort merge tree over detected runs with a stable scratch-based quicksort for unsorted runs and a branchless bidirectional-merge small-sort. It allocates up ton/2extra elements (a 4 KiB stack buffer for small inputs, heap otherwise). The full algorithm runs for trivially-copyable types; other types fall back tostd::stable_sort.Measurements (10M elements, single thread, aarch64, best-of)
ipnsort vs the previous
::sortengine on the workloads::sortactually sees:UInt64: ~3x faster thanpdqsort.O(n)(e.g. reverseUInt64~8 ms vs hundreds of ms for a naive quicksort), removing the worst case.pdqsort.driftsort vs
std::stable_sort:Correctness
The ports were validated against
std::sort/std::stable_sortover thousands of randomized cases (sizes 0..2M, many duplicate densities, sorted / reverse / organ-pipe / nearly-sorted inputs, trivially-copyable and non-trivially-copyable element types) under AddressSanitizer with zero mismatches; driftsort additionally verified stable (equal-key order matchesstd::stable_sortexactly).Relation to #106618
This supersedes the
blqsortapproach in #106618. The CI performance comparison there showedblqsortimproving string / multi-column sorts but regressingORDER BY toDecimal128(...) DESCby ~4.9x (a reverse-sorted worst case). ipnsort keeps the string-sort improvements and does not have that regression.Related: #106618
Changelog category (leave one):
Changelog entry (a user-readable short description of the changes that goes into CHANGELOG.md):
Use
ipnsort(unstable) anddriftsort(stable) — C++ ports of the Rust standard library's sort implementations — for the general-purpose comparison sorts. Speeds upORDER BYon non-numeric columns and stable sorts, and removes a worst case on reverse-sorted input.Documentation entry for user-facing changes
🤖 Generated with Claude Code