Add a SIMD-friendly memcmp to glibc-compatibility#107563
Conversation
Provide a ClickHouse replacement for the libc `memcmp`, built and linked into `GLIBC_COMPATIBILITY` builds alongside the existing custom `memcpy`. The default `memcmp` in some libc implementations (notably `musl`) is a naive byte-by-byte loop that the compiler cannot auto-vectorize because of the early-exit on mismatch. For typical ClickHouse workloads (hash aggregation / join on long string keys) that loop dominates the profile. `inline_memcmp` compares 16 bytes at a time using SSE2 on x86_64 and NEON on aarch64, with word-at-a-time and byte-at-a-time tails. It is exposed through `memcmp.h` for inlining at call sites and as a strong `extern "C"` `memcmp` symbol in `memcmp.cpp`, mirroring the structure of the sibling `memcpy` target. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Let's tag as ci-performance to force AMD performance tests to run and see. We can always implement dynamic dispatch if needed and compare, or only enable it for musl builds |
| endif () | ||
|
|
||
| target_link_libraries(global-libs INTERFACE glibc-compatibility ${MEMCPY_LIBRARY}) | ||
| target_link_libraries(global-libs INTERFACE glibc-compatibility ${MEMCPY_LIBRARY} ${MEMCMP_LIBRARY}) |
There was a problem hiding this comment.
This makes the new memcmp interposer part of every GLIBC_COMPATIBILITY build. In the current tree base/glibc-compatibility is not added for USE_MUSL, so this primarily replaces glibc's IFUNC-selected memcmp in regular glibc compatibility builds, and the PR description already measures that path as slower. Since string-key grouping and joins hit this path heavily, this turns a musl-specific optimization into a glibc hot-path regression. Please gate the interposer to the libc/build mode that needs it, or add a CMake option defaulting off for glibc builds and keep the inline helper separate.
We're aiming to have musl as default build. |
| * It matches the C memcmp contract: returns the signed difference of the first | ||
| * pair of differing bytes (as unsigned char), or 0 if the ranges are equal. | ||
| */ | ||
| inline int inline_memcmp(const void * vl, const void * vr, size_t n) |
There was a problem hiding this comment.
This replaces the process-wide memcmp, including ordering-sensitive callers, but the PR only mentions external fuzzing and adds no in-tree regression guard. That leaves future edits to the SSE2/NEON mask logic, tail path, or signed-difference behavior untested in CI. Please add a focused gtest or fuzzer that compares inline_memcmp against libc for lengths around 0..32 and larger blocks, all mismatch positions, unaligned offsets, and both sign directions on the supported architectures.
| endif () | ||
|
|
||
| target_link_libraries(global-libs INTERFACE glibc-compatibility ${MEMCPY_LIBRARY}) | ||
| target_link_libraries(global-libs INTERFACE glibc-compatibility ${MEMCPY_LIBRARY} ${MEMCMP_LIBRARY}) |
There was a problem hiding this comment.
Linking memcmp only through global-libs leaves the replacement dependent on archive extraction. The existing memcpy interposer has to be added as $<TARGET_OBJECTS:memcpy> in clickhouse_add_executable for ENABLE_THINLTO, because ThinLTO can introduce libcalls after the archive has been scanned. memcmp has the same risk: source or optimized comparisons may become a late memcmp libcall and then bind to libc instead of this strong symbol. Please force the memcmp object into ThinLTO executables the same way, or use an equivalent whole-archive/object-library link path.

Split out of #97452 (
Build musl from sources): thememcmpspecialization is independent of the rest of that PR, so it lives on its own here.Adds a ClickHouse replacement for the libc
memcmp, built and linked intoGLIBC_COMPATIBILITYbuilds alongside the existing custommemcpy(base/glibc-compatibility/memcpy).inline_memcmpcompares 16 bytes at a time using SSE2 on x86_64 and NEON on aarch64, with word-at-a-time and byte-at-a-time tails. It is exposed throughmemcmp.hfor inlining at call sites and as a strongextern "C"memcmpsymbol inmemcmp.cpp.Motivation: the default
memcmpin some libc implementations (notablymusl) is a naive byte-by-byte loop that the compiler cannot auto-vectorize because of the early-exit on mismatch. For typical ClickHouse workloads (hash aggregation / join on long string keys) that loop dominates the profile.Correctness was fuzzed against the reference memcmp (602k cases each on x86_64/SSE2 and aarch64/NEON under qemu), covering all branches, unaligned offsets, and sign cases.
Draft, because there is a tradeoff to discuss: on glibc builds this overrides glibc's IFUNC-dispatched AVX2 memcmp, which is faster than this SSE2/NEON version (measured ~0.22s vs ~0.26s on group_by_multiple_strings). The win only materializes against musl's naive loop. The linking strategy (always override vs. only when the libc memcmp is slow) is open for review.
Related: #97452
Changelog category (leave one):
Changelog entry (a user-readable short description of the changes that goes into CHANGELOG.md):
Add a SIMD-friendly
memcmpimplementation (SSE2 on x86_64, NEON on aarch64) for builds that use glibc compatibility.