Add a SIMD-friendly memcmp to glibc-compatibility by thevar1able · Pull Request #107563 · ClickHouse/ClickHouse · GitHub
Skip to content

Add a SIMD-friendly memcmp to glibc-compatibility#107563

Closed
thevar1able wants to merge 1 commit into
masterfrom
custom-memcmp
Closed

Add a SIMD-friendly memcmp to glibc-compatibility#107563
thevar1able wants to merge 1 commit into
masterfrom
custom-memcmp

Conversation

@thevar1able

@thevar1able thevar1able commented Jun 15, 2026

Copy link
Copy Markdown
Member

Split out of #97452 (Build musl from sources): the memcmp specialization is independent of the rest of that PR, so it lives on its own here.

Adds a ClickHouse replacement for the libc memcmp, built and linked into GLIBC_COMPATIBILITY builds alongside the existing custom memcpy (base/glibc-compatibility/memcpy). inline_memcmp compares 16 bytes at a time using SSE2 on x86_64 and NEON on aarch64, with word-at-a-time and byte-at-a-time tails. It is exposed through memcmp.h for inlining at call sites and as a strong extern "C" memcmp symbol in memcmp.cpp.

Motivation: the default memcmp in some libc implementations (notably musl) is a naive byte-by-byte loop that the compiler cannot auto-vectorize because of the early-exit on mismatch. For typical ClickHouse workloads (hash aggregation / join on long string keys) that loop dominates the profile.

Correctness was fuzzed against the reference memcmp (602k cases each on x86_64/SSE2 and aarch64/NEON under qemu), covering all branches, unaligned offsets, and sign cases.

Draft, because there is a tradeoff to discuss: on glibc builds this overrides glibc's IFUNC-dispatched AVX2 memcmp, which is faster than this SSE2/NEON version (measured ~0.22s vs ~0.26s on group_by_multiple_strings). The win only materializes against musl's naive loop. The linking strategy (always override vs. only when the libc memcmp is slow) is open for review.

Related: #97452

Changelog category (leave one):

  • Build/Testing/Packaging Improvement

Changelog entry (a user-readable short description of the changes that goes into CHANGELOG.md):

Add a SIMD-friendly memcmp implementation (SSE2 on x86_64, NEON on aarch64) for builds that use glibc compatibility.

Provide a ClickHouse replacement for the libc `memcmp`, built and linked
into `GLIBC_COMPATIBILITY` builds alongside the existing custom `memcpy`.

The default `memcmp` in some libc implementations (notably `musl`) is a
naive byte-by-byte loop that the compiler cannot auto-vectorize because of
the early-exit on mismatch. For typical ClickHouse workloads (hash
aggregation / join on long string keys) that loop dominates the profile.

`inline_memcmp` compares 16 bytes at a time using SSE2 on x86_64 and NEON
on aarch64, with word-at-a-time and byte-at-a-time tails. It is exposed
through `memcmp.h` for inlining at call sites and as a strong `extern "C"`
`memcmp` symbol in `memcmp.cpp`, mirroring the structure of the sibling
`memcpy` target.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@clickhouse-gh

clickhouse-gh Bot commented Jun 15, 2026

Copy link
Copy Markdown
Contributor

@clickhouse-gh clickhouse-gh Bot added the pr-build Pull request with build/testing/packaging improvement label Jun 15, 2026
@Algunenano

Copy link
Copy Markdown
Member

Draft, because there is a tradeoff to discuss: on glibc builds this overrides glibc's IFUNC-dispatched AVX2 memcmp, which is faster than this SSE2/NEON version (measured ~0.22s vs ~0.26s on group_by_multiple_strings). The win only materializes against musl's naive loop. The linking strategy (always override vs. only when the libc memcmp is slow) is open for review.

Let's tag as ci-performance to force AMD performance tests to run and see. We can always implement dynamic dispatch if needed and compare, or only enable it for musl builds

@thevar1able thevar1able added the ci-performance performance only label Jun 15, 2026
endif ()

target_link_libraries(global-libs INTERFACE glibc-compatibility ${MEMCPY_LIBRARY})
target_link_libraries(global-libs INTERFACE glibc-compatibility ${MEMCPY_LIBRARY} ${MEMCMP_LIBRARY})

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This makes the new memcmp interposer part of every GLIBC_COMPATIBILITY build. In the current tree base/glibc-compatibility is not added for USE_MUSL, so this primarily replaces glibc's IFUNC-selected memcmp in regular glibc compatibility builds, and the PR description already measures that path as slower. Since string-key grouping and joins hit this path heavily, this turns a musl-specific optimization into a glibc hot-path regression. Please gate the interposer to the libc/build mode that needs it, or add a CMake option defaulting off for glibc builds and keep the inline helper separate.

@thevar1able

thevar1able commented Jun 15, 2026

Copy link
Copy Markdown
Member Author

@Algunenano

enable it for musl builds

We're aiming to have musl as default build.

* It matches the C memcmp contract: returns the signed difference of the first
* pair of differing bytes (as unsigned char), or 0 if the ranges are equal.
*/
inline int inline_memcmp(const void * vl, const void * vr, size_t n)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This replaces the process-wide memcmp, including ordering-sensitive callers, but the PR only mentions external fuzzing and adds no in-tree regression guard. That leaves future edits to the SSE2/NEON mask logic, tail path, or signed-difference behavior untested in CI. Please add a focused gtest or fuzzer that compares inline_memcmp against libc for lengths around 0..32 and larger blocks, all mismatch positions, unaligned offsets, and both sign directions on the supported architectures.

endif ()

target_link_libraries(global-libs INTERFACE glibc-compatibility ${MEMCPY_LIBRARY})
target_link_libraries(global-libs INTERFACE glibc-compatibility ${MEMCPY_LIBRARY} ${MEMCMP_LIBRARY})

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Linking memcmp only through global-libs leaves the replacement dependent on archive extraction. The existing memcpy interposer has to be added as $<TARGET_OBJECTS:memcpy> in clickhouse_add_executable for ENABLE_THINLTO, because ThinLTO can introduce libcalls after the archive has been scanned. memcmp has the same risk: source or optimized comparisons may become a late memcmp libcall and then bind to libc instead of this strong symbol. Please force the memcmp object into ThinLTO executables the same way, or use an equivalent whole-archive/object-library link path.

@thevar1able

Copy link
Copy Markdown
Member Author

@thevar1able thevar1able deleted the custom-memcmp branch June 15, 2026 20:06
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ci-performance performance only pr-build Pull request with build/testing/packaging improvement

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants