iframe-proxy

thevar1able · 2026-06-15T19:53:26Z

Split out of #97452 (Build musl from sources): the memcmp specialization is independent of the rest of that PR, so it lives on its own here.

Adds a ClickHouse replacement for the libc memcmp, built and linked into GLIBC_COMPATIBILITY builds alongside the existing custom memcpy (base/glibc-compatibility/memcpy). inline_memcmp compares 16 bytes at a time using SSE2 on x86_64 and NEON on aarch64, with word-at-a-time and byte-at-a-time tails. It is exposed through memcmp.h for inlining at call sites and as a strong extern "C" memcmp symbol in memcmp.cpp.

Motivation: the default memcmp in some libc implementations (notably musl) is a naive byte-by-byte loop that the compiler cannot auto-vectorize because of the early-exit on mismatch. For typical ClickHouse workloads (hash aggregation / join on long string keys) that loop dominates the profile.

Correctness was fuzzed against the reference memcmp (602k cases each on x86_64/SSE2 and aarch64/NEON under qemu), covering all branches, unaligned offsets, and sign cases.

Draft, because there is a tradeoff to discuss: on glibc builds this overrides glibc's IFUNC-dispatched AVX2 memcmp, which is faster than this SSE2/NEON version (measured ~0.22s vs ~0.26s on group_by_multiple_strings). The win only materializes against musl's naive loop. The linking strategy (always override vs. only when the libc memcmp is slow) is open for review.

Related: #97452

Changelog category (leave one):

Build/Testing/Packaging Improvement

Changelog entry (a user-readable short description of the changes that goes into CHANGELOG.md):

Add a SIMD-friendly memcmp implementation (SSE2 on x86_64, NEON on aarch64) for builds that use glibc compatibility.

Provide a ClickHouse replacement for the libc `memcmp`, built and linked into `GLIBC_COMPATIBILITY` builds alongside the existing custom `memcpy`. The default `memcmp` in some libc implementations (notably `musl`) is a naive byte-by-byte loop that the compiler cannot auto-vectorize because of the early-exit on mismatch. For typical ClickHouse workloads (hash aggregation / join on long string keys) that loop dominates the profile. `inline_memcmp` compares 16 bytes at a time using SSE2 on x86_64 and NEON on aarch64, with word-at-a-time and byte-at-a-time tails. It is exposed through `memcmp.h` for inlining at call sites and as a strong `extern "C"` `memcmp` symbol in `memcmp.cpp`, mirroring the structure of the sibling `memcpy` target. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

clickhouse-gh · 2026-06-15T19:54:01Z

Workflow [PR], commit [9516880]

AI Review

Summary

This PR adds a GLIBC_COMPATIBILITY replacement for libc memcmp, with SSE2 and NEON implementations. I do not think it is ready to merge because the current wiring turns a musl-motivated optimization into an unconditional glibc hot-path replacement, does not follow the existing ThinLTO object-forcing pattern used by memcpy, and lacks committed regression coverage for the new process-wide primitive.

Missing context / blind spots

⚠️ The intended integration with the related musl-from-source work is outside this PR. In the current tree base/glibc-compatibility is skipped for USE_MUSL, so the review cannot validate that this interposer is actually limited to the slow-libc target that motivated it.

Findings

❌ Blockers

[base/glibc-compatibility/CMakeLists.txt:54] The new memcmp interposer is linked into every GLIBC_COMPATIBILITY build. In the current tree that means regular glibc compatibility builds replace glibc's IFUNC-selected memcmp, while the PR description already reports that glibc path as slower. Since string-key grouping and joins hit memcmp heavily, this is a hot-path regression rather than a safe musl-specific optimization.
Suggested fix: gate the interposer to the libc/build mode that needs it, or add a CMake option defaulting off for glibc builds and keep inline_memcmp available separately.

⚠️ Majors

[base/glibc-compatibility/CMakeLists.txt:54] Linking memcmp only through global-libs leaves the strong replacement dependent on static archive extraction. The existing memcpy interposer is forced into ThinLTO executables with $<TARGET_OBJECTS:memcpy> because ThinLTO can create libcalls after archive scanning; memcmp has the same replacement contract and should not rely on pre-existing undefined references.
Suggested fix: force the memcmp object into ENABLE_THINLTO executables the same way, or use an equivalent whole-archive/object-library link path.

Tests

⚠️ [base/glibc-compatibility/memcmp/memcmp.h:28] The PR mentions external fuzzing, but adds no in-tree guard for a process-wide memcmp replacement. Please add a focused gtest or fuzzer comparing inline_memcmp with libc across 0..32 and larger sizes, all mismatch positions, unaligned offsets, and both sign directions on x86_64 and aarch64.
⚠️ The performance evidence is insufficient for the promised rollout. The PR body includes a glibc measurement that is slower, but no committed or linked before/after benchmark for the intended slow-libc build mode. Please provide the smallest benchmark that demonstrates the target win and confirms no regression in the build modes where the interposer is enabled.

Final Verdict

Status: ❌ Block

Minimum required actions: gate the global interposer so glibc builds do not regress, make the ThinLTO link path force the replacement symbol reliably, and add in-tree correctness coverage plus target performance evidence.

Algunenano · 2026-06-15T19:57:13Z

Draft, because there is a tradeoff to discuss: on glibc builds this overrides glibc's IFUNC-dispatched AVX2 memcmp, which is faster than this SSE2/NEON version (measured ~0.22s vs ~0.26s on group_by_multiple_strings). The win only materializes against musl's naive loop. The linking strategy (always override vs. only when the libc memcmp is slow) is open for review.

Let's tag as ci-performance to force AMD performance tests to run and see. We can always implement dynamic dispatch if needed and compare, or only enable it for musl builds

clickhouse-gh · 2026-06-15T19:59:50Z

    endif ()

-    target_link_libraries(global-libs INTERFACE glibc-compatibility ${MEMCPY_LIBRARY})
+    target_link_libraries(global-libs INTERFACE glibc-compatibility ${MEMCPY_LIBRARY} ${MEMCMP_LIBRARY})


This makes the new memcmp interposer part of every GLIBC_COMPATIBILITY build. In the current tree base/glibc-compatibility is not added for USE_MUSL, so this primarily replaces glibc's IFUNC-selected memcmp in regular glibc compatibility builds, and the PR description already measures that path as slower. Since string-key grouping and joins hit this path heavily, this turns a musl-specific optimization into a glibc hot-path regression. Please gate the interposer to the libc/build mode that needs it, or add a CMake option defaulting off for glibc builds and keep the inline helper separate.

thevar1able · 2026-06-15T19:59:50Z

@Algunenano

enable it for musl builds

We're aiming to have musl as default build.

clickhouse-gh · 2026-06-15T19:59:50Z

+  * It matches the C memcmp contract: returns the signed difference of the first
+  * pair of differing bytes (as unsigned char), or 0 if the ranges are equal.
+  */
+inline int inline_memcmp(const void * vl, const void * vr, size_t n)


This replaces the process-wide memcmp, including ordering-sensitive callers, but the PR only mentions external fuzzing and adds no in-tree regression guard. That leaves future edits to the SSE2/NEON mask logic, tail path, or signed-difference behavior untested in CI. Please add a focused gtest or fuzzer that compares inline_memcmp against libc for lengths around 0..32 and larger blocks, all mismatch positions, unaligned offsets, and both sign directions on the supported architectures.

clickhouse-gh · 2026-06-15T19:59:50Z

    endif ()

-    target_link_libraries(global-libs INTERFACE glibc-compatibility ${MEMCPY_LIBRARY})
+    target_link_libraries(global-libs INTERFACE glibc-compatibility ${MEMCPY_LIBRARY} ${MEMCMP_LIBRARY})


Linking memcmp only through global-libs leaves the replacement dependent on archive extraction. The existing memcpy interposer has to be added as $<TARGET_OBJECTS:memcpy> in clickhouse_add_executable for ENABLE_THINLTO, because ThinLTO can introduce libcalls after the archive has been scanned. memcmp has the same risk: source or optimized comparisons may become a late memcmp libcall and then bind to libc instead of this strong symbol. Please force the memcmp object into ThinLTO executables the same way, or use an equivalent whole-archive/object-library link path.

thevar1able · 2026-06-15T20:05:21Z

clickhouse-gh Bot added the pr-build Pull request with build/testing/packaging improvement label Jun 15, 2026

thevar1able added the ci-performance performance only label Jun 15, 2026

clickhouse-gh Bot reviewed Jun 15, 2026

View reviewed changes

thevar1able closed this Jun 15, 2026

thevar1able deleted the custom-memcmp branch June 15, 2026 20:06

Sunbelt Computer Software

PL/B Language Development and Support

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add a SIMD-friendly memcmp to glibc-compatibility#107563

Add a SIMD-friendly memcmp to glibc-compatibility#107563
thevar1able wants to merge 1 commit into
masterfrom
custom-memcmp

thevar1able commented Jun 15, 2026 •

edited

Loading

Uh oh!

clickhouse-gh Bot commented Jun 15, 2026 •

edited

Loading

Uh oh!

Algunenano commented Jun 15, 2026

Uh oh!

clickhouse-gh Bot Jun 15, 2026

Uh oh!

thevar1able commented Jun 15, 2026 •

edited

Loading

Uh oh!

clickhouse-gh Bot Jun 15, 2026

Uh oh!

clickhouse-gh Bot Jun 15, 2026

Uh oh!

thevar1able commented Jun 15, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Sunbelt Computer Software

PL/B Language Development and Support

Uh oh!

Conversation

thevar1able commented Jun 15, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Changelog category (leave one):

Changelog entry (a user-readable short description of the changes that goes into CHANGELOG.md):

Uh oh!

clickhouse-gh Bot commented Jun 15, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

AI Review

Summary

Missing context / blind spots

Findings

Tests

Final Verdict

Uh oh!

Algunenano commented Jun 15, 2026

Uh oh!

clickhouse-gh Bot Jun 15, 2026

Choose a reason for hiding this comment

Uh oh!

thevar1able commented Jun 15, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

clickhouse-gh Bot Jun 15, 2026

Choose a reason for hiding this comment

Uh oh!

clickhouse-gh Bot Jun 15, 2026

Choose a reason for hiding this comment

Uh oh!

thevar1able commented Jun 15, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

thevar1able commented Jun 15, 2026 •

edited

Loading

clickhouse-gh Bot commented Jun 15, 2026 •

edited

Loading

thevar1able commented Jun 15, 2026 •

edited

Loading