Replace hand-written AVX-512 intrinsics in arrayDotProduct with platform-independent auto-vectorizable loops#101571
Conversation
…tform-independent auto-vectorizable loops Use `MULTITARGET_FUNCTION_X86_V4_V3` to compile a simple dot product kernel for Default (SSE2/NEON), x86_64_v3 (AVX2), and x86_64_v4 (AVX-512) targets. The kernel uses manually-unrolled independent accumulators (128/sizeof(T)) to break floating-point dependency chains, enabling auto-vectorization.
IRainman
left a comment
There was a problem hiding this comment.
You can use std::accumulate in many places here. ;)
Ergus
left a comment
There was a problem hiding this comment.
Overall looks good.
But please; clarify the comment about the else path in the new code const path L134. All the others are minor details.
| Kernel::template accumulate<ResultType>(states[j], static_cast<ResultType>(data_x[current_offset + i + j]), static_cast<ResultType>(data_y[current_offset + i + j])); | ||
| /// SIMD-optimized path for same-type floating point | ||
| #if USE_MULTITARGET_CODE | ||
| if (isArchSupported(TargetArch::x86_64_v4)) |
There was a problem hiding this comment.
Very minor: The idiomatic pattern of hoisting the dispatch outside the loop would look marginally cleaner, but it makes no measurable difference. I should not have listed it as a "major".
There was a problem hiding this comment.
isArchSupported is a compile-time constant inside each clone, so no runtime cost. Will revisit if the structure gets more complex.
|
|
||
| constexpr size_t n = is_float32 ? 16 : 8; | ||
|
|
||
| for (; i + n < i_max; i += n) |
There was a problem hiding this comment.
I see that you improved this apparent "issue" that was in the old code:
For example for Float32 array of exactly 16 elements 0 + 16 < 16 is false.
So the SIMD loop never executes: all 16 elements fall to the scalar tail. Same for any array whose size is an exact multiple of n: the last SIMD-eligible chunk was always handed off to the tail.
I see that you fixed this with the unrolled loop. It worth mentioning also that fix in the PR description.
There was a problem hiding this comment.
Updated the PR description.
|
Workflow [PR], commit [0894799] Summary: ✅ AI ReviewSummaryThis PR replaces hand-written AVX-512 intrinsics in ClickHouse Rules
Final Verdict
|
| /// SIMD-optimized path for same-type floating point | ||
| #if USE_MULTITARGET_CODE | ||
| if (isArchSupported(TargetArch::x86_64_v4)) | ||
| result_data[row] = dotProductImpl_x86_64_v4<ResultType>(&data_x[current_offset], &data_y[current_offset], array_size); |
There was a problem hiding this comment.
❌ Potential UB on empty arrays in release builds.
When array_size == 0, this still forms &data_x[current_offset] / &data_y[current_offset]. For empty arrays or trailing empty rows, current_offset can be equal to data_*.size(), so operator[] is out-of-bounds even though the kernel will not dereference for count == 0.
Please avoid operator[] here and use data() + current_offset (valid for one-past) for both pointers.
There was a problem hiding this comment.
Fixed — using data.data() + offset now.
|
@Ergus Done — const-left else branch now has the same unrolled structure with a comment clarifying it only handles mixed-type inputs. |
| /// Process chunks in vectorized manner | ||
| static constexpr size_t VEC_SIZE = 4; | ||
| typename Kernel::template State<ResultType> states[VEC_SIZE]; | ||
| for (; i + VEC_SIZE < array_size; i += VEC_SIZE) |
There was a problem hiding this comment.
This comment is slightly inaccurate: this branch is also taken when both arguments have the same non-floating type (for example Int32 x Int32), because ResultType is widened and the SIMD condition is false. Could you reword it to avoid implying it is only for mixed-type inputs?
|
The Stress test (arm_msan) failure is fixed by #101239, which should be merged first. After it is merged, please update the branch to include the fix. |
| SELECT arrayDotProduct([]::Array(UInt8), []::Array(UInt8)); | ||
|
|
||
| -- Mixed empty/non-empty via table (exercises per-row offset logic) | ||
| SELECT arrayDotProduct(x, y) FROM VALUES('x Array(Float32), y Array(Float32)', |
There was a problem hiding this comment.
const-left execution path that had its own UB fix (data_x.data() replacing &data_x[0]).
Please add a case like:
SELECT arrayDotProduct([]::Array(Float32), y)
FROM VALUES('y Array(Float32)', ([],), ([1, 2, 3]));This ensures the executeWithLeftArgConst branch is covered for empty constant left arrays.
Ergus
left a comment
There was a problem hiding this comment.
LGTM, but there are a couple of minor details pending to be solved.
Also a profiling result to ensure that this change doesn't impact performance is very recommended considering that we are relying more in the (black box) compiler capabilities.
| Kernel::template accumulate<ResultType>(states[j], static_cast<ResultType>(data_x[i + j]), static_cast<ResultType>(data_y[current_offset + i + j])); | ||
| /// Scalar path for mixed types / integer types. | ||
| /// This branch is only reached when left and right have different types | ||
| /// (e.g. Int32 × Float64) — not a hot path, but we keep the same |
There was a problem hiding this comment.
Is it possible to reach this for same-type non-float inputs like Int32 × Int32? because ResultType is widened in that case and the SIMD condition std::is_same_v<ResultType, LeftType> could be false
There was a problem hiding this comment.
Yes, exactly. Fixed the comment to reflect that.
|
|
||
| static constexpr size_t VEC_SIZE = 4; | ||
| typename Kernel::template State<ResultType> states[VEC_SIZE]; | ||
| for (; i + VEC_SIZE < array_size; i += VEC_SIZE) |
There was a problem hiding this comment.
When array_size % 4 == 0 using this < will make the last chunk go into the tail handling loop. That's correct, but could impact performance a bit. could we check if using <= here is correct??
Adding a test for that specific case is also a good idea.
There was a problem hiding this comment.
All minor details addressed — fixed the comment wording, added const-left empty array tests, and fixed the < vs <= loop condition in the scalar fallback.
|
The failures of "Flaky check" in "functions_bad_arguments" will be fixed by #101994. |
|
The MSan stress test failure (MemorySanitizer: use-of-uninitialized-value, STID 4179-5154 or 4148-3044) is a known pre-existing issue unrelated to this PR. Fix: #102158 |
|
The flaky failure of |
|
The |
LLVM Coverage Report
Changed lines: 86.05% (148/172) | lost baseline coverage: 45 line(s) · Uncovered code |

Replace hand-written AVX-512 intrinsics in
arrayDotProductwith platform-independent auto-vectorizable loops that the compiler can lower to optimal SIMD on any target.The old code only had an AVX-512F fast path (
accumulateCombinewith_mm512_fmadd_ps/pd). The new implementation usesMULTITARGET_FUNCTION_X86_V4_V3to generate x86_64_v4 (AVX-512), x86_64_v3 (AVX2+FMA), and a default (SSE2 / NEON) variant from a single source loop. Manual unrolling with 128/sizeof(T) independent accumulators breaks FP dependency chains so the compiler emits FMA across all targets.Also fixes a latent off-by-one in the old SIMD loop condition (
i + n < countinstead ofi + n <= count), which caused arrays whose size was an exact multiple of the SIMD width to fall through entirely to the scalar tail.Round 2 fixes (review feedback from @Ergus and clickhouse-gh[bot]):
&data[offset]withdata.data() + offsetto avoid out-of-bounds subscript on zero-length vectors.Changelog category (leave one):
Changelog entry (a user-readable short description of the changes that goes into CHANGELOG.md):
Replace hand-written AVX-512 intrinsics in
arrayDotProductwith platform-independent auto-vectorizable loops, adding AVX2 and ARM NEON support.Documentation entry for user-facing changes
No user-facing behavior changes. Same function, same results, broader SIMD coverage.
Version info
26.4.1.812