Replace hand-written AVX-512 intrinsics in `arrayDotProduct` with platform-independent auto-vectorizable loops by fastio · Pull Request #101571 · ClickHouse/ClickHouse · GitHub
Skip to content

Replace hand-written AVX-512 intrinsics in arrayDotProduct with platform-independent auto-vectorizable loops#101571

Merged
Ergus merged 10 commits into
ClickHouse:masterfrom
fastio:feature-simd-dot-product
Apr 10, 2026
Merged

Replace hand-written AVX-512 intrinsics in arrayDotProduct with platform-independent auto-vectorizable loops#101571
Ergus merged 10 commits into
ClickHouse:masterfrom
fastio:feature-simd-dot-product

Conversation

@fastio

@fastio fastio commented Apr 2, 2026

Copy link
Copy Markdown
Contributor

Replace hand-written AVX-512 intrinsics in arrayDotProduct with platform-independent auto-vectorizable loops that the compiler can lower to optimal SIMD on any target.

The old code only had an AVX-512F fast path (accumulateCombine with _mm512_fmadd_ps/pd). The new implementation uses MULTITARGET_FUNCTION_X86_V4_V3 to generate x86_64_v4 (AVX-512), x86_64_v3 (AVX2+FMA), and a default (SSE2 / NEON) variant from a single source loop. Manual unrolling with 128/sizeof(T) independent accumulators breaks FP dependency chains so the compiler emits FMA across all targets.

Also fixes a latent off-by-one in the old SIMD loop condition (i + n < count instead of i + n <= count), which caused arrays whose size was an exact multiple of the SIMD width to fall through entirely to the scalar tail.

Round 2 fixes (review feedback from @Ergus and clickhouse-gh[bot]):

  • Fix undefined behavior when arrays are empty: replace &data[offset] with data.data() + offset to avoid out-of-bounds subscript on zero-length vectors.
  • Fix accumulator count comment: explain FMA latency hiding rationale instead of incorrect register-width calculation.
  • Unify const-left scalar path with non-const path: use the same multi-accumulator structure for consistency, with a comment noting this branch only handles mixed-type combinations.
  • Add regression test for empty array inputs.

Changelog category (leave one):

  • Performance Improvement

Changelog entry (a user-readable short description of the changes that goes into CHANGELOG.md):

Replace hand-written AVX-512 intrinsics in arrayDotProduct with platform-independent auto-vectorizable loops, adding AVX2 and ARM NEON support.

Documentation entry for user-facing changes

  • Documentation is written (mandatory for new features)

No user-facing behavior changes. Same function, same results, broader SIMD coverage.

Version info

  • Merged into: 26.4.1.812

…tform-independent auto-vectorizable loops

Use `MULTITARGET_FUNCTION_X86_V4_V3` to compile a simple dot product kernel
for Default (SSE2/NEON), x86_64_v3 (AVX2), and x86_64_v4 (AVX-512) targets.
The kernel uses manually-unrolled independent accumulators (128/sizeof(T))
to break floating-point dependency chains, enabling auto-vectorization.

@IRainman IRainman left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can use std::accumulate in many places here. ;)

@fastio

fastio commented Apr 2, 2026

Copy link
Copy Markdown
Contributor Author

@Ergus Ergus left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall looks good.

But please; clarify the comment about the else path in the new code const path L134. All the others are minor details.

Comment thread src/Functions/array/arrayDotProduct.cpp
Comment thread src/Functions/array/arrayDotProduct.cpp
Kernel::template accumulate<ResultType>(states[j], static_cast<ResultType>(data_x[current_offset + i + j]), static_cast<ResultType>(data_y[current_offset + i + j]));
/// SIMD-optimized path for same-type floating point
#if USE_MULTITARGET_CODE
if (isArchSupported(TargetArch::x86_64_v4))

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very minor: The idiomatic pattern of hoisting the dispatch outside the loop would look marginally cleaner, but it makes no measurable difference. I should not have listed it as a "major".

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

isArchSupported is a compile-time constant inside each clone, so no runtime cost. Will revisit if the structure gets more complex.

Comment thread src/Functions/array/arrayDotProduct.cpp Outdated

constexpr size_t n = is_float32 ? 16 : 8;

for (; i + n < i_max; i += n)

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see that you improved this apparent "issue" that was in the old code:

For example for Float32 array of exactly 16 elements 0 + 16 < 16 is false.

So the SIMD loop never executes: all 16 elements fall to the scalar tail. Same for any array whose size is an exact multiple of n: the last SIMD-eligible chunk was always handed off to the tail.

I see that you fixed this with the unrolled loop. It worth mentioning also that fix in the PR description.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated the PR description.

Comment thread src/Functions/array/arrayDotProduct.cpp
@Ergus Ergus self-assigned this Apr 2, 2026
@clickhouse-gh

clickhouse-gh Bot commented Apr 2, 2026

Copy link
Copy Markdown
Contributor

Workflow [PR], commit [0894799]

Summary:


AI Review

Summary

This PR replaces hand-written AVX-512 intrinsics in arrayDotProduct with a multiversion auto-vectorized kernel (x86_64_v4, x86_64_v3, default), fixes previously reported UB around empty arrays by switching to data.data() + offset, and adds regression coverage for empty arrays and aligned chunk boundaries. I did not find any remaining high-confidence correctness, safety, concurrency, or compatibility issues in the current diff.

ClickHouse Rules
Item Status Notes
Deletion logging
Serialization versioning
Core-area scrutiny
No test removal
Experimental gate
No magic constants
Backward compatibility
SettingsChangesHistory.cpp
PR metadata quality
Safe rollout
Compilation time
No large/binary files
Final Verdict
  • Status: ✅ Approve

@clickhouse-gh clickhouse-gh Bot added the pr-performance Pull request with some performance improvements label Apr 2, 2026
@Ergus Ergus added can be tested Allows running workflows for external contributors and removed pr-performance Pull request with some performance improvements labels Apr 2, 2026
@clickhouse-gh clickhouse-gh Bot added the pr-performance Pull request with some performance improvements label Apr 2, 2026
Comment thread src/Functions/array/arrayDotProduct.cpp Outdated
/// SIMD-optimized path for same-type floating point
#if USE_MULTITARGET_CODE
if (isArchSupported(TargetArch::x86_64_v4))
result_data[row] = dotProductImpl_x86_64_v4<ResultType>(&data_x[current_offset], &data_y[current_offset], array_size);

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

❌ Potential UB on empty arrays in release builds.

When array_size == 0, this still forms &data_x[current_offset] / &data_y[current_offset]. For empty arrays or trailing empty rows, current_offset can be equal to data_*.size(), so operator[] is out-of-bounds even though the kernel will not dereference for count == 0.

Please avoid operator[] here and use data() + current_offset (valid for one-past) for both pointers.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed — using data.data() + offset now.

Comment thread src/Functions/array/arrayDotProduct.cpp Outdated
@fastio

fastio commented Apr 3, 2026

Copy link
Copy Markdown
Contributor Author

@Ergus Done — const-left else branch now has the same unrolled structure with a comment clarifying it only handles mixed-type inputs.

/// Process chunks in vectorized manner
static constexpr size_t VEC_SIZE = 4;
typename Kernel::template State<ResultType> states[VEC_SIZE];
for (; i + VEC_SIZE < array_size; i += VEC_SIZE)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This comment is slightly inaccurate: this branch is also taken when both arguments have the same non-floating type (for example Int32 x Int32), because ResultType is widened and the SIMD condition is false. Could you reword it to avoid implying it is only for mixed-type inputs?

@alexey-milovidov

Copy link
Copy Markdown
Member

The Stress test (arm_msan) failure is fixed by #101239, which should be merged first. After it is merged, please update the branch to include the fix.

SELECT arrayDotProduct([]::Array(UInt8), []::Array(UInt8));

-- Mixed empty/non-empty via table (exercises per-row offset logic)
SELECT arrayDotProduct(x, y) FROM VALUES('x Array(Float32), y Array(Float32)',

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ The new regression test validates empty arrays in the non-const/non-const path, but it does not exercise the const-left execution path that had its own UB fix (data_x.data() replacing &data_x[0]).

Please add a case like:

SELECT arrayDotProduct([]::Array(Float32), y)
FROM VALUES('y Array(Float32)', ([],), ([1, 2, 3]));

This ensures the executeWithLeftArgConst branch is covered for empty constant left arrays.

@Ergus Ergus left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, but there are a couple of minor details pending to be solved.

Also a profiling result to ensure that this change doesn't impact performance is very recommended considering that we are relying more in the (black box) compiler capabilities.

Comment thread src/Functions/array/arrayDotProduct.cpp Outdated
Kernel::template accumulate<ResultType>(states[j], static_cast<ResultType>(data_x[i + j]), static_cast<ResultType>(data_y[current_offset + i + j]));
/// Scalar path for mixed types / integer types.
/// This branch is only reached when left and right have different types
/// (e.g. Int32 × Float64) — not a hot path, but we keep the same

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it possible to reach this for same-type non-float inputs like Int32 × Int32? because ResultType is widened in that case and the SIMD condition std::is_same_v<ResultType, LeftType> could be false

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, exactly. Fixed the comment to reflect that.

Comment thread tests/queries/0_stateless/04061_array_dot_product_empty_arrays.sql
Comment thread src/Functions/array/arrayDotProduct.cpp Outdated

static constexpr size_t VEC_SIZE = 4;
typename Kernel::template State<ResultType> states[VEC_SIZE];
for (; i + VEC_SIZE < array_size; i += VEC_SIZE)

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When array_size % 4 == 0 using this < will make the last chunk go into the tail handling loop. That's correct, but could impact performance a bit. could we check if using <= here is correct??

Adding a test for that specific case is also a good idea.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All minor details addressed — fixed the comment wording, added const-left empty array tests, and fixed the < vs <= loop condition in the scalar fallback.

@alexey-milovidov

Copy link
Copy Markdown
Member

The failures of "Flaky check" in "functions_bad_arguments" will be fixed by #101994.

@alexey-milovidov

Copy link
Copy Markdown
Member

The MSan stress test failure (MemorySanitizer: use-of-uninitialized-value, STID 4179-5154 or 4148-3044) is a known pre-existing issue unrelated to this PR. Fix: #102158

@alexey-milovidov

Copy link
Copy Markdown
Member

The flaky failure of 02494_query_cache_http_introspection in this PR's CI is addressed by #102165.

@alexey-milovidov

Copy link
Copy Markdown
Member

The Can't adjust last granule error in CI is a known issue. The fix is in #101641

@clickhouse-gh

clickhouse-gh Bot commented Apr 10, 2026

Copy link
Copy Markdown
Contributor

LLVM Coverage Report

Metric Baseline Current Δ
Lines 84.00% 84.00% +0.00%
Functions 90.90% 90.90% +0.00%
Branches 76.50% 76.50% +0.00%

Changed lines: 86.05% (148/172) | lost baseline coverage: 45 line(s) · Uncovered code

Full report · Diff report

@Ergus Ergus added this pull request to the merge queue Apr 10, 2026
Merged via the queue into ClickHouse:master with commit af58312 Apr 10, 2026
164 checks passed
@robot-clickhouse robot-clickhouse added the pr-synced-to-cloud The PR is synced to the cloud repo label Apr 10, 2026
@rschu1ze

Copy link
Copy Markdown
Member

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

can be tested Allows running workflows for external contributors pr-performance Pull request with some performance improvements pr-synced-to-cloud The PR is synced to the cloud repo

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants