iframe-proxy

fastio · 2026-04-02T00:01:06Z

Replace hand-written AVX-512 intrinsics in arrayDotProduct with platform-independent auto-vectorizable loops that the compiler can lower to optimal SIMD on any target.

The old code only had an AVX-512F fast path (accumulateCombine with _mm512_fmadd_ps/pd). The new implementation uses MULTITARGET_FUNCTION_X86_V4_V3 to generate x86_64_v4 (AVX-512), x86_64_v3 (AVX2+FMA), and a default (SSE2 / NEON) variant from a single source loop. Manual unrolling with 128/sizeof(T) independent accumulators breaks FP dependency chains so the compiler emits FMA across all targets.

Also fixes a latent off-by-one in the old SIMD loop condition (i + n < count instead of i + n <= count), which caused arrays whose size was an exact multiple of the SIMD width to fall through entirely to the scalar tail.

Round 2 fixes (review feedback from @Ergus and clickhouse-gh[bot]):

Fix undefined behavior when arrays are empty: replace &data[offset] with data.data() + offset to avoid out-of-bounds subscript on zero-length vectors.
Fix accumulator count comment: explain FMA latency hiding rationale instead of incorrect register-width calculation.
Unify const-left scalar path with non-const path: use the same multi-accumulator structure for consistency, with a comment noting this branch only handles mixed-type combinations.
Add regression test for empty array inputs.

Changelog category (leave one):

Performance Improvement

Changelog entry (a user-readable short description of the changes that goes into CHANGELOG.md):

Replace hand-written AVX-512 intrinsics in arrayDotProduct with platform-independent auto-vectorizable loops, adding AVX2 and ARM NEON support.

Documentation entry for user-facing changes

Documentation is written (mandatory for new features)

No user-facing behavior changes. Same function, same results, broader SIMD coverage.

Version info

Merged into: 26.4.1.812

…tform-independent auto-vectorizable loops Use `MULTITARGET_FUNCTION_X86_V4_V3` to compile a simple dot product kernel for Default (SSE2/NEON), x86_64_v3 (AVX2), and x86_64_v4 (AVX-512) targets. The kernel uses manually-unrolled independent accumulators (128/sizeof(T)) to break floating-point dependency chains, enabling auto-vectorization.

IRainman

You can use std::accumulate in many places here. ;)

fastio · 2026-04-02T00:36:46Z

Ergus

Overall looks good.

But please; clarify the comment about the else path in the new code const path L134. All the others are minor details.

Ergus · 2026-04-02T16:26:02Z

-                    Kernel::template accumulate<ResultType>(states[j], static_cast<ResultType>(data_x[current_offset + i + j]), static_cast<ResultType>(data_y[current_offset + i + j]));
+                /// SIMD-optimized path for same-type floating point
+#if USE_MULTITARGET_CODE
+                if (isArchSupported(TargetArch::x86_64_v4))


Very minor: The idiomatic pattern of hoisting the dispatch outside the loop would look marginally cleaner, but it makes no measurable difference. I should not have listed it as a "major".

isArchSupported is a compile-time constant inside each clone, so no runtime cost. Will revisit if the structure gets more complex.

Ergus · 2026-04-02T16:42:46Z

-
-        constexpr size_t n = is_float32 ? 16 : 8;
-
-        for (; i + n < i_max; i += n)


I see that you improved this apparent "issue" that was in the old code:

For example for Float32 array of exactly 16 elements 0 + 16 < 16 is false.

So the SIMD loop never executes: all 16 elements fall to the scalar tail. Same for any array whose size is an exact multiple of n: the last SIMD-eligible chunk was always handed off to the tail.

I see that you fixed this with the unrolled loop. It worth mentioning also that fix in the PR description.

Updated the PR description.

clickhouse-gh · 2026-04-02T16:55:28Z

Workflow [PR], commit [0894799]

Summary: ✅

AI Review

Summary

This PR replaces hand-written AVX-512 intrinsics in arrayDotProduct with a multiversion auto-vectorized kernel (x86_64_v4, x86_64_v3, default), fixes previously reported UB around empty arrays by switching to data.data() + offset, and adds regression coverage for empty arrays and aligned chunk boundaries. I did not find any remaining high-confidence correctness, safety, concurrency, or compatibility issues in the current diff.

ClickHouse Rules

Item	Status	Notes
Deletion logging	➖
Serialization versioning	➖
Core-area scrutiny	✅
No test removal	✅
Experimental gate	➖
No magic constants	✅
Backward compatibility	✅
`SettingsChangesHistory.cpp`	➖
PR metadata quality	✅
Safe rollout	✅
Compilation time	✅
No large/binary files	✅

Final Verdict

Status: ✅ Approve

clickhouse-gh · 2026-04-02T16:59:44Z

+                /// SIMD-optimized path for same-type floating point
+#if USE_MULTITARGET_CODE
+                if (isArchSupported(TargetArch::x86_64_v4))
+                    result_data[row] = dotProductImpl_x86_64_v4<ResultType>(&data_x[current_offset], &data_y[current_offset], array_size);


❌ Potential UB on empty arrays in release builds.

When array_size == 0, this still forms &data_x[current_offset] / &data_y[current_offset]. For empty arrays or trailing empty rows, current_offset can be equal to data_*.size(), so operator[] is out-of-bounds even though the kernel will not dereference for count == 0.

Please avoid operator[] here and use data() + current_offset (valid for one-past) for both pointers.

Fixed — using data.data() + offset now.

…lar path

fastio · 2026-04-03T04:18:15Z

@Ergus Done — const-left else branch now has the same unrolled structure with a comment clarifying it only handles mixed-type inputs.

clickhouse-gh · 2026-04-06T14:43:13Z

-            /// Process chunks in vectorized manner
-            static constexpr size_t VEC_SIZE = 4;
-            typename Kernel::template State<ResultType> states[VEC_SIZE];
-            for (; i + VEC_SIZE < array_size; i += VEC_SIZE)


This comment is slightly inaccurate: this branch is also taken when both arguments have the same non-floating type (for example Int32 x Int32), because ResultType is widened and the SIMD condition is false. Could you reword it to avoid implying it is only for mixed-type inputs?

alexey-milovidov · 2026-04-07T00:29:11Z

The Stress test (arm_msan) failure is fixed by #101239, which should be merged first. After it is merged, please update the branch to include the fix.

clickhouse-gh · 2026-04-07T01:33:32Z

+SELECT arrayDotProduct([]::Array(UInt8), []::Array(UInt8));
+
+-- Mixed empty/non-empty via table (exercises per-row offset logic)
+SELECT arrayDotProduct(x, y) FROM VALUES('x Array(Float32), y Array(Float32)',


⚠️ The new regression test validates empty arrays in the non-const/non-const path, but it does not exercise the const-left execution path that had its own UB fix (data_x.data() replacing &data_x[0]).

Please add a case like:

SELECT arrayDotProduct([]::Array(Float32), y) FROM VALUES('y Array(Float32)', ([],), ([1, 2, 3]));

This ensures the executeWithLeftArgConst branch is covered for empty constant left arrays.

Ergus

LGTM, but there are a couple of minor details pending to be solved.

Also a profiling result to ensure that this change doesn't impact performance is very recommended considering that we are relying more in the (black box) compiler capabilities.

Ergus · 2026-04-07T14:11:43Z

-                    Kernel::template accumulate<ResultType>(states[j], static_cast<ResultType>(data_x[i + j]), static_cast<ResultType>(data_y[current_offset + i + j]));
+                /// Scalar path for mixed types / integer types.
+                /// This branch is only reached when left and right have different types
+                /// (e.g. Int32 × Float64) — not a hot path, but we keep the same


Is it possible to reach this for same-type non-float inputs like Int32 × Int32? because ResultType is widened in that case and the SIMD condition std::is_same_v<ResultType, LeftType> could be false

Yes, exactly. Fixed the comment to reflect that.

Ergus · 2026-04-07T14:18:48Z

+
+                static constexpr size_t VEC_SIZE = 4;
+                typename Kernel::template State<ResultType> states[VEC_SIZE];
+                for (; i + VEC_SIZE < array_size; i += VEC_SIZE)


When array_size % 4 == 0 using this < will make the last chunk go into the tail handling loop. That's correct, but could impact performance a bit. could we check if using <= here is correct??

Adding a test for that specific case is also a good idea.

All minor details addressed — fixed the comment wording, added const-left empty array tests, and fixed the < vs <= loop condition in the scalar fallback.

alexey-milovidov · 2026-04-07T19:47:07Z

The failures of "Flaky check" in "functions_bad_arguments" will be fixed by #101994.

alexey-milovidov · 2026-04-09T03:23:54Z

The MSan stress test failure (MemorySanitizer: use-of-uninitialized-value, STID 4179-5154 or 4148-3044) is a known pre-existing issue unrelated to this PR. Fix: #102158

alexey-milovidov · 2026-04-09T05:40:40Z

The flaky failure of 02494_query_cache_http_introspection in this PR's CI is addressed by #102165.

alexey-milovidov · 2026-04-09T21:01:41Z

The Can't adjust last granule error in CI is a known issue. The fix is in #101641

clickhouse-gh · 2026-04-10T15:03:36Z

LLVM Coverage Report

Metric	Baseline	Current	Δ
Lines	84.00%	84.00%	+0.00%
Functions	90.90%	90.90%	+0.00%
Branches	76.50%	76.50%	+0.00%

Changed lines: 86.05% (148/172) | lost baseline coverage: 45 line(s) · Uncovered code

Full report · Diff report

rschu1ze · 2026-04-13T10:04:01Z

IRainman reviewed Apr 2, 2026

View reviewed changes

Ergus reviewed Apr 2, 2026

View reviewed changes

Ergus self-assigned this Apr 2, 2026

clickhouse-gh Bot added the pr-performance Pull request with some performance improvements label Apr 2, 2026

Ergus added can be tested Allows running workflows for external contributors and removed pr-performance Pull request with some performance improvements labels Apr 2, 2026

clickhouse-gh Bot added the pr-performance Pull request with some performance improvements label Apr 2, 2026

clickhouse-gh Bot reviewed Apr 2, 2026

View reviewed changes

Comment thread src/Functions/array/arrayDotProduct.cpp Outdated

Fix UB on empty arrays, correct FMA comment, and unify const-left sca…

ed1c961

…lar path

Merge branch 'ClickHouse:master' into feature-simd-dot-product

b61cd45

clickhouse-gh Bot reviewed Apr 6, 2026

View reviewed changes

Merge branch 'master' into feature-simd-dot-product

1a84e40

clickhouse-gh Bot reviewed Apr 7, 2026

View reviewed changes

Merge branch 'ClickHouse:master' into feature-simd-dot-product

ce4100e

Ergus reviewed Apr 7, 2026

View reviewed changes

alexey-milovidov and others added 2 commits April 8, 2026 03:25

Merge remote-tracking branch 'origin/master' into tmp-101571

4d95e5c

fix bug

ae0218f

Merge branch 'master' into feature-simd-dot-product

d1a065a

alexey-milovidov and others added 2 commits April 9, 2026 14:05

Merge branch 'master' into feature-simd-dot-product

b3f2fde

Merge branch 'ClickHouse:master' into feature-simd-dot-product

0894799

Ergus added this pull request to the merge queue Apr 10, 2026

Merged via the queue into ClickHouse:master with commit af58312 Apr 10, 2026
164 checks passed

robot-clickhouse added the pr-synced-to-cloud The PR is synced to the cloud repo label Apr 10, 2026


		constexpr size_t n = is_float32 ? 16 : 8;

		for (; i + n < i_max; i += n)

Sunbelt Computer Software

PL/B Language Development and Support

Uh oh!

Conversation

fastio commented Apr 2, 2026 • edited by robot-clickhouse Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Changelog category (leave one):

Changelog entry (a user-readable short description of the changes that goes into CHANGELOG.md):

Documentation entry for user-facing changes

Version info

Uh oh!

IRainman left a comment

Choose a reason for hiding this comment

Uh oh!

fastio commented Apr 2, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Ergus left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

clickhouse-gh Bot commented Apr 2, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

AI Review

Summary

ClickHouse Rules

Final Verdict

Uh oh!

clickhouse-gh Bot Apr 2, 2026

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

fastio commented Apr 3, 2026

Uh oh!

clickhouse-gh Bot Apr 6, 2026

Choose a reason for hiding this comment

Uh oh!

alexey-milovidov commented Apr 7, 2026

Uh oh!

clickhouse-gh Bot Apr 7, 2026

Choose a reason for hiding this comment

Uh oh!

Ergus left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

alexey-milovidov commented Apr 7, 2026

Uh oh!

alexey-milovidov commented Apr 9, 2026

Uh oh!

alexey-milovidov commented Apr 9, 2026

Uh oh!

alexey-milovidov commented Apr 9, 2026

Uh oh!

clickhouse-gh Bot commented Apr 10, 2026

LLVM Coverage Report

Uh oh!

Uh oh!

rschu1ze commented Apr 13, 2026

fastio commented Apr 2, 2026 •

edited by robot-clickhouse

Loading

fastio commented Apr 2, 2026 •

edited

Loading

clickhouse-gh Bot commented Apr 2, 2026 •

edited

Loading