Vectorize `arrayNorm` with runtime CPU dispatch by nickitat · Pull Request #106211 · ClickHouse/ClickHouse · GitHub
Skip to content

Vectorize arrayNorm with runtime CPU dispatch#106211

Merged
alexey-milovidov merged 6 commits into
masterfrom
arraynorm-multitarget
Jun 6, 2026
Merged

Vectorize arrayNorm with runtime CPU dispatch#106211
alexey-milovidov merged 6 commits into
masterfrom
arraynorm-multitarget

Conversation

@nickitat

@nickitat nickitat commented May 31, 2026

Copy link
Copy Markdown
Member

Changelog category (leave one):

  • Performance Improvement

Changelog entry (a user-readable short description of the changes that goes into CHANGELOG.md):

Improved performance of the arrayNorm function.


Screenshot 2026-06-04 at 10 23 13 Screenshot 2026-06-04 at 10 23 26

New situation:

  │        Metric        │    Input type    │          master          │         this branch          │
  ├──────────────────────┼──────────────────┼──────────────────────────┼──────────────────────────────┤
  │ L1/L2/L2Squared/Linf │ Float32, Float64 │ scalar (VEC_SIZE=4 loop) │ v4 batched kernel            │
  ├──────────────────────┼──────────────────┼──────────────────────────┼──────────────────────────────┤
  │ L1/L2/L2Squared/Linf │ BFloat16         │ scalar                   │ v4 batched (widened→Float32) │
  ├──────────────────────┼──────────────────┼──────────────────────────┼──────────────────────────────┤
  │ L1/L2/L2Squared/Linf │ integers         │ scalar                   │ v4 batched (widened→Float64) │
  ├──────────────────────┼──────────────────┼──────────────────────────┼──────────────────────────────┤
  │ Lp                   │ any              │ scalar (pow)             │ scalar (pow)                 │
  └──────────────────────┴──────────────────┴──────────────────────────┴──────────────────────────────┘

I will upload changes to arrayDistance with a separate PR.

Version info

  • Merged into: 26.6.1.457

@nickitat nickitat added pr-performance Pull request with some performance improvements ci-performance performance only labels May 31, 2026
@clickhouse-gh

clickhouse-gh Bot commented May 31, 2026

Copy link
Copy Markdown
Contributor

@nickitat nickitat removed the ci-performance performance only label May 31, 2026
Comment thread src/Functions/array/CMakeLists.txt Outdated
Comment thread src/Functions/array/arrayNorm.cpp Outdated
Comment thread src/Functions/array/CMakeLists.txt Outdated
@nickitat

nickitat commented Jun 1, 2026

Copy link
Copy Markdown
Member Author

Waiting for #105019 to merge first, then will rebase and remove the v3 specialisation.

Add an auto-vectorized single-array norm reduction kernel for the same-type
floating-point paths (`Float32`/`Float64`), modeled after the `arrayDotProduct`
kernel: 16-way manual unrolling with independent accumulators breaks the FP
dependency chain so the compiler keeps several SIMD registers in flight and, for
`L2`/`L2Squared`, fuses `a*b + c` into FMA.

The kernel is emitted via `MULTITARGET_FUNCTION_X86_V4`, producing an `x86_64_v4`
(AVX-512) specialisation plus a default (SSE2/NEON) variant; the caller dispatches
to AVX-512 when available and otherwise uses the baseline variant. Only the v4
specialisation is generated: on `v4`-capable CPUs it is always selected, and the
file is already compiled at `-march=x86-64-v2` (the existing pin that keeps the
default reductions at 128-bit to avoid the SLP YMM regression on the `BFloat16`
paths), so a separate `v3` variant is not worthwhile.

`-ffp-contract=fast` is re-enabled for `arrayNorm.cpp` (appended after the v2 pin)
so the AVX-512 specialisation can fuse FMA on the `L2`/`L2Squared` reductions; the
global `-ffp-contract=off` otherwise suppresses it. Widened types (integers,
`BFloat16`) keep the scalar reduction.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@nickitat nickitat force-pushed the arraynorm-multitarget branch 2 times, most recently from 024a754 to 2e90249 Compare June 2, 2026 12:07
nickitat and others added 4 commits June 2, 2026 12:25
…ream continuous

The `x86_64_v4` (AVX-512) specialisation cannot be inlined into the `v2`-baseline
caller, so dispatching it per row imposed a hard call boundary every ~150 elements.
On bandwidth-bound paths (`L2Norm` over `Array(Float64)`: 1 FMA per 8 bytes) that
boundary interrupts the hardware prefetcher's stream: the wide AVX-512 loads then
outrun memory and stall. On AMD Zen5 this made the AVX-512 `L2`/`Float64` kernel
~15% *slower* than the scalar loop (which is throttled enough to stream cleanly),
even though it executes 3x fewer instructions.

Move the whole row loop inside the multitarget function (`normBatchImpl`), so the
column is processed in a single AVX-512-attributed call and the loads stay
contiguous across row boundaries. Measured on AMD Zen5, `L2Norm(Array(Float64))`
over 5M x 150: cache-misses drop from 286M to 81M, IPC 1.10 -> 1.25, and the kernel
goes from 0.87x to 1.02x of the scalar baseline (1.21x vs the per-row variant).
No regression on the cases that were already faster; results are unchanged
(bit-identical to the per-row kernel). The scalar path for widened/`BFloat16` types
is unaffected.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Apply modernize-loop-convert to the batched kernel's combine loop (same as 9844723).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@nickitat nickitat changed the title Vectorize arrayNorm with runtime CPU dispatch Vectorize arrayNorm with runtime CPU dispatch Jun 2, 2026
@clickhouse-gh

clickhouse-gh Bot commented Jun 3, 2026

Copy link
Copy Markdown
Contributor

📊 Cloud Performance Report

✅ AI verdict: no_change — no significant changes across 39 queries analysed

This PR rewrites only the array-norm reduction kernel (arrayNorm.cpp) into a single batched, manually-unrolled multitarget call with an AVX-512 path, plus an FMA-contraction compile flag scoped to that one file. The three flagged ClickBench improvements — Q4 (-14.7%), Q15 (-16.5%), and Q33 (-5.6%) — run plain aggregations and GROUP BYs that never invoke any array-norm function, so the changed code is not on their execution path. Although both tests agree the deltas are consistent within this run, a change that touches code these queries don't execute cannot plausibly speed them up; the shifts are run-to-run/build variance. All three are downgraded to not-sure.

clickbench

⚠️ 3 inconclusive

Flagged queries (3 of 43)
Query Verdict Baseline med (ms) PR med (ms) Change q-value Hint
⚠️ 4 not_sure 265 226 -14.7% <0.0001 cpu: PR only revectorizes arrayNorm kernel; Q4 (AVG) calls no array-norm function, so this -14.7% is off-path variance
⚠️ 15 not_sure 249 208 -16.5% <0.0001 cpu: Q15 GROUP BY exercises none of the changed array-norm code path; -16.5% is unrelated to this PR
⚠️ 33 not_sure 1514 1430 -5.5% <0.0001 cpu: Q33 doesn't call array-norm functions; -5.6% can't come from the arrayNorm kernel change

q-value = BH-FDR adjusted p; smaller is stronger evidence. MIRAI flags a query when q < fdr_q (default 0.10) — the value the verdict is based on.

tpch_adapted_1_official

🟢 No significant changes

Debug info
  • StressHouse run: 9d29f7cb-b127-4bd9-a016-54a9478724f7
  • MIRAI run: 815c794b-2c10-474b-a9cb-05a7d6a31c77
  • PR check IDs:
    • clickbench_345149_1780528493
    • clickbench_345163_1780528493
    • clickbench_345169_1780528493
    • tpch_adapted_1_official_345175_1780528493
    • tpch_adapted_1_official_345177_1780528493
    • tpch_adapted_1_official_345192_1780528493

Previously only the same-type floating point cases (`Float32`/`Float64`)
went through the vectorized batched `normBatchImpl`; `BFloat16` and the
integer types fell into a scalar fallback loop because the kernel operated
on `ResultType` directly and could not consume a narrower input column.

Template `normBatchImpl` on `ArgumentType` and widen each element to
`ResultType` with a `static_cast` inside the accumulate calls. The widening
(`BFloat16` -> `Float32`, integers -> `Float64`) is exact and lets every
type take the same AVX-512 batched path, so the row loop stays a single
multitarget call with a continuous load stream.

`Float32`/`Float64` are unchanged: the `static_cast` is the identity there,
so the generated kernel is byte-for-byte identical (verified at runtime,
~1.00x on AMD Zen5 and Intel GNR). `BFloat16` norm is now vectorized:
L1/L2 ~2.0-2.2x and Linf ~1.3x on Zen5; L1 1.6x, L2 1.8x, Linf 1.8x on GNR.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
}
/// The entire row loop is handled in a single multitarget call (runtime-dispatched to AVX-512 when
/// available, else the baseline variant), keeping the load stream continuous across rows. The kernel
/// widens each element to `ResultType` internally, so `BFloat16` (-> Float32) and integers (-> Float64)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The latest change routes BFloat16 through this runtime-dispatched normBatchImpl path, but the existing norm coverage only exercises UInt8, Float32, and Float64; 03269_bf16 covers distance functions, not L1Norm/L2Norm/LpNorm/LinfNorm. That means a regression in BFloat16 widening or the target-specific reduction would still pass line coverage via other template instantiations. Please add a focused stateless case for Array(BFloat16) covering L1Norm, L2Norm, L2SquaredNorm, LpNorm, and LinfNorm, including empty, shorter-than-16, exactly-16, and tail lengths.

@clickhouse-gh

clickhouse-gh Bot commented Jun 4, 2026

Copy link
Copy Markdown
Contributor

LLVM Coverage Report

Metric Baseline Current Δ
Lines 84.40% 84.50% +0.10%
Functions 92.40% 92.40% +0.00%
Branches 77.00% 77.10% +0.10%

Changed lines: Changed C/C++ lines covered by tests: 7/7 (100.00%) | Lost baseline coverage (was covered on master, now uncovered in this PR): 7 line(s) · Uncovered code

Full report · Diff report

@nickitat nickitat marked this pull request as ready for review June 4, 2026 08:24
@alexey-milovidov alexey-milovidov self-assigned this Jun 6, 2026
@alexey-milovidov alexey-milovidov added this pull request to the merge queue Jun 6, 2026
Merged via the queue into master with commit 2e9d1ab Jun 6, 2026
166 checks passed
@alexey-milovidov alexey-milovidov deleted the arraynorm-multitarget branch June 6, 2026 16:19
@robot-ch-test-poll4 robot-ch-test-poll4 added the pr-synced-to-cloud The PR is synced to the cloud repo label Jun 6, 2026
@clickgapai

Copy link
Copy Markdown
Contributor

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

pr-performance Pull request with some performance improvements pr-synced-to-cloud The PR is synced to the cloud repo

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants