Vectorize arrayNorm with runtime CPU dispatch#106211
Conversation
|
Waiting for #105019 to merge first, then will rebase and remove the v3 specialisation. |
Add an auto-vectorized single-array norm reduction kernel for the same-type floating-point paths (`Float32`/`Float64`), modeled after the `arrayDotProduct` kernel: 16-way manual unrolling with independent accumulators breaks the FP dependency chain so the compiler keeps several SIMD registers in flight and, for `L2`/`L2Squared`, fuses `a*b + c` into FMA. The kernel is emitted via `MULTITARGET_FUNCTION_X86_V4`, producing an `x86_64_v4` (AVX-512) specialisation plus a default (SSE2/NEON) variant; the caller dispatches to AVX-512 when available and otherwise uses the baseline variant. Only the v4 specialisation is generated: on `v4`-capable CPUs it is always selected, and the file is already compiled at `-march=x86-64-v2` (the existing pin that keeps the default reductions at 128-bit to avoid the SLP YMM regression on the `BFloat16` paths), so a separate `v3` variant is not worthwhile. `-ffp-contract=fast` is re-enabled for `arrayNorm.cpp` (appended after the v2 pin) so the AVX-512 specialisation can fuse FMA on the `L2`/`L2Squared` reductions; the global `-ffp-contract=off` otherwise suppresses it. Widened types (integers, `BFloat16`) keep the scalar reduction. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
024a754 to
2e90249
Compare
…ream continuous The `x86_64_v4` (AVX-512) specialisation cannot be inlined into the `v2`-baseline caller, so dispatching it per row imposed a hard call boundary every ~150 elements. On bandwidth-bound paths (`L2Norm` over `Array(Float64)`: 1 FMA per 8 bytes) that boundary interrupts the hardware prefetcher's stream: the wide AVX-512 loads then outrun memory and stall. On AMD Zen5 this made the AVX-512 `L2`/`Float64` kernel ~15% *slower* than the scalar loop (which is throttled enough to stream cleanly), even though it executes 3x fewer instructions. Move the whole row loop inside the multitarget function (`normBatchImpl`), so the column is processed in a single AVX-512-attributed call and the loads stay contiguous across row boundaries. Measured on AMD Zen5, `L2Norm(Array(Float64))` over 5M x 150: cache-misses drop from 286M to 81M, IPC 1.10 -> 1.25, and the kernel goes from 0.87x to 1.02x of the scalar baseline (1.21x vs the per-row variant). No regression on the cases that were already faster; results are unchanged (bit-identical to the per-row kernel). The scalar path for widened/`BFloat16` types is unaffected. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
arrayNorm with runtime CPU dispatch
|
📊 Cloud Performance Report ✅ AI verdict: This PR rewrites only the array-norm reduction kernel (arrayNorm.cpp) into a single batched, manually-unrolled multitarget call with an AVX-512 path, plus an FMA-contraction compile flag scoped to that one file. The three flagged ClickBench improvements — Q4 (-14.7%), Q15 (-16.5%), and Q33 (-5.6%) — run plain aggregations and GROUP BYs that never invoke any array-norm function, so the changed code is not on their execution path. Although both tests agree the deltas are consistent within this run, a change that touches code these queries don't execute cannot plausibly speed them up; the shifts are run-to-run/build variance. All three are downgraded to not-sure. clickbenchFlagged queries (3 of 43)
q-value = BH-FDR adjusted p; smaller is stronger evidence. MIRAI flags a query when q < fdr_q (default 0.10) — the value the verdict is based on. tpch_adapted_1_official🟢 No significant changes Debug info
|
Previously only the same-type floating point cases (`Float32`/`Float64`) went through the vectorized batched `normBatchImpl`; `BFloat16` and the integer types fell into a scalar fallback loop because the kernel operated on `ResultType` directly and could not consume a narrower input column. Template `normBatchImpl` on `ArgumentType` and widen each element to `ResultType` with a `static_cast` inside the accumulate calls. The widening (`BFloat16` -> `Float32`, integers -> `Float64`) is exact and lets every type take the same AVX-512 batched path, so the row loop stays a single multitarget call with a continuous load stream. `Float32`/`Float64` are unchanged: the `static_cast` is the identity there, so the generated kernel is byte-for-byte identical (verified at runtime, ~1.00x on AMD Zen5 and Intel GNR). `BFloat16` norm is now vectorized: L1/L2 ~2.0-2.2x and Linf ~1.3x on Zen5; L1 1.6x, L2 1.8x, Linf 1.8x on GNR. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
| } | ||
| /// The entire row loop is handled in a single multitarget call (runtime-dispatched to AVX-512 when | ||
| /// available, else the baseline variant), keeping the load stream continuous across rows. The kernel | ||
| /// widens each element to `ResultType` internally, so `BFloat16` (-> Float32) and integers (-> Float64) |
There was a problem hiding this comment.
The latest change routes BFloat16 through this runtime-dispatched normBatchImpl path, but the existing norm coverage only exercises UInt8, Float32, and Float64; 03269_bf16 covers distance functions, not L1Norm/L2Norm/LpNorm/LinfNorm. That means a regression in BFloat16 widening or the target-specific reduction would still pass line coverage via other template instantiations. Please add a focused stateless case for Array(BFloat16) covering L1Norm, L2Norm, L2SquaredNorm, LpNorm, and LinfNorm, including empty, shorter-than-16, exactly-16, and tail lengths.
LLVM Coverage Report
Changed lines: Changed C/C++ lines covered by tests: 7/7 (100.00%) | Lost baseline coverage (was covered on master, now uncovered in this PR): 7 line(s) · Uncovered code |

Changelog category (leave one):
Changelog entry (a user-readable short description of the changes that goes into CHANGELOG.md):
Improved performance of the
arrayNormfunction.New situation:
I will upload changes to
arrayDistancewith a separate PR.Version info
26.6.1.457