iframe-proxy

nickitat · 2026-05-31T18:01:51Z

Changelog category (leave one):

Performance Improvement

Changelog entry (a user-readable short description of the changes that goes into CHANGELOG.md):

Improved performance of the arrayNorm function.

New situation:

  │        Metric        │    Input type    │          master          │         this branch          │
  ├──────────────────────┼──────────────────┼──────────────────────────┼──────────────────────────────┤
  │ L1/L2/L2Squared/Linf │ Float32, Float64 │ scalar (VEC_SIZE=4 loop) │ v4 batched kernel            │
  ├──────────────────────┼──────────────────┼──────────────────────────┼──────────────────────────────┤
  │ L1/L2/L2Squared/Linf │ BFloat16         │ scalar                   │ v4 batched (widened→Float32) │
  ├──────────────────────┼──────────────────┼──────────────────────────┼──────────────────────────────┤
  │ L1/L2/L2Squared/Linf │ integers         │ scalar                   │ v4 batched (widened→Float64) │
  ├──────────────────────┼──────────────────┼──────────────────────────┼──────────────────────────────┤
  │ Lp                   │ any              │ scalar (pow)             │ scalar (pow)                 │
  └──────────────────────┴──────────────────┴──────────────────────────┴──────────────────────────────┘

I will upload changes to arrayDistance with a separate PR.

Version info

Merged into: 26.6.1.457

clickhouse-gh · 2026-05-31T18:02:26Z

Workflow [PR], commit [4d37f17]

Summary: ✅

Performance Comparison: Performance dashboard

AI Review

Summary

This PR moves arrayNorm reductions into a runtime-dispatched AVX-512-capable batched kernel and enables -ffp-contract=fast for arrayNorm.cpp. The implementation direction looks plausible, but I am not approving yet because the latest BFloat16 routing through the new kernel is not covered by focused tests, and the PR metadata still has todo as the changelog entry.

PR Metadata

Changelog category: Performance Improvement matches the change.
Changelog entry: required for this category, but currently todo.
Suggested replacement: Improved performance of L1Norm, L2Norm, L2SquaredNorm, LpNorm, and LinfNorm on array inputs by using runtime CPU dispatch for vectorized reductions.

Missing context / blind spots

⚠️ No local build directory exists in this checkout, so I did not run 02283_array_norm locally. The PR-level CI report currently has no failed tests; a local/stateless run with the added BFloat16 case would close this gap.

Findings

💡 Nits

[src/Functions/array/CMakeLists.txt:38] [dismissed by author -- https://github.com/Vectorize arrayNorm with runtime CPU dispatch #106211#discussion_r3330962144] The current comment still says -ffp-contract=fast is re-enabled for "these two TUs" and only describes arrayDotProduct.cpp / arrayDistance.cpp, while line 47 now adds arrayNorm.cpp too. I still consider this real because the scope of the floating-point contraction exception remains misleading in the current code; update the wording to list all three translation units or say "these TUs".

Tests

⚠️ [src/Functions/array/arrayNorm.cpp:295] The latest change routes BFloat16 through normBatchImpl, but existing norm coverage only exercises UInt8, Float32, and Float64; 03269_bf16 covers distance functions, not norm functions. Please add a focused stateless test for Array(BFloat16) covering L1Norm, L2Norm, L2SquaredNorm, LpNorm, and LinfNorm, including empty, shorter-than-16, exactly-16, and tail lengths.

Final Verdict

Status: ⚠️ Request changes

Minimum required actions: replace the todo changelog entry and add the focused Array(BFloat16) norm coverage. The stale CMakeLists.txt wording should be fixed while touching the PR.

nickitat · 2026-06-01T11:53:48Z

Waiting for #105019 to merge first, then will rebase and remove the v3 specialisation.

Add an auto-vectorized single-array norm reduction kernel for the same-type floating-point paths (`Float32`/`Float64`), modeled after the `arrayDotProduct` kernel: 16-way manual unrolling with independent accumulators breaks the FP dependency chain so the compiler keeps several SIMD registers in flight and, for `L2`/`L2Squared`, fuses `a*b + c` into FMA. The kernel is emitted via `MULTITARGET_FUNCTION_X86_V4`, producing an `x86_64_v4` (AVX-512) specialisation plus a default (SSE2/NEON) variant; the caller dispatches to AVX-512 when available and otherwise uses the baseline variant. Only the v4 specialisation is generated: on `v4`-capable CPUs it is always selected, and the file is already compiled at `-march=x86-64-v2` (the existing pin that keeps the default reductions at 128-bit to avoid the SLP YMM regression on the `BFloat16` paths), so a separate `v3` variant is not worthwhile. `-ffp-contract=fast` is re-enabled for `arrayNorm.cpp` (appended after the v2 pin) so the AVX-512 specialisation can fuse FMA on the `L2`/`L2Squared` reductions; the global `-ffp-contract=off` otherwise suppresses it. Widened types (integers, `BFloat16`) keep the scalar reduction. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

…ream continuous The `x86_64_v4` (AVX-512) specialisation cannot be inlined into the `v2`-baseline caller, so dispatching it per row imposed a hard call boundary every ~150 elements. On bandwidth-bound paths (`L2Norm` over `Array(Float64)`: 1 FMA per 8 bytes) that boundary interrupts the hardware prefetcher's stream: the wide AVX-512 loads then outrun memory and stall. On AMD Zen5 this made the AVX-512 `L2`/`Float64` kernel ~15% *slower* than the scalar loop (which is throttled enough to stream cleanly), even though it executes 3x fewer instructions. Move the whole row loop inside the multitarget function (`normBatchImpl`), so the column is processed in a single AVX-512-attributed call and the loads stay contiguous across row boundaries. Measured on AMD Zen5, `L2Norm(Array(Float64))` over 5M x 150: cache-misses drop from 286M to 81M, IPC 1.10 -> 1.25, and the kernel goes from 0.87x to 1.02x of the scalar baseline (1.21x vs the per-row variant). No regression on the cases that were already faster; results are unchanged (bit-identical to the per-row kernel). The scalar path for widened/`BFloat16` types is unaffected. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

Apply modernize-loop-convert to the batched kernel's combine loop (same as 9844723). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

clickhouse-gh · 2026-06-03T12:47:59Z

📊 Cloud Performance Report

✅ AI verdict: no_change — no significant changes across 39 queries analysed

This PR rewrites only the array-norm reduction kernel (arrayNorm.cpp) into a single batched, manually-unrolled multitarget call with an AVX-512 path, plus an FMA-contraction compile flag scoped to that one file. The three flagged ClickBench improvements — Q4 (-14.7%), Q15 (-16.5%), and Q33 (-5.6%) — run plain aggregations and GROUP BYs that never invoke any array-norm function, so the changed code is not on their execution path. Although both tests agree the deltas are consistent within this run, a change that touches code these queries don't execute cannot plausibly speed them up; the shifts are run-to-run/build variance. All three are downgraded to not-sure.

clickbench

⚠️ 3 inconclusive

Flagged queries (3 of 43)

	Query	Verdict	Baseline med (ms)	PR med (ms)	Change	q-value	Hint
⚠️	4	not_sure	265	226	-14.7%	<0.0001	cpu: PR only revectorizes arrayNorm kernel; Q4 (AVG) calls no array-norm function, so this -14.7% is off-path variance
⚠️	15	not_sure	249	208	-16.5%	<0.0001	cpu: Q15 GROUP BY exercises none of the changed array-norm code path; -16.5% is unrelated to this PR
⚠️	33	not_sure	1514	1430	-5.5%	<0.0001	cpu: Q33 doesn't call array-norm functions; -5.6% can't come from the arrayNorm kernel change

_{q-value = BH-FDR adjusted p; smaller is stronger evidence. MIRAI flags a query when q < fdr_q (default 0.10) — the value the verdict is based on.}

tpch_adapted_1_official

🟢 No significant changes

Debug info

StressHouse run: 9d29f7cb-b127-4bd9-a016-54a9478724f7
MIRAI run: 815c794b-2c10-474b-a9cb-05a7d6a31c77
PR check IDs:
- clickbench_345149_1780528493
- clickbench_345163_1780528493
- clickbench_345169_1780528493
- tpch_adapted_1_official_345175_1780528493
- tpch_adapted_1_official_345177_1780528493
- tpch_adapted_1_official_345192_1780528493

Previously only the same-type floating point cases (`Float32`/`Float64`) went through the vectorized batched `normBatchImpl`; `BFloat16` and the integer types fell into a scalar fallback loop because the kernel operated on `ResultType` directly and could not consume a narrower input column. Template `normBatchImpl` on `ArgumentType` and widen each element to `ResultType` with a `static_cast` inside the accumulate calls. The widening (`BFloat16` -> `Float32`, integers -> `Float64`) is exact and lets every type take the same AVX-512 batched path, so the row loop stays a single multitarget call with a continuous load stream. `Float32`/`Float64` are unchanged: the `static_cast` is the identity there, so the generated kernel is byte-for-byte identical (verified at runtime, ~1.00x on AMD Zen5 and Intel GNR). `BFloat16` norm is now vectorized: L1/L2 ~2.0-2.2x and Linf ~1.3x on Zen5; L1 1.6x, L2 1.8x, Linf 1.8x on GNR. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

clickhouse-gh · 2026-06-03T21:28:58Z

-        }
+        /// The entire row loop is handled in a single multitarget call (runtime-dispatched to AVX-512 when
+        /// available, else the baseline variant), keeping the load stream continuous across rows. The kernel
+        /// widens each element to `ResultType` internally, so `BFloat16` (-> Float32) and integers (-> Float64)


The latest change routes BFloat16 through this runtime-dispatched normBatchImpl path, but the existing norm coverage only exercises UInt8, Float32, and Float64; 03269_bf16 covers distance functions, not L1Norm/L2Norm/LpNorm/LinfNorm. That means a regression in BFloat16 widening or the target-specific reduction would still pass line coverage via other template instantiations. Please add a focused stateless case for Array(BFloat16) covering L1Norm, L2Norm, L2SquaredNorm, LpNorm, and LinfNorm, including empty, shorter-than-16, exactly-16, and tail lengths.

clickhouse-gh · 2026-06-04T00:37:37Z

LLVM Coverage Report

Metric	Baseline	Current	Δ
Lines	84.40%	84.50%	+0.10%
Functions	92.40%	92.40%	+0.00%
Branches	77.00%	77.10%	+0.10%

Changed lines: Changed C/C++ lines covered by tests: 7/7 (100.00%) | Lost baseline coverage (was covered on master, now uncovered in this PR): 7 line(s) · Uncovered code

Full report · Diff report

clickgapai · 2026-06-06T18:50:55Z

nickitat added pr-performance Pull request with some performance improvements ci-performance performance only labels May 31, 2026

nickitat removed the ci-performance performance only label May 31, 2026

clickhouse-gh Bot reviewed May 31, 2026

View reviewed changes

Comment thread src/Functions/array/CMakeLists.txt Outdated

clickhouse-gh Bot reviewed May 31, 2026

View reviewed changes

Comment thread src/Functions/array/arrayNorm.cpp Outdated

clickhouse-gh Bot reviewed May 31, 2026

View reviewed changes

Comment thread src/Functions/array/CMakeLists.txt Outdated

nickitat force-pushed the arraynorm-multitarget branch 2 times, most recently from 024a754 to 2e90249 Compare June 2, 2026 12:07

nickitat and others added 4 commits June 2, 2026 12:25

fix tidy

9844723

fix tidy

8cc5a73

Apply modernize-loop-convert to the batched kernel's combine loop (same as 9844723). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

fix style

afe0a5c

nickitat changed the title ~~Vectorize arrayNorm with runtime CPU dispatch~~ Vectorize arrayNorm with runtime CPU dispatch Jun 2, 2026

clickhouse-gh Bot reviewed Jun 3, 2026

View reviewed changes

nickitat marked this pull request as ready for review June 4, 2026 08:24

alexey-milovidov approved these changes Jun 6, 2026

View reviewed changes

alexey-milovidov self-assigned this Jun 6, 2026

alexey-milovidov added this pull request to the merge queue Jun 6, 2026

Merged via the queue into master with commit 2e9d1ab Jun 6, 2026
166 checks passed

alexey-milovidov deleted the arraynorm-multitarget branch June 6, 2026 16:19

robot-ch-test-poll4 added the pr-synced-to-cloud The PR is synced to the cloud repo label Jun 6, 2026

clickgapai mentioned this pull request Jun 6, 2026

Add test: arrayNorm over Array(BFloat16) (widen-to-Float32 kernel path) is untested across the whole suite #106644

Merged

1 task

groeneai mentioned this pull request Jun 8, 2026

Fix STID 1941-1bfa: stop mutating call-syntax arrayElement on kql_array_sort_* #106691

Closed

1 task

github-merge-queue Bot pushed a commit that referenced this pull request Jun 10, 2026

Add 1 test(s) for coverage gaps in PR #106211

9160411

Sunbelt Computer Software

PL/B Language Development and Support

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Vectorize `arrayNorm` with runtime CPU dispatch#106211

Vectorize `arrayNorm` with runtime CPU dispatch#106211
alexey-milovidov merged 6 commits into
masterfrom
arraynorm-multitarget

nickitat commented May 31, 2026 •

edited by robot-clickhouse

Loading

Uh oh!

clickhouse-gh Bot commented May 31, 2026 •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

Uh oh!

nickitat commented Jun 1, 2026 •

edited

Loading

Uh oh!

clickhouse-gh Bot commented Jun 3, 2026 •

edited

Loading

Uh oh!

clickhouse-gh Bot Jun 3, 2026

Uh oh!

clickhouse-gh Bot commented Jun 4, 2026

Uh oh!

Uh oh!

clickgapai commented Jun 6, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Sunbelt Computer Software

PL/B Language Development and Support

Uh oh!

Conversation

nickitat commented May 31, 2026 • edited by robot-clickhouse Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Changelog category (leave one):

Changelog entry (a user-readable short description of the changes that goes into CHANGELOG.md):

Version info

Uh oh!

clickhouse-gh Bot commented May 31, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

AI Review

Summary

PR Metadata

Missing context / blind spots

Findings

Tests

Final Verdict

Uh oh!

Uh oh!

Uh oh!

Uh oh!

nickitat commented Jun 1, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

clickhouse-gh Bot commented Jun 3, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

clickbench

tpch_adapted_1_official

Uh oh!

clickhouse-gh Bot Jun 3, 2026

Choose a reason for hiding this comment

Uh oh!

clickhouse-gh Bot commented Jun 4, 2026

LLVM Coverage Report

Uh oh!

Uh oh!

clickgapai commented Jun 6, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

nickitat commented May 31, 2026 •

edited by robot-clickhouse

Loading

clickhouse-gh Bot commented May 31, 2026 •

edited

Loading

nickitat commented Jun 1, 2026 •

edited

Loading

clickhouse-gh Bot commented Jun 3, 2026 •

edited

Loading