perf(Statistics): devirtualize varPopStable for Float64 by riyaneel · Pull Request #99460 · ClickHouse/ClickHouse · GitHub
Skip to content

perf(Statistics): devirtualize varPopStable for Float64#99460

Merged
alexey-milovidov merged 6 commits into
ClickHouse:masterfrom
riyaneel:perf/devirtualize-stable-variance
Apr 18, 2026
Merged

perf(Statistics): devirtualize varPopStable for Float64#99460
alexey-milovidov merged 6 commits into
ClickHouse:masterfrom
riyaneel:perf/devirtualize-stable-variance

Conversation

@riyaneel

@riyaneel riyaneel commented Mar 13, 2026

Copy link
Copy Markdown
Contributor

Description

This PR devirtualizes the Welford inner loop for varPopStable, varSampStable, stddevPopStable, and stddevSampStable when aggregating Float64 columns.

By overriding addBatchSinglePlace and iterating directly over the raw ColumnFloat64 data pointer, we remove the IColumn::getFloat64() virtual call per row, allowing the compiler to fully inline the logic.

Behavior Change & CI Impact:
Exposing the loop to the optimizer allows Clang/GCC to keep intermediate values in registers and utilize FMA (vfmadd231sd), rather than forcing a 64-bit memory store on every row.
This causes an expected 1-ULP precision drift compared to the old virtualized code. Consequently, the SQLLogic test fails with exactly 18 new query mismatches against SQLite (174,004 vs the current 173,986 limit). This drift is mathematically acceptable and is an explicit trade-off for the performance gain.

Perf (500M rows, Float64):

  • Master: 0.211s, 3.71B instructions, 633M branches.
  • This PR: 0.167s, 2.42B instructions, 420M branches.

(~1.3B instructions saved by dropping the vtable lookups and enabling FMA).

Changelog category (leave one):

  • Performance Improvement

Changelog entry (a user-readable short description of the changes that goes into CHANGELOG.md):

Speed up var*Stable and stddev*Stable functions for Float64 columns by devirtualizing the inner loop. Note: this enables compiler optimizations (FMA/registers) that alter floating-point results at the ULP level.

Documentation entry for user-facing changes

  • Documentation is written (mandatory for new features)

Version info

  • Merged into: 26.4.1.1049

@alexey-milovidov alexey-milovidov added the can be tested Allows running workflows for external contributors label Mar 13, 2026
@clickhouse-gh

clickhouse-gh Bot commented Mar 14, 2026

Copy link
Copy Markdown
Contributor

@clickhouse-gh clickhouse-gh Bot added pr-performance Pull request with some performance improvements manual approve Manual approve required to run CI labels Mar 14, 2026
@riyaneel riyaneel force-pushed the perf/devirtualize-stable-variance branch from 73d3607 to 8176975 Compare March 15, 2026 05:27

#pragma clang fp contract(off)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

#pragma clang fp contract(off) is unconditional, but this file is also built with GCC. Under -Wall -Werror, GCC reports unknown-pragmas for this line and fails compilation.

Please guard this pragma with #if defined(__clang__) / #endif (or use another portable way) so non-clang builds keep compiling.

@riyaneel riyaneel force-pushed the perf/devirtualize-stable-variance branch 2 times, most recently from cf78cb2 to a653786 Compare March 16, 2026 01:09
void addBatchSinglePlace(size_t row_begin, size_t row_end, AggregateDataPtr __restrict place, const IColumn ** columns, Arena * arena, ssize_t if_argument_pos) const override
{
#if defined(__clang__)
#pragma clang fp contract(off)

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are a few suspicions:

  • missing comments;
  • this switch extends to the end of file, it is not turned off below;
  • as "stable" versions of statistical functions use an algorithm like Kahan Summation, this switch can make it pointless.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, I see

This altered floating-point precision at the ULP level and broke cross-engine consistency in SQLLogic test.

but

  1. We need to move comments in the code.
  2. Does the difference really matter?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@alexey-milovidov You are right. Forcing the compiler to drop precision with pragmas defeats the purpose of the stable algorithm. I've removed all of them and reverted to the clean devirtualized loop.

However, allowing the compiler to properly optimize it (FMA/registers) slightly changes the intermediate precision compared to the old virtualized code (which forced a 64-bit memory store per row). Because of this, the SQLLogic test fails with exactly 18 new query mismatches against SQLite (174,004 vs the 173,986 limit).

Should I just bump the allowed threshold in the SQLLogic config, or is there a preferred way to handle this expected precision drift?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@alexey-milovidov Just pushed a new commit to address your remaining points:

  1. Not at all. For statistical variance, a 1-ULP difference is completely acceptable, especially considering the massive CPU cycle savings (and the fact that keeping values in registers is mathematically sound).

  2. Done. I've added a clear comment directly above the devirtualized loop in addBatchSinglePlace explaining that allowing FMA/register promotion causes this expected ULP drift.

(I tried to find the SQLLogic config to bump the threshold to 174,004 myself but couldn't locate the exact file. Let me know if you want me to update it somewhere specific, or if it's handled on your end!)

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@alexey-milovidov Just a quick heads-up, the CI is currently red, but the two failing checks seem completely unrelated to this PR (flaky tests inherited from master):

  1. Stateless tests (amd_tsan) failed on 04004_alter_modify_column_ttl_without_type
  2. BuzzHouse (amd_debug) failed with an Inconsistent AST formatting logical error.

I don't have the permissions to re-trigger the CI jobs. Could you or someone from the team restart these flaky checks when you have a moment?

@alexey-milovidov alexey-milovidov self-assigned this Mar 16, 2026
@riyaneel riyaneel force-pushed the perf/devirtualize-stable-variance branch 2 times, most recently from 415e13e to 0d41731 Compare March 16, 2026 18:47
/// Devirtualizing this loop allows the compiler to keep intermediate values in registers
/// and utilize FMA, rather than forcing a 64-bit memory store on every row.
/// This alters the floating-point result at the ULP level, which is mathematically
/// acceptable but causes a known divergence in cross-engine SQLLogic tests against SQLite.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ This fast-path intentionally changes floating-point results (the comment mentions known SQLLogic divergence vs SQLite, and the PR discussion reports +18 mismatches beyond the current limit).

For var*Stable/stddev*Stable, this is a behavior change, not just a perf change. Please make the compatibility decision explicit in this PR: either preserve previous bit-level behavior in this loop (with compiler-guarded FP-contract controls) or update the affected SQLLogic expectation/threshold in-tree together with clear rationale.

@alexey-milovidov

Copy link
Copy Markdown
Member

Let's permit fused-multiply-add and add exceptions to SQLLogicTest.

@riyaneel riyaneel force-pushed the perf/devirtualize-stable-variance branch from 0d41731 to c65d27b Compare March 29, 2026 21:48
/// Devirtualizing this loop allows the compiler to keep intermediate values in registers
/// and utilize FMA, rather than forcing a 64-bit memory store on every row.
/// This alters the floating-point result at the ULP level, which is mathematically
/// acceptable but causes a known divergence in cross-engine SQLLogic tests against SQLite.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ The behavior change here is broader than the PR metadata says: this AggregateFunctionVariance path is shared by stddevSampStable/stddevPopStable too, and the nearby comment explicitly documents ULP-level divergence from previous results.

Please update the PR Changelog entry text accordingly (it currently says "preserving strict IEEE 754 math precision", which conflicts with this code comment) and mention all affected stable functions.

@riyaneel riyaneel force-pushed the perf/devirtualize-stable-variance branch from c65d27b to 1b77103 Compare March 30, 2026 17:23
Comment thread src/AggregateFunctions/AggregateFunctionStatistics.cpp
@riyaneel riyaneel force-pushed the perf/devirtualize-stable-variance branch from 1b77103 to 7a7d873 Compare April 6, 2026 22:38
@alexey-milovidov

Copy link
Copy Markdown
Member

The Stress test (arm_msan) failure is fixed by #101239, which should be merged first. After it is merged, please update the branch to include the fix.

@riyaneel

riyaneel commented Apr 7, 2026

Copy link
Copy Markdown
Contributor Author

The Stress test (arm_msan) failure is fixed by #101239, which should be merged first. After it is merged, please update the branch to include the fix.

ACK. I will wait for #101239 to be merged and rebase my branch afterward.

@alexey-milovidov

Copy link
Copy Markdown
Member

The failures of "Flaky check" in "functions_bad_arguments" will be fixed by #101994.

@riyaneel

riyaneel commented Apr 8, 2026

Copy link
Copy Markdown
Contributor Author

I merged master including #101994, but the Stress test (arm_msan) failing with STID: 4179-5154 in ConnectionPoolWithFailover. As this appears to be an issue unrelated to the aggregate function changes. Should I wait for a specific fix?

@alexey-milovidov

Copy link
Copy Markdown
Member

The MSan stress test failure (MemorySanitizer: use-of-uninitialized-value, STID 4179-5154 or 4148-3044) is a known pre-existing issue unrelated to this PR. Fix: #102158

@clickhouse-gh

clickhouse-gh Bot commented Apr 12, 2026

Copy link
Copy Markdown
Contributor

LLVM Coverage Report

Metric Baseline Current Δ
Lines 84.00% 84.10% +0.10%
Functions 90.90% 90.90% +0.00%
Branches 76.50% 76.60% +0.10%

Changed lines: 82.86% (29/35) | lost baseline coverage: 3 line(s) · Uncovered code

Full report · Diff report

@alexey-milovidov alexey-milovidov added this pull request to the merge queue Apr 18, 2026
Merged via the queue into ClickHouse:master with commit 118028a Apr 18, 2026
164 checks passed
@robot-ch-test-poll1 robot-ch-test-poll1 added the pr-synced-to-cloud The PR is synced to the cloud repo label Apr 18, 2026
@riyaneel riyaneel deleted the perf/devirtualize-stable-variance branch April 18, 2026 17:30
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

can be tested Allows running workflows for external contributors manual approve Manual approve required to run CI pr-performance Pull request with some performance improvements pr-synced-to-cloud The PR is synced to the cloud repo

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants