perf(Statistics): devirtualize varPopStable for Float64#99460
Conversation
73d3607 to
8176975
Compare
|
|
||
| #pragma clang fp contract(off) |
There was a problem hiding this comment.
❌ #pragma clang fp contract(off) is unconditional, but this file is also built with GCC. Under -Wall -Werror, GCC reports unknown-pragmas for this line and fails compilation.
Please guard this pragma with #if defined(__clang__) / #endif (or use another portable way) so non-clang builds keep compiling.
cf78cb2 to
a653786
Compare
| void addBatchSinglePlace(size_t row_begin, size_t row_end, AggregateDataPtr __restrict place, const IColumn ** columns, Arena * arena, ssize_t if_argument_pos) const override | ||
| { | ||
| #if defined(__clang__) | ||
| #pragma clang fp contract(off) |
There was a problem hiding this comment.
There are a few suspicions:
- missing comments;
- this switch extends to the end of file, it is not turned off below;
- as "stable" versions of statistical functions use an algorithm like Kahan Summation, this switch can make it pointless.
There was a problem hiding this comment.
Ok, I see
This altered floating-point precision at the ULP level and broke cross-engine consistency in SQLLogic test.
but
- We need to move comments in the code.
- Does the difference really matter?
There was a problem hiding this comment.
@alexey-milovidov You are right. Forcing the compiler to drop precision with pragmas defeats the purpose of the stable algorithm. I've removed all of them and reverted to the clean devirtualized loop.
However, allowing the compiler to properly optimize it (FMA/registers) slightly changes the intermediate precision compared to the old virtualized code (which forced a 64-bit memory store per row). Because of this, the SQLLogic test fails with exactly 18 new query mismatches against SQLite (174,004 vs the 173,986 limit).
Should I just bump the allowed threshold in the SQLLogic config, or is there a preferred way to handle this expected precision drift?
There was a problem hiding this comment.
@alexey-milovidov Just pushed a new commit to address your remaining points:
-
Not at all. For statistical variance, a 1-ULP difference is completely acceptable, especially considering the massive CPU cycle savings (and the fact that keeping values in registers is mathematically sound).
-
Done. I've added a clear comment directly above the devirtualized loop in addBatchSinglePlace explaining that allowing FMA/register promotion causes this expected ULP drift.
(I tried to find the SQLLogic config to bump the threshold to 174,004 myself but couldn't locate the exact file. Let me know if you want me to update it somewhere specific, or if it's handled on your end!)
There was a problem hiding this comment.
@alexey-milovidov Just a quick heads-up, the CI is currently red, but the two failing checks seem completely unrelated to this PR (flaky tests inherited from master):
Stateless tests (amd_tsan)failed on04004_alter_modify_column_ttl_without_typeBuzzHouse (amd_debug)failed with anInconsistent AST formattinglogical error.
I don't have the permissions to re-trigger the CI jobs. Could you or someone from the team restart these flaky checks when you have a moment?
415e13e to
0d41731
Compare
| /// Devirtualizing this loop allows the compiler to keep intermediate values in registers | ||
| /// and utilize FMA, rather than forcing a 64-bit memory store on every row. | ||
| /// This alters the floating-point result at the ULP level, which is mathematically | ||
| /// acceptable but causes a known divergence in cross-engine SQLLogic tests against SQLite. |
There was a problem hiding this comment.
For var*Stable/stddev*Stable, this is a behavior change, not just a perf change. Please make the compatibility decision explicit in this PR: either preserve previous bit-level behavior in this loop (with compiler-guarded FP-contract controls) or update the affected SQLLogic expectation/threshold in-tree together with clear rationale.
|
Let's permit fused-multiply-add and add exceptions to SQLLogicTest. |
0d41731 to
c65d27b
Compare
| /// Devirtualizing this loop allows the compiler to keep intermediate values in registers | ||
| /// and utilize FMA, rather than forcing a 64-bit memory store on every row. | ||
| /// This alters the floating-point result at the ULP level, which is mathematically | ||
| /// acceptable but causes a known divergence in cross-engine SQLLogic tests against SQLite. |
There was a problem hiding this comment.
AggregateFunctionVariance path is shared by stddevSampStable/stddevPopStable too, and the nearby comment explicitly documents ULP-level divergence from previous results.
Please update the PR Changelog entry text accordingly (it currently says "preserving strict IEEE 754 math precision", which conflicts with this code comment) and mention all affected stable functions.
c65d27b to
1b77103
Compare
1b77103 to
7a7d873
Compare
|
The Stress test (arm_msan) failure is fixed by #101239, which should be merged first. After it is merged, please update the branch to include the fix. |
|
The failures of "Flaky check" in "functions_bad_arguments" will be fixed by #101994. |
|
I merged master including #101994, but the Stress test (arm_msan) failing with STID: 4179-5154 in ConnectionPoolWithFailover. As this appears to be an issue unrelated to the aggregate function changes. Should I wait for a specific fix? |
|
The MSan stress test failure (MemorySanitizer: use-of-uninitialized-value, STID 4179-5154 or 4148-3044) is a known pre-existing issue unrelated to this PR. Fix: #102158 |
LLVM Coverage ReportChanged lines: 82.86% (29/35) | lost baseline coverage: 3 line(s) · Uncovered code |

Description
This PR devirtualizes the Welford inner loop for
varPopStable,varSampStable,stddevPopStable, andstddevSampStablewhen aggregatingFloat64columns.By overriding
addBatchSinglePlaceand iterating directly over the rawColumnFloat64data pointer, we remove theIColumn::getFloat64()virtual call per row, allowing the compiler to fully inline the logic.Behavior Change & CI Impact:
Exposing the loop to the optimizer allows Clang/GCC to keep intermediate values in registers and utilize FMA (
vfmadd231sd), rather than forcing a 64-bit memory store on every row.This causes an expected 1-ULP precision drift compared to the old virtualized code. Consequently, the
SQLLogic testfails with exactly 18 new query mismatches against SQLite (174,004 vs the current 173,986 limit). This drift is mathematically acceptable and is an explicit trade-off for the performance gain.Perf (500M rows, Float64):
(~1.3B instructions saved by dropping the vtable lookups and enabling FMA).
Changelog category (leave one):
Changelog entry (a user-readable short description of the changes that goes into CHANGELOG.md):
Speed up
var*Stableandstddev*Stablefunctions forFloat64columns by devirtualizing the inner loop. Note: this enables compiler optimizations (FMA/registers) that alter floating-point results at the ULP level.Documentation entry for user-facing changes
Version info
26.4.1.1049