Aggregate function combinators: TOTALS and BY#103540
Conversation
| auto scratch_col = final_col->cloneEmpty(); | ||
| post_state->finalizeInto(*scratch_col, nullptr, 0, arena); | ||
|
|
||
| final_col = final_col->cloneEmpty(); |
There was a problem hiding this comment.
convertToBlockImplFinal can pre-insert one row for null key (has_null_key_data) before this function runs. In this branch, final_col is reset with cloneEmpty() and then filled only for places.size() rows, but places does not include the pre-inserted null-key row.
That drops the already-inserted row for combinator aggregates and can leave key/result columns with different sizes on nullable-key aggregation paths.
Concrete trace:
- before this block:
key_columns[0].size() == 1(null-key row),final_col.size() == 1 places.size() == N(non-null groups)- after
final_col = final_col->cloneEmpty(); ... insert N rows:final_col.size() == N(lost null-key row)
Please preserve the pre-inserted row (or include null-key state in this post-aggregation write path).
| /// TOTALS/BY combinators | ||
| bool totals_combinator = false; | ||
| std::optional<Names> by_columns; | ||
| std::optional<Names> order_by_columns; |
There was a problem hiding this comment.
AggregateDescription now carries semantic fields (totals_combinator, by_columns, order_by_columns), but serializeAggregateDescriptions/deserializeAggregateDescriptions still encode only column name + args + function + params.
That makes plan serialization lossy: whenever an AggregatingStep/MergingAggregatedStep is serialized and restored (for example distributed/remote execution paths), combinator metadata is dropped and execution can silently fall back to plain aggregate semantics.
Please extend aggregate-description serialization with these new fields (with explicit versioning and backward-compatible defaults) so TOTALS/BY/ORDER BY survive plan round-trips.
| #define COMPUTE_POST_TOTALS_ONLY(NAME) \ | ||
| else if (data_variants.type == AggregatedDataVariants::Type::NAME) \ | ||
| { \ | ||
| data_variants.NAME->data.forEachValue([&](const auto & /*key*/, auto & mapped) \ |
There was a problem hiding this comment.
prepareChunkAndFillSingleLevel/prepareChunksAndFillTwoLevel compute post-aggregation states only via forEachValue, so the nullable-key special row (data.hasNullKeyData / data.getNullKeyData) is never absorbed into post_aggregation.
That makes TOTALS/BY results incorrect for nullable-key aggregation: the null-key group contributes to regular aggregates (it is inserted in convertToBlockImplFinal) but is missing from post-aggregated combinator states.
Please include the null-key state in the post-aggregation pass (for both single-level and two-level paths), otherwise combinator output diverges from actual grouped data whenever a null-key row is present.
| /// hash table once, before any chunk conversion starts. This has to happen here | ||
| /// (and in prepareChunksAndFillTwoLevel), because insertResultsIntoColumns sees | ||
| /// only per-chunk places and cannot merge globally. | ||
| if (final && !post_aggregation.empty() && !post_aggregation.isComputed()) |
There was a problem hiding this comment.
This precompute pass iterates only hash-table cells, but it never absorbs data_variants.without_key (the overflow aggregate state used with group_by_overflow_mode='any').
As a result, combinator results miss rows that were folded into the overflow state. At minimum TOTALS should include that state; otherwise f(x TOTALS) becomes inconsistent with the actual aggregated input when overflow is enabled.
| throw Exception(ErrorCodes::LOGICAL_ERROR, | ||
| "BY post-state: key_columns must be provided during finalize"); | ||
|
|
||
| auto key = serializeByKey(key_columns, row_num, keys_arena); |
There was a problem hiding this comment.
finalizeInto serializes the BY key into keys_arena for every output row before lookup. Because arena allocations are monotonic, this turns finalization into an extra O(rows × key_size) retained allocation, which can become very large for big result sets.
Can we avoid arena-backed serialization on the lookup path (e.g. transient buffer / non-owning serialization), and reserve arena allocations only when inserting new keys in absorb?
| for (size_t i = 0; | ||
| i < by_nodes.size(); ++i) | ||
| { | ||
| if (auto * id = by_nodes[i] |
There was a problem hiding this comment.
calculateFunctionProjectionName serializes BY keys only for IdentifierNode / ColumnNode. For valid BY expressions (for example, sum(x BY toDate(ts)) with the same expression in GROUP BY), this branch emits an empty key list, so different BY expressions collapse to the same projection name like sum(x BY ).
That can cause analyzer-level name collisions/dedup and mis-bind aggregates that should stay distinct. Please serialize each BY node via the same generic naming path used elsewhere (for example, calculateActionNodeName equivalent), not only identifier/column fast paths.
af918f6 to
cd0af1d
Compare
| bool by_combinator = false; | ||
|
|
||
| /// Columns after BY: avg(x BY a, b) | ||
| ASTPtr by_combinator_columns; |
There was a problem hiding this comment.
by_combinator_columns is a semantic child, but ASTFunction::clone and ASTFunction::updateTreeHashImpl were not updated. A cloned function keeps a shallow pointer to the original BY list and does not put it back into children, so AST visitors/hash/checks can miss or conflate sum(x BY a); TOTALS has the same hash issue because it is only a boolean. Please deep-clone and attach by_combinator_columns in ASTFunction::clone and include both combinator markers in ASTFunction::updateTreeHashImpl.
| /// Build a const IColumn* array from out_cols.raw_key_columns for BY lookups. | ||
| /// (raw_key_columns is already a vector of IColumn* in the same order as params.keys.) | ||
| std::vector<const IColumn *> out_key_cols(params.keys_size); | ||
| for (size_t i = 0; i < params.keys_size; ++i) |
There was a problem hiding this comment.
BY lookup positions are resolved against the original GROUP BY key order, but this uses raw_key_columns after convertToBlockImplFinal has called method.shuffleKeyColumns(out_cols.raw_key_columns, key_sizes). For fixed composite-key methods that can reorder the pointer vector by key size, so a BY aggregate over the second key can serialize a different output column during finalizeInto. The precompute path has the opposite problem: it calls method_ref.insertKeyIntoColumns(key, raw_key_cols, key_sizes, ...) without applying the same shuffle first.
That makes BY states and final lookups disagree for some key layouts, producing missed lookups or values for the wrong subset. Please keep a key-column array in the original params.keys order for PostAggregationState lookups, or carry an explicit original-position mapping through shuffleKeyColumns and use it consistently in both precompute and finalization.
| /// prepareChunksAndFillTwoLevelImpl kicks in. This guarantees no race between | ||
| /// post-state accumulation and parallel bucket-to-chunk conversion. | ||
| if (final) | ||
| computePostAggregationStates(data_variants); |
There was a problem hiding this comment.
The test and comment now say two-level aggregation is not supported for TOTALS / BY, but this path still enters two-level finalization whenever the thresholds convert data_variants to two-level. prepareChunksAndFillTwoLevel computes post states and then converts buckets instead of throwing, so the new T13 query will execute here rather than raise the expected NOT_IMPLEMENTED exception.
If two-level is intentionally unsupported for now, please reject it here with the same !post_aggregation.empty() guard used in the distributed merge path before calling computePostAggregationStates. Otherwise the test should assert correct two-level results and cover the BY cases too.
| /// (and in prepareChunksAndFillTwoLevel), because insertResultsIntoColumns sees | ||
| /// only per-chunk places and cannot merge globally. | ||
| if (final) | ||
| computePostAggregationStates(data_variants); |
There was a problem hiding this comment.
prepareChunkAndFillSingleLevel is also reached from Aggregator::mergeBlocks in the memory-efficient/external aggregation path, where each call may finalize only one bucket's merged chunks. Calling computePostAggregationStates here then makes that bucket-local state final for the whole Aggregator because PostAggregationManager::computeOnce suppresses later calls. In that path sum(x TOTALS) / sum(x BY k) can be finalized from one bucket instead of all buckets, or the first computed bucket can be reused for subsequent buckets. Please reject these combinators in the bucket-by-bucket merge path as in mergeAndConvertOneBucketToChunk, or compute post-aggregation only after all buckets that contribute to the result have been merged into one complete variant.
…rg::alter PR ClickHouse#106102 (merged 2026-06-05 17:43 UTC) refactored Iceberg::alter to declare last_version and compression_method outside an if/else that assigns them in both branches. Clang-tidy cppcoreguidelines-init-variables flags the bare declarations and Build (arm_tidy) is built with -warnings-as-errors, so the build fails for every PR whose CI ran on master after this commit. CIDB shows 30+ unrelated PRs hitting this in the last 2 days (e.g. ClickHouse#89360, ClickHouse#103540, ClickHouse#106120, ClickHouse#106386, ClickHouse#106404, ClickHouse#106522, ClickHouse#105102, ClickHouse#106590). Initialize both variables to safe defaults at declaration. They are unconditionally overwritten in both if and else branches before use, so behavior is unchanged. The new defaults are only relevant if a future code path skips both assignments, in which case last_version=0 and compression_method=CompressionMethod::None are sane no-op values (the same defaults the old structured-binding form would produce through default-construction of the destructured aggregate). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> (cherry picked from commit d72cac5)
…rg::alter PR ClickHouse#106102 (merged 2026-06-05 17:43 UTC) refactored Iceberg::alter to declare last_version and compression_method outside an if/else that assigns them in both branches. Clang-tidy cppcoreguidelines-init-variables flags the bare declarations and Build (arm_tidy) is built with -warnings-as-errors, so the build fails for every PR whose CI ran on master after this commit. CIDB shows 30+ unrelated PRs hitting this in the last 2 days (e.g. ClickHouse#89360, ClickHouse#103540, ClickHouse#106120, ClickHouse#106386, ClickHouse#106404, ClickHouse#106522, ClickHouse#105102, ClickHouse#106590). Initialize both variables to safe defaults at declaration. They are unconditionally overwritten in both if and else branches before use, so behavior is unchanged. The new defaults are only relevant if a future code path skips both assignments, in which case last_version=0 and compression_method=CompressionMethod::None are sane no-op values (the same defaults the old structured-binding form would produce through default-construction of the destructured aggregate). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> (cherry picked from commit d72cac5)
…rg::alter PR ClickHouse#106102 (merged 2026-06-05 17:43 UTC) refactored Iceberg::alter to declare last_version and compression_method outside an if/else that assigns them in both branches. Clang-tidy cppcoreguidelines-init-variables flags the bare declarations and Build (arm_tidy) is built with -warnings-as-errors, so the build fails for every PR whose CI ran on master after this commit. CIDB shows 30+ unrelated PRs hitting this in the last 2 days (e.g. ClickHouse#89360, ClickHouse#103540, ClickHouse#106120, ClickHouse#106386, ClickHouse#106404, ClickHouse#106522, ClickHouse#105102, ClickHouse#106590). Initialize both variables to safe defaults at declaration. They are unconditionally overwritten in both if and else branches before use, so behavior is unchanged. The new defaults are only relevant if a future code path skips both assignments, in which case last_version=0 and compression_method=CompressionMethod::None are sane no-op values (the same defaults the old structured-binding form would produce through default-construction of the destructured aggregate). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> (cherry picked from commit d72cac5)
…rg::alter PR ClickHouse#106102 (merged 2026-06-05 17:43 UTC) refactored Iceberg::alter to declare last_version and compression_method outside an if/else that assigns them in both branches. Clang-tidy cppcoreguidelines-init-variables flags the bare declarations and Build (arm_tidy) is built with -warnings-as-errors, so the build fails for every PR whose CI ran on master after this commit. CIDB shows 30+ unrelated PRs hitting this in the last 2 days (e.g. ClickHouse#89360, ClickHouse#103540, ClickHouse#106120, ClickHouse#106386, ClickHouse#106404, ClickHouse#106522, ClickHouse#105102, ClickHouse#106590). Initialize both variables to safe defaults at declaration. They are unconditionally overwritten in both if and else branches before use, so behavior is unchanged. The new defaults are only relevant if a future code path skips both assignments, in which case last_version=0 and compression_method=CompressionMethod::None are sane no-op values (the same defaults the old structured-binding form would produce through default-construction of the destructured aggregate). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> (cherry picked from commit d72cac5)
…rg::alter PR ClickHouse#106102 (merged 2026-06-05 17:43 UTC) refactored Iceberg::alter to declare last_version and compression_method outside an if/else that assigns them in both branches. Clang-tidy cppcoreguidelines-init-variables flags the bare declarations and Build (arm_tidy) is built with -warnings-as-errors, so the build fails for every PR whose CI ran on master after this commit. CIDB shows 30+ unrelated PRs hitting this in the last 2 days (e.g. ClickHouse#89360, ClickHouse#103540, ClickHouse#106120, ClickHouse#106386, ClickHouse#106404, ClickHouse#106522, ClickHouse#105102, ClickHouse#106590). Initialize both variables to safe defaults at declaration. They are unconditionally overwritten in both if and else branches before use, so behavior is unchanged. The new defaults are only relevant if a future code path skips both assignments, in which case last_version=0 and compression_method=CompressionMethod::None are sane no-op values (the same defaults the old structured-binding form would produce through default-construction of the destructured aggregate). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> (cherry picked from commit d72cac5)
…rg::alter PR ClickHouse#106102 (merged 2026-06-05 17:43 UTC) refactored Iceberg::alter to declare last_version and compression_method outside an if/else that assigns them in both branches. Clang-tidy cppcoreguidelines-init-variables flags the bare declarations and Build (arm_tidy) is built with -warnings-as-errors, so the build fails for every PR whose CI ran on master after this commit. CIDB shows 30+ unrelated PRs hitting this in the last 2 days (e.g. ClickHouse#89360, ClickHouse#103540, ClickHouse#106120, ClickHouse#106386, ClickHouse#106404, ClickHouse#106522, ClickHouse#105102, ClickHouse#106590). Initialize both variables to safe defaults at declaration. They are unconditionally overwritten in both if and else branches before use, so behavior is unchanged. The new defaults are only relevant if a future code path skips both assignments, in which case last_version=0 and compression_method=CompressionMethod::None are sane no-op values (the same defaults the old structured-binding form would produce through default-construction of the destructured aggregate). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> (cherry picked from commit d72cac5)
…rg::alter PR ClickHouse#106102 (merged 2026-06-05 17:43 UTC) refactored Iceberg::alter to declare last_version and compression_method outside an if/else that assigns them in both branches. Clang-tidy cppcoreguidelines-init-variables flags the bare declarations and Build (arm_tidy) is built with -warnings-as-errors, so the build fails for every PR whose CI ran on master after this commit. CIDB shows 30+ unrelated PRs hitting this in the last 2 days (e.g. ClickHouse#89360, ClickHouse#103540, ClickHouse#106120, ClickHouse#106386, ClickHouse#106404, ClickHouse#106522, ClickHouse#105102, ClickHouse#106590). Initialize both variables to safe defaults at declaration. They are unconditionally overwritten in both if and else branches before use, so behavior is unchanged. The new defaults are only relevant if a future code path skips both assignments, in which case last_version=0 and compression_method=CompressionMethod::None are sane no-op values (the same defaults the old structured-binding form would produce through default-construction of the destructured aggregate). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> (cherry picked from commit d72cac5)
…rg::alter PR ClickHouse#106102 (merged 2026-06-05 17:43 UTC) refactored Iceberg::alter to declare last_version and compression_method outside an if/else that assigns them in both branches. Clang-tidy cppcoreguidelines-init-variables flags the bare declarations and Build (arm_tidy) is built with -warnings-as-errors, so the build fails for every PR whose CI ran on master after this commit. CIDB shows 30+ unrelated PRs hitting this in the last 2 days (e.g. ClickHouse#89360, ClickHouse#103540, ClickHouse#106120, ClickHouse#106386, ClickHouse#106404, ClickHouse#106522, ClickHouse#105102, ClickHouse#106590). Initialize both variables to safe defaults at declaration. They are unconditionally overwritten in both if and else branches before use, so behavior is unchanged. The new defaults are only relevant if a future code path skips both assignments, in which case last_version=0 and compression_method=CompressionMethod::None are sane no-op values (the same defaults the old structured-binding form would produce through default-construction of the destructured aggregate). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> (cherry picked from commit d72cac5)
…rg::alter PR ClickHouse#106102 (merged 2026-06-05 17:43 UTC) refactored Iceberg::alter to declare last_version and compression_method outside an if/else that assigns them in both branches. Clang-tidy cppcoreguidelines-init-variables flags the bare declarations and Build (arm_tidy) is built with -warnings-as-errors, so the build fails for every PR whose CI ran on master after this commit. CIDB shows 30+ unrelated PRs hitting this in the last 2 days (e.g. ClickHouse#89360, ClickHouse#103540, ClickHouse#106120, ClickHouse#106386, ClickHouse#106404, ClickHouse#106522, ClickHouse#105102, ClickHouse#106590). Initialize both variables to safe defaults at declaration. They are unconditionally overwritten in both if and else branches before use, so behavior is unchanged. The new defaults are only relevant if a future code path skips both assignments, in which case last_version=0 and compression_method=CompressionMethod::None are sane no-op values (the same defaults the old structured-binding form would produce through default-construction of the destructured aggregate). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> (cherry picked from commit d72cac5)
…rg::alter PR ClickHouse#106102 (merged 2026-06-05 17:43 UTC) refactored Iceberg::alter to declare last_version and compression_method outside an if/else that assigns them in both branches. Clang-tidy cppcoreguidelines-init-variables flags the bare declarations and Build (arm_tidy) is built with -warnings-as-errors, so the build fails for every PR whose CI ran on master after this commit. CIDB shows 30+ unrelated PRs hitting this in the last 2 days (e.g. ClickHouse#89360, ClickHouse#103540, ClickHouse#106120, ClickHouse#106386, ClickHouse#106404, ClickHouse#106522, ClickHouse#105102, ClickHouse#106590). Initialize both variables to safe defaults at declaration. They are unconditionally overwritten in both if and else branches before use, so behavior is unchanged. The new defaults are only relevant if a future code path skips both assignments, in which case last_version=0 and compression_method=CompressionMethod::None are sane no-op values (the same defaults the old structured-binding form would produce through default-construction of the destructured aggregate). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> (cherry picked from commit d72cac5)
…rg::alter PR #106102 (merged 2026-06-05 17:43 UTC) refactored Iceberg::alter to declare last_version and compression_method outside an if/else that assigns them in both branches. Clang-tidy cppcoreguidelines-init-variables flags the bare declarations and Build (arm_tidy) is built with -warnings-as-errors, so the build fails for every PR whose CI ran on master after this commit. CIDB shows 30+ unrelated PRs hitting this in the last 2 days (e.g. #89360, #103540, #106120, #106386, #106404, #106522, #105102, #106590). Initialize both variables to safe defaults at declaration. They are unconditionally overwritten in both if and else branches before use, so behavior is unchanged. The new defaults are only relevant if a future code path skips both assignments, in which case last_version=0 and compression_method=CompressionMethod::None are sane no-op values (the same defaults the old structured-binding form would produce through default-construction of the destructured aggregate). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> (cherry picked from commit d72cac5)
…rg::alter PR ClickHouse#106102 (merged 2026-06-05 17:43 UTC) refactored Iceberg::alter to declare last_version and compression_method outside an if/else that assigns them in both branches. Clang-tidy cppcoreguidelines-init-variables flags the bare declarations and Build (arm_tidy) is built with -warnings-as-errors, so the build fails for every PR whose CI ran on master after this commit. CIDB shows 30+ unrelated PRs hitting this in the last 2 days (e.g. ClickHouse#89360, ClickHouse#103540, ClickHouse#106120, ClickHouse#106386, ClickHouse#106404, ClickHouse#106522, ClickHouse#105102, ClickHouse#106590). Initialize both variables to safe defaults at declaration. They are unconditionally overwritten in both if and else branches before use, so behavior is unchanged. The new defaults are only relevant if a future code path skips both assignments, in which case last_version=0 and compression_method=CompressionMethod::None are sane no-op values (the same defaults the old structured-binding form would produce through default-construction of the destructured aggregate). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> (cherry picked from commit d72cac5)
…rg::alter PR ClickHouse#106102 (merged 2026-06-05 17:43 UTC) refactored Iceberg::alter to declare last_version and compression_method outside an if/else that assigns them in both branches. Clang-tidy cppcoreguidelines-init-variables flags the bare declarations and Build (arm_tidy) is built with -warnings-as-errors, so the build fails for every PR whose CI ran on master after this commit. CIDB shows 30+ unrelated PRs hitting this in the last 2 days (e.g. ClickHouse#89360, ClickHouse#103540, ClickHouse#106120, ClickHouse#106386, ClickHouse#106404, ClickHouse#106522, ClickHouse#105102, ClickHouse#106590). Initialize both variables to safe defaults at declaration. They are unconditionally overwritten in both if and else branches before use, so behavior is unchanged. The new defaults are only relevant if a future code path skips both assignments, in which case last_version=0 and compression_method=CompressionMethod::None are sane no-op values (the same defaults the old structured-binding form would produce through default-construction of the destructured aggregate). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> (cherry picked from commit d72cac5)
…rg::alter PR #106102 (merged 2026-06-05 17:43 UTC) refactored Iceberg::alter to declare last_version and compression_method outside an if/else that assigns them in both branches. Clang-tidy cppcoreguidelines-init-variables flags the bare declarations and Build (arm_tidy) is built with -warnings-as-errors, so the build fails for every PR whose CI ran on master after this commit. CIDB shows 30+ unrelated PRs hitting this in the last 2 days (e.g. #89360, #103540, #106120, #106386, #106404, #106522, #105102, #106590). Initialize both variables to safe defaults at declaration. They are unconditionally overwritten in both if and else branches before use, so behavior is unchanged. The new defaults are only relevant if a future code path skips both assignments, in which case last_version=0 and compression_method=CompressionMethod::None are sane no-op values (the same defaults the old structured-binding form would produce through default-construction of the destructured aggregate). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> (cherry picked from commit d72cac5)
…rg::alter PR #106102 (merged 2026-06-05 17:43 UTC) refactored Iceberg::alter to declare last_version and compression_method outside an if/else that assigns them in both branches. Clang-tidy cppcoreguidelines-init-variables flags the bare declarations and Build (arm_tidy) is built with -warnings-as-errors, so the build fails for every PR whose CI ran on master after this commit. CIDB shows 30+ unrelated PRs hitting this in the last 2 days (e.g. #89360, #103540, #106120, #106386, #106404, #106522, #105102, #106590). Initialize both variables to safe defaults at declaration. They are unconditionally overwritten in both if and else branches before use, so behavior is unchanged. The new defaults are only relevant if a future code path skips both assignments, in which case last_version=0 and compression_method=CompressionMethod::None are sane no-op values (the same defaults the old structured-binding form would produce through default-construction of the destructured aggregate). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> (cherry picked from commit d72cac5)
…rg::alter PR ClickHouse#106102 (merged 2026-06-05 17:43 UTC) refactored Iceberg::alter to declare last_version and compression_method outside an if/else that assigns them in both branches. Clang-tidy cppcoreguidelines-init-variables flags the bare declarations and Build (arm_tidy) is built with -warnings-as-errors, so the build fails for every PR whose CI ran on master after this commit. CIDB shows 30+ unrelated PRs hitting this in the last 2 days (e.g. ClickHouse#89360, ClickHouse#103540, ClickHouse#106120, ClickHouse#106386, ClickHouse#106404, ClickHouse#106522, ClickHouse#105102, ClickHouse#106590). Initialize both variables to safe defaults at declaration. They are unconditionally overwritten in both if and else branches before use, so behavior is unchanged. The new defaults are only relevant if a future code path skips both assignments, in which case last_version=0 and compression_method=CompressionMethod::None are sane no-op values (the same defaults the old structured-binding form would produce through default-construction of the destructured aggregate). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> (cherry picked from commit d72cac5)
…rg::alter PR #106102 (merged 2026-06-05 17:43 UTC) refactored Iceberg::alter to declare last_version and compression_method outside an if/else that assigns them in both branches. Clang-tidy cppcoreguidelines-init-variables flags the bare declarations and Build (arm_tidy) is built with -warnings-as-errors, so the build fails for every PR whose CI ran on master after this commit. CIDB shows 30+ unrelated PRs hitting this in the last 2 days (e.g. #89360, #103540, #106120, #106386, #106404, #106522, #105102, #106590). Initialize both variables to safe defaults at declaration. They are unconditionally overwritten in both if and else branches before use, so behavior is unchanged. The new defaults are only relevant if a future code path skips both assignments, in which case last_version=0 and compression_method=CompressionMethod::None are sane no-op values (the same defaults the old structured-binding form would produce through default-construction of the destructured aggregate). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> (cherry picked from commit d72cac5)
| return; | ||
|
|
||
| /// Do not fuse functions with TOTALS/BY combinators — fusing loses the combinator flag. | ||
| if (function_node->hasTotalsCombinator() || function_node->hasByCombinator()) |
There was a problem hiding this comment.
TOTALS / BY need to be treated as semantic modifiers by every aggregate rewrite, not only by FuseFunctionsPass. Several default-on sibling passes still rewrite aggregate FunctionNodes without checking these bits, so they can silently change the post-aggregation contract.
Concrete example: AggregateFunctionOfGroupByKeysPass can replace any(city BY country) in GROUP BY country, city with plain city, because the argument is a group key. The expected BY country result is one post-aggregated value shared by all cities in a country, but after the rewrite it becomes the per-row city key. Similar guards are needed for other aggregate rewrites that clear/replace arguments or create a new aggregate node (NormalizeCountVariantsPass, SumIfToCountIfPass, RewriteAggregateFunctionWithIfPass, UniqToCountPass, etc.), or they must explicitly preserve and revalidate the combinator semantics.
LLVM Coverage ReportChanged lines: Changed C/C++ lines covered by tests: 385/453 (84.99%) | Lost baseline coverage (was covered on master, now uncovered in this PR): 7 line(s) · Uncovered code |
`test_replication_without_zookeeper/test.py::test_startup_without_zookeeper` intermittently fails with `kazoo.exceptions.NotEmptyError` inside `drop_zk`, which calls `zk.delete(path="/clickhouse", recursive=True)`. The drop happens while `node1` (ClickHouse) is still running, so its `ReplicatedMergeTree` background threads keep re-creating ZooKeeper nodes under `/clickhouse`. kazoo's recursive delete races with that: between `get_children` and `delete` a new child node can appear, and kazoo raises `NotEmptyError` without retrying. The test already used `cluster.run_kazoo_commands_with_retries`, but with the default `repeats=1`. In that helper the retry loop is `for i in range(repeats - 1)`, so `repeats=1` performs zero retries — the callback runs exactly once with no exception handling. Pass `repeats=5` (matching the existing usages in `helpers/cluster.py`) so the drop is retried on the transient `NotEmptyError`. This is a pre-existing flaky test (same failure seen on unrelated PRs, e.g. ClickHouse#103540 and ClickHouse#101757), not a regression. CI report: https://s3.amazonaws.com/clickhouse-test-reports/json.html?REF=master&sha=d40ea5d0da4103b54971ac01a9960ef9153242bb&name_0=MasterCI&name_1=Integration%20tests%20%28amd_asan_ubsan%2C%20db%20disk%2C%20old%20analyzer%2C%203%2F6%29 Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

Changelog category
Changelog entry
Add
TOTALSandBYcombinators for aggregate functions.Description
This PR implements two new aggregate function combinators:
TOTALS: computes the aggregate over the entire dataset, independent of
GROUP BY. Example:SELECT country, avg(salary), avg(salary TOTALS) FROM t GROUP BY country— the third column is the same value (global average) in every row.BY: computes the aggregate over a subset of
GROUP BYkeys. Example:SELECT country, city, avg(salary), avg(salary BY country) FROM t GROUP BY country, city— the fourth column is the per-country average, same value for rows sharing the same country.Closes part of #34156. The
ORDER BYcombinator from the same issue is implemented in a separate PR.Approach
Both combinators are implemented as post-aggregation of the main hash table, not as wrappers around the aggregate function. The reason: TOTALS and BY require access to states of multiple groups (TOTALS — all groups, BY — groups sharing a subset of keys), which a wrapper can't provide because it operates within a single group.
After the main aggregation hash table is filled, a single pass merges per-group states into auxiliary post-aggregation states (
PostAggregationState):When the result chunk is formed, values for aggregates with combinators are taken from the post-aggregation states instead of per-group states.
Compared to a JOIN-based implementation (separate aggregation per BY-subset + LEFT JOIN), this requires only one pass over input data and one main hash table.
Components changed
ExpressionListParsers.cpp,ASTFunction.h/cpp): recognizesTOTALSandBY <cols>inside function arguments.QueryTreeBuilder.cpp,FunctionNode.h/cpp): propagates combinator info to query tree.resolveFunction.cpp,QueryAnalyzer.cpp): resolves BY columns to data sources, generates unique projection names.PlannerActionsVisitor.cpp,PlannerExpressionAnalysis.cpp,PlannerAggregation.cpp): validates BY columns are in GROUP BY, fillsAggregateDescriptionwith combinator info.Aggregator.h/cpp): addsPostAggregationManager, calls post-processing at finalization points, uses post-states ininsertResultsIntoColumns.PostAggregationState.h/cpp): the post-aggregation machinery itself.FuseFunctionsPass.cpp): skips fusingsum/avgintosumCountfor aggregates with combinators (preserves combinator information).Both single-level and two-level hash aggregation are supported.
Validation
BY columns must be a subset of GROUP BY keys. Checked at planning time (
analyzeAggregation) and double-checked atPostAggregationManager::init(defense in depth). Error:BAD_ARGUMENTSwith a message identifying the offending column and aggregate.