Fix correlated subquery + GROUP BY ROLLUP under group_by_use_nulls#104350
Fix correlated subquery + GROUP BY ROLLUP under group_by_use_nulls#104350groeneai wants to merge 21 commits into
Conversation
A correlated column reference resolved inside a subquery used to keep its original (non-Nullable) type even when the column is a GROUP BY key of an outer query under `group_by_use_nulls = 1` with `WITH ROLLUP`/`CUBE`/ `GROUPING SETS`. The analyzer's `nullable_group_by_keys` walk stopped at the first enclosing `QUERY` scope, so the outer scope where the column was defined was never consulted for the inner reference. The planner then built the inner subquery's `ActionsDAG` (CAST wrappers, aggregate functions, etc.) with the original argument types. After the decorrelation pass replaced `PLACEHOLDER` nodes with `INPUT` nodes whose actual types were Nullable (because the outer GROUP BY made the keys Nullable), the precomputed wrappers received columns of an unexpected type and threw `Logical error: 'Bad cast from type DB::ColumnNullable to DB::ColumnVector<...>'` during header inference. The fix lifts the walk so it also visits enclosing outer-query scopes for column nodes whose source table expression is not registered in the current scope's FROM (i.e. correlated). At each scope, the `expressions_in_resolve_process_stack.hasAggregateFunction()` exclusion is evaluated against THAT scope's stack, so an outer scope's `nullable_group_by_keys` correctly applies even when the inner subquery is currently inside an inner aggregate function call (the inner aggregate operates on the outer's already-Nullable values). For correlated matches the existing column node is mutated in place instead of being replaced with a clone of the GROUP BY key node. The same `shared_ptr` is referenced by the enclosing `QueryNode`'s `correlated_columns` list (registered via `addCorrelatedColumn`); cloning would leave the planner's `correlated_columns_set` pointing to the original non-Nullable node, and `PlannerActionsVisitor::visitColumn` (which uses type-sensitive equality) would then fail to identify the projected column as correlated and skip the `PLACEHOLDER` step that the decorrelation pass replaces. Closes ClickHouse#91119.
|
cc @novikd @KochetovNicolai — could you review this? It fixes #91119 by extending the analyzer's |
|
cc @PedroTadim — invited me here. The reproducer from #91119 (and related cases — CUBE, GROUPING SETS, aggregate function over correlated outer column, HAVING/WHERE with correlated subqueries, Nullable source) all return correct results with this fix. Also non-regressed against the existing |
|
Workflow [PR], commit [775264e] Summary: ❌
AI ReviewSummaryThis PR reworks Final Verdict✅ No new findings on the current PR head. |
CI Status — covered SHA
|
| Check | Test | Disposition |
|---|---|---|
Fast test |
Server died |
Chronic Fast-test infra flake. The Fast-test runner reports WARNING: Timeout exceeded [300], sending SIGTERM followed by Hung check failed: server is not responding. CIDB 30d shows Server died is endemic across 100+ PRs (top: PR #104754=33, PR #100185=83) and master (20 hits). Last-test-before-hang on this run was 01548_uncomparable_columns_in_keys at 06:13:31, then a 42-second silence before the timeout — clearly not produced by the analyzer changes in this PR (analyzer/group-by-use-nulls). |
Mergeable Check / PR / CH Inc sync |
n/a | Aggregator/infra rollup of the above. |
Code change unchanged since 2026-05-08. The 22 related existing tests still pass; the comprehensive new test 04210_91119_correlated_subquery_group_by_use_nulls.sql (12 query shapes) verifies the analyzer-level fix. Awaiting @novikd / @KochetovNicolai re-review.
— cron:our-pr-ci-monitor:20260604-073000
The previous commit replaced the OR-accumulated aggregate-function check in
`resolveExpressionNode` with a per-scope check, intending to allow the walk to
continue past the enclosing `QUERY` scope for correlated subqueries. That
broke `* APPLY x -> agg(x, key)` queries under `group_by_use_nulls = 1`:
* The LAMBDA scope sees the aggregate on its `expressions_in_resolve_process_stack`
but its own `nullable_group_by_keys` is empty, so the per-scope check skips
that scope's lookup.
* The outer `QUERY` scope holds the keys but its own stack does not see the
aggregate (the aggregate lives inside the lambda body), so the per-scope
check finds the key and wraps the inner reference as `Nullable`.
* The aggregate's `addBatchSinglePlace` then runs on a `ColumnNullable` where
its signature expects `ColumnVector<UInt64>` (or `ColumnString` for the
`min` family), and the `assert_cast` faults inside `SingleValueDataFixed::set`.
Fast test on PR ClickHouse#104350 SHA `30c602753b` reproduced this on
`04067_group_by_use_nulls_apply_aggregate.sql` (server died, signal 11 in
`AggregateFunctionNullVariadic::addBatchSinglePlace`). The exact same shape
appears in the AST-fuzzer findings tracked under STID `0250-30be`,
`1499-22cb`, `3336-34e8`, and `3336-2f87`.
Restore the OR-accumulated `in_aggregate_function_scope` flag for scopes
within the same query nesting level, and reset it when crossing into an
outer `QUERY` scope. The reset preserves the new correlated-subquery
behaviour: an outer scope's GROUP BY keys are evaluated relative to the
outer scope's own resolution context, so an aggregate in an inner subquery
must not shield against the outer scope's `nullable_group_by_keys`.
Adds a regression test
`04326_91119_apply_lambda_aggregate_rollup.sql` covering nine variants of
`* APPLY <lambda>` plus an aggregate plus `ROLLUP` / `CUBE` / `GROUPING SETS`
under `group_by_use_nulls = 1`, including the regex-filtered APPLY shape from
issue ClickHouse#98190 (STID `1499-22cb`). The existing
`04067_group_by_use_nulls_apply_aggregate` continues to pass.
Local verification: master HEAD passes; PR HEAD without this fix aborts on
`04067` query 1; PR HEAD with this fix passes `04067`, `04210` (PR's own
test), and `04326` (new regression test).
Related: ClickHouse#91119
Related: ClickHouse#98190
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Pre-PR validation gate
Session id: cron:clickhouse-ci-task-worker:20260608-134700 |
CI Ledger - covered SHA: 6f52db2Workflow
The Fast-test failure on Session: cron:our-pr-ci-monitor:20260608-193000 |
…d-rollup-cast-wrapper
…lls to a correlated column `nullable_group_by_keys` matching uses query-tree equality, and `ColumnNode::isEqualImpl` compares only the column name and type, not the column source. For a correlated reference this meant an intermediate query with a same-named, same-typed `GROUP BY` key (under `WITH ROLLUP`/`CUBE`/`GROUPING SETS`) could match the outer column and wrap it as `Nullable`, even though that intermediate query's rollup does not produce `NULL`s for the outer column. Require that the scope whose `nullable_group_by_keys` is being applied actually owns the column's source table expression (the same ownership test already used at the `QUERY`-boundary break). Otherwise keep walking outwards to the owning scope. Single-level correlated cases (the only ones currently reachable; deeper correlation through an intermediate grouped query is `NOT_IMPLEMENTED` in the planner) are unaffected, as confirmed by the existing tests. Also update the stale block comment to describe the actual OR-accumulate-within-query / reset-at-`QUERY`-boundary rule for the aggregate-function exclusion. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
|
Merged Addressed both review threads in 62084af:
Verified locally: the #91119 reproducer now returns Remaining CI reds are all the BuzzHouse |
CI Ledger — covered SHA:
|
| QueryTreeNodePtr column_source; | ||
| if (auto * column_node = node->as<ColumnNode>()) | ||
| column_source = column_node->getColumnSourceOrNull(); |
There was a problem hiding this comment.
I guess it won't work in cases like GROUP BY c % 2
There was a problem hiding this comment.
Confirmed locally. With
SET enable_analyzer=1, allow_experimental_correlated_subqueries=1, group_by_use_nulls=1;
EXPLAIN QUERY TREE
SELECT (SELECT c % 2)
FROM (SELECT toInt32(3) AS c) t
GROUP BY c % 2 WITH ROLLUP;the inner query tree is
QUERY id: 2, alias: __table1, is_subquery: 1, is_correlated: 1
PROJECTION COLUMNS
modulo(c, 2) Int16 <-- should be Nullable(Int16)
PROJECTION
FUNCTION id: 7, function_name: modulo, result_type: Int16
ARGUMENTS
COLUMN id: 4, column_name: c, result_type: Int32, source_id: 5 <-- should be Nullable(Int32)
The inner expression is left non-Nullable because the walk only captures column_source for a bare ColumnNode, so a FunctionNode like c % 2 (or any compound expression that contains a correlated column) hits if (!column_source) break; at the inner QUERY boundary and never reaches the outer scope's nullable_group_by_keys, which stores c % 2 itself, not bare c.
The planner currently rejects this shape with NOT_IMPLEMENTED: can't find correlated column ... in current header: __grouping_set, modulo(__table3.c, 2_UInt8), so it does not surface to users today, but the analyzer is producing the wrong type. The right shape of the fix is to drive the walk by "this expression contains a correlated column whose source lives in an outer scope" rather than by node->as<ColumnNode>()->getColumnSourceOrNull(). I will pivot in that direction as part of addressing C3 / C4 below.
| bool is_correlated_to_outer_scope = column_source | ||
| && !scope.registered_table_expression_nodes.contains(column_source) | ||
| && scope_ptr != &scope; | ||
| if (is_correlated_to_outer_scope) | ||
| { | ||
| /// `nullable_group_by_keys` matching uses query-tree equality, which for | ||
| /// columns (`ColumnNode::isEqualImpl`) compares only the column name and | ||
| /// type, not the column source. For a correlated reference this means an | ||
| /// intermediate query with a same-named, same-typed GROUP BY key could | ||
| /// match `node` even though that query's rollup does not produce `NULL`s | ||
| /// for the outer column. Only the outer query that actually owns the | ||
| /// column's source table expression may apply its `group_by_use_nulls` | ||
| /// nullability, so require source ownership before wrapping; otherwise keep | ||
| /// walking outwards to the owning scope. | ||
| if (scope_ptr->registered_table_expression_nodes.contains(column_source)) | ||
| { | ||
| node->convertToNullable(); | ||
| break; | ||
| } |
There was a problem hiding this comment.
Same root cause as C1. The branch wraps the bare-column path: when is_correlated_to_outer_scope is true (so node is a bare ColumnNode) AND we have reached the source-owning scope, the column is mutated in place and we break out. For the bare-c shape this works (e.g. (SELECT c) FROM t GROUP BY c WITH ROLLUP, group_by_use_nulls=1 correctly produces Nullable(Int32) on the inner column). The issue is the surrounding loop, not this branch in isolation: once we leave the bare-column case (e.g. the inner expression is c % 2), column_source is null and the walk stops at the inner QUERY boundary, so we never get the chance to apply node->convertToNullable() to the compound expression here. C3 covers the actual stop condition; once that's fixed the in-place mutation here needs to be widened to compound expressions too (mutate node rather than clone, otherwise the outer QueryNode::correlated_columns_list and the planner's correlated_columns_set end up referencing the original non-Nullable column).
| if (!column_source) | ||
| break; |
There was a problem hiding this comment.
This condition does not make sense.
There was a problem hiding this comment.
Agreed. if (!column_source) break; is what makes the C1 / C2 case slip through: a non-ColumnNode (e.g. a FunctionNode for c % 2) carries no column_source, so we stop at the inner QUERY boundary even when the expression is correlated through child columns whose real source lives further out. The condition was a literal port of the previous column_source design, but the design itself is too narrow.
What we actually want at this point is "keep walking iff the expression we are resolving is correlated relative to the current scope", which is a property of the expression's correlated descendants, not just its top-level node. Concretely: walk node's subtree once before the loop, collect all ColumnNode sources that are NOT registered in &scope's registered_table_expression_nodes, and use that set to drive both (a) whether we walk past the inner QUERY and (b) which outer scope is the source-owning one. For a bare-column node, the set is just {column_source} and the behaviour is unchanged; for a compound expression like c % 2, the set is {c} and the walk continues to the outer scope where nullable_group_by_keys.find(c % 2) matches. A query whose expression mixes correlated columns from different outer scopes already triggers NOT_IMPLEMENTED upstream in checkCorrelatedColumn / decorrelation, so a single owning-source assumption is safe to make here.
I will rework the condition along those lines and re-test the cases that were the original reason for the walk-past-QUERY (the (SELECT c) shape, 04210, 04326).
| if (scope_ptr->registered_table_expression_nodes.contains(column_source)) | ||
| break; | ||
|
|
||
| in_aggregate_function_scope = false; |
There was a problem hiding this comment.
Agreed, and reworked. In the rewritten applyGroupByUseNullsToExpression, the reset only happens when we deliberately continue the walk into an outer query for a correlated expression that has not yet reached its source-owning scope:
if (!is_correlated() || ownsCorrelatedSource(*scope_ptr, *correlated_columns))
break;
is_aggregate_argument = false;So we never reset (and never walk outward) for a non-correlated expression, and we stop as soon as we reach the scope that owns the correlated column's source. The reset itself is correct for the cases that do continue: the inner subquery's aggregate context (e.g. the inner argMax in * APPLY x -> argMax(x, key)) constrains only the inner query's own keys; it must not suppress wrapping of a reference that is correlated to an outer query's GROUP BY key, because the values fed into the inner query are the outer query's post-rollup Nullables. The previous unconditional reset at every QUERY boundary was the part that did not hold up.
Addresses review feedback on the group_by_use_nulls Nullable-wrapping block in QueryAnalyzer::resolveExpressionNode. The previous version decided whether to walk past an enclosing QUERY scope for a correlated reference using node->as<ColumnNode>()->getColumnSourceOrNull(). That only recognises a bare column as correlated, so a compound expression such as `c % 2` (the GROUP BY key being `c % 2`) was left non-Nullable, and the stop condition `if (!column_source) break;` conflated "not a column" with "not correlated". Extract the logic into applyGroupByUseNullsToExpression and drive it by whether the expression is correlated relative to the current scope, computed with the existing CorrelatedColumnsCollector (the same mechanism used elsewhere in the resolver). The walk now continues past a QUERY boundary only for a genuinely correlated expression and only up to the scope that owns the correlated column's source, so an intermediate query with a same-named GROUP BY key cannot match. A correlated match wraps the node in place to preserve the shared_ptr identity that the planner's correlated-columns set relies on. A fast path returns early when no enclosing scope has any nullable_group_by_keys, and the correlated-columns analysis is computed lazily, so queries without group_by_use_nulls pay only a cheap scope-chain check. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
|
Pushed What changed per comment:
One honesty note on scope: the compound-key correlated shapes (C1/C2) are currently rejected by the planner before execution, on both master and this branch: So this part of the change fixes the analyzer's type, which is the correctness issue you flagged, but there is no runnable query that demonstrates a row-level difference for it yet. The reachable shapes (bare GROUP BY key, with bare or compound correlated references, plus the original Pre-PR validation gate
Session id: cron:clickhouse-worker-slot-5:20260610-090200 |
The previous commit drove the group_by_use_nulls Nullable-wrapping in applyGroupByUseNullsToExpression by expression correlation, but used `scope_ptr == &scope` as the discriminator for the non-correlated clone-and-wrap branch. A GROUP BY key referenced inside a lambda body, e.g. `arrayMap(x -> intDiv(x, key), ...)` where `key` is a GROUP BY key of the surrounding query, is resolved with `&scope` pointing at the LAMBDA scope while the key lives in the enclosing QUERY scope. Such a reference is not correlated, so it matched neither the `scope_ptr == &scope` clone branch nor the correlated `ownsCorrelatedSource` branch, and was left non-Nullable. That produced two regressions visible in the Fast test: - 04001_logical_error_with_cube: the lambda-body key kept its non-Nullable type, so arrayMap returned Array(UInt8) instead of Array(Nullable(UInt8)), which changed DISTINCT semantics and emitted an extra row. - 03023_group_by_use_nulls_analyzer_crashes: the type mismatch between the wrapped projection column and the unwrapped lambda-captured column produced `Cannot capture column N because it has incompatible type: got Nullable(UInt8), but UInt8 is expected` (LOGICAL_ERROR). The matched key belongs to the same query as the reference whenever the expression is not correlated (whether resolved directly in the query or inside one of its lambda scopes), so take the clone-and-wrap branch for any non-correlated match. The `scope_ptr == &scope` check is kept as a fast path so the common direct-projection match skips the correlated-columns analysis. The correlated in-place wrap (gated by ownsCorrelatedSource) is unchanged, and the compound-key correlated case the previous commit addressed still wraps as Nullable. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
|
Pushed What broke and why: Fix: take the clone-and-wrap branch for any non-correlated match ( Pre-PR validation gate
Session id: cron:clickhouse-worker-slot-7:20260610-104500 |
CI Ledger — covered SHA:
|
| Check | Failure | Classification | Disposition |
|---|---|---|---|
Integration tests (amd_asan_ubsan, db disk, old analyzer, 1/6) |
test_refreshable_mat_view_replicated/test.py::test_circular_dependencies_survive_restart |
chronic flaky (NOT PR-caused) | 30-day cross-PR: 14+ distinct unrelated PRs + master. Tracked under issue #106651. This PR only touches QueryAnalyzer.cpp + group_by_use_nulls tests, nothing in refreshable-mat-view / ReplicatedMergeTree. |
Stateless tests (arm_asan_ubsan, azure, parallel) |
Server died / OOM in dmesg / Logical error: Expected 3 to 10 arguments in table function azureBlobStorage, got 1 |
infra outage (NOT PR-caused) | The arm_asan_ubsan, azure, parallel runner is in a broad failure window: 272 distinct tests across 41 unrelated PRs failed on this exact check in the last 12h (OOM / Server died). Independent of this PR's diff. |
Mergeable Check / PR / CH Inc sync |
aggregators | downstream | Reflect the two above; no independent signal. |
The two Fast test failures from the prior commit (be05b67e0cc: 03023_group_by_use_nulls_analyzer_crashes, 04001_logical_error_with_cube) were PR-caused regressions from the rework and are fixed in e8d62a69667 (both verified passing locally in both directions). No PR-caused failures remain on the current HEAD.
A correlated subquery aggregating a String key (e.g. (SELECT min(c0)) over a GROUP BY ... WITH ROLLUP / GROUPING SETS key) hits the ColumnNullable -> ColumnString wrapper instead of the numeric ColumnVector path, so it was not covered by the existing 04210 cases. Both queries crashed with 'Bad cast from ColumnNullable to ColumnString' on master before this fix and now return the post-rollup Nullable value. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
|
Pushed While reviewing the rework I rebuilt the PR on current
Pre-PR validation gate
Session id: cron:clickhouse-worker-slot-9:20260610-190100 |
|
CI finished on this head (ec6b58a). The single red check is not caused by this change.
All other checks pass. The fuzzer "Bad cast" (STID 3721-419e) fix in this commit (LowCardinality-aware nullable wrapping in (cron:our-pr-ci-monitor:20260622-080000) |
|
@novikd — verified the current
The end-to-end planning symptom is gone too: The existing tests check query results; I put together a complementary one that pins the types the analyzer assigns (passes against this revision, and fails with a clean diff if the correlated type stops being wrapped). Happy to contribute it if useful: SET enable_analyzer = 1;
SET allow_experimental_correlated_subqueries = 1;
SET group_by_use_nulls = 1;
SELECT '-- bare correlated column ---';
SELECT replaceRegexpAll(trim(explain), 'id: [0-9]+', 'id: N') AS line
FROM (EXPLAIN QUERY TREE
SELECT (SELECT c) FROM (SELECT toInt32(3) AS c) t GROUP BY c WITH ROLLUP)
WHERE explain ILIKE '%column_name: c,%result_type%';
SELECT '-- compound correlated key (modulo) ---';
SELECT replaceRegexpAll(trim(explain), 'id: [0-9]+', 'id: N') AS line
FROM (EXPLAIN QUERY TREE
SELECT (SELECT c % 2) FROM (SELECT toInt32(3) AS c) t GROUP BY c % 2 WITH ROLLUP)
WHERE explain ILIKE '%function_name: modulo%result_type%';
SELECT '-- aggregate argument stays non-Nullable ---';
SELECT replaceRegexpAll(trim(explain), 'id: [0-9]+', 'id: N') AS line
FROM (EXPLAIN QUERY TREE
SELECT * APPLY x -> argMax(x, number) FROM numbers(1) GROUP BY number WITH ROLLUP)
WHERE explain ILIKE '%column_name: number,%result_type%';Output on this revision: |
|
Thanks @coderashed, and thanks again for routing #108714 here. Both tests are welcome and I will fold them into
The type-pinning one is a good addition: 04210 only checks results, so a regression that silently stopped wrapping the correlated key I will add both on the next rebase rather than a coverage-only push now (this branch rebases often for the |
…d-rollup-cast-wrapper # Conflicts: # src/Analyzer/ConstantNode.cpp
Two test cases contributed by @coderashed (who closed his duplicate PR ClickHouse#108714 in favour of this one): 1. The exact AST-fuzzer seed query from ClickHouse#108714: a correlated scalar subquery toString(number) inside concat() over a GROUPING SETS key. Aborts on master with 'Bad cast ColumnNullable to ColumnString', returns s0..s4 with the fix. 2. A type-pinning EXPLAIN QUERY TREE check. The existing cases assert query results, which still pass if a future change silently stops wrapping the correlated key as Nullable (values unchanged, declared type wrong). These pin the analyzer-assigned types via SELECT ... FROM (EXPLAIN QUERY TREE ...) WHERE explain ILIKE '%...%', so dropping the Nullable wrapping fails with a clean diff. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
|
Rebased onto master (the branch had drifted to CONFLICTING). The only conflict was in Folded in @coderashed's two contributed cases (04210), credited:
Both verified both directions against a local debug build (Build ID b5f40416, has |
…sion (ClickHouse#91119) A code review raised that the `scope_ptr == &scope` fast path in applyGroupByUseNullsToExpression could mis-wrap a correlated outer column as the inner query's same-named local GROUP BY key, because nullable_group_by_keys matches on name+type only (ColumnNode::isEqualImpl ignores column source). Investigated and confirmed this cannot happen: scope.nullable_group_by_keys is populated only for ROLLUP/CUBE/GROUPING SETS queries under group_by_use_nulls (IdentifierResolveScope.cpp), so a plain inner GROUP BY registers no keys and the inner-scope find() is never reached. For the inner scope to hold a colliding entry it would itself have to be ROLLUP/CUBE/GROUPING SETS while correlated, which decorrelation rejects with NOT_IMPLEMENTED. No source change is needed. Add the collision shape as a regression (results plus an EXPLAIN QUERY TREE type pin) so a future over-broad guard cannot silently flip the inner local key to Nullable. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
CI finish ledger — 0596553CI fully finished (Finish Workflow: success). This PR fixes a Bad-cast in argMax + APPLY + ROLLUP (analyzer). Every failure has an owner; none is PR-caused.
Session id: cron:our-pr-ci-monitor:20260630-163000 |
|
Fixing PR for the |
STID 3418-3e98 (AST fuzzer, arm_asan_ubsan) is the statistics-aggregate manifestation of the same correlated-subquery + group_by_use_nulls bug. skewSamp/kurtPop/varSamp/covarSamp and friends share one add() that reads the argument via an unchecked static_cast (AggregateFunctionStatisticsSimple.h), so when the correlated argument is left non-Nullable the failure is a silent heap-buffer-overflow on release builds instead of the clean "Bad cast" exception the min/sum/argMax cases already cover. Add runtime cases over the statistics family (crash without the fix, correct output with it) plus an EXPLAIN QUERY TREE type pin that the correlated statistics-aggregate argument is wrapped Nullable. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
|
Pushed The AST fuzzer surfaced this bug again as STID Minimal repro (aborts on master with ASan, correct with this PR): SELECT (SELECT skewSamp(a.k)) FROM (SELECT number AS k FROM numbers(64)) AS a
GROUP BY a.k WITH CUBE SETTINGS group_by_use_nulls = 1;Verified both directions locally (ASan): the new runtime cases ( |
The correlated subquery returns a tuple whose element is the outer rollup key. On master this aborts with a LOGICAL_ERROR (Unexpected return type from tuple) under group_by_use_nulls + WITH ROLLUP/CUBE; the analyzer fix resolves the tuple element to Nullable so it matches the decorrelated input. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
The two coverage commits pushed for this test (the statistics-aggregate crash-guard section at numbers(64) and the ClickHouse#106377 correlated-tuple cases) made 04210 heavy enough to exceed the flaky-check 180s soft cap under debug builds with randomized settings, so the flaky-check job reported 'Test runs too long (> 180s)'. The numbers(64) size is required for the statistics-aggregate out-of-bounds read to land on an ASan redzone, so the coverage must not be reduced. Tag the test long instead: this exempts it from the flaky-check 180s FAIL and runs it at the reduced long_test_runs_ratio repeat count. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…use#91119) The scalar path (applyGroupByUseNullsToExpression) suppresses the group_by_use_nulls Nullable wrapping for both aggregate and grouping() function arguments via hasAggregateOrGroupingFunction(). The sibling APPLY-matcher path in resolveMatcher only suppressed it for aggregate functions, so a query like SELECT * APPLY x -> grouping(x) FROM numbers(1) GROUP BY number WITH ROLLUP SETTINGS group_by_use_nulls = 1 cloned the matched `number` key as Nullable before the lambda was resolved. GroupingFunctionsResolvePass matches its arguments against the original GROUP BY keys, so the rewritten grouping(number) no longer matched and raised a spurious "GROUPING function ... is not in GROUP BY keys" (BAD_ARGUMENTS) exception even though the key is valid. Make the APPLY-tree scan detect grouping() the same way as the scalar path (exact name comparison, matching ExpressionsStack:: isAggregateOrGroupingFunction). Covers both the lambda form (* APPLY x -> grouping(x)) and the function form (* APPLY grouping), and grouping() nested inside other functions, across all GROUP BY shapes (ROLLUP, CUBE, GROUPING SETS). Regression cases added to 04326_91119_apply_lambda_aggregate_rollup. Found by clickhouse-gh[bot] review on PR ClickHouse#104350. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
|
Pre-PR validation gate for the APPLY- Pre-PR validation gate (click to expand)
Session id: cron:clickhouse-worker-slot-2:20260701-172600 |
LLVM Coverage Report
Changed lines: Changed C/C++ lines covered: 57/61 (93.44%) · Uncovered code |
CI finish ledger — 775264eCI fully finished (Finish Workflow: success, Mergeable Check: success). This PR fixes a Bad-cast in correlated subquery + GROUP BY ROLLUP under
Session id: cron:our-pr-ci-monitor:20260701-223000 |
CI finish ledger update — STID 4752-5ad8 owner resolved to a fixing PRThe AST-fuzzer "Input nodes size mismatch in dag" / "Cannot find input column ... on its position in inputs of expression actions DAG" leaf (STID 4752-5ad8) now resolves to a merged fixing PR, not a fix task. Fixing PR: #107112 ("Fix LOGICAL_ERROR in decorrelated correlated subquery join with duplicate columns"), merged 2026-07-01 20:20 UTC. Root cause (confirmed by local reproduction): a correlated scalar subquery is decorrelated into a JOIN; when its input side projects the same column identifier twice and Deterministic minimal trigger (2 settings): SET allow_experimental_correlated_subqueries = 1;
SET correlated_subqueries_default_join_kind = 'left';
SELECT x, (SELECT y) FROM (SELECT 1 AS x, 2 AS y, x, y QUALIFY 7) LIMIT 100;Verified locally: server aborts 5/5 without #107112, runs clean with it. The Why this branch still saw the crash: this PR's failing run (commit Session id: cron:clickhouse-worker-slot-1:20260701-225000 |
CI finish ledger — 775264eEvery failure below has an owner: a fixing PR (ours or external), or a full-effort fix task whose fixing-PR link will be posted here when it opens. Only Session id: cron:our-pr-ci-monitor:20260703-043000 |

A correlated subquery references an outer column that is a GROUP BY key of an outer query under
group_by_use_nulls = 1withWITH ROLLUP/CUBE/GROUPING SETS. The actual values fed into the inner subquery are post-rollupNullables, but the analyzer used to leave the inner column reference at its non-Nullable type. The planner then built CAST wrappers and aggregate functions with the original signature, and at header inference (after decorrelation replacedPLACEHOLDERnodes withINPUTnodes of the actualNullabletypes) the precomputed wrappers raisedLogical error: 'Bad cast from type DB::ColumnNullable to DB::ColumnVector<...>'.The minimal reproducer from issue #91119 hits
createUInt8ToBoolWrapper:The same surface bug class also appears for plain numeric types (modulo, plus), aggregate functions over correlated outer columns (e.g.
(SELECT anyLastOrDefault(number))), correlatedHAVING/WHEREfilter steps, andNullablesource columns.The root cause is in
QueryAnalyzer::resolveExpressionNode: thenullable_group_by_keyswalk stops at the first enclosingQUERYscope, so the outer scope where the column was defined is never consulted for an inner correlated reference. This change makes the walk continue past the innerQUERYscope when thenodeis a column whose source table expression is registered in a deeper outer scope (i.e. the column is correlated). The aggregate-function exclusion is now evaluated against each scope'sexpressions_in_resolve_process_stack, so an outer scope'snullable_group_by_keysapplies even when the inner subquery is currently inside an inner aggregate (the inner aggregate operates on the outer's already-Nullablevalues).For correlated matches the existing node is mutated in place instead of being replaced with a clone of the GROUP BY key. The same
shared_ptris referenced by the enclosingQueryNode::correlated_columnslist (registered viaaddCorrelatedColumn); cloning would leave the planner'scorrelated_columns_setpointing to the original non-Nullablenode, andPlannerActionsVisitor::visitColumn(which uses type-sensitive equality) would then fail to identify the projected column as correlated and skip thePLACEHOLDERstep that the decorrelation pass replaces.Closes #91119.
Closes #106377.
Changelog category (leave one):
Changelog entry (a user-readable short description of the changes that goes into CHANGELOG.md):
Fixed
Bad castexception in correlated subqueries that reference an outer column promoted toNullablebygroup_by_use_nulls = 1withGROUP BYWITH ROLLUP/CUBE/GROUPING SETS.Documentation entry for user-facing changes