Fix correlated subquery + GROUP BY ROLLUP under group_by_use_nulls by groeneai · Pull Request #104350 · ClickHouse/ClickHouse · GitHub
Skip to content

Fix correlated subquery + GROUP BY ROLLUP under group_by_use_nulls#104350

Open
groeneai wants to merge 21 commits into
ClickHouse:masterfrom
groeneai:fix-91119-correlated-rollup-cast-wrapper
Open

Fix correlated subquery + GROUP BY ROLLUP under group_by_use_nulls#104350
groeneai wants to merge 21 commits into
ClickHouse:masterfrom
groeneai:fix-91119-correlated-rollup-cast-wrapper

Conversation

@groeneai

@groeneai groeneai commented May 8, 2026

Copy link
Copy Markdown
Contributor

A correlated subquery references an outer column that is a GROUP BY key of an outer query under group_by_use_nulls = 1 with WITH ROLLUP/CUBE/GROUPING SETS. The actual values fed into the inner subquery are post-rollup Nullables, but the analyzer used to leave the inner column reference at its non-Nullable type. The planner then built CAST wrappers and aggregate functions with the original signature, and at header inference (after decorrelation replaced PLACEHOLDER nodes with INPUT nodes of the actual Nullable types) the precomputed wrappers raised Logical error: 'Bad cast from type DB::ColumnNullable to DB::ColumnVector<...>'.

The minimal reproducer from issue #91119 hits createUInt8ToBoolWrapper:

SELECT (SELECT c0)
FROM (SELECT 1::Bool) t0(c0)
GROUP BY c0 WITH ROLLUP
SETTINGS group_by_use_nulls = 1;

The same surface bug class also appears for plain numeric types (modulo, plus), aggregate functions over correlated outer columns (e.g. (SELECT anyLastOrDefault(number))), correlated HAVING/WHERE filter steps, and Nullable source columns.

The root cause is in QueryAnalyzer::resolveExpressionNode: the nullable_group_by_keys walk stops at the first enclosing QUERY scope, so the outer scope where the column was defined is never consulted for an inner correlated reference. This change makes the walk continue past the inner QUERY scope when the node is a column whose source table expression is registered in a deeper outer scope (i.e. the column is correlated). The aggregate-function exclusion is now evaluated against each scope's expressions_in_resolve_process_stack, so an outer scope's nullable_group_by_keys applies even when the inner subquery is currently inside an inner aggregate (the inner aggregate operates on the outer's already-Nullable values).

For correlated matches the existing node is mutated in place instead of being replaced with a clone of the GROUP BY key. The same shared_ptr is referenced by the enclosing QueryNode::correlated_columns list (registered via addCorrelatedColumn); cloning would leave the planner's correlated_columns_set pointing to the original non-Nullable node, and PlannerActionsVisitor::visitColumn (which uses type-sensitive equality) would then fail to identify the projected column as correlated and skip the PLACEHOLDER step that the decorrelation pass replaces.

Closes #91119.
Closes #106377.

Changelog category (leave one):

  • Bug Fix (user-visible misbehavior in an official stable release)

Changelog entry (a user-readable short description of the changes that goes into CHANGELOG.md):

Fixed Bad cast exception in correlated subqueries that reference an outer column promoted to Nullable by group_by_use_nulls = 1 with GROUP BY WITH ROLLUP/CUBE/GROUPING SETS.

Documentation entry for user-facing changes

  • Documentation is written (mandatory for new features)

A correlated column reference resolved inside a subquery used to keep its
original (non-Nullable) type even when the column is a GROUP BY key of an
outer query under `group_by_use_nulls = 1` with `WITH ROLLUP`/`CUBE`/
`GROUPING SETS`. The analyzer's `nullable_group_by_keys` walk stopped at
the first enclosing `QUERY` scope, so the outer scope where the column was
defined was never consulted for the inner reference.

The planner then built the inner subquery's `ActionsDAG` (CAST wrappers,
aggregate functions, etc.) with the original argument types. After the
decorrelation pass replaced `PLACEHOLDER` nodes with `INPUT` nodes whose
actual types were Nullable (because the outer GROUP BY made the keys
Nullable), the precomputed wrappers received columns of an unexpected type
and threw `Logical error: 'Bad cast from type DB::ColumnNullable to
DB::ColumnVector<...>'` during header inference.

The fix lifts the walk so it also visits enclosing outer-query scopes for
column nodes whose source table expression is not registered in the
current scope's FROM (i.e. correlated). At each scope, the
`expressions_in_resolve_process_stack.hasAggregateFunction()` exclusion is
evaluated against THAT scope's stack, so an outer scope's
`nullable_group_by_keys` correctly applies even when the inner subquery
is currently inside an inner aggregate function call (the inner aggregate
operates on the outer's already-Nullable values).

For correlated matches the existing column node is mutated in place
instead of being replaced with a clone of the GROUP BY key node. The same
`shared_ptr` is referenced by the enclosing `QueryNode`'s
`correlated_columns` list (registered via `addCorrelatedColumn`); cloning
would leave the planner's `correlated_columns_set` pointing to the
original non-Nullable node, and `PlannerActionsVisitor::visitColumn`
(which uses type-sensitive equality) would then fail to identify the
projected column as correlated and skip the `PLACEHOLDER` step that the
decorrelation pass replaces.

Closes ClickHouse#91119.
@groeneai

groeneai commented May 8, 2026

Copy link
Copy Markdown
Contributor Author

@groeneai

groeneai commented May 8, 2026

Copy link
Copy Markdown
Contributor Author

cc @novikd @KochetovNicolai — could you review this? It fixes #91119 by extending the analyzer's nullable_group_by_keys walk so that correlated columns referenced from inside a subquery pick up the Nullable type promotion from the outer scope's GROUP BY ROLLUP/CUBE/GROUPING SETS, which was previously silently lost and caused Bad cast ColumnNullable -> ColumnVector exceptions during planner header inference.

@groeneai

groeneai commented May 8, 2026

Copy link
Copy Markdown
Contributor Author

cc @PedroTadim — invited me here. The reproducer from #91119 (and related cases — CUBE, GROUPING SETS, aggregate function over correlated outer column, HAVING/WHERE with correlated subqueries, Nullable source) all return correct results with this fix. Also non-regressed against the existing 02343_group_by_use_nulls / 02535_analyzer_group_by_use_nulls / 03014_analyzer_group_by_use_nulls / 03315_analyzer_correlated_subqueries and the four 03408/03411 correlated subquery test suites.

@Algunenano Algunenano added the can be tested Allows running workflows for external contributors label May 8, 2026
@clickhouse-gh

clickhouse-gh Bot commented May 8, 2026

Copy link
Copy Markdown
Contributor

Workflow [PR], commit [775264e]

Summary:

job_name test_name status info comment
AST fuzzer (amd_debug, targeted, old_compatibility) FAIL
Logical error: Input nodes size mismatch in dag: A, expected: [B], [C] (STID: 4752-5ad8) FAIL cidb

AI Review

Summary

This PR reworks group_by_use_nulls handling in QueryAnalyzer so correlated references can inherit the owning outer query's nullable GROUP BY key type, while still preserving the existing aggregate / grouping suppression rules for same-query resolution and * APPLY. After reviewing the current PR head, the full thread history, and the updated regression coverage, I did not find a remaining correctness issue that warrants a new review comment.

Final Verdict

✅ No new findings on the current PR head.

@clickhouse-gh clickhouse-gh Bot added the pr-bugfix Pull request with bugfix, not backported by default label May 8, 2026
Comment thread src/Analyzer/Resolve/QueryAnalyzer.cpp Outdated
Comment thread src/Analyzer/Resolve/QueryAnalyzer.cpp Outdated
@groeneai

groeneai commented Jun 4, 2026

Copy link
Copy Markdown
Contributor Author

CI Status — covered SHA 30c60275

Master-merge commit 30c60275 brings the analyzer correlated-rollup fix up to current trunk. CI fully finished. Sole "code" failure is a chronic Fast-test infra hang signature, not PR-caused.

Check Test Disposition
Fast test Server died Chronic Fast-test infra flake. The Fast-test runner reports WARNING: Timeout exceeded [300], sending SIGTERM followed by Hung check failed: server is not responding. CIDB 30d shows Server died is endemic across 100+ PRs (top: PR #104754=33, PR #100185=83) and master (20 hits). Last-test-before-hang on this run was 01548_uncomparable_columns_in_keys at 06:13:31, then a 42-second silence before the timeout — clearly not produced by the analyzer changes in this PR (analyzer/group-by-use-nulls).
Mergeable Check / PR / CH Inc sync n/a Aggregator/infra rollup of the above.

Code change unchanged since 2026-05-08. The 22 related existing tests still pass; the comprehensive new test 04210_91119_correlated_subquery_group_by_use_nulls.sql (12 query shapes) verifies the analyzer-level fix. Awaiting @novikd / @KochetovNicolai re-review.

cron:our-pr-ci-monitor:20260604-073000

The previous commit replaced the OR-accumulated aggregate-function check in
`resolveExpressionNode` with a per-scope check, intending to allow the walk to
continue past the enclosing `QUERY` scope for correlated subqueries. That
broke `* APPLY x -> agg(x, key)` queries under `group_by_use_nulls = 1`:

  * The LAMBDA scope sees the aggregate on its `expressions_in_resolve_process_stack`
    but its own `nullable_group_by_keys` is empty, so the per-scope check skips
    that scope's lookup.
  * The outer `QUERY` scope holds the keys but its own stack does not see the
    aggregate (the aggregate lives inside the lambda body), so the per-scope
    check finds the key and wraps the inner reference as `Nullable`.
  * The aggregate's `addBatchSinglePlace` then runs on a `ColumnNullable` where
    its signature expects `ColumnVector<UInt64>` (or `ColumnString` for the
    `min` family), and the `assert_cast` faults inside `SingleValueDataFixed::set`.

Fast test on PR ClickHouse#104350 SHA `30c602753b` reproduced this on
`04067_group_by_use_nulls_apply_aggregate.sql` (server died, signal 11 in
`AggregateFunctionNullVariadic::addBatchSinglePlace`). The exact same shape
appears in the AST-fuzzer findings tracked under STID `0250-30be`,
`1499-22cb`, `3336-34e8`, and `3336-2f87`.

Restore the OR-accumulated `in_aggregate_function_scope` flag for scopes
within the same query nesting level, and reset it when crossing into an
outer `QUERY` scope. The reset preserves the new correlated-subquery
behaviour: an outer scope's GROUP BY keys are evaluated relative to the
outer scope's own resolution context, so an aggregate in an inner subquery
must not shield against the outer scope's `nullable_group_by_keys`.

Adds a regression test
`04326_91119_apply_lambda_aggregate_rollup.sql` covering nine variants of
`* APPLY <lambda>` plus an aggregate plus `ROLLUP` / `CUBE` / `GROUPING SETS`
under `group_by_use_nulls = 1`, including the regex-filtered APPLY shape from
issue ClickHouse#98190 (STID `1499-22cb`). The existing
`04067_group_by_use_nulls_apply_aggregate` continues to pass.

Local verification: master HEAD passes; PR HEAD without this fix aborts on
`04067` query 1; PR HEAD with this fix passes `04067`, `04210` (PR's own
test), and `04326` (new regression test).

Related: ClickHouse#91119
Related: ClickHouse#98190

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
@groeneai

groeneai commented Jun 8, 2026

Copy link
Copy Markdown
Contributor Author

Pre-PR validation gate

# Question Answer
a Deterministic repro? Yes. build/programs/clickhouse local -q "SET group_by_use_nulls = 1; SELECT * APPLY x -> argMax(x, number) FROM numbers(1) GROUP BY GROUPING SETS ((materialize(65537)), (*));" aborts with SIGABRT on PR HEAD 30c602753be (without the new commit), passes on master HEAD 7a725b4ef59, and passes on PR HEAD + the new commit. Single command, every run.
b Root cause explained? Yes. The previous commit replaced the OR-accumulated in_aggregate_function_scope check in QueryAnalyzer::resolveExpressionNode with a per-scope check, intending to allow the walk to continue past the enclosing QUERY scope for correlated subqueries. For * APPLY x -> agg(x, key) queries the LAMBDA scope sees the aggregate on its expressions_in_resolve_process_stack but its own nullable_group_by_keys is empty (per-scope check skips); the outer QUERY scope holds the keys but its own stack does not see the aggregate (per-scope check finds the key and wraps the inner reference as Nullable). The aggregate's addBatchSinglePlace then runs on a ColumnNullable where the function signature expects ColumnVector<UInt64> (or ColumnString for the min family), and the assert_cast inside SingleValueDataFixed::set faults. The Fast test crash on 04067_group_by_use_nulls_apply_aggregate (signal 11 in AggregateFunctionNullVariadic::addBatchSinglePlace) and the AST-fuzzer findings under STID 0250-30be, 1499-22cb, 3336-34e8, 3336-2f87 are all the same shape.
c Fix matches root cause? Yes. Restore the OR-accumulated in_aggregate_function_scope flag for scopes within the same query nesting level, and reset it when crossing into an outer QUERY scope. The OR within a query reproduces the original (pre-correlated) suppression that the aggregate "consumes" the GROUP BY nullability boundary regardless of which scope recorded it; the reset across QUERY boundaries preserves the new correlated-subquery behaviour: an aggregate in an inner subquery must not shield against an outer scope's nullable_group_by_keys. The fix targets the analyzer-level scope walk, not the cast site.
d Test intent preserved / new tests added? Yes. Existing test 04067_group_by_use_nulls_apply_aggregate (the one Fast test failed on) continues to pass without modification. PR's own new test 04210_91119_correlated_subquery_group_by_use_nulls continues to pass without modification. New regression test 04326_91119_apply_lambda_aggregate_rollup.sql added, covering nine variants of * APPLY <lambda> plus an aggregate plus ROLLUP / CUBE / GROUPING SETS under group_by_use_nulls = 1, including the regex-filtered APPLY shape from issue #98190 (STID 1499-22cb), nested aggregate inside a function, function-form APPLY, and a non-aggregate APPLY sanity row that confirms the suppression is genuinely scoped to the aggregate-argument path and not a blanket disable.
e Both directions demonstrated? Yes. Built two binaries from the same PR HEAD on the same machine. Without the new commit: 04067 query 1 aborts with SIGABRT (core dumped); 04326 aborts on its first query. With the new commit: 04067, 04210, and 04326 all run to completion and produce output matching their .reference files.
f Fix is general, not a narrow patch? Yes. The fix is at the analyzer-level scope walk (QueryAnalyzer::resolveExpressionNode), not at any cast site. All paths that resolve column references under group_by_use_nulls = 1 benefit. The new test exercises 9 distinct shapes (regular ROLLUP/CUBE/GROUPING SETS, multiple aggregate functions including argMax, sum, min-via-function-form-APPLY, regex-filtered APPLY, nested aggregate-inside-toString, deeply-nested aggregate, and a non-aggregate APPLY sanity row). The fix traces the bad value to its origin (the analyzer setting up a wrong scope walk) instead of guarding against the symptom downstream.

Session id: cron:clickhouse-ci-task-worker:20260608-134700

Comment thread src/Analyzer/Resolve/QueryAnalyzer.cpp Outdated
@groeneai

groeneai commented Jun 8, 2026

Copy link
Copy Markdown
Contributor Author

CI Ledger - covered SHA: 6f52db2

Workflow PR finished 2026-06-08T18:52:53Z. Triage of the four reds:

Check Failure Classification Disposition
BuzzHouse (amd_debug) Logical error: Mutation of \Memory` table produced incomplete output (STID: 2380-2e8c)` chronic trunk bug Tracked. Active fix PR #106615. 30-day cross-PR: 14 master hits, 30+ unrelated PRs. PR diff is src/Analyzer/Resolve/QueryAnalyzer.cpp + a stateless test only - completely unrelated to Memory table mutation pipeline.
BuzzHouse (amd_tsan) Same STID family (2380-3043) chronic trunk bug Same as above.
Mergeable Check aggregator reflects the two BuzzHouse fails clears when STID 2380 family clears.
PR aggregator reflects above same.

The Fast-test failure on 04067_group_by_use_nulls_apply_aggregate from the prior HEAD has been fixed by this commit (analyzer per-scope check now restores OR-accumulated in_aggregate_function_scope for scopes within the same query nesting level, matching the original semantic). All other 143 checks pass.

Session: cron:our-pr-ci-monitor:20260608-193000

alexey-milovidov and others added 2 commits June 9, 2026 08:07
…lls to a correlated column

`nullable_group_by_keys` matching uses query-tree equality, and
`ColumnNode::isEqualImpl` compares only the column name and type, not the
column source. For a correlated reference this meant an intermediate query with
a same-named, same-typed `GROUP BY` key (under `WITH ROLLUP`/`CUBE`/`GROUPING
SETS`) could match the outer column and wrap it as `Nullable`, even though that
intermediate query's rollup does not produce `NULL`s for the outer column.

Require that the scope whose `nullable_group_by_keys` is being applied actually
owns the column's source table expression (the same ownership test already used
at the `QUERY`-boundary break). Otherwise keep walking outwards to the owning
scope. Single-level correlated cases (the only ones currently reachable; deeper
correlation through an intermediate grouped query is `NOT_IMPLEMENTED` in the
planner) are unaffected, as confirmed by the existing tests.

Also update the stale block comment to describe the actual
OR-accumulate-within-query / reset-at-`QUERY`-boundary rule for the
aggregate-function exclusion.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@alexey-milovidov

Copy link
Copy Markdown
Member

Merged master (branch was ~1547 commits / ~7 days behind and red) — clean merge, no conflicts.

Addressed both review threads in 62084af:

  • Correctness (QueryAnalyzer.cpp:3440): an outer scope's nullable_group_by_keys is now applied to a correlated column only when that scope actually owns the column's source table expression (registered_table_expression_nodes.contains(column_source)), mirroring the ownership test already used at the QUERY-boundary break. This prevents an intermediate query with a same-named, same-typed GROUP BY key from matching a correlated outer column (ColumnNode::isEqualImpl compares only name + type, not source). The deeper multi-level scenario is currently NOT_IMPLEMENTED in the planner, so this is a defensive guard; single-level correlated cases are unaffected (existing tests pass).
  • Stale comment (QueryAnalyzer.cpp:3406): rewritten to describe the actual OR-accumulate-within-query / reset-at-QUERY-boundary rule.

Verified locally: the #91119 reproducer now returns Nullable values instead of the Bad cast exception, and both 04210/04326 tests pass against the rebuilt binary.

Remaining CI reds are all the BuzzHouse Logical error: Mutation of Memory table produced incomplete output finding (STID 2380) in StorageMemory::mutate on ALTER ... APPLY PATCHES/MODIFY TTL — unrelated to this analyzer change.

@groeneai

groeneai commented Jun 9, 2026

Copy link
Copy Markdown
Contributor Author

CI Ledger — covered SHA: 62084af978d84d75e9f27292fa61bbcd70c1cb42

Workflow PR finished 2026-06-09T13:10:28Z (success); Finish Workflow and Mergeable Check pass; gate satisfied (≥20 min CIDB ingestion buffer met, no jobs queued/in-progress). CIDB reports zero FAIL/ERROR rows on this HEAD across all 162 completed jobs — every Stateless / Integration / AST fuzzer / BuzzHouse / Stress / Unit / Performance / SQLLogic / Compatibility / Install variant is green.

The previous ledger covered 6f52db2ce0c (analyzer OR-accumulated in_aggregate_function_scope restore). HEAD advanced to 62084af978d after @ alexey-milovidov pushed the master merge plus the correctness guard at QueryAnalyzer.cpp:3440 and the comment rewrite at :3406 to address his own earlier review threads — both are resolution commits, not new directives. Local verification (issue #91119 reproducer, 04067_group_by_use_nulls_apply_aggregate, the new 04326_91119_apply_lambda_aggregate_rollup regression test) confirmed the fix held.

No outstanding CI items. Awaiting human review on the analyzer change.

Comment thread src/Analyzer/Resolve/QueryAnalyzer.cpp Outdated
Comment on lines +3423 to +3425
QueryTreeNodePtr column_source;
if (auto * column_node = node->as<ColumnNode>())
column_source = column_node->getColumnSourceOrNull();

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess it won't work in cases like GROUP BY c % 2

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Confirmed locally. With

SET enable_analyzer=1, allow_experimental_correlated_subqueries=1, group_by_use_nulls=1;
EXPLAIN QUERY TREE
SELECT (SELECT c % 2)
FROM (SELECT toInt32(3) AS c) t
GROUP BY c % 2 WITH ROLLUP;

the inner query tree is

QUERY id: 2, alias: __table1, is_subquery: 1, is_correlated: 1
  PROJECTION COLUMNS
    modulo(c, 2) Int16            <-- should be Nullable(Int16)
  PROJECTION
    FUNCTION id: 7, function_name: modulo, result_type: Int16
      ARGUMENTS
        COLUMN id: 4, column_name: c, result_type: Int32, source_id: 5  <-- should be Nullable(Int32)

The inner expression is left non-Nullable because the walk only captures column_source for a bare ColumnNode, so a FunctionNode like c % 2 (or any compound expression that contains a correlated column) hits if (!column_source) break; at the inner QUERY boundary and never reaches the outer scope's nullable_group_by_keys, which stores c % 2 itself, not bare c.

The planner currently rejects this shape with NOT_IMPLEMENTED: can't find correlated column ... in current header: __grouping_set, modulo(__table3.c, 2_UInt8), so it does not surface to users today, but the analyzer is producing the wrong type. The right shape of the fix is to drive the walk by "this expression contains a correlated column whose source lives in an outer scope" rather than by node->as<ColumnNode>()->getColumnSourceOrNull(). I will pivot in that direction as part of addressing C3 / C4 below.

Comment thread src/Analyzer/Resolve/QueryAnalyzer.cpp Outdated
Comment on lines +3443 to +3461
bool is_correlated_to_outer_scope = column_source
&& !scope.registered_table_expression_nodes.contains(column_source)
&& scope_ptr != &scope;
if (is_correlated_to_outer_scope)
{
/// `nullable_group_by_keys` matching uses query-tree equality, which for
/// columns (`ColumnNode::isEqualImpl`) compares only the column name and
/// type, not the column source. For a correlated reference this means an
/// intermediate query with a same-named, same-typed GROUP BY key could
/// match `node` even though that query's rollup does not produce `NULL`s
/// for the outer column. Only the outer query that actually owns the
/// column's source table expression may apply its `group_by_use_nulls`
/// nullability, so require source ownership before wrapping; otherwise keep
/// walking outwards to the owning scope.
if (scope_ptr->registered_table_expression_nodes.contains(column_source))
{
node->convertToNullable();
break;
}

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same problem

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same root cause as C1. The branch wraps the bare-column path: when is_correlated_to_outer_scope is true (so node is a bare ColumnNode) AND we have reached the source-owning scope, the column is mutated in place and we break out. For the bare-c shape this works (e.g. (SELECT c) FROM t GROUP BY c WITH ROLLUP, group_by_use_nulls=1 correctly produces Nullable(Int32) on the inner column). The issue is the surrounding loop, not this branch in isolation: once we leave the bare-column case (e.g. the inner expression is c % 2), column_source is null and the walk stops at the inner QUERY boundary, so we never get the chance to apply node->convertToNullable() to the compound expression here. C3 covers the actual stop condition; once that's fixed the in-place mutation here needs to be widened to compound expressions too (mutate node rather than clone, otherwise the outer QueryNode::correlated_columns_list and the planner's correlated_columns_set end up referencing the original non-Nullable column).

Comment thread src/Analyzer/Resolve/QueryAnalyzer.cpp Outdated
Comment on lines +3480 to +3481
if (!column_source)
break;

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This condition does not make sense.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed. if (!column_source) break; is what makes the C1 / C2 case slip through: a non-ColumnNode (e.g. a FunctionNode for c % 2) carries no column_source, so we stop at the inner QUERY boundary even when the expression is correlated through child columns whose real source lives further out. The condition was a literal port of the previous column_source design, but the design itself is too narrow.

What we actually want at this point is "keep walking iff the expression we are resolving is correlated relative to the current scope", which is a property of the expression's correlated descendants, not just its top-level node. Concretely: walk node's subtree once before the loop, collect all ColumnNode sources that are NOT registered in &scope's registered_table_expression_nodes, and use that set to drive both (a) whether we walk past the inner QUERY and (b) which outer scope is the source-owning one. For a bare-column node, the set is just {column_source} and the behaviour is unchanged; for a compound expression like c % 2, the set is {c} and the walk continues to the outer scope where nullable_group_by_keys.find(c % 2) matches. A query whose expression mixes correlated columns from different outer scopes already triggers NOT_IMPLEMENTED upstream in checkCorrelatedColumn / decorrelation, so a single owning-source assumption is safe to make here.

I will rework the condition along those lines and re-test the cases that were the original reason for the walk-past-QUERY (the (SELECT c) shape, 04210, 04326).

Comment thread src/Analyzer/Resolve/QueryAnalyzer.cpp Outdated
if (scope_ptr->registered_table_expression_nodes.contains(column_source))
break;

in_aggregate_function_scope = false;

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems to be incorrect.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed, and reworked. In the rewritten applyGroupByUseNullsToExpression, the reset only happens when we deliberately continue the walk into an outer query for a correlated expression that has not yet reached its source-owning scope:

if (!is_correlated() || ownsCorrelatedSource(*scope_ptr, *correlated_columns))
    break;

is_aggregate_argument = false;

So we never reset (and never walk outward) for a non-correlated expression, and we stop as soon as we reach the scope that owns the correlated column's source. The reset itself is correct for the cases that do continue: the inner subquery's aggregate context (e.g. the inner argMax in * APPLY x -> argMax(x, key)) constrains only the inner query's own keys; it must not suppress wrapping of a reference that is correlated to an outer query's GROUP BY key, because the values fed into the inner query are the outer query's post-rollup Nullables. The previous unconditional reset at every QUERY boundary was the part that did not hold up.

Addresses review feedback on the group_by_use_nulls Nullable-wrapping block in
QueryAnalyzer::resolveExpressionNode. The previous version decided whether to walk
past an enclosing QUERY scope for a correlated reference using
node->as<ColumnNode>()->getColumnSourceOrNull(). That only recognises a bare column
as correlated, so a compound expression such as `c % 2` (the GROUP BY key being
`c % 2`) was left non-Nullable, and the stop condition `if (!column_source) break;`
conflated "not a column" with "not correlated".

Extract the logic into applyGroupByUseNullsToExpression and drive it by whether the
expression is correlated relative to the current scope, computed with the existing
CorrelatedColumnsCollector (the same mechanism used elsewhere in the resolver). The
walk now continues past a QUERY boundary only for a genuinely correlated expression
and only up to the scope that owns the correlated column's source, so an intermediate
query with a same-named GROUP BY key cannot match. A correlated match wraps the node
in place to preserve the shared_ptr identity that the planner's correlated-columns set
relies on.

A fast path returns early when no enclosing scope has any nullable_group_by_keys, and
the correlated-columns analysis is computed lazily, so queries without group_by_use_nulls
pay only a cheap scope-chain check.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
@groeneai

Copy link
Copy Markdown
Contributor Author

Pushed be05b67e0cc reworking the group_by_use_nulls correlated-wrapping logic along the lines discussed in the C1/C3 threads. Extracted it into applyGroupByUseNullsToExpression and made it driven by whether the resolved expression is correlated, using the existing CorrelatedColumnsCollector instead of node->as<ColumnNode>()->getColumnSourceOrNull().

What changed per comment:

  • C1 / C2 (compound key c % 2): correlation is now a property of the expression's columns, not its top-level node type. CorrelatedColumnsCollector collects correlated columns from the whole subtree, so c % 2 is recognised and wrapped in place. EXPLAIN QUERY TREE for SELECT (SELECT c0 % 2) FROM (SELECT toInt32(3)) t0(c0) GROUP BY c0 % 2 WITH ROLLUP now shows modulo(c0, 2) Nullable(Int16) instead of Int16.
  • C3 (if (!column_source) break;): replaced with if (!is_correlated() || ownsCorrelatedSource(...)) break;. The walk continues past a QUERY boundary only for a genuinely correlated expression and only until the scope that owns the correlated source, so an intermediate query with a same-named key cannot match.
  • C4 (in_aggregate_function_scope = false;): the reset now happens only on the deliberate continuation into an outer query for a not-yet-resolved correlated expression (see the thread reply).

One honesty note on scope: the compound-key correlated shapes (C1/C2) are currently rejected by the planner before execution, on both master and this branch:

SELECT (SELECT c0 % 2) FROM (SELECT toInt32(3)) t0(c0) GROUP BY c0 % 2 WITH ROLLUP
-> NOT_IMPLEMENTED: can't find correlated column '__table3.c0' in current header: __grouping_set, modulo(__table3.c0, 2_UInt8)

So this part of the change fixes the analyzer's type, which is the correctness issue you flagged, but there is no runnable query that demonstrates a row-level difference for it yet. The reachable shapes (bare GROUP BY key, with bare or compound correlated references, plus the original * APPLY ... argMax cases) are covered by the runtime regression tests 04210 and 04326, which still pass, and I confirmed no regressions across the *group_by_use_nulls* and *correlated* stateless suites. Happy to add an EXPLAIN QUERY TREE test that locks in the c % 2 Nullable type if you want one, or to hold it until the planner supports the shape.

Pre-PR validation gate

# Question Answer
a Deterministic repro? Yes. The original crash family (* APPLY x -> argMax(x, key) + ROLLUP under group_by_use_nulls) is covered by 04326; the reviewed type bug is EXPLAIN QUERY TREE SELECT (SELECT c0 % 2) FROM (SELECT toInt32(3)) t0(c0) GROUP BY c0 % 2 WITH ROLLUP showing Int16 before / Nullable(Int16) after.
b Root cause explained? The walk decided "is this correlated, keep walking outward" from node->as<ColumnNode>(), so a compound expression (no top-level column source) stopped at the inner QUERY boundary and never consulted the owning outer scope's nullable_group_by_keys. Correlation is a property of the subtree's columns, not the top-level node.
c Fix matches root cause? Yes. The walk is now driven by CorrelatedColumnsCollector (subtree correlation) and source ownership, so it handles bare columns and compound expressions uniformly and stops at the owning scope.
d Test intent preserved / new tests added? 04210 and 04326 unchanged and still pass. No test weakened. The fix is a refactor of the same wrapping logic with broader correctness; the compound-key case has no runnable test (planner NOT_IMPLEMENTED, see above).
e Both directions demonstrated? Yes. 04326 aborts on the no-fix branch and passes with the fix; the c % 2 type is Int16 without the rework and Nullable(Int16) with it.
f Fix is general, not a narrow patch? Yes. Drives off the canonical correlation mechanism rather than a special-case top-level-column check, covering all expression shapes, and adds a fast path so non-group_by_use_nulls queries pay only a cheap scope-chain check.

Session id: cron:clickhouse-worker-slot-5:20260610-090200

The previous commit drove the group_by_use_nulls Nullable-wrapping in
applyGroupByUseNullsToExpression by expression correlation, but used
`scope_ptr == &scope` as the discriminator for the non-correlated
clone-and-wrap branch. A GROUP BY key referenced inside a lambda body, e.g.
`arrayMap(x -> intDiv(x, key), ...)` where `key` is a GROUP BY key of the
surrounding query, is resolved with `&scope` pointing at the LAMBDA scope while
the key lives in the enclosing QUERY scope. Such a reference is not correlated,
so it matched neither the `scope_ptr == &scope` clone branch nor the correlated
`ownsCorrelatedSource` branch, and was left non-Nullable.

That produced two regressions visible in the Fast test:

- 04001_logical_error_with_cube: the lambda-body key kept its non-Nullable
  type, so arrayMap returned Array(UInt8) instead of Array(Nullable(UInt8)),
  which changed DISTINCT semantics and emitted an extra row.
- 03023_group_by_use_nulls_analyzer_crashes: the type mismatch between the
  wrapped projection column and the unwrapped lambda-captured column produced
  `Cannot capture column N because it has incompatible type: got Nullable(UInt8),
  but UInt8 is expected` (LOGICAL_ERROR).

The matched key belongs to the same query as the reference whenever the
expression is not correlated (whether resolved directly in the query or inside
one of its lambda scopes), so take the clone-and-wrap branch for any
non-correlated match. The `scope_ptr == &scope` check is kept as a fast path so
the common direct-projection match skips the correlated-columns analysis. The
correlated in-place wrap (gated by ownsCorrelatedSource) is unchanged, and the
compound-key correlated case the previous commit addressed still wraps as
Nullable.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
@groeneai

Copy link
Copy Markdown
Contributor Author

Pushed e8d62a69667 to fix two Fast test regressions that the previous commit (be05b67e0cc) introduced. The previous commit drove the wrapping by expression correlation (addressing the c % 2 / source-ownership review feedback) but regressed non-correlated GROUP BY keys referenced inside lambda bodies.

What broke and why: applyGroupByUseNullsToExpression used scope_ptr == &scope as the discriminator for the non-correlated clone-and-wrap branch. A GROUP BY key referenced inside a lambda body, e.g. arrayMap(x -> intDiv(x, key), ...) where key is a GROUP BY key of the surrounding query, is resolved with &scope pointing at the LAMBDA scope while the key lives in the enclosing QUERY scope. That reference is not correlated, so it matched neither the scope_ptr == &scope clone branch nor the correlated ownsCorrelatedSource branch and was left non-Nullable. Result: 04001_logical_error_with_cube emitted an extra row (arrayMap returned Array(UInt8) instead of Array(Nullable(UInt8)), changing DISTINCT semantics) and 03023_group_by_use_nulls_analyzer_crashes hit Cannot capture column N ... got Nullable(UInt8), but UInt8 is expected (LOGICAL_ERROR).

Fix: take the clone-and-wrap branch for any non-correlated match (scope_ptr == &scope || !is_correlated()), not just the scope_ptr == &scope case. The scope_ptr == &scope check is kept as a fast path so the common direct-projection match skips the correlated-columns analysis. The correlated in-place wrap (gated by ownsCorrelatedSource) and the compound-key correlated case are unchanged.

Pre-PR validation gate

# Question Answer
a Deterministic repro? Yes. 04001 reproduces via clickhouse-local (extra 1 [] row). 03023 reproduces via the test runner against a server built from be05b67e0cc: deterministic LOGICAL_ERROR Cannot capture column 2 ... got Nullable(UInt8), but UInt8 is expected, server SIGABRT (exit 134).
b Root cause explained? Yes. The previous commit used scope_ptr == &scope to decide the non-correlated clone branch. A non-correlated key referenced inside a lambda body has &scope = LAMBDA scope and the key in the enclosing QUERY scope (scope_ptr != &scope), and CorrelatedColumnsCollector returns empty for it, so it fell through both branches and stayed non-Nullable. The resulting type mismatch (wrapped projection vs unwrapped lambda capture) causes the extra DISTINCT row (04001) and the capture LOGICAL_ERROR (03023).
c Fix matches root cause? Yes. The discriminator becomes "is this reference correlated?": scope_ptr == &scope || !is_correlated() takes the clone-and-wrap branch for any non-correlated match (direct or lambda-nested), exactly the case the previous commit dropped. Not a band-aid: the EXPLAIN QUERY TREE for 04001 is now byte-identical to the green parent 62084af978d.
d Test intent preserved / new tests added? Preserved. No tests modified. 03023 and 04001 are the existing regression tests that now pass; 04210 and 04326 (this PR's own tests) continue to pass; 04067_group_by_use_nulls_apply_aggregate (the earlier Fast test crash) passes. novikd's c % 2 compound-key correlated case still wraps to Nullable.
e Both directions demonstrated? Yes. be05b67e0cc: 04001 emits extra row, 03023 LOGICAL_ERROR + SIGABRT. e8d62a69667: both pass via the test runner; 04001 tree identical to green parent.
f Fix is general, not a narrow patch? Yes. The fix corrects the wrapping decision at its source (the scope-walk discriminator), covering direct-projection, lambda-body, CUBE/ROLLUP/GROUPING SETS, and correlated shapes alike. Verified against the full group_by_use_nulls suite (16/16 non-stateful pass) and 75 correlated/cube/rollup/grouping_sets tests (0 unrelated failures; the only two skips were a stateful test needing test.hits and a test hardcoding HTTP port 8123, both passing once the local env matches CI).

Session id: cron:clickhouse-worker-slot-7:20260610-104500

@groeneai

Copy link
Copy Markdown
Contributor Author

CI Ledger — covered SHA: e8d62a696678d0d70504e09e17a60b1ac636ea21

CI fully finished on this HEAD (Finish Workflow = success; no checks queued/in-progress). Triaging the two remaining red checks; neither is caused by this PR.

Check Failure Classification Disposition
Integration tests (amd_asan_ubsan, db disk, old analyzer, 1/6) test_refreshable_mat_view_replicated/test.py::test_circular_dependencies_survive_restart chronic flaky (NOT PR-caused) 30-day cross-PR: 14+ distinct unrelated PRs + master. Tracked under issue #106651. This PR only touches QueryAnalyzer.cpp + group_by_use_nulls tests, nothing in refreshable-mat-view / ReplicatedMergeTree.
Stateless tests (arm_asan_ubsan, azure, parallel) Server died / OOM in dmesg / Logical error: Expected 3 to 10 arguments in table function azureBlobStorage, got 1 infra outage (NOT PR-caused) The arm_asan_ubsan, azure, parallel runner is in a broad failure window: 272 distinct tests across 41 unrelated PRs failed on this exact check in the last 12h (OOM / Server died). Independent of this PR's diff.
Mergeable Check / PR / CH Inc sync aggregators downstream Reflect the two above; no independent signal.

The two Fast test failures from the prior commit (be05b67e0cc: 03023_group_by_use_nulls_analyzer_crashes, 04001_logical_error_with_cube) were PR-caused regressions from the rework and are fixed in e8d62a69667 (both verified passing locally in both directions). No PR-caused failures remain on the current HEAD.

A correlated subquery aggregating a String key (e.g. (SELECT min(c0))
over a GROUP BY ... WITH ROLLUP / GROUPING SETS key) hits the
ColumnNullable -> ColumnString wrapper instead of the numeric
ColumnVector path, so it was not covered by the existing 04210 cases.
Both queries crashed with 'Bad cast from ColumnNullable to ColumnString'
on master before this fix and now return the post-rollup Nullable value.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
@groeneai

Copy link
Copy Markdown
Contributor Author

Pushed fd5dfbf70a1 adding two String-aggregate cases to 04210 ((SELECT min(c0)) over a String GROUP BY key with WITH ROLLUP, and minOrDefault with GROUPING SETS). These exercise the ColumnNullable -> ColumnString wrapper, distinct from the numeric ColumnVector path the other cases cover. Both crashed with Bad cast from ColumnNullable to ColumnString on master and now return the post-rollup Nullable value.

While reviewing the rework I rebuilt the PR on current master (which now contains the multistage_distributed_queries rewrite #106020, merged after this branch's last rebase) to confirm the analyzer-level fix still applies after the planner changes. Built from a fresh checkout rooted at the worktree sources (not a single-TU relink), the four headline shapes behave as expected:

case master this PR on master
(SELECT c0) Bool (#91119 headline) crash true, \N
(SELECT c0) numeric ok ok
(SELECT sum(number)) over correlated key crash ok
(SELECT min(c0)) String crash a, \N

Pre-PR validation gate

# Question Answer
a Deterministic repro? Yes. clickhouse-local --enable_analyzer=1 --allow_experimental_correlated_subqueries=1 --group_by_use_nulls=1 -q "SELECT (SELECT min(c0)) FROM (SELECT 'a'::String) t0(c0) GROUP BY c0 WITH ROLLUP ORDER BY c0 NULLS LAST" aborts (exit 134, Bad cast from ColumnNullable to ColumnString) on master; same for the GROUPING SETS variant.
b Root cause explained? Same root cause as the rest of the PR: the correlated String key kept its non-Nullable type, so the planner built aggregate/CAST wrappers for ColumnString while decorrelation fed in the post-rollup Nullable(String). The fix wraps the correlated reference to Nullable in the analyzer, so the wrapper signatures match.
c Fix matches root cause? Yes. No code change in this commit; it adds coverage for the SingleValueDataString aggregate branch the existing cases did not reach.
d Test intent preserved / new tests added? New cases only; no existing assertion changed.
e Both directions demonstrated? Yes. Both queries crash on master and pass on this branch; the full 04210 reference matches.
f Fix is general, not a narrow patch? The underlying fix is type-agnostic (convertToNullable); these cases verify the String manifestation alongside the numeric and Bool ones. Regression set re-run green: 03023, 04001, 02343(+distributed), 02535, 04210, 04326 all pass.

Session id: cron:clickhouse-worker-slot-9:20260610-190100

@groeneai

Copy link
Copy Markdown
Contributor Author

CI finished on this head (ec6b58a). The single red check is not caused by this change.

All other checks pass. The fuzzer "Bad cast" (STID 3721-419e) fix in this commit (LowCardinality-aware nullable wrapping in ConstantNode::convertToNullable, plus the regression cases in 04210) is green.

(cron:our-pr-ci-monitor:20260622-080000)

@coderashed

Copy link
Copy Markdown

@novikd — verified the current CorrelatedColumnsCollector-driven revision against each of the cases you flagged on the earlier one. Type-level evidence from EXPLAIN QUERY TREE (group_by_use_nulls = 1, allow_experimental_correlated_subqueries = 1):

Your concern Now
GROUP BY c % 2 wouldn't wrap (compound key) (SELECT c % 2) … GROUP BY c % 2 WITH ROLLUP → correlated modulo(c, 2) resolves to Nullable(Int16)
Bare correlated column (SELECT c) … GROUP BY c WITH ROLLUP → correlated c resolves to Nullable(Int32); the outer GROUP BY key node stays Int32
Aggregate-argument reset looked incorrect * APPLY x -> argMax(x, number) … WITH ROLLUPnumber inside the aggregate stays UInt64 (non-Nullable) ✓
if (!column_source) break; No longer present — the walk is now driven by CorrelatedColumnsCollector rather than a bare ColumnNode source, which is also what fixes the compound-key case

The end-to-end planning symptom is gone too: SELECT concat('s', (SELECT toString(number))) FROM numbers(5) GROUP BY GROUPING SETS ((number)) executes cleanly. 04210_91119, 04326_91119, and 04067 all pass on a single-node build.

The existing tests check query results; I put together a complementary one that pins the types the analyzer assigns (passes against this revision, and fails with a clean diff if the correlated type stops being wrapped). Happy to contribute it if useful:

SET enable_analyzer = 1;
SET allow_experimental_correlated_subqueries = 1;
SET group_by_use_nulls = 1;

SELECT '-- bare correlated column ---';
SELECT replaceRegexpAll(trim(explain), 'id: [0-9]+', 'id: N') AS line
FROM (EXPLAIN QUERY TREE
    SELECT (SELECT c) FROM (SELECT toInt32(3) AS c) t GROUP BY c WITH ROLLUP)
WHERE explain ILIKE '%column_name: c,%result_type%';

SELECT '-- compound correlated key (modulo) ---';
SELECT replaceRegexpAll(trim(explain), 'id: [0-9]+', 'id: N') AS line
FROM (EXPLAIN QUERY TREE
    SELECT (SELECT c % 2) FROM (SELECT toInt32(3) AS c) t GROUP BY c % 2 WITH ROLLUP)
WHERE explain ILIKE '%function_name: modulo%result_type%';

SELECT '-- aggregate argument stays non-Nullable ---';
SELECT replaceRegexpAll(trim(explain), 'id: [0-9]+', 'id: N') AS line
FROM (EXPLAIN QUERY TREE
    SELECT * APPLY x -> argMax(x, number) FROM numbers(1) GROUP BY number WITH ROLLUP)
WHERE explain ILIKE '%column_name: number,%result_type%';

Output on this revision:

-- bare correlated column ---
COLUMN id: N, column_name: c, result_type: Nullable(Int32), source_id: N
COLUMN id: N, column_name: c, result_type: Nullable(Int32), source_id: N
COLUMN id: N, column_name: c, result_type: Int32, source_id: N
-- compound correlated key (modulo) ---
FUNCTION id: N, function_name: modulo, function_type: ordinary, result_type: Nullable(Int16)
FUNCTION id: N, function_name: modulo, function_type: ordinary, result_type: Int16
-- aggregate argument stays non-Nullable ---
COLUMN id: N, column_name: number, result_type: UInt64, source_id: N
COLUMN id: N, column_name: number, result_type: UInt64, source_id: N
COLUMN id: N, column_name: number, result_type: UInt64, source_id: N

@groeneai

Copy link
Copy Markdown
Contributor Author

Thanks @coderashed, and thanks again for routing #108714 here.

Both tests are welcome and I will fold them into 04210_91119_correlated_subquery_group_by_use_nulls.{sql,reference}, credited to you:

  1. the fuzzer-seed shape SELECT concat('s', (SELECT toString(number))) FROM numbers(5) GROUP BY GROUPING SETS ((number)) (a concat over a correlated scalar subquery on a GROUPING SETS key, which 04210 did not cover yet), and
  2. the type-pinning EXPLAIN QUERY TREE cases.

The type-pinning one is a good addition: 04210 only checks results, so a regression that silently stopped wrapping the correlated key Nullable would still pass there but fail your type diff. The SELECT ... FROM (EXPLAIN QUERY TREE ...) WHERE explain ILIKE '%...%' form (rather than a full-plan reference diff) is also exactly the robust shape we prefer, so I will take it largely as written.

I will add both on the next rebase rather than a coverage-only push now (this branch rebases often for the QueryAnalyzer.cpp area, and the PR is in review). Appreciate the help.

groeneai and others added 2 commits June 30, 2026 11:31
…d-rollup-cast-wrapper

# Conflicts:
#	src/Analyzer/ConstantNode.cpp
Two test cases contributed by @coderashed (who closed his duplicate PR
ClickHouse#108714 in favour of this one):

1. The exact AST-fuzzer seed query from ClickHouse#108714: a correlated scalar
   subquery toString(number) inside concat() over a GROUPING SETS key.
   Aborts on master with 'Bad cast ColumnNullable to ColumnString',
   returns s0..s4 with the fix.

2. A type-pinning EXPLAIN QUERY TREE check. The existing cases assert
   query results, which still pass if a future change silently stops
   wrapping the correlated key as Nullable (values unchanged, declared
   type wrong). These pin the analyzer-assigned types via
   SELECT ... FROM (EXPLAIN QUERY TREE ...) WHERE explain ILIKE '%...%',
   so dropping the Nullable wrapping fails with a clean diff.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
@groeneai

Copy link
Copy Markdown
Contributor Author

Rebased onto master (the branch had drifted to CONFLICTING). The only conflict was in src/Analyzer/ConstantNode.cpp::convertToNullable: master independently landed the same LowCardinality-constant-key fix in a8afbdc ("Fix Bad cast with group_by_use_nulls over a LowCardinality key constant"), so I dropped our copy and kept master's. Net diff is now exactly the correlated-walk in QueryAnalyzer.cpp plus the regression tests. mergeable is back to MERGEABLE.

Folded in @coderashed's two contributed cases (04210), credited:

  1. The Fix logical error in correlated subqueries with group_by_use_nulls #108714 fuzzer-seed shape concat('s', (SELECT toString(number))) ... GROUP BY GROUPING SETS ((number)). Aborts on master (Bad cast ColumnNullable to ColumnString), returns s0..s4 with the fix.
  2. A type-pinning EXPLAIN QUERY TREE check. The existing cases assert results, which still pass if a future change silently stops wrapping the correlated key as Nullable; these pin the analyzer types (SELECT ... FROM (EXPLAIN QUERY TREE ...) WHERE explain ILIKE '%...%'), so a regression fails with a clean diff. On master the bare correlated c shows Int32; with the fix Nullable(Int32).

Both verified both directions against a local debug build (Build ID b5f40416, has applyGroupByUseNullsToExpression) vs a master snapshot (no symbol): crash/wrong-type without, correct with.

Comment thread src/Analyzer/Resolve/QueryAnalyzer.cpp
…sion (ClickHouse#91119)

A code review raised that the `scope_ptr == &scope` fast path in
applyGroupByUseNullsToExpression could mis-wrap a correlated outer column as
the inner query's same-named local GROUP BY key, because
nullable_group_by_keys matches on name+type only (ColumnNode::isEqualImpl
ignores column source).

Investigated and confirmed this cannot happen: scope.nullable_group_by_keys is
populated only for ROLLUP/CUBE/GROUPING SETS queries under group_by_use_nulls
(IdentifierResolveScope.cpp), so a plain inner GROUP BY registers no keys and
the inner-scope find() is never reached. For the inner scope to hold a
colliding entry it would itself have to be ROLLUP/CUBE/GROUPING SETS while
correlated, which decorrelation rejects with NOT_IMPLEMENTED. No source change
is needed.

Add the collision shape as a regression (results plus an EXPLAIN QUERY TREE
type pin) so a future over-broad guard cannot silently flip the inner local
key to Nullable.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
@groeneai

Copy link
Copy Markdown
Contributor Author

CI finish ledger — 0596553

CI fully finished (Finish Workflow: success). This PR fixes a Bad-cast in argMax + APPLY + ROLLUP (analyzer). Every failure has an owner; none is PR-caused.

Check / test Reason Owner
Stress test (amd_tsan) + (arm_tsan) / "Hung check failed, possible deadlock found" deadlock (chronic hung-check family, 600+ distinct PRs / 30d) #108212 (ours, merged 2026-06-28) / #105905 (ours, open)
Docker server image / distroless-amd64 + alpine-amd64 image test infra: docker.io base-image resolve/pull failure; fails on master too (1 master + 2 unrelated PRs / 7d) a fix task is created (investigating at full effort; fixing-PR link to follow here)
AST fuzzer (amd_debug, targeted, old_compatibility) / "Input nodes size mismatch in dag" (STID 4752-5ad8) [prior SHA 9a32bb9] crash (chronic trunk AST-fuzzer finding; 8 PRs + 2 master / 30d; not in this PR's analyzer path) #103088 (ours, open)
Mergeable Check / PR aggregator rollup of the rows above

Session id: cron:our-pr-ci-monitor:20260630-163000

@groeneai

Copy link
Copy Markdown
Contributor Author

Fixing PR for the Docker server image ledger line created: #108969. Root cause was a transient docker.io registry outage (2026-06-30 15:00-15:06 UTC) failing docker buildx while resolving the SBOM scanner image; it recovered on its own and is unrelated to this PR's diff. #108969 hardens the Docker server image job with retries against transient registry errors.

STID 3418-3e98 (AST fuzzer, arm_asan_ubsan) is the statistics-aggregate
manifestation of the same correlated-subquery + group_by_use_nulls bug.
skewSamp/kurtPop/varSamp/covarSamp and friends share one add() that reads
the argument via an unchecked static_cast (AggregateFunctionStatisticsSimple.h),
so when the correlated argument is left non-Nullable the failure is a silent
heap-buffer-overflow on release builds instead of the clean "Bad cast" exception
the min/sum/argMax cases already cover. Add runtime cases over the statistics
family (crash without the fix, correct output with it) plus an EXPLAIN QUERY TREE
type pin that the correlated statistics-aggregate argument is wrapped Nullable.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
@groeneai

groeneai commented Jul 1, 2026

Copy link
Copy Markdown
Contributor Author

Pushed c32cf6e46be: added statistics-aggregate coverage to 04210.

The AST fuzzer surfaced this bug again as STID 3418-3e98 (AST fuzzer (arm_asan_ubsan), report) as a heap-buffer-overflow in AggregateFunctionVarianceSimple::add, i.e. the statistics-aggregate manifestation of this same correlated-subquery + group_by_use_nulls bug. skewSamp/kurtPop/varSamp/covarSamp etc. share one add() that reads the argument via an unchecked static_cast (AggregateFunctionStatisticsSimple.h), unlike the min/sum/argMax cases already here which assert_cast and raise a clean Bad cast. So the same non-Nullable-argument bug is a silent memory-safety fault there rather than an exception.

Minimal repro (aborts on master with ASan, correct with this PR):

SELECT (SELECT skewSamp(a.k)) FROM (SELECT number AS k FROM numbers(64)) AS a
GROUP BY a.k WITH CUBE SETTINGS group_by_use_nulls = 1;

Verified both directions locally (ASan): the new runtime cases (numbers(64), numbers(8) uses the ASan-redzone size, not the PODArray SIMD padding) abort on the master snapshot and return correct results on this branch; the new EXPLAIN QUERY TREE type pin shows the correlated statistics-aggregate argument as Nullable(Int32) with the fix and Int32 on master. Test-only change; the analyzer fix is unchanged.

groeneai and others added 2 commits July 1, 2026 14:03
The correlated subquery returns a tuple whose element is the outer rollup key.
On master this aborts with a LOGICAL_ERROR (Unexpected return type from tuple)
under group_by_use_nulls + WITH ROLLUP/CUBE; the analyzer fix resolves the tuple
element to Nullable so it matches the decorrelated input.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
The two coverage commits pushed for this test (the statistics-aggregate
crash-guard section at numbers(64) and the ClickHouse#106377 correlated-tuple cases)
made 04210 heavy enough to exceed the flaky-check 180s soft cap under debug
builds with randomized settings, so the flaky-check job reported
'Test runs too long (> 180s)'.

The numbers(64) size is required for the statistics-aggregate out-of-bounds
read to land on an ASan redzone, so the coverage must not be reduced. Tag the
test long instead: this exempts it from the flaky-check 180s FAIL and runs it
at the reduced long_test_runs_ratio repeat count.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Comment thread src/Analyzer/Resolve/QueryAnalyzer.cpp
…use#91119)

The scalar path (applyGroupByUseNullsToExpression) suppresses the
group_by_use_nulls Nullable wrapping for both aggregate and grouping()
function arguments via hasAggregateOrGroupingFunction(). The sibling
APPLY-matcher path in resolveMatcher only suppressed it for aggregate
functions, so a query like

  SELECT * APPLY x -> grouping(x)
  FROM numbers(1) GROUP BY number WITH ROLLUP
  SETTINGS group_by_use_nulls = 1

cloned the matched `number` key as Nullable before the lambda was
resolved. GroupingFunctionsResolvePass matches its arguments against the
original GROUP BY keys, so the rewritten grouping(number) no longer
matched and raised a spurious "GROUPING function ... is not in GROUP BY
keys" (BAD_ARGUMENTS) exception even though the key is valid.

Make the APPLY-tree scan detect grouping() the same way as the scalar
path (exact name comparison, matching ExpressionsStack::
isAggregateOrGroupingFunction). Covers both the lambda form
(* APPLY x -> grouping(x)) and the function form (* APPLY grouping),
and grouping() nested inside other functions, across all GROUP BY
shapes (ROLLUP, CUBE, GROUPING SETS). Regression cases added to
04326_91119_apply_lambda_aggregate_rollup.

Found by clickhouse-gh[bot] review on PR ClickHouse#104350.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
@groeneai

groeneai commented Jul 1, 2026

Copy link
Copy Markdown
Contributor Author

Pre-PR validation gate for the APPLY-grouping fix (commit 775264e, addressing the bot review on resolveMatcher):

Pre-PR validation gate (click to expand)
# Question Answer
a Deterministic repro? Yes. SELECT * APPLY x -> grouping(x) FROM numbers(1) GROUP BY number WITH ROLLUP SETTINGS enable_analyzer=1, group_by_use_nulls=1 -> Code 36 BAD_ARGUMENTS: GROUPING function argument number is not in GROUP BY keys, on demand, every run. Same for * APPLY grouping and for CUBE / GROUPING SETS.
b Root cause explained? The APPLY matcher in resolveMatcher scans APPLY expressions and suppresses the group_by_use_nulls Nullable-wrapping only when it finds an AGGREGATE function name (AggregateFunctionFactory::isAggregateFunctionName). It misses grouping, so * APPLY x -> grouping(x) does not set the suppression flag; the matched number key is cloned Nullable before the lambda resolves. GroupingFunctionsResolvePass then compares the rewritten grouping(number) argument against the ORIGINAL non-Nullable GROUP BY keys, no longer matches, and raises the spurious BAD_ARGUMENTS.
c Fix matches root cause? Yes. The APPLY-tree scan now also detects grouping (exact name comparison), so the suppression flag is set for grouping() args exactly as for aggregate args, and the key is left non-Nullable for GroupingFunctionsResolvePass to match. It directly closes the aggregate-vs-grouping asymmetry the bot identified, not a symptom.
d Test intent preserved / new tests added? Yes. No existing assertion weakened. Added 5 focused regression cases to 04326_91119_apply_lambda_aggregate_rollup (lambda grouping(x), function-form grouping, CUBE, GROUPING SETS, nested grouping(x)+1), plus the pre-existing non-aggregate-APPLY sanity case still asserts Nullable wrapping after ROLLUP.
e Both directions demonstrated? Yes. Pre-fix binary (Build ID b5f4041648): all cases crash with BAD_ARGUMENTS. Post-fix binary (Build ID 42b4af186875): correct grouping bits (0/1, nested 1/2), verified against a real memory-capped server and matching the .reference exactly.
f Fix is general across code paths? Yes. The scalar path already used hasAggregateOrGroupingFunction(); this change brings the sibling APPLY-matcher path to parity. Both APPLY forms (lambda and function) and grouping nested inside other functions are covered because the scan walks the whole expression tree. The detection now matches the canonical ExpressionsStack::isAggregateOrGroupingFunction.
g Fix generalizes across inputs? Yes. Verified across all GROUP BY shapes (ROLLUP / CUBE / GROUPING SETS), both APPLY spellings, top-level and nested grouping, and multi-key groups. Not type-specific (grouping returns a fixed integer bitmask); no type-wrapper matrix applies. Non-grouping/non-aggregate APPLY still wraps Nullable, confirming the suppression is scoped, not blanket.
h Backward compatible? Yes. No setting, on-disk/wire format, or default changed. This makes a previously-erroring valid query succeed; behavior when group_by_use_nulls=0 is unchanged. No SettingsChangesHistory entry needed.
i Invariants and contracts preserved? Yes. The invariant is that a GROUP BY key referenced by grouping() must reach GroupingFunctionsResolvePass in its original (non-Nullable) form; the fix upholds it on the APPLY path. The Nullable wrapping of genuine projection keys after ROLLUP/CUBE is unchanged (non-aggregate APPLY sanity case still returns the \N totals row). No concurrency or ownership contract touched (analyzer-only, single-threaded resolution).

Session id: cron:clickhouse-worker-slot-2:20260701-172600

@clickhouse-gh

clickhouse-gh Bot commented Jul 1, 2026

Copy link
Copy Markdown
Contributor

LLVM Coverage Report

Metric Baseline Current Δ
Lines 85.40% 85.50% +0.10%
Functions 92.60% 92.60% +0.00%
Branches 77.60% 77.70% +0.10%

Changed lines: Changed C/C++ lines covered: 57/61 (93.44%) · Uncovered code

Full report · Diff report

@groeneai

groeneai commented Jul 1, 2026

Copy link
Copy Markdown
Contributor Author

CI finish ledger — 775264e

CI fully finished (Finish Workflow: success, Mergeable Check: success). This PR fixes a Bad-cast in correlated subquery + GROUP BY ROLLUP under group_by_use_nulls (analyzer). The one red leaf is a chronic trunk AST-fuzzer finding, not caused by this PR, and it has an owner.

Check / test Reason Owner
AST fuzzer (amd_debug, targeted, old_compatibility) / "Input nodes size mismatch in dag" (STID 4752-5ad8) chronic trunk AST-fuzzer finding in the join-order optimizer DAG (JoinExpressionActions), surfaced when the fuzzer mutates this PR's new correlated-subquery test queries into GROUPING SETS shapes. 6 unrelated PRs + 1 master hit / 30d; not in this PR's analyzer diff. #106418 (merged 2026-06-04, present on this branch) did not seal this manifestation. a fix task is moved to pending (investigating at full effort; fixing-PR link to follow here)
PR / Mergeable Check aggregator rollup of the row above
CH Inc sync private sync CH Inc sync (private, not actionable)

Session id: cron:our-pr-ci-monitor:20260701-223000

@groeneai

groeneai commented Jul 1, 2026

Copy link
Copy Markdown
Contributor Author

CI finish ledger update — STID 4752-5ad8 owner resolved to a fixing PR

The AST-fuzzer "Input nodes size mismatch in dag" / "Cannot find input column ... on its position in inputs of expression actions DAG" leaf (STID 4752-5ad8) now resolves to a merged fixing PR, not a fix task.

Fixing PR: #107112 ("Fix LOGICAL_ERROR in decorrelated correlated subquery join with duplicate columns"), merged 2026-07-01 20:20 UTC.

Root cause (confirmed by local reproduction): a correlated scalar subquery is decorrelated into a JOIN; when its input side projects the same column identifier twice and correlated_subqueries_default_join_kind = 'left' swaps that side onto the left join child, the join's ActionsDAG keeps a single input for the duplicated name (the unreferenced duplicate input is pruned) while the child step still produces two same-named columns. buildPhysicalJoinImpl (JoinStepLogical.cpp:1313) popped from a per-name deque, underflowed on the second column, and aborted the server. #107112 keeps the last remaining DAG input node when the deque is exhausted.

Deterministic minimal trigger (2 settings):

SET allow_experimental_correlated_subqueries = 1;
SET correlated_subqueries_default_join_kind = 'left';
SELECT x, (SELECT y) FROM (SELECT 1 AS x, 2 AS y, x, y QUALIFY 7) LIMIT 100;

Verified locally: server aborts 5/5 without #107112, runs clean with it. The old_compatibility AST fuzzer variant sets --compatibility=24.3, which flips correlated_subqueries_default_join_kind to left, which is why this surfaced when the fuzzer mutated this PR's correlated-subquery test queries.

Why this branch still saw the crash: this PR's failing run (commit 775264e6b39, 2026-07-01 18:50 UTC) predates the #107112 merge (20:20 UTC), so the build did not yet contain the fix. #106418 (the earlier overlapping-column-names guard) does not cover this path because buildPhysicalJoin runs unconditionally, not only on the reorder path it guards. Merging master into this branch picks up #107112; the leaf should not recur on the next CI run.

Report: https://s3.amazonaws.com/clickhouse-test-reports/json.html?PR=104350&sha=775264e6b397184709c2c985f9115416572c8740&name_0=PR&name_1=AST%20fuzzer%20%28amd_debug%2C%20targeted%2C%20old_compatibility%29

Session id: cron:clickhouse-worker-slot-1:20260701-225000

@groeneai

groeneai commented Jul 3, 2026

Copy link
Copy Markdown
Contributor Author

CI finish ledger — 775264e

Every failure below has an owner: a fixing PR (ours or external), or a full-effort fix task whose fixing-PR link will be posted here when it opens. Only CH Inc sync is exempt.

Check / test Reason Owner / fixing PR
AST fuzzer (amd_debug, targeted, old_compatibility) / "Input nodes size mismatch in dag" (STID 4752-5ad8) chronic trunk analyzer DAG bug (30d: 5 unrelated PRs, unrelated to this PR's correlated-subquery/ROLLUP fix) #103088 (ours, open)

Session id: cron:our-pr-ci-monitor:20260703-043000

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

can be tested Allows running workflows for external contributors pr-bugfix Pull request with bugfix, not backported by default

Projects

None yet

5 participants