iframe-proxy

groeneai · 2026-05-08T08:49:47Z

A correlated subquery references an outer column that is a GROUP BY key of an outer query under group_by_use_nulls = 1 with WITH ROLLUP/CUBE/GROUPING SETS. The actual values fed into the inner subquery are post-rollup Nullables, but the analyzer used to leave the inner column reference at its non-Nullable type. The planner then built CAST wrappers and aggregate functions with the original signature, and at header inference (after decorrelation replaced PLACEHOLDER nodes with INPUT nodes of the actual Nullable types) the precomputed wrappers raised Logical error: 'Bad cast from type DB::ColumnNullable to DB::ColumnVector<...>'.

The minimal reproducer from issue #91119 hits createUInt8ToBoolWrapper:

SELECT (SELECT c0)
FROM (SELECT 1::Bool) t0(c0)
GROUP BY c0 WITH ROLLUP
SETTINGS group_by_use_nulls = 1;

The same surface bug class also appears for plain numeric types (modulo, plus), aggregate functions over correlated outer columns (e.g. (SELECT anyLastOrDefault(number))), correlated HAVING/WHERE filter steps, and Nullable source columns.

The root cause is in QueryAnalyzer::resolveExpressionNode: the nullable_group_by_keys walk stops at the first enclosing QUERY scope, so the outer scope where the column was defined is never consulted for an inner correlated reference. This change makes the walk continue past the inner QUERY scope when the node is a column whose source table expression is registered in a deeper outer scope (i.e. the column is correlated). The aggregate-function exclusion is now evaluated against each scope's expressions_in_resolve_process_stack, so an outer scope's nullable_group_by_keys applies even when the inner subquery is currently inside an inner aggregate (the inner aggregate operates on the outer's already-Nullable values).

For correlated matches the existing node is mutated in place instead of being replaced with a clone of the GROUP BY key. The same shared_ptr is referenced by the enclosing QueryNode::correlated_columns list (registered via addCorrelatedColumn); cloning would leave the planner's correlated_columns_set pointing to the original non-Nullable node, and PlannerActionsVisitor::visitColumn (which uses type-sensitive equality) would then fail to identify the projected column as correlated and skip the PLACEHOLDER step that the decorrelation pass replaces.

Closes #91119.
Closes #106377.

Changelog category (leave one):

Bug Fix (user-visible misbehavior in an official stable release)

Changelog entry (a user-readable short description of the changes that goes into CHANGELOG.md):

Fixed Bad cast exception in correlated subqueries that reference an outer column promoted to Nullable by group_by_use_nulls = 1 with GROUP BY WITH ROLLUP/CUBE/GROUPING SETS.

Documentation entry for user-facing changes

Documentation is written (mandatory for new features)

A correlated column reference resolved inside a subquery used to keep its original (non-Nullable) type even when the column is a GROUP BY key of an outer query under `group_by_use_nulls = 1` with `WITH ROLLUP`/`CUBE`/ `GROUPING SETS`. The analyzer's `nullable_group_by_keys` walk stopped at the first enclosing `QUERY` scope, so the outer scope where the column was defined was never consulted for the inner reference. The planner then built the inner subquery's `ActionsDAG` (CAST wrappers, aggregate functions, etc.) with the original argument types. After the decorrelation pass replaced `PLACEHOLDER` nodes with `INPUT` nodes whose actual types were Nullable (because the outer GROUP BY made the keys Nullable), the precomputed wrappers received columns of an unexpected type and threw `Logical error: 'Bad cast from type DB::ColumnNullable to DB::ColumnVector<...>'` during header inference. The fix lifts the walk so it also visits enclosing outer-query scopes for column nodes whose source table expression is not registered in the current scope's FROM (i.e. correlated). At each scope, the `expressions_in_resolve_process_stack.hasAggregateFunction()` exclusion is evaluated against THAT scope's stack, so an outer scope's `nullable_group_by_keys` correctly applies even when the inner subquery is currently inside an inner aggregate function call (the inner aggregate operates on the outer's already-Nullable values). For correlated matches the existing column node is mutated in place instead of being replaced with a clone of the GROUP BY key node. The same `shared_ptr` is referenced by the enclosing `QueryNode`'s `correlated_columns` list (registered via `addCorrelatedColumn`); cloning would leave the planner's `correlated_columns_set` pointing to the original non-Nullable node, and `PlannerActionsVisitor::visitColumn` (which uses type-sensitive equality) would then fail to identify the projected column as correlated and skip the `PLACEHOLDER` step that the decorrelation pass replaces. Closes ClickHouse#91119.

groeneai · 2026-05-08T08:50:12Z

Pre-PR validation gate (TASK.md Phase 4 Step 9):

a) Deterministic repro? Yes. The reproducer in issue #91119 (SELECT (SELECT c0) FROM (SELECT 1::Bool) t0(c0) GROUP BY c0 WITH ROLLUP SETTINGS group_by_use_nulls = 1) deterministically aborts under debug builds.

b) Root cause explained? Yes. The analyzer's nullable_group_by_keys walk in QueryAnalyzer::resolveExpressionNode stops at the first enclosing QUERY scope, so the outer scope's GROUP BY-key set is never consulted for column references inside a correlated subquery. The inner column reference keeps its non-Nullable type, the planner builds CAST wrappers and aggregate functions for that signature, and the decorrelation pass replaces PLACEHOLDER nodes with INPUT nodes whose actual types are Nullable (because the outer GROUP BY made the keys Nullable post-rollup). The precomputed wrappers then raise Bad cast from type ColumnNullable to ColumnVector<...> during header inference.

c) Fix matches root cause? Yes — the fix is in the analyzer, not the planner. PR #100365 attempted a planner-side reconciliation (reconcileInputTypesAfterDecorrelation rebuilding function_base after types change) and was rejected by @alexey-milovidov as a band-aid ("we should fix the root cause, why the types were incorrect in the first place"). The fix here lifts the analyzer walk so it consults outer scopes for correlated column references and converts them in place.

d) Test intent preserved? / New tests added? Both. The new test 04210_91119_correlated_subquery_group_by_use_nulls.sql covers WITH ROLLUP/CUBE/GROUPING SETS, Bool (the original CAST wrapper case), numeric types, complex correlated expressions (c0 + 1, c0 % 2), aggregate function over correlated outer column (anyLastOrDefault), correlated HAVING, correlated WHERE, Nullable source, and a sanity check that a non-correlated subquery is NOT affected. I also re-ran 22 existing group_by_use_nulls and correlated-subquery tests against this change — all pass. The existing aggregate-function exclusion (max(c0) inside the same query keeps c0 non-Nullable) is preserved by evaluating it per-scope rather than upfront.

e) Demonstrated in both directions? Yes. Without the fix the server crashes with Logical error: 'Bad cast from type DB::ColumnNullable to DB::ColumnVector<char8_t>'. With the fix the original reproducer returns true \N and the broader test set (12 query shapes) passes.

f) Fix is general, not a narrow patch? Yes. The fix is in resolveExpressionNode, which is recursively called for every expression node, so the type propagation handles columns wrapped in functions, aggregates over correlated columns, lambda-wrapped expressions, and HAVING/WHERE filter contexts uniformly. Symmetric callers (resolveFunction, resolveExpressionNodeList, etc.) all dispatch through this entry point. The match is gated on the column's source table expression so unrelated identically-named columns in non-correlated subqueries are not affected.

Session: cron:clickhouse-ci-task-worker:20260508-074500

groeneai · 2026-05-08T08:50:35Z

cc @novikd @KochetovNicolai — could you review this? It fixes #91119 by extending the analyzer's nullable_group_by_keys walk so that correlated columns referenced from inside a subquery pick up the Nullable type promotion from the outer scope's GROUP BY ROLLUP/CUBE/GROUPING SETS, which was previously silently lost and caused Bad cast ColumnNullable -> ColumnVector exceptions during planner header inference.

groeneai · 2026-05-08T08:50:36Z

cc @PedroTadim — invited me here. The reproducer from #91119 (and related cases — CUBE, GROUPING SETS, aggregate function over correlated outer column, HAVING/WHERE with correlated subqueries, Nullable source) all return correct results with this fix. Also non-regressed against the existing 02343_group_by_use_nulls / 02535_analyzer_group_by_use_nulls / 03014_analyzer_group_by_use_nulls / 03315_analyzer_correlated_subqueries and the four 03408/03411 correlated subquery test suites.

clickhouse-gh · 2026-05-08T10:15:56Z

Workflow [PR], commit [775264e]

Summary: ❌

Performance Comparison: Performance dashboard

job_name	test_name	status	info	comment
AST fuzzer (amd_debug, targeted, old_compatibility)		FAIL
	Logical error: Input nodes size mismatch in dag: A, expected: [B], [C] (STID: 4752-5ad8)	FAIL	cidb

AI Review

Summary

This PR reworks group_by_use_nulls handling in QueryAnalyzer so correlated references can inherit the owning outer query's nullable GROUP BY key type, while still preserving the existing aggregate / grouping suppression rules for same-query resolution and * APPLY. After reviewing the current PR head, the full thread history, and the updated regression coverage, I did not find a remaining correctness issue that warrants a new review comment.

Final Verdict

✅ No new findings on the current PR head.

groeneai · 2026-06-04T07:47:27Z

CI Status — covered SHA `30c60275`

Master-merge commit 30c60275 brings the analyzer correlated-rollup fix up to current trunk. CI fully finished. Sole "code" failure is a chronic Fast-test infra hang signature, not PR-caused.

Check	Test	Disposition
`Fast test`	`Server died`	Chronic Fast-test infra flake. The Fast-test runner reports `WARNING: Timeout exceeded [300], sending SIGTERM` followed by `Hung check failed: server is not responding`. CIDB 30d shows `Server died` is endemic across 100+ PRs (top: PR #104754=33, PR #100185=83) and master (20 hits). Last-test-before-hang on this run was `01548_uncomparable_columns_in_keys` at 06:13:31, then a 42-second silence before the timeout — clearly not produced by the analyzer changes in this PR (analyzer/group-by-use-nulls).
`Mergeable Check` / `PR` / `CH Inc sync`	n/a	Aggregator/infra rollup of the above.

Code change unchanged since 2026-05-08. The 22 related existing tests still pass; the comprehensive new test 04210_91119_correlated_subquery_group_by_use_nulls.sql (12 query shapes) verifies the analyzer-level fix. Awaiting @novikd / @KochetovNicolai re-review.

— cron:our-pr-ci-monitor:20260604-073000

The previous commit replaced the OR-accumulated aggregate-function check in `resolveExpressionNode` with a per-scope check, intending to allow the walk to continue past the enclosing `QUERY` scope for correlated subqueries. That broke `* APPLY x -> agg(x, key)` queries under `group_by_use_nulls = 1`: * The LAMBDA scope sees the aggregate on its `expressions_in_resolve_process_stack` but its own `nullable_group_by_keys` is empty, so the per-scope check skips that scope's lookup. * The outer `QUERY` scope holds the keys but its own stack does not see the aggregate (the aggregate lives inside the lambda body), so the per-scope check finds the key and wraps the inner reference as `Nullable`. * The aggregate's `addBatchSinglePlace` then runs on a `ColumnNullable` where its signature expects `ColumnVector<UInt64>` (or `ColumnString` for the `min` family), and the `assert_cast` faults inside `SingleValueDataFixed::set`. Fast test on PR ClickHouse#104350 SHA `30c602753b` reproduced this on `04067_group_by_use_nulls_apply_aggregate.sql` (server died, signal 11 in `AggregateFunctionNullVariadic::addBatchSinglePlace`). The exact same shape appears in the AST-fuzzer findings tracked under STID `0250-30be`, `1499-22cb`, `3336-34e8`, and `3336-2f87`. Restore the OR-accumulated `in_aggregate_function_scope` flag for scopes within the same query nesting level, and reset it when crossing into an outer `QUERY` scope. The reset preserves the new correlated-subquery behaviour: an outer scope's GROUP BY keys are evaluated relative to the outer scope's own resolution context, so an aggregate in an inner subquery must not shield against the outer scope's `nullable_group_by_keys`. Adds a regression test `04326_91119_apply_lambda_aggregate_rollup.sql` covering nine variants of `* APPLY <lambda>` plus an aggregate plus `ROLLUP` / `CUBE` / `GROUPING SETS` under `group_by_use_nulls = 1`, including the regex-filtered APPLY shape from issue ClickHouse#98190 (STID `1499-22cb`). The existing `04067_group_by_use_nulls_apply_aggregate` continues to pass. Local verification: master HEAD passes; PR HEAD without this fix aborts on `04067` query 1; PR HEAD with this fix passes `04067`, `04210` (PR's own test), and `04326` (new regression test). Related: ClickHouse#91119 Related: ClickHouse#98190 Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

groeneai · 2026-06-08T14:53:18Z

Pre-PR validation gate

#	Question	Answer
a	Deterministic repro?	Yes. `build/programs/clickhouse local -q "SET group_by_use_nulls = 1; SELECT * APPLY x -> argMax(x, number) FROM numbers(1) GROUP BY GROUPING SETS ((materialize(65537)), (*));"` aborts with SIGABRT on PR HEAD `30c602753be` (without the new commit), passes on master HEAD `7a725b4ef59`, and passes on PR HEAD + the new commit. Single command, every run.
b	Root cause explained?	Yes. The previous commit replaced the OR-accumulated `in_aggregate_function_scope` check in `QueryAnalyzer::resolveExpressionNode` with a per-scope check, intending to allow the walk to continue past the enclosing `QUERY` scope for correlated subqueries. For `* APPLY x -> agg(x, key)` queries the LAMBDA scope sees the aggregate on its `expressions_in_resolve_process_stack` but its own `nullable_group_by_keys` is empty (per-scope check skips); the outer `QUERY` scope holds the keys but its own stack does not see the aggregate (per-scope check finds the key and wraps the inner reference as `Nullable`). The aggregate's `addBatchSinglePlace` then runs on a `ColumnNullable` where the function signature expects `ColumnVector<UInt64>` (or `ColumnString` for the `min` family), and the `assert_cast` inside `SingleValueDataFixed::set` faults. The Fast test crash on `04067_group_by_use_nulls_apply_aggregate` (signal 11 in `AggregateFunctionNullVariadic::addBatchSinglePlace`) and the AST-fuzzer findings under STID `0250-30be`, `1499-22cb`, `3336-34e8`, `3336-2f87` are all the same shape.
c	Fix matches root cause?	Yes. Restore the OR-accumulated `in_aggregate_function_scope` flag for scopes within the same query nesting level, and reset it when crossing into an outer `QUERY` scope. The OR within a query reproduces the original (pre-correlated) suppression that the aggregate "consumes" the GROUP BY nullability boundary regardless of which scope recorded it; the reset across `QUERY` boundaries preserves the new correlated-subquery behaviour: an aggregate in an inner subquery must not shield against an outer scope's `nullable_group_by_keys`. The fix targets the analyzer-level scope walk, not the cast site.
d	Test intent preserved / new tests added?	Yes. Existing test `04067_group_by_use_nulls_apply_aggregate` (the one Fast test failed on) continues to pass without modification. PR's own new test `04210_91119_correlated_subquery_group_by_use_nulls` continues to pass without modification. New regression test `04326_91119_apply_lambda_aggregate_rollup.sql` added, covering nine variants of `* APPLY <lambda>` plus an aggregate plus `ROLLUP` / `CUBE` / `GROUPING SETS` under `group_by_use_nulls = 1`, including the regex-filtered APPLY shape from issue #98190 (STID `1499-22cb`), nested aggregate inside a function, function-form APPLY, and a non-aggregate APPLY sanity row that confirms the suppression is genuinely scoped to the aggregate-argument path and not a blanket disable.
e	Both directions demonstrated?	Yes. Built two binaries from the same PR HEAD on the same machine. Without the new commit: `04067` query 1 aborts with SIGABRT (core dumped); `04326` aborts on its first query. With the new commit: `04067`, `04210`, and `04326` all run to completion and produce output matching their `.reference` files.
f	Fix is general, not a narrow patch?	Yes. The fix is at the analyzer-level scope walk (`QueryAnalyzer::resolveExpressionNode`), not at any cast site. All paths that resolve column references under `group_by_use_nulls = 1` benefit. The new test exercises 9 distinct shapes (regular `ROLLUP`/`CUBE`/`GROUPING SETS`, multiple aggregate functions including `argMax`, `sum`, `min`-via-function-form-APPLY, regex-filtered APPLY, nested aggregate-inside-`toString`, deeply-nested aggregate, and a non-aggregate APPLY sanity row). The fix traces the bad value to its origin (the analyzer setting up a wrong scope walk) instead of guarding against the symptom downstream.

Session id: cron:clickhouse-ci-task-worker:20260608-134700

groeneai · 2026-06-08T19:38:53Z

CI Ledger - covered SHA: `6f52db2`

Workflow PR finished 2026-06-08T18:52:53Z. Triage of the four reds:

Check	Failure	Classification	Disposition
BuzzHouse (amd_debug)	`Logical error: Mutation of \`Memory` table produced incomplete output (STID: 2380-2e8c)`	chronic trunk bug	Tracked. Active fix PR #106615. 30-day cross-PR: 14 master hits, 30+ unrelated PRs. PR diff is `src/Analyzer/Resolve/QueryAnalyzer.cpp` + a stateless test only - completely unrelated to `Memory` table mutation pipeline.
BuzzHouse (amd_tsan)	Same STID family (2380-3043)	chronic trunk bug	Same as above.
Mergeable Check	aggregator	reflects the two BuzzHouse fails	clears when STID 2380 family clears.
PR	aggregator	reflects above	same.

The Fast-test failure on 04067_group_by_use_nulls_apply_aggregate from the prior HEAD has been fixed by this commit (analyzer per-scope check now restores OR-accumulated in_aggregate_function_scope for scopes within the same query nesting level, matching the original semantic). All other 143 checks pass.

Session: cron:our-pr-ci-monitor:20260608-193000

…d-rollup-cast-wrapper

…lls to a correlated column `nullable_group_by_keys` matching uses query-tree equality, and `ColumnNode::isEqualImpl` compares only the column name and type, not the column source. For a correlated reference this meant an intermediate query with a same-named, same-typed `GROUP BY` key (under `WITH ROLLUP`/`CUBE`/`GROUPING SETS`) could match the outer column and wrap it as `Nullable`, even though that intermediate query's rollup does not produce `NULL`s for the outer column. Require that the scope whose `nullable_group_by_keys` is being applied actually owns the column's source table expression (the same ownership test already used at the `QUERY`-boundary break). Otherwise keep walking outwards to the owning scope. Single-level correlated cases (the only ones currently reachable; deeper correlation through an intermediate grouped query is `NOT_IMPLEMENTED` in the planner) are unaffected, as confirmed by the existing tests. Also update the stale block comment to describe the actual OR-accumulate-within-query / reset-at-`QUERY`-boundary rule for the aggregate-function exclusion. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

alexey-milovidov · 2026-06-09T08:19:11Z

Merged master (branch was ~1547 commits / ~7 days behind and red) — clean merge, no conflicts.

Addressed both review threads in 62084af:

Correctness (QueryAnalyzer.cpp:3440): an outer scope's nullable_group_by_keys is now applied to a correlated column only when that scope actually owns the column's source table expression (registered_table_expression_nodes.contains(column_source)), mirroring the ownership test already used at the QUERY-boundary break. This prevents an intermediate query with a same-named, same-typed GROUP BY key from matching a correlated outer column (ColumnNode::isEqualImpl compares only name + type, not source). The deeper multi-level scenario is currently NOT_IMPLEMENTED in the planner, so this is a defensive guard; single-level correlated cases are unaffected (existing tests pass).
Stale comment (QueryAnalyzer.cpp:3406): rewritten to describe the actual OR-accumulate-within-query / reset-at-QUERY-boundary rule.

Verified locally: the #91119 reproducer now returns Nullable values instead of the Bad cast exception, and both 04210/04326 tests pass against the rebuilt binary.

Remaining CI reds are all the BuzzHouse Logical error: Mutation of Memory table produced incomplete output finding (STID 2380) in StorageMemory::mutate on ALTER ... APPLY PATCHES/MODIFY TTL — unrelated to this analyzer change.

groeneai · 2026-06-09T13:37:23Z

CI Ledger — covered SHA: `62084af978d84d75e9f27292fa61bbcd70c1cb42`

Workflow PR finished 2026-06-09T13:10:28Z (success); Finish Workflow and Mergeable Check pass; gate satisfied (≥20 min CIDB ingestion buffer met, no jobs queued/in-progress). CIDB reports zero FAIL/ERROR rows on this HEAD across all 162 completed jobs — every Stateless / Integration / AST fuzzer / BuzzHouse / Stress / Unit / Performance / SQLLogic / Compatibility / Install variant is green.

The previous ledger covered 6f52db2ce0c (analyzer OR-accumulated in_aggregate_function_scope restore). HEAD advanced to 62084af978d after @ alexey-milovidov pushed the master merge plus the correctness guard at QueryAnalyzer.cpp:3440 and the comment rewrite at :3406 to address his own earlier review threads — both are resolution commits, not new directives. Local verification (issue #91119 reproducer, 04067_group_by_use_nulls_apply_aggregate, the new 04326_91119_apply_lambda_aggregate_rollup regression test) confirmed the fix held.

No outstanding CI items. Awaiting human review on the analyzer change.

novikd · 2026-06-09T15:55:21Z

+    QueryTreeNodePtr column_source;
+    if (auto * column_node = node->as<ColumnNode>())
+        column_source = column_node->getColumnSourceOrNull();


I guess it won't work in cases like GROUP BY c % 2

Confirmed locally. With

SET enable_analyzer=1, allow_experimental_correlated_subqueries=1, group_by_use_nulls=1; EXPLAIN QUERY TREE SELECT (SELECT c % 2) FROM (SELECT toInt32(3) AS c) t GROUP BY c % 2 WITH ROLLUP;

the inner query tree is

QUERY id: 2, alias: __table1, is_subquery: 1, is_correlated: 1 PROJECTION COLUMNS modulo(c, 2) Int16 <-- should be Nullable(Int16) PROJECTION FUNCTION id: 7, function_name: modulo, result_type: Int16 ARGUMENTS COLUMN id: 4, column_name: c, result_type: Int32, source_id: 5 <-- should be Nullable(Int32)

The inner expression is left non-Nullable because the walk only captures column_source for a bare ColumnNode, so a FunctionNode like c % 2 (or any compound expression that contains a correlated column) hits if (!column_source) break; at the inner QUERY boundary and never reaches the outer scope's nullable_group_by_keys, which stores c % 2 itself, not bare c.

The planner currently rejects this shape with NOT_IMPLEMENTED: can't find correlated column ... in current header: __grouping_set, modulo(__table3.c, 2_UInt8), so it does not surface to users today, but the analyzer is producing the wrong type. The right shape of the fix is to drive the walk by "this expression contains a correlated column whose source lives in an outer scope" rather than by node->as<ColumnNode>()->getColumnSourceOrNull(). I will pivot in that direction as part of addressing C3 / C4 below.

novikd · 2026-06-09T15:55:41Z

+                bool is_correlated_to_outer_scope = column_source
+                    && !scope.registered_table_expression_nodes.contains(column_source)
+                    && scope_ptr != &scope;
+                if (is_correlated_to_outer_scope)
+                {
+                    /// `nullable_group_by_keys` matching uses query-tree equality, which for
+                    /// columns (`ColumnNode::isEqualImpl`) compares only the column name and
+                    /// type, not the column source. For a correlated reference this means an
+                    /// intermediate query with a same-named, same-typed GROUP BY key could
+                    /// match `node` even though that query's rollup does not produce `NULL`s
+                    /// for the outer column. Only the outer query that actually owns the
+                    /// column's source table expression may apply its `group_by_use_nulls`
+                    /// nullability, so require source ownership before wrapping; otherwise keep
+                    /// walking outwards to the owning scope.
+                    if (scope_ptr->registered_table_expression_nodes.contains(column_source))
+                    {
+                        node->convertToNullable();
+                        break;
+                    }


Same problem

Same root cause as C1. The branch wraps the bare-column path: when is_correlated_to_outer_scope is true (so node is a bare ColumnNode) AND we have reached the source-owning scope, the column is mutated in place and we break out. For the bare-c shape this works (e.g. (SELECT c) FROM t GROUP BY c WITH ROLLUP, group_by_use_nulls=1 correctly produces Nullable(Int32) on the inner column). The issue is the surrounding loop, not this branch in isolation: once we leave the bare-column case (e.g. the inner expression is c % 2), column_source is null and the walk stops at the inner QUERY boundary, so we never get the chance to apply node->convertToNullable() to the compound expression here. C3 covers the actual stop condition; once that's fixed the in-place mutation here needs to be widened to compound expressions too (mutate node rather than clone, otherwise the outer QueryNode::correlated_columns_list and the planner's correlated_columns_set end up referencing the original non-Nullable column).

novikd · 2026-06-09T15:56:40Z

+        if (!column_source)
+            break;


This condition does not make sense.

Agreed. if (!column_source) break; is what makes the C1 / C2 case slip through: a non-ColumnNode (e.g. a FunctionNode for c % 2) carries no column_source, so we stop at the inner QUERY boundary even when the expression is correlated through child columns whose real source lives further out. The condition was a literal port of the previous column_source design, but the design itself is too narrow.

What we actually want at this point is "keep walking iff the expression we are resolving is correlated relative to the current scope", which is a property of the expression's correlated descendants, not just its top-level node. Concretely: walk node's subtree once before the loop, collect all ColumnNode sources that are NOT registered in &scope's registered_table_expression_nodes, and use that set to drive both (a) whether we walk past the inner QUERY and (b) which outer scope is the source-owning one. For a bare-column node, the set is just {column_source} and the behaviour is unchanged; for a compound expression like c % 2, the set is {c} and the walk continues to the outer scope where nullable_group_by_keys.find(c % 2) matches. A query whose expression mixes correlated columns from different outer scopes already triggers NOT_IMPLEMENTED upstream in checkCorrelatedColumn / decorrelation, so a single owning-source assumption is safe to make here.

I will rework the condition along those lines and re-test the cases that were the original reason for the walk-past-QUERY (the (SELECT c) shape, 04210, 04326).

novikd · 2026-06-09T15:57:30Z

+        if (scope_ptr->registered_table_expression_nodes.contains(column_source))
+            break;
+
+        in_aggregate_function_scope = false;


This seems to be incorrect.

Agreed, and reworked. In the rewritten applyGroupByUseNullsToExpression, the reset only happens when we deliberately continue the walk into an outer query for a correlated expression that has not yet reached its source-owning scope:

if (!is_correlated() || ownsCorrelatedSource(*scope_ptr, *correlated_columns)) break; is_aggregate_argument = false;

So we never reset (and never walk outward) for a non-correlated expression, and we stop as soon as we reach the scope that owns the correlated column's source. The reset itself is correct for the cases that do continue: the inner subquery's aggregate context (e.g. the inner argMax in * APPLY x -> argMax(x, key)) constrains only the inner query's own keys; it must not suppress wrapping of a reference that is correlated to an outer query's GROUP BY key, because the values fed into the inner query are the outer query's post-rollup Nullables. The previous unconditional reset at every QUERY boundary was the part that did not hold up.

Addresses review feedback on the group_by_use_nulls Nullable-wrapping block in QueryAnalyzer::resolveExpressionNode. The previous version decided whether to walk past an enclosing QUERY scope for a correlated reference using node->as<ColumnNode>()->getColumnSourceOrNull(). That only recognises a bare column as correlated, so a compound expression such as `c % 2` (the GROUP BY key being `c % 2`) was left non-Nullable, and the stop condition `if (!column_source) break;` conflated "not a column" with "not correlated". Extract the logic into applyGroupByUseNullsToExpression and drive it by whether the expression is correlated relative to the current scope, computed with the existing CorrelatedColumnsCollector (the same mechanism used elsewhere in the resolver). The walk now continues past a QUERY boundary only for a genuinely correlated expression and only up to the scope that owns the correlated column's source, so an intermediate query with a same-named GROUP BY key cannot match. A correlated match wraps the node in place to preserve the shared_ptr identity that the planner's correlated-columns set relies on. A fast path returns early when no enclosing scope has any nullable_group_by_keys, and the correlated-columns analysis is computed lazily, so queries without group_by_use_nulls pay only a cheap scope-chain check. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

groeneai · 2026-06-10T09:57:03Z

Pushed be05b67e0cc reworking the group_by_use_nulls correlated-wrapping logic along the lines discussed in the C1/C3 threads. Extracted it into applyGroupByUseNullsToExpression and made it driven by whether the resolved expression is correlated, using the existing CorrelatedColumnsCollector instead of node->as<ColumnNode>()->getColumnSourceOrNull().

What changed per comment:

C1 / C2 (compound key c % 2): correlation is now a property of the expression's columns, not its top-level node type. CorrelatedColumnsCollector collects correlated columns from the whole subtree, so c % 2 is recognised and wrapped in place. EXPLAIN QUERY TREE for SELECT (SELECT c0 % 2) FROM (SELECT toInt32(3)) t0(c0) GROUP BY c0 % 2 WITH ROLLUP now shows modulo(c0, 2) Nullable(Int16) instead of Int16.
C3 (if (!column_source) break;): replaced with if (!is_correlated() || ownsCorrelatedSource(...)) break;. The walk continues past a QUERY boundary only for a genuinely correlated expression and only until the scope that owns the correlated source, so an intermediate query with a same-named key cannot match.
C4 (in_aggregate_function_scope = false;): the reset now happens only on the deliberate continuation into an outer query for a not-yet-resolved correlated expression (see the thread reply).

One honesty note on scope: the compound-key correlated shapes (C1/C2) are currently rejected by the planner before execution, on both master and this branch:

SELECT (SELECT c0 % 2) FROM (SELECT toInt32(3)) t0(c0) GROUP BY c0 % 2 WITH ROLLUP
-> NOT_IMPLEMENTED: can't find correlated column '__table3.c0' in current header: __grouping_set, modulo(__table3.c0, 2_UInt8)

So this part of the change fixes the analyzer's type, which is the correctness issue you flagged, but there is no runnable query that demonstrates a row-level difference for it yet. The reachable shapes (bare GROUP BY key, with bare or compound correlated references, plus the original * APPLY ... argMax cases) are covered by the runtime regression tests 04210 and 04326, which still pass, and I confirmed no regressions across the *group_by_use_nulls* and *correlated* stateless suites. Happy to add an EXPLAIN QUERY TREE test that locks in the c % 2 Nullable type if you want one, or to hold it until the planner supports the shape.

Pre-PR validation gate

#	Question	Answer
a	Deterministic repro?	Yes. The original crash family (`* APPLY x -> argMax(x, key)` + ROLLUP under `group_by_use_nulls`) is covered by `04326`; the reviewed type bug is `EXPLAIN QUERY TREE SELECT (SELECT c0 % 2) FROM (SELECT toInt32(3)) t0(c0) GROUP BY c0 % 2 WITH ROLLUP` showing `Int16` before / `Nullable(Int16)` after.
b	Root cause explained?	The walk decided "is this correlated, keep walking outward" from `node->as<ColumnNode>()`, so a compound expression (no top-level column source) stopped at the inner QUERY boundary and never consulted the owning outer scope's `nullable_group_by_keys`. Correlation is a property of the subtree's columns, not the top-level node.
c	Fix matches root cause?	Yes. The walk is now driven by `CorrelatedColumnsCollector` (subtree correlation) and source ownership, so it handles bare columns and compound expressions uniformly and stops at the owning scope.
d	Test intent preserved / new tests added?	`04210` and `04326` unchanged and still pass. No test weakened. The fix is a refactor of the same wrapping logic with broader correctness; the compound-key case has no runnable test (planner NOT_IMPLEMENTED, see above).
e	Both directions demonstrated?	Yes. `04326` aborts on the no-fix branch and passes with the fix; the `c % 2` type is `Int16` without the rework and `Nullable(Int16)` with it.
f	Fix is general, not a narrow patch?	Yes. Drives off the canonical correlation mechanism rather than a special-case top-level-column check, covering all expression shapes, and adds a fast path so non-`group_by_use_nulls` queries pay only a cheap scope-chain check.

Session id: cron:clickhouse-worker-slot-5:20260610-090200

The previous commit drove the group_by_use_nulls Nullable-wrapping in applyGroupByUseNullsToExpression by expression correlation, but used `scope_ptr == &scope` as the discriminator for the non-correlated clone-and-wrap branch. A GROUP BY key referenced inside a lambda body, e.g. `arrayMap(x -> intDiv(x, key), ...)` where `key` is a GROUP BY key of the surrounding query, is resolved with `&scope` pointing at the LAMBDA scope while the key lives in the enclosing QUERY scope. Such a reference is not correlated, so it matched neither the `scope_ptr == &scope` clone branch nor the correlated `ownsCorrelatedSource` branch, and was left non-Nullable. That produced two regressions visible in the Fast test: - 04001_logical_error_with_cube: the lambda-body key kept its non-Nullable type, so arrayMap returned Array(UInt8) instead of Array(Nullable(UInt8)), which changed DISTINCT semantics and emitted an extra row. - 03023_group_by_use_nulls_analyzer_crashes: the type mismatch between the wrapped projection column and the unwrapped lambda-captured column produced `Cannot capture column N because it has incompatible type: got Nullable(UInt8), but UInt8 is expected` (LOGICAL_ERROR). The matched key belongs to the same query as the reference whenever the expression is not correlated (whether resolved directly in the query or inside one of its lambda scopes), so take the clone-and-wrap branch for any non-correlated match. The `scope_ptr == &scope` check is kept as a fast path so the common direct-projection match skips the correlated-columns analysis. The correlated in-place wrap (gated by ownsCorrelatedSource) is unchanged, and the compound-key correlated case the previous commit addressed still wraps as Nullable. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

groeneai · 2026-06-10T11:29:46Z

Pushed e8d62a69667 to fix two Fast test regressions that the previous commit (be05b67e0cc) introduced. The previous commit drove the wrapping by expression correlation (addressing the c % 2 / source-ownership review feedback) but regressed non-correlated GROUP BY keys referenced inside lambda bodies.

What broke and why: applyGroupByUseNullsToExpression used scope_ptr == &scope as the discriminator for the non-correlated clone-and-wrap branch. A GROUP BY key referenced inside a lambda body, e.g. arrayMap(x -> intDiv(x, key), ...) where key is a GROUP BY key of the surrounding query, is resolved with &scope pointing at the LAMBDA scope while the key lives in the enclosing QUERY scope. That reference is not correlated, so it matched neither the scope_ptr == &scope clone branch nor the correlated ownsCorrelatedSource branch and was left non-Nullable. Result: 04001_logical_error_with_cube emitted an extra row (arrayMap returned Array(UInt8) instead of Array(Nullable(UInt8)), changing DISTINCT semantics) and 03023_group_by_use_nulls_analyzer_crashes hit Cannot capture column N ... got Nullable(UInt8), but UInt8 is expected (LOGICAL_ERROR).

Fix: take the clone-and-wrap branch for any non-correlated match (scope_ptr == &scope || !is_correlated()), not just the scope_ptr == &scope case. The scope_ptr == &scope check is kept as a fast path so the common direct-projection match skips the correlated-columns analysis. The correlated in-place wrap (gated by ownsCorrelatedSource) and the compound-key correlated case are unchanged.

Pre-PR validation gate

#	Question	Answer
a	Deterministic repro?	Yes. `04001` reproduces via `clickhouse-local` (extra `1 []` row). `03023` reproduces via the test runner against a server built from `be05b67e0cc`: deterministic LOGICAL_ERROR `Cannot capture column 2 ... got Nullable(UInt8), but UInt8 is expected`, server SIGABRT (exit 134).
b	Root cause explained?	Yes. The previous commit used `scope_ptr == &scope` to decide the non-correlated clone branch. A non-correlated key referenced inside a lambda body has `&scope` = LAMBDA scope and the key in the enclosing QUERY scope (`scope_ptr != &scope`), and `CorrelatedColumnsCollector` returns empty for it, so it fell through both branches and stayed non-Nullable. The resulting type mismatch (wrapped projection vs unwrapped lambda capture) causes the extra DISTINCT row (04001) and the capture LOGICAL_ERROR (03023).
c	Fix matches root cause?	Yes. The discriminator becomes "is this reference correlated?": `scope_ptr == &scope \|\| !is_correlated()` takes the clone-and-wrap branch for any non-correlated match (direct or lambda-nested), exactly the case the previous commit dropped. Not a band-aid: the EXPLAIN QUERY TREE for 04001 is now byte-identical to the green parent `62084af978d`.
d	Test intent preserved / new tests added?	Preserved. No tests modified. `03023` and `04001` are the existing regression tests that now pass; `04210` and `04326` (this PR's own tests) continue to pass; `04067_group_by_use_nulls_apply_aggregate` (the earlier Fast test crash) passes. novikd's `c % 2` compound-key correlated case still wraps to Nullable.
e	Both directions demonstrated?	Yes. `be05b67e0cc`: 04001 emits extra row, 03023 LOGICAL_ERROR + SIGABRT. `e8d62a69667`: both pass via the test runner; 04001 tree identical to green parent.
f	Fix is general, not a narrow patch?	Yes. The fix corrects the wrapping decision at its source (the scope-walk discriminator), covering direct-projection, lambda-body, CUBE/ROLLUP/GROUPING SETS, and correlated shapes alike. Verified against the full `group_by_use_nulls` suite (16/16 non-stateful pass) and 75 correlated/cube/rollup/grouping_sets tests (0 unrelated failures; the only two skips were a `stateful` test needing `test.hits` and a test hardcoding HTTP port 8123, both passing once the local env matches CI).

Session id: cron:clickhouse-worker-slot-7:20260610-104500

groeneai · 2026-06-10T16:45:35Z

CI Ledger — covered SHA: `e8d62a696678d0d70504e09e17a60b1ac636ea21`

CI fully finished on this HEAD (Finish Workflow = success; no checks queued/in-progress). Triaging the two remaining red checks; neither is caused by this PR.

Check	Failure	Classification	Disposition
`Integration tests (amd_asan_ubsan, db disk, old analyzer, 1/6)`	`test_refreshable_mat_view_replicated/test.py::test_circular_dependencies_survive_restart`	chronic flaky (NOT PR-caused)	30-day cross-PR: 14+ distinct unrelated PRs + master. Tracked under issue #106651. This PR only touches `QueryAnalyzer.cpp` + `group_by_use_nulls` tests, nothing in refreshable-mat-view / ReplicatedMergeTree.
`Stateless tests (arm_asan_ubsan, azure, parallel)`	`Server died` / `OOM in dmesg` / `Logical error: Expected 3 to 10 arguments in table function azureBlobStorage, got 1`	infra outage (NOT PR-caused)	The `arm_asan_ubsan, azure, parallel` runner is in a broad failure window: 272 distinct tests across 41 unrelated PRs failed on this exact check in the last 12h (OOM / Server died). Independent of this PR's diff.
`Mergeable Check` / `PR` / `CH Inc sync`	aggregators	downstream	Reflect the two above; no independent signal.

The two Fast test failures from the prior commit (be05b67e0cc: 03023_group_by_use_nulls_analyzer_crashes, 04001_logical_error_with_cube) were PR-caused regressions from the rework and are fixed in e8d62a69667 (both verified passing locally in both directions). No PR-caused failures remain on the current HEAD.

A correlated subquery aggregating a String key (e.g. (SELECT min(c0)) over a GROUP BY ... WITH ROLLUP / GROUPING SETS key) hits the ColumnNullable -> ColumnString wrapper instead of the numeric ColumnVector path, so it was not covered by the existing 04210 cases. Both queries crashed with 'Bad cast from ColumnNullable to ColumnString' on master before this fix and now return the post-rollup Nullable value. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

groeneai · 2026-06-10T19:45:18Z

Pushed fd5dfbf70a1 adding two String-aggregate cases to 04210 ((SELECT min(c0)) over a String GROUP BY key with WITH ROLLUP, and minOrDefault with GROUPING SETS). These exercise the ColumnNullable -> ColumnString wrapper, distinct from the numeric ColumnVector path the other cases cover. Both crashed with Bad cast from ColumnNullable to ColumnString on master and now return the post-rollup Nullable value.

While reviewing the rework I rebuilt the PR on current master (which now contains the multistage_distributed_queries rewrite #106020, merged after this branch's last rebase) to confirm the analyzer-level fix still applies after the planner changes. Built from a fresh checkout rooted at the worktree sources (not a single-TU relink), the four headline shapes behave as expected:

case	master	this PR on master
`(SELECT c0)` Bool (#91119 headline)	crash	`true`, `\N`
`(SELECT c0)` numeric	ok	ok
`(SELECT sum(number))` over correlated key	crash	ok
`(SELECT min(c0))` String	crash	`a`, `\N`

Pre-PR validation gate

#	Question	Answer
a	Deterministic repro?	Yes. `clickhouse-local --enable_analyzer=1 --allow_experimental_correlated_subqueries=1 --group_by_use_nulls=1 -q "SELECT (SELECT min(c0)) FROM (SELECT 'a'::String) t0(c0) GROUP BY c0 WITH ROLLUP ORDER BY c0 NULLS LAST"` aborts (exit 134, `Bad cast from ColumnNullable to ColumnString`) on master; same for the `GROUPING SETS` variant.
b	Root cause explained?	Same root cause as the rest of the PR: the correlated `String` key kept its non-Nullable type, so the planner built aggregate/CAST wrappers for `ColumnString` while decorrelation fed in the post-rollup `Nullable(String)`. The fix wraps the correlated reference to `Nullable` in the analyzer, so the wrapper signatures match.
c	Fix matches root cause?	Yes. No code change in this commit; it adds coverage for the `SingleValueDataString` aggregate branch the existing cases did not reach.
d	Test intent preserved / new tests added?	New cases only; no existing assertion changed.
e	Both directions demonstrated?	Yes. Both queries crash on master and pass on this branch; the full `04210` reference matches.
f	Fix is general, not a narrow patch?	The underlying fix is type-agnostic (`convertToNullable`); these cases verify the `String` manifestation alongside the numeric and `Bool` ones. Regression set re-run green: `03023`, `04001`, `02343`(+distributed), `02535`, `04210`, `04326` all pass.

Session id: cron:clickhouse-worker-slot-9:20260610-190100

groeneai · 2026-06-22T08:18:22Z

CI finished on this head (ec6b58a). The single red check is not caused by this change.

Integration tests (amd_tsan, 2/6): test_storage_kafka/test_batch_fast.py::test_system_kafka_consumers FAIL (assert_eq_with_retry on system.kafka_consumers content). Chronic flaky, unrelated to this Analyzer change: 8 master + 5 unrelated PR hits in the last 30 days. Tracked separately. Report: https://s3.amazonaws.com/clickhouse-test-reports/json.html?PR=104350&sha=ec6b58aa965359ca27fcfdd1ec100ddf37c6fce5&name_0=PR&name_1=Integration%20tests%20%28amd_tsan%2C%202%2F6%29

All other checks pass. The fuzzer "Bad cast" (STID 3721-419e) fix in this commit (LowCardinality-aware nullable wrapping in ConstantNode::convertToNullable, plus the regression cases in 04210) is green.

(cron:our-pr-ci-monitor:20260622-080000)

coderashed · 2026-06-28T02:08:14Z

@novikd — verified the current CorrelatedColumnsCollector-driven revision against each of the cases you flagged on the earlier one. Type-level evidence from EXPLAIN QUERY TREE (group_by_use_nulls = 1, allow_experimental_correlated_subqueries = 1):

Your concern	Now
`GROUP BY c % 2` wouldn't wrap (compound key)	`(SELECT c % 2) … GROUP BY c % 2 WITH ROLLUP` → correlated `modulo(c, 2)` resolves to `Nullable(Int16)` ✓
Bare correlated column	`(SELECT c) … GROUP BY c WITH ROLLUP` → correlated `c` resolves to `Nullable(Int32)`; the outer GROUP BY key node stays `Int32` ✓
Aggregate-argument reset looked incorrect	`* APPLY x -> argMax(x, number) … WITH ROLLUP` → `number` inside the aggregate stays `UInt64` (non-`Nullable`) ✓
`if (!column_source) break;`	No longer present — the walk is now driven by `CorrelatedColumnsCollector` rather than a bare `ColumnNode` source, which is also what fixes the compound-key case

The end-to-end planning symptom is gone too: SELECT concat('s', (SELECT toString(number))) FROM numbers(5) GROUP BY GROUPING SETS ((number)) executes cleanly. 04210_91119, 04326_91119, and 04067 all pass on a single-node build.

The existing tests check query results; I put together a complementary one that pins the types the analyzer assigns (passes against this revision, and fails with a clean diff if the correlated type stops being wrapped). Happy to contribute it if useful:

SET enable_analyzer = 1;
SET allow_experimental_correlated_subqueries = 1;
SET group_by_use_nulls = 1;

SELECT '-- bare correlated column ---';
SELECT replaceRegexpAll(trim(explain), 'id: [0-9]+', 'id: N') AS line
FROM (EXPLAIN QUERY TREE
    SELECT (SELECT c) FROM (SELECT toInt32(3) AS c) t GROUP BY c WITH ROLLUP)
WHERE explain ILIKE '%column_name: c,%result_type%';

SELECT '-- compound correlated key (modulo) ---';
SELECT replaceRegexpAll(trim(explain), 'id: [0-9]+', 'id: N') AS line
FROM (EXPLAIN QUERY TREE
    SELECT (SELECT c % 2) FROM (SELECT toInt32(3) AS c) t GROUP BY c % 2 WITH ROLLUP)
WHERE explain ILIKE '%function_name: modulo%result_type%';

SELECT '-- aggregate argument stays non-Nullable ---';
SELECT replaceRegexpAll(trim(explain), 'id: [0-9]+', 'id: N') AS line
FROM (EXPLAIN QUERY TREE
    SELECT * APPLY x -> argMax(x, number) FROM numbers(1) GROUP BY number WITH ROLLUP)
WHERE explain ILIKE '%column_name: number,%result_type%';

Output on this revision:

-- bare correlated column ---
COLUMN id: N, column_name: c, result_type: Nullable(Int32), source_id: N
COLUMN id: N, column_name: c, result_type: Nullable(Int32), source_id: N
COLUMN id: N, column_name: c, result_type: Int32, source_id: N
-- compound correlated key (modulo) ---
FUNCTION id: N, function_name: modulo, function_type: ordinary, result_type: Nullable(Int16)
FUNCTION id: N, function_name: modulo, function_type: ordinary, result_type: Int16
-- aggregate argument stays non-Nullable ---
COLUMN id: N, column_name: number, result_type: UInt64, source_id: N
COLUMN id: N, column_name: number, result_type: UInt64, source_id: N
COLUMN id: N, column_name: number, result_type: UInt64, source_id: N

groeneai · 2026-06-28T02:31:52Z

Thanks @coderashed, and thanks again for routing #108714 here.

Both tests are welcome and I will fold them into 04210_91119_correlated_subquery_group_by_use_nulls.{sql,reference}, credited to you:

the fuzzer-seed shape SELECT concat('s', (SELECT toString(number))) FROM numbers(5) GROUP BY GROUPING SETS ((number)) (a concat over a correlated scalar subquery on a GROUPING SETS key, which 04210 did not cover yet), and
the type-pinning EXPLAIN QUERY TREE cases.

The type-pinning one is a good addition: 04210 only checks results, so a regression that silently stopped wrapping the correlated key Nullable would still pass there but fail your type diff. The SELECT ... FROM (EXPLAIN QUERY TREE ...) WHERE explain ILIKE '%...%' form (rather than a full-plan reference diff) is also exactly the robust shape we prefer, so I will take it largely as written.

I will add both on the next rebase rather than a coverage-only push now (this branch rebases often for the QueryAnalyzer.cpp area, and the PR is in review). Appreciate the help.

…d-rollup-cast-wrapper # Conflicts: # src/Analyzer/ConstantNode.cpp

@coderashed

Two test cases contributed by @coderashed (who closed his duplicate PR ClickHouse#108714 in favour of this one): 1. The exact AST-fuzzer seed query from ClickHouse#108714: a correlated scalar subquery toString(number) inside concat() over a GROUPING SETS key. Aborts on master with 'Bad cast ColumnNullable to ColumnString', returns s0..s4 with the fix. 2. A type-pinning EXPLAIN QUERY TREE check. The existing cases assert query results, which still pass if a future change silently stops wrapping the correlated key as Nullable (values unchanged, declared type wrong). These pin the analyzer-assigned types via SELECT ... FROM (EXPLAIN QUERY TREE ...) WHERE explain ILIKE '%...%', so dropping the Nullable wrapping fails with a clean diff. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

groeneai · 2026-06-30T11:42:41Z

Rebased onto master (the branch had drifted to CONFLICTING). The only conflict was in src/Analyzer/ConstantNode.cpp::convertToNullable: master independently landed the same LowCardinality-constant-key fix in a8afbdc ("Fix Bad cast with group_by_use_nulls over a LowCardinality key constant"), so I dropped our copy and kept master's. Net diff is now exactly the correlated-walk in QueryAnalyzer.cpp plus the regression tests. mergeable is back to MERGEABLE.

Folded in @coderashed's two contributed cases (04210), credited:

The Fix logical error in correlated subqueries with group_by_use_nulls #108714 fuzzer-seed shape concat('s', (SELECT toString(number))) ... GROUP BY GROUPING SETS ((number)). Aborts on master (Bad cast ColumnNullable to ColumnString), returns s0..s4 with the fix.
A type-pinning EXPLAIN QUERY TREE check. The existing cases assert results, which still pass if a future change silently stops wrapping the correlated key as Nullable; these pin the analyzer types (SELECT ... FROM (EXPLAIN QUERY TREE ...) WHERE explain ILIKE '%...%'), so a regression fails with a clean diff. On master the bare correlated c shows Int32; with the fix Nullable(Int32).

Both verified both directions against a local debug build (Build ID b5f40416, has applyGroupByUseNullsToExpression) vs a master snapshot (no symbol): crash/wrong-type without, correct with.

…sion (ClickHouse#91119) A code review raised that the `scope_ptr == &scope` fast path in applyGroupByUseNullsToExpression could mis-wrap a correlated outer column as the inner query's same-named local GROUP BY key, because nullable_group_by_keys matches on name+type only (ColumnNode::isEqualImpl ignores column source). Investigated and confirmed this cannot happen: scope.nullable_group_by_keys is populated only for ROLLUP/CUBE/GROUPING SETS queries under group_by_use_nulls (IdentifierResolveScope.cpp), so a plain inner GROUP BY registers no keys and the inner-scope find() is never reached. For the inner scope to hold a colliding entry it would itself have to be ROLLUP/CUBE/GROUPING SETS while correlated, which decorrelation rejects with NOT_IMPLEMENTED. No source change is needed. Add the collision shape as a regression (results plus an EXPLAIN QUERY TREE type pin) so a future over-broad guard cannot silently flip the inner local key to Nullable. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

groeneai · 2026-06-30T17:23:19Z

CI finish ledger — `0596553`

CI fully finished (Finish Workflow: success). This PR fixes a Bad-cast in argMax + APPLY + ROLLUP (analyzer). Every failure has an owner; none is PR-caused.

Check / test	Reason	Owner
Stress test (amd_tsan) + (arm_tsan) / "Hung check failed, possible deadlock found"	deadlock (chronic hung-check family, 600+ distinct PRs / 30d)	#108212 (ours, merged 2026-06-28) / #105905 (ours, open)
Docker server image / distroless-amd64 + alpine-amd64 image test	infra: docker.io base-image resolve/pull failure; fails on master too (1 master + 2 unrelated PRs / 7d)	a fix task is created (investigating at full effort; fixing-PR link to follow here)
AST fuzzer (amd_debug, targeted, old_compatibility) / "Input nodes size mismatch in dag" (STID 4752-5ad8) [prior SHA `9a32bb9`]	crash (chronic trunk AST-fuzzer finding; 8 PRs + 2 master / 30d; not in this PR's analyzer path)	#103088 (ours, open)
Mergeable Check / PR	aggregator	rollup of the rows above

Session id: cron:our-pr-ci-monitor:20260630-163000

groeneai · 2026-06-30T17:48:56Z

Fixing PR for the Docker server image ledger line created: #108969. Root cause was a transient docker.io registry outage (2026-06-30 15:00-15:06 UTC) failing docker buildx while resolving the SBOM scanner image; it recovered on its own and is unrelated to this PR's diff. #108969 hardens the Docker server image job with retries against transient registry errors.

STID 3418-3e98 (AST fuzzer, arm_asan_ubsan) is the statistics-aggregate manifestation of the same correlated-subquery + group_by_use_nulls bug. skewSamp/kurtPop/varSamp/covarSamp and friends share one add() that reads the argument via an unchecked static_cast (AggregateFunctionStatisticsSimple.h), so when the correlated argument is left non-Nullable the failure is a silent heap-buffer-overflow on release builds instead of the clean "Bad cast" exception the min/sum/argMax cases already cover. Add runtime cases over the statistics family (crash without the fix, correct output with it) plus an EXPLAIN QUERY TREE type pin that the correlated statistics-aggregate argument is wrapped Nullable. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

groeneai · 2026-07-01T06:48:52Z

Pushed c32cf6e46be: added statistics-aggregate coverage to 04210.

The AST fuzzer surfaced this bug again as STID 3418-3e98 (AST fuzzer (arm_asan_ubsan), report) as a heap-buffer-overflow in AggregateFunctionVarianceSimple::add, i.e. the statistics-aggregate manifestation of this same correlated-subquery + group_by_use_nulls bug. skewSamp/kurtPop/varSamp/covarSamp etc. share one add() that reads the argument via an unchecked static_cast (AggregateFunctionStatisticsSimple.h), unlike the min/sum/argMax cases already here which assert_cast and raise a clean Bad cast. So the same non-Nullable-argument bug is a silent memory-safety fault there rather than an exception.

Minimal repro (aborts on master with ASan, correct with this PR):

SELECT (SELECT skewSamp(a.k)) FROM (SELECT number AS k FROM numbers(64)) AS a
GROUP BY a.k WITH CUBE SETTINGS group_by_use_nulls = 1;

Verified both directions locally (ASan): the new runtime cases (numbers(64), numbers(8) uses the ASan-redzone size, not the PODArray SIMD padding) abort on the master snapshot and return correct results on this branch; the new EXPLAIN QUERY TREE type pin shows the correlated statistics-aggregate argument as Nullable(Int32) with the fix and Int32 on master. Test-only change; the analyzer fix is unchanged.

The correlated subquery returns a tuple whose element is the outer rollup key. On master this aborts with a LOGICAL_ERROR (Unexpected return type from tuple) under group_by_use_nulls + WITH ROLLUP/CUBE; the analyzer fix resolves the tuple element to Nullable so it matches the decorrelated input. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

The two coverage commits pushed for this test (the statistics-aggregate crash-guard section at numbers(64) and the ClickHouse#106377 correlated-tuple cases) made 04210 heavy enough to exceed the flaky-check 180s soft cap under debug builds with randomized settings, so the flaky-check job reported 'Test runs too long (> 180s)'. The numbers(64) size is required for the statistics-aggregate out-of-bounds read to land on an ASan redzone, so the coverage must not be reduced. Tag the test long instead: this exempts it from the flaky-check 180s FAIL and runs it at the reduced long_test_runs_ratio repeat count. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

…use#91119) The scalar path (applyGroupByUseNullsToExpression) suppresses the group_by_use_nulls Nullable wrapping for both aggregate and grouping() function arguments via hasAggregateOrGroupingFunction(). The sibling APPLY-matcher path in resolveMatcher only suppressed it for aggregate functions, so a query like SELECT * APPLY x -> grouping(x) FROM numbers(1) GROUP BY number WITH ROLLUP SETTINGS group_by_use_nulls = 1 cloned the matched `number` key as Nullable before the lambda was resolved. GroupingFunctionsResolvePass matches its arguments against the original GROUP BY keys, so the rewritten grouping(number) no longer matched and raised a spurious "GROUPING function ... is not in GROUP BY keys" (BAD_ARGUMENTS) exception even though the key is valid. Make the APPLY-tree scan detect grouping() the same way as the scalar path (exact name comparison, matching ExpressionsStack:: isAggregateOrGroupingFunction). Covers both the lambda form (* APPLY x -> grouping(x)) and the function form (* APPLY grouping), and grouping() nested inside other functions, across all GROUP BY shapes (ROLLUP, CUBE, GROUPING SETS). Regression cases added to 04326_91119_apply_lambda_aggregate_rollup. Found by clickhouse-gh[bot] review on PR ClickHouse#104350. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

groeneai · 2026-07-01T17:48:29Z

Pre-PR validation gate for the APPLY-grouping fix (commit 775264e, addressing the bot review on resolveMatcher):

Pre-PR validation gate (click to expand)

#	Question	Answer
a	Deterministic repro?	Yes. `SELECT * APPLY x -> grouping(x) FROM numbers(1) GROUP BY number WITH ROLLUP SETTINGS enable_analyzer=1, group_by_use_nulls=1` -> `Code 36 BAD_ARGUMENTS: GROUPING function argument number is not in GROUP BY keys`, on demand, every run. Same for `* APPLY grouping` and for CUBE / GROUPING SETS.
b	Root cause explained?	The APPLY matcher in `resolveMatcher` scans APPLY expressions and suppresses the `group_by_use_nulls` Nullable-wrapping only when it finds an AGGREGATE function name (`AggregateFunctionFactory::isAggregateFunctionName`). It misses `grouping`, so `* APPLY x -> grouping(x)` does not set the suppression flag; the matched `number` key is cloned Nullable before the lambda resolves. `GroupingFunctionsResolvePass` then compares the rewritten `grouping(number)` argument against the ORIGINAL non-Nullable GROUP BY keys, no longer matches, and raises the spurious BAD_ARGUMENTS.
c	Fix matches root cause?	Yes. The APPLY-tree scan now also detects `grouping` (exact name comparison), so the suppression flag is set for grouping() args exactly as for aggregate args, and the key is left non-Nullable for `GroupingFunctionsResolvePass` to match. It directly closes the aggregate-vs-grouping asymmetry the bot identified, not a symptom.
d	Test intent preserved / new tests added?	Yes. No existing assertion weakened. Added 5 focused regression cases to `04326_91119_apply_lambda_aggregate_rollup` (lambda `grouping(x)`, function-form `grouping`, CUBE, GROUPING SETS, nested `grouping(x)+1`), plus the pre-existing non-aggregate-APPLY sanity case still asserts Nullable wrapping after ROLLUP.
e	Both directions demonstrated?	Yes. Pre-fix binary (Build ID b5f4041648): all cases crash with BAD_ARGUMENTS. Post-fix binary (Build ID 42b4af186875): correct grouping bits (`0`/`1`, nested `1`/`2`), verified against a real memory-capped server and matching the `.reference` exactly.
f	Fix is general across code paths?	Yes. The scalar path already used `hasAggregateOrGroupingFunction()`; this change brings the sibling APPLY-matcher path to parity. Both APPLY forms (lambda and function) and grouping nested inside other functions are covered because the scan walks the whole expression tree. The detection now matches the canonical `ExpressionsStack::isAggregateOrGroupingFunction`.
g	Fix generalizes across inputs?	Yes. Verified across all GROUP BY shapes (ROLLUP / CUBE / GROUPING SETS), both APPLY spellings, top-level and nested grouping, and multi-key groups. Not type-specific (grouping returns a fixed integer bitmask); no type-wrapper matrix applies. Non-grouping/non-aggregate APPLY still wraps Nullable, confirming the suppression is scoped, not blanket.
h	Backward compatible?	Yes. No setting, on-disk/wire format, or default changed. This makes a previously-erroring valid query succeed; behavior when `group_by_use_nulls=0` is unchanged. No `SettingsChangesHistory` entry needed.
i	Invariants and contracts preserved?	Yes. The invariant is that a GROUP BY key referenced by `grouping()` must reach `GroupingFunctionsResolvePass` in its original (non-Nullable) form; the fix upholds it on the APPLY path. The Nullable wrapping of genuine projection keys after ROLLUP/CUBE is unchanged (non-aggregate APPLY sanity case still returns the `\N` totals row). No concurrency or ownership contract touched (analyzer-only, single-threaded resolution).

Session id: cron:clickhouse-worker-slot-2:20260701-172600

clickhouse-gh · 2026-07-01T21:31:59Z

LLVM Coverage Report

Metric	Baseline	Current	Δ
Lines	85.40%	85.50%	+0.10%
Functions	92.60%	92.60%	+0.00%
Branches	77.60%	77.70%	+0.10%

Changed lines: Changed C/C++ lines covered: 57/61 (93.44%) · Uncovered code

Full report · Diff report

groeneai · 2026-07-01T22:48:11Z

CI finish ledger — `775264e`

CI fully finished (Finish Workflow: success, Mergeable Check: success). This PR fixes a Bad-cast in correlated subquery + GROUP BY ROLLUP under group_by_use_nulls (analyzer). The one red leaf is a chronic trunk AST-fuzzer finding, not caused by this PR, and it has an owner.

Check / test	Reason	Owner
AST fuzzer (amd_debug, targeted, old_compatibility) / "Input nodes size mismatch in dag" (STID 4752-5ad8)	chronic trunk AST-fuzzer finding in the join-order optimizer DAG (JoinExpressionActions), surfaced when the fuzzer mutates this PR's new correlated-subquery test queries into GROUPING SETS shapes. 6 unrelated PRs + 1 master hit / 30d; not in this PR's analyzer diff. #106418 (merged 2026-06-04, present on this branch) did not seal this manifestation.	a fix task is moved to pending (investigating at full effort; fixing-PR link to follow here)
PR / Mergeable Check	aggregator	rollup of the row above
CH Inc sync	private sync	CH Inc sync (private, not actionable)

Session id: cron:our-pr-ci-monitor:20260701-223000

groeneai · 2026-07-01T23:34:11Z

CI finish ledger update — STID 4752-5ad8 owner resolved to a fixing PR

The AST-fuzzer "Input nodes size mismatch in dag" / "Cannot find input column ... on its position in inputs of expression actions DAG" leaf (STID 4752-5ad8) now resolves to a merged fixing PR, not a fix task.

Fixing PR: #107112 ("Fix LOGICAL_ERROR in decorrelated correlated subquery join with duplicate columns"), merged 2026-07-01 20:20 UTC.

Root cause (confirmed by local reproduction): a correlated scalar subquery is decorrelated into a JOIN; when its input side projects the same column identifier twice and correlated_subqueries_default_join_kind = 'left' swaps that side onto the left join child, the join's ActionsDAG keeps a single input for the duplicated name (the unreferenced duplicate input is pruned) while the child step still produces two same-named columns. buildPhysicalJoinImpl (JoinStepLogical.cpp:1313) popped from a per-name deque, underflowed on the second column, and aborted the server. #107112 keeps the last remaining DAG input node when the deque is exhausted.

Deterministic minimal trigger (2 settings):

SET allow_experimental_correlated_subqueries = 1;
SET correlated_subqueries_default_join_kind = 'left';
SELECT x, (SELECT y) FROM (SELECT 1 AS x, 2 AS y, x, y QUALIFY 7) LIMIT 100;

Verified locally: server aborts 5/5 without #107112, runs clean with it. The old_compatibility AST fuzzer variant sets --compatibility=24.3, which flips correlated_subqueries_default_join_kind to left, which is why this surfaced when the fuzzer mutated this PR's correlated-subquery test queries.

Why this branch still saw the crash: this PR's failing run (commit 775264e6b39, 2026-07-01 18:50 UTC) predates the #107112 merge (20:20 UTC), so the build did not yet contain the fix. #106418 (the earlier overlapping-column-names guard) does not cover this path because buildPhysicalJoin runs unconditionally, not only on the reorder path it guards. Merging master into this branch picks up #107112; the leaf should not recur on the next CI run.

Report: https://s3.amazonaws.com/clickhouse-test-reports/json.html?PR=104350&sha=775264e6b397184709c2c985f9115416572c8740&name_0=PR&name_1=AST%20fuzzer%20%28amd_debug%2C%20targeted%2C%20old_compatibility%29

Session id: cron:clickhouse-worker-slot-1:20260701-225000

groeneai · 2026-07-03T05:01:46Z

CI finish ledger — `775264e`

Every failure below has an owner: a fixing PR (ours or external), or a full-effort fix task whose fixing-PR link will be posted here when it opens. Only CH Inc sync is exempt.

Session id: cron:our-pr-ci-monitor:20260703-043000

Algunenano added the can be tested Allows running workflows for external contributors label May 8, 2026

clickhouse-gh Bot added the pr-bugfix Pull request with bugfix, not backported by default label May 8, 2026

Merge commit '3fa1ed2d9d9cc4f472cd43e2e4eaabac12ec36db' into HEAD

30c6027

clickhouse-gh Bot reviewed Jun 3, 2026

View reviewed changes

Comment thread src/Analyzer/Resolve/QueryAnalyzer.cpp Outdated

clickhouse-gh Bot reviewed Jun 3, 2026

View reviewed changes

Comment thread src/Analyzer/Resolve/QueryAnalyzer.cpp Outdated

groeneai mentioned this pull request Jun 8, 2026

Aggregate function combinator -Tuple #98190

Open

1 task

clickhouse-gh Bot reviewed Jun 8, 2026

View reviewed changes

Comment thread src/Analyzer/Resolve/QueryAnalyzer.cpp Outdated

groeneai mentioned this pull request Jun 9, 2026

AST JSON serialization #100412

Open

alexey-milovidov and others added 2 commits June 9, 2026 08:07

Merge remote-tracking branch 'origin/master' into fix-91119-correlate…

e487412

…d-rollup-cast-wrapper

novikd reviewed Jun 9, 2026

View reviewed changes

novikd requested changes Jun 9, 2026

View reviewed changes

This was referenced Jun 26, 2026

Fix JIT crash on arithmetic in correlated subquery with group_by_use_nulls #108558

Closed

Fix NOT_FOUND_COLUMN_IN_BLOCK for virtual columns under parallel replicas #108451

Merged

coderashed mentioned this pull request Jun 27, 2026

Fix logical error in correlated subqueries with group_by_use_nulls #108714

Closed

This was referenced Jun 28, 2026

Fix use-after-free in SYSTEM FLUSH DISTRIBUTED #108684

Merged

Remove obsolete CustomResourceManager in favor of WorkloadResourceManager #108286

Open

coderashed mentioned this pull request Jun 28, 2026

Add skewPopStable, skewSampStable, kurtPopStable, kurtSampStable aggregate functions #108685

Open

groeneai mentioned this pull request Jun 29, 2026

Fix server crash on lazy remote source with unbuilt materialized CTE #108547

Merged

groeneai and others added 2 commits June 30, 2026 11:31

Merge remote-tracking branch 'origin/master' into fix-91119-correlate…

6fdb42b

…d-rollup-cast-wrapper # Conflicts: # src/Analyzer/ConstantNode.cpp

clickhouse-gh Bot reviewed Jun 30, 2026

View reviewed changes

Comment thread src/Analyzer/Resolve/QueryAnalyzer.cpp

groeneai mentioned this pull request Jun 30, 2026

Logical error: Bad cast from type A to B (STID: 3721-40ef) #108931

Open

groeneai mentioned this pull request Jun 30, 2026

Inherit engine and merge settings for CREATE TABLE AS with storage clauses #108085

Merged

groeneai and others added 2 commits July 1, 2026 14:03

clickhouse-gh Bot reviewed Jul 1, 2026

View reviewed changes

Comment thread src/Analyzer/Resolve/QueryAnalyzer.cpp

Sunbelt Computer Software

PL/B Language Development and Support

Uh oh!

Conversation

groeneai commented May 8, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Changelog category (leave one):

Changelog entry (a user-readable short description of the changes that goes into CHANGELOG.md):

Documentation entry for user-facing changes

Uh oh!

groeneai commented May 8, 2026

Uh oh!

groeneai commented May 8, 2026

Uh oh!

groeneai commented May 8, 2026

Uh oh!

clickhouse-gh Bot commented May 8, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

AI Review

Summary

Final Verdict

Uh oh!

Uh oh!

Uh oh!

groeneai commented Jun 4, 2026

CI Status — covered SHA 30c60275

Uh oh!

groeneai commented Jun 8, 2026

Pre-PR validation gate

Uh oh!

Uh oh!

groeneai commented Jun 8, 2026

CI Ledger - covered SHA: 6f52db2

Uh oh!

alexey-milovidov commented Jun 9, 2026

Uh oh!

groeneai commented Jun 9, 2026

CI Ledger — covered SHA: 62084af978d84d75e9f27292fa61bbcd70c1cb42

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

groeneai commented Jun 10, 2026

Pre-PR validation gate

Uh oh!

groeneai commented Jun 10, 2026

Pre-PR validation gate

Uh oh!

groeneai commented Jun 10, 2026

CI Ledger — covered SHA: e8d62a696678d0d70504e09e17a60b1ac636ea21

Uh oh!

groeneai commented Jun 10, 2026

Pre-PR validation gate

Uh oh!

groeneai commented Jun 22, 2026

Uh oh!

coderashed commented Jun 28, 2026

Uh oh!

groeneai commented Jun 28, 2026

Uh oh!

groeneai commented Jun 30, 2026

Uh oh!

Uh oh!

groeneai commented Jun 30, 2026

CI finish ledger — 0596553

Uh oh!

groeneai commented Jun 30, 2026

groeneai commented May 8, 2026 •

edited

Loading

clickhouse-gh Bot commented May 8, 2026 •

edited

Loading

CI Status — covered SHA `30c60275`

CI Ledger - covered SHA: `6f52db2`

CI Ledger — covered SHA: `62084af978d84d75e9f27292fa61bbcd70c1cb42`

CI Ledger — covered SHA: `e8d62a696678d0d70504e09e17a60b1ac636ea21`

CI finish ledger — `0596553`

CI finish ledger — `775264e`

CI finish ledger — `775264e`