Use exact comparison for nullable GROUP BY keys, remove compare_types#107050
Conversation
When `group_by_use_nulls` is enabled, expressions equal to GROUP BY keys were matched with a comparison that ignored types (`QueryTreeNodePtrWithHashIgnoreTypes`), because by the time an expression is checked against `nullable_group_by_keys`, its subexpressions that are themselves GROUP BY keys have already been converted to `Nullable`, so the types no longer match the stored original key. Instead, store in `nullable_group_by_keys` every shape in which an expression equal to a key can arrive at the lookup: the original form, the form with every maximal proper sub-key converted to `Nullable` (with the types of the nodes above recomputed via `rerunFunctionResolve`), and the key converted to `Nullable` itself. All shapes map to the original key node, and the lookup uses exact comparison, including types. This allows removing the `compare_types` option from `IQueryTreeNode::CompareOptions` entirely, together with the half-dead ignore-types branch in `ConstantNode::isEqualImpl`. The type-blind comparison could conflate constants with equal values but different types or source expressions. For example, in `04001_logical_error_with_cube` the constant `1 AS a` inside a lambda was replaced with a clone of a different key (`isNullable(2) IS NOT NULL`, which folds to the same constant `1`), and the mangled projection then constant-folded instead of reading the per-grouping-set key column. The test reference captured this incorrect result; with exact matching the projection correctly reads the key column, producing an additional row with the default key value for the grouping set where the key is not aggregated. The exact matching now correctly replaces such projections with a clone of the original key, which exposed a gap in the planner: the already-computed-expression check in `PlannerActionsVisitor::visitFunction` recognized only `INPUT` nodes, while a constant-foldable GROUP BY key column arrives as a constant `COLUMN` node (constant inputs are duplicated as `COLUMN` nodes by the `ActionsDAG` constructor). Visiting the arguments of such a function builds a lambda capture over the original-typed constant, which collides with the `Nullable`-wrapped key constant of the same name and fails with a `Cannot capture column ... incompatible type` exception. Extend the check to also recognize constant `COLUMN` source nodes. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
aae3c17 to
59583e1
Compare
The check_log retry waited only for any span with parent_span_id=0x73 to appear before running the assertions. The ===http=== flow produces two such gateway spans (a local HTTPHandler and a remote TCPHandler), each flushed to opentelemetry_span_log by an independent background thread. Under the amd_tsan parallel runner the remote TCPHandler could be flushed first, satisfying the retry condition while the HTTPHandler and the initial 'select 1 from remote(...)' query span it parents were not yet in the table. The assertions then observed 'total spans': 2 instead of 3, or 'initial query spans with proper parent': 1 instead of 2. Poll instead until the exact quantities the assertions read are present: the total number of 'query' spans and the number of initial query spans with a proper parent, passed per flow (3 2 for HTTP, 3 1 for native). The asserted output is unchanged, so the .reference file is untouched. Confirmed root cause from CI failures on PRs ClickHouse#107050 and ClickHouse#101769 (Stateless tests (amd_tsan, parallel, 2/2)). Follow-up to ClickHouse#98552, which fixes issues ClickHouse#67108 and ClickHouse#93452. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…y_use_nulls A projection equal to a GROUP BY key is replaced with a Nullable clone of the key from scope.nullable_group_by_keys. Two constants with equal value and type but different source expressions collide in that map, since ConstantNode equality and hashing ignore the source expression. The lookup cloned the stored key, which could be a different colliding constant, giving the projection the wrong source expression - and the source expression determines the planner action node name, hence which aggregation key column the projection reads. Clone the matched node itself for constants, preserving its own source expression; non-constant keys still clone the stored canonical key so a partially Nullable-converted shape keeps the matching action node name. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
| } | ||
|
|
||
| auto node_type = node->getNodeType(); | ||
| if (node_type != QueryTreeNodeType::FUNCTION && node_type != QueryTreeNodeType::LAMBDA && node_type != QueryTreeNodeType::LIST) |
There was a problem hiding this comment.
convertNestedGroupByKeysToNullable stops at COLUMN nodes unless the column itself is a GROUP BY key, but ColumnNode can carry an expression child for table ALIAS columns. That expression is resolved after nullable_group_by_keys has been registered, so a nested key inside the alias expression is converted to Nullable and produces a shape that was never registered here. A concrete case is an ALIAS column a ALIAS number + 1 with SELECT a + 1 FROM t GROUP BY number, a + 1 WITH ROLLUP SETTINGS group_by_use_nulls = 1: the projection resolves a with number converted to Nullable, while the registered a + 1 key still contains the original non-nullable alias expression, so the exact lookup misses and the expression is no longer treated as the grouping key. Please recurse through ColumnNode expressions, or otherwise register this converted alias-column shape, and add a regression test for a table ALIAS key that contains another key.
There was a problem hiding this comment.
Looks like it was already an existing issue. Claude tells:
Good catch — this is a real bug, but it predates this PR. The alias-column case fails identically on master (verified on a near-master build, 26.6.1.1), so it isn't introduced by the switch to exact matching here.
Minimal repro (throws NOT_AN_AGGREGATE: Column 'default.t.y' is not under aggregate function and not in GROUP BY keys on both master and this branch):
SET enable_analyzer = 1, group_by_use_nulls = 1;
CREATE TABLE t (x UInt64, y UInt64 ALIAS x + 1) ENGINE = MergeTree ORDER BY x;
INSERT INTO t (x) VALUES (10);
SELECT x, y FROM t GROUP BY x, y WITH CUBE;
The function-key analog (GROUP BY x, x + 1 WITH CUBE) works on both, so this is specifically about an ALIAS column key whose expression contains another grouped key: under group_by_use_nulls the projection's copy gets the nested key
converted to Nullable, yielding a shape that isn't registered (the converted-sub-keys shape is only built for FunctionNode keys, and convertNestedGroupByKeysToNullable doesn't recurse into ColumnNode expressions). A proper fix also
needs consistent alias-column type handling, which is orthogonal to this PR's "exact comparison" change.
I'd prefer to keep this PR focused and track the alias-column case as a separate issue (with the repro above) rather than expand scope here.
When the initialization of a TCP connection failed — for example, when the allocation of the connection buffers in `TCPHandler::runImpl` threw `MEMORY_LIMIT_EXCEEDED` because the total server memory limit was reached — the exception propagated out of the handler and the socket was closed with the client's `Hello` packet still unread in the receive queue. Closing a socket with pending unread data makes the kernel send RST, so the client observed `Connection reset by peer` without any explanation (and RST can also discard data written to the socket earlier). Now the exception is written into the socket directly, without the connection buffers: the buffer for the exception packet is allocated on the stack, and the error handling path is exempt from the memory limit (it still allocates the strings of the exception message). The client has not received `Hello` yet, so the communication is not chunked, and the client handles an exception packet in place of the `Hello` response (this is also how authentication errors are delivered). The pending input is read out before closing the socket, so the connection is terminated gracefully with FIN and the client reliably receives the exception. The stack trace is not sent: the message already contains the relevant details, such as the current memory limits. The failure is reproduced with the new `tcp_handler_fail_connection_setup` fail point. An example of such a failure in CI (the test failed with `Code: 210. DB::NetException: Connection reset by peer` while the server hit the total memory limit during connection setup): https://s3.amazonaws.com/clickhouse-test-reports/json.html?PR=107050&sha=59583e112fa696c67340daaf3137e40bb8f63982&name_0=PR&name_1=Stateless+tests+%28arm_asan_ubsan%2C+azure%2C+parallel%29 (from ClickHouse#107050) Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
| */ | ||
| void QueryAnalyzer::resolveGroupByNode(QueryNode & query_node_typed, IdentifierResolveScope & scope) | ||
| { | ||
| QueryTreeNodes nullable_group_by_keys; |
| auto not_nullable_rhs_column = removeNullable(rhs_column); | ||
|
|
||
| return not_nullable_column->compareAt(0, 0, *not_nullable_rhs_column, 1) == 0; | ||
| return constant_value.getType()->equals(*rhs_typed.constant_value.getType()) |
There was a problem hiding this comment.
Check for NULL is missing now. Is it safe?
| */ | ||
| bool is_input_node = function_node.isAggregateFunction() || function_node.isWindowFunction() | ||
| || actions_stack.front().containsInputNode(function_node_name); | ||
| || actions_stack.front().containsInputOrConstantNode(function_node_name); |
There was a problem hiding this comment.
Seems to be solving a separate problem
| /// in value and type but with different source expressions share a single map entry, | ||
| /// and the source expression determines the action node name (hence which aggregation | ||
| /// key column the projection reads), so the matched node's own one must be preserved. | ||
| node = (node->getNodeType() == QueryTreeNodeType::CONSTANT ? node : it->second)->clone(); |
There was a problem hiding this comment.
This preserves the matched node only when the top-level node is a CONSTANT, but the same source-expression collision can happen below a non-constant key. For example, GROUP BY keys like number + toUInt8(1) and number + CAST(1, 'UInt8') have FunctionNode roots; their constant children compare/hash equal in ConstantNode::isEqualImpl because source_expression is ignored, so registerNullableGroupByKeys keeps only the first key in the map. When the projection matching the second key reaches this line, it clones it->second because the root is a FUNCTION, replacing the matched key's constant source expression with the first key's source. On paths where planner action names use constant source expressions (secondary queries or disabled AST-level optimizations), the projection can then read the wrong per-grouping-set key column, which is the same invariant the top-level constant special case is trying to protect.
Please preserve source-expression identity for the whole matched key, not only when the root is a CONSTANT (or include the source expression in the key used by nullable_group_by_keys), and add a regression test with same-typed folded constants inside a non-constant GROUP BY key.
This reverts commit d5a026d.
LLVM Coverage ReportChanged lines: Changed C/C++ lines covered by tests: 103/114 (90.35%) | Lost baseline coverage (was covered on master, now uncovered in this PR): 5 line(s) · Uncovered code |

Claude
When
group_by_use_nullsis enabled, expressions equal to GROUP BY keys were matched with a comparison that ignored types(
QueryTreeNodePtrWithHashIgnoreTypes), because by the time an expression is checked againstnullable_group_by_keys, its subexpressions that are themselves GROUP BY keys have already been converted toNullable, so the types no longer match the stored original key.Instead, store in
nullable_group_by_keysevery shape in which an expression equal to a key can arrive at the lookup: the original form, the form with every maximal proper sub-key converted toNullable(with the types of the nodes above recomputed viarerunFunctionResolve), and the key converted toNullableitself. All shapes map to the original key node, and the lookup uses exact comparison, including types. This allows removing thecompare_typesoption fromIQueryTreeNode::CompareOptionsentirely, together with the half-dead ignore-types branch inConstantNode::isEqualImpl.The type-blind comparison could conflate constants with equal values but different types or source expressions. For example, in
04001_logical_error_with_cubethe constant1 AS ainside a lambda was replaced with a clone of a different key (isNullable(2) IS NOT NULL, which folds to the same constant1), and the mangled projection then constant-folded instead of reading the per-grouping-set key column. The test reference captured this incorrect result; with exact matching the projection correctly reads the key column, producing an additional row with the default key value for the grouping set where the key is not aggregated.The exact matching now correctly replaces such projections with a clone of the original key, which exposed a gap in the planner: the already-computed-expression check in
PlannerActionsVisitor::visitFunctionrecognized onlyINPUTnodes, while a constant-foldable GROUP BY key column arrives as a constantCOLUMNnode (constant inputs are duplicated asCOLUMNnodes by theActionsDAGconstructor). Visiting the arguments of such a function builds a lambda capture over the original-typed constant, which collides with theNullable-wrapped key constant of the same name and fails with aCannot capture column ... incompatible typeexception. Extend the check to also recognize constantCOLUMNsource nodes.Removing
QueryTreeNodePtrWithHashIgnoreTypesandignore_typecomparison.Changelog category (leave one):
Changelog entry (a user-readable short description of the changes that goes into CHANGELOG.md):
...
Version info
26.6.1.1033