Use exact comparison for nullable GROUP BY keys, remove `compare_types` by KochetovNicolai · Pull Request #107050 · ClickHouse/ClickHouse · GitHub
Skip to content

Use exact comparison for nullable GROUP BY keys, remove compare_types#107050

Merged
KochetovNicolai merged 7 commits into
masterfrom
group-by-use-nulls-typed-comparison
Jun 19, 2026
Merged

Use exact comparison for nullable GROUP BY keys, remove compare_types#107050
KochetovNicolai merged 7 commits into
masterfrom
group-by-use-nulls-typed-comparison

Conversation

@KochetovNicolai

@KochetovNicolai KochetovNicolai commented Jun 10, 2026

Copy link
Copy Markdown
Member
Claude

When group_by_use_nulls is enabled, expressions equal to GROUP BY keys were matched with a comparison that ignored types
(QueryTreeNodePtrWithHashIgnoreTypes), because by the time an expression is checked against nullable_group_by_keys, its subexpressions that are themselves GROUP BY keys have already been converted to Nullable, so the types no longer match the stored original key.

Instead, store in nullable_group_by_keys every shape in which an expression equal to a key can arrive at the lookup: the original form, the form with every maximal proper sub-key converted to Nullable (with the types of the nodes above recomputed via rerunFunctionResolve), and the key converted to Nullable itself. All shapes map to the original key node, and the lookup uses exact comparison, including types. This allows removing the compare_types option from IQueryTreeNode::CompareOptions entirely, together with the half-dead ignore-types branch in ConstantNode::isEqualImpl.

The type-blind comparison could conflate constants with equal values but different types or source expressions. For example, in 04001_logical_error_with_cube the constant 1 AS a inside a lambda was replaced with a clone of a different key (isNullable(2) IS NOT NULL, which folds to the same constant 1), and the mangled projection then constant-folded instead of reading the per-grouping-set key column. The test reference captured this incorrect result; with exact matching the projection correctly reads the key column, producing an additional row with the default key value for the grouping set where the key is not aggregated.

The exact matching now correctly replaces such projections with a clone of the original key, which exposed a gap in the planner: the already-computed-expression check in PlannerActionsVisitor::visitFunction recognized only INPUT nodes, while a constant-foldable GROUP BY key column arrives as a constant COLUMN node (constant inputs are duplicated as COLUMN nodes by the ActionsDAG constructor). Visiting the arguments of such a function builds a lambda capture over the original-typed constant, which collides with the Nullable-wrapped key constant of the same name and fails with a Cannot capture column ... incompatible type exception. Extend the check to also recognize constant COLUMN source nodes.

Removing QueryTreeNodePtrWithHashIgnoreTypes and ignore_type comparison.

Changelog category (leave one):

  • Not for changelog (changelog entry is not required)

Changelog entry (a user-readable short description of the changes that goes into CHANGELOG.md):

...

Version info

  • Merged into: 26.6.1.1033

@clickhouse-gh

clickhouse-gh Bot commented Jun 10, 2026

Copy link
Copy Markdown
Contributor

@clickhouse-gh clickhouse-gh Bot added the pr-not-for-changelog This PR should not be mentioned in the changelog label Jun 10, 2026
Comment thread src/Analyzer/ConstantNode.cpp
When `group_by_use_nulls` is enabled, expressions equal to GROUP BY keys
were matched with a comparison that ignored types
(`QueryTreeNodePtrWithHashIgnoreTypes`), because by the time an expression
is checked against `nullable_group_by_keys`, its subexpressions that are
themselves GROUP BY keys have already been converted to `Nullable`, so the
types no longer match the stored original key.

Instead, store in `nullable_group_by_keys` every shape in which an
expression equal to a key can arrive at the lookup: the original form, the
form with every maximal proper sub-key converted to `Nullable` (with the
types of the nodes above recomputed via `rerunFunctionResolve`), and the
key converted to `Nullable` itself. All shapes map to the original key
node, and the lookup uses exact comparison, including types. This allows
removing the `compare_types` option from `IQueryTreeNode::CompareOptions`
entirely, together with the half-dead ignore-types branch in
`ConstantNode::isEqualImpl`.

The type-blind comparison could conflate constants with equal values but
different types or source expressions. For example, in
`04001_logical_error_with_cube` the constant `1 AS a` inside a lambda was
replaced with a clone of a different key (`isNullable(2) IS NOT NULL`,
which folds to the same constant `1`), and the mangled projection then
constant-folded instead of reading the per-grouping-set key column. The
test reference captured this incorrect result; with exact matching the
projection correctly reads the key column, producing an additional row
with the default key value for the grouping set where the key is not
aggregated.

The exact matching now correctly replaces such projections with a clone of
the original key, which exposed a gap in the planner: the
already-computed-expression check in `PlannerActionsVisitor::visitFunction`
recognized only `INPUT` nodes, while a constant-foldable GROUP BY key
column arrives as a constant `COLUMN` node (constant inputs are duplicated
as `COLUMN` nodes by the `ActionsDAG` constructor). Visiting the arguments
of such a function builds a lambda capture over the original-typed
constant, which collides with the `Nullable`-wrapped key constant of the
same name and fails with a `Cannot capture column ... incompatible type`
exception. Extend the check to also recognize constant `COLUMN` source
nodes.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
@KochetovNicolai KochetovNicolai force-pushed the group-by-use-nulls-typed-comparison branch from aae3c17 to 59583e1 Compare June 11, 2026 10:57
sebver pushed a commit to sebver/ClickHouse that referenced this pull request Jun 12, 2026
The check_log retry waited only for any span with parent_span_id=0x73 to
appear before running the assertions. The ===http=== flow produces two such
gateway spans (a local HTTPHandler and a remote TCPHandler), each flushed to
opentelemetry_span_log by an independent background thread. Under the amd_tsan
parallel runner the remote TCPHandler could be flushed first, satisfying the
retry condition while the HTTPHandler and the initial 'select 1 from
remote(...)' query span it parents were not yet in the table. The assertions
then observed 'total spans': 2 instead of 3, or 'initial query spans with
proper parent': 1 instead of 2.

Poll instead until the exact quantities the assertions read are present: the
total number of 'query' spans and the number of initial query spans with a
proper parent, passed per flow (3 2 for HTTP, 3 1 for native). The asserted
output is unchanged, so the .reference file is untouched.

Confirmed root cause from CI failures on PRs ClickHouse#107050 and ClickHouse#101769
(Stateless tests (amd_tsan, parallel, 2/2)). Follow-up to ClickHouse#98552, which fixes
issues ClickHouse#67108 and ClickHouse#93452.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…y_use_nulls

A projection equal to a GROUP BY key is replaced with a Nullable clone of the
key from scope.nullable_group_by_keys. Two constants with equal value and type
but different source expressions collide in that map, since ConstantNode
equality and hashing ignore the source expression. The lookup cloned the stored
key, which could be a different colliding constant, giving the projection the
wrong source expression - and the source expression determines the planner
action node name, hence which aggregation key column the projection reads.

Clone the matched node itself for constants, preserving its own source
expression; non-constant keys still clone the stored canonical key so a
partially Nullable-converted shape keeps the matching action node name.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
}

auto node_type = node->getNodeType();
if (node_type != QueryTreeNodeType::FUNCTION && node_type != QueryTreeNodeType::LAMBDA && node_type != QueryTreeNodeType::LIST)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

convertNestedGroupByKeysToNullable stops at COLUMN nodes unless the column itself is a GROUP BY key, but ColumnNode can carry an expression child for table ALIAS columns. That expression is resolved after nullable_group_by_keys has been registered, so a nested key inside the alias expression is converted to Nullable and produces a shape that was never registered here. A concrete case is an ALIAS column a ALIAS number + 1 with SELECT a + 1 FROM t GROUP BY number, a + 1 WITH ROLLUP SETTINGS group_by_use_nulls = 1: the projection resolves a with number converted to Nullable, while the registered a + 1 key still contains the original non-nullable alias expression, so the exact lookup misses and the expression is no longer treated as the grouping key. Please recurse through ColumnNode expressions, or otherwise register this converted alias-column shape, and add a regression test for a table ALIAS key that contains another key.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like it was already an existing issue. Claude tells:

  Good catch — this is a real bug, but it predates this PR. The alias-column case fails identically on master (verified on a near-master build, 26.6.1.1), so it isn't introduced by the switch to exact matching here.
  
  Minimal repro (throws NOT_AN_AGGREGATE: Column 'default.t.y' is not under aggregate function and not in GROUP BY keys on both master and this branch):
  
  SET enable_analyzer = 1, group_by_use_nulls = 1;
  CREATE TABLE t (x UInt64, y UInt64 ALIAS x + 1) ENGINE = MergeTree ORDER BY x;
  INSERT INTO t (x) VALUES (10);
  SELECT x, y FROM t GROUP BY x, y WITH CUBE; 
  
  The function-key analog (GROUP BY x, x + 1 WITH CUBE) works on both, so this is specifically about an ALIAS column key whose expression contains another grouped key: under group_by_use_nulls the projection's copy gets the nested key
  converted to Nullable, yielding a shape that isn't registered (the converted-sub-keys shape is only built for FunctionNode keys, and convertNestedGroupByKeysToNullable doesn't recurse into ColumnNode expressions). A proper fix also
  needs consistent alias-column type handling, which is orthogonal to this PR's "exact comparison" change.
  
  I'd prefer to keep this PR focused and track the alias-column case as a separate issue (with the repro above) rather than expand scope here.

@KochetovNicolai KochetovNicolai marked this pull request as ready for review June 12, 2026 18:37
@novikd novikd self-assigned this Jun 12, 2026
rorylshanks pushed a commit to rorylshanks/ClickHouse that referenced this pull request Jun 16, 2026
When the initialization of a TCP connection failed — for example, when
the allocation of the connection buffers in `TCPHandler::runImpl` threw
`MEMORY_LIMIT_EXCEEDED` because the total server memory limit was
reached — the exception propagated out of the handler and the socket
was closed with the client's `Hello` packet still unread in the receive
queue. Closing a socket with pending unread data makes the kernel send
RST, so the client observed `Connection reset by peer` without any
explanation (and RST can also discard data written to the socket
earlier).

Now the exception is written into the socket directly, without the
connection buffers: the buffer for the exception packet is allocated on
the stack, and the error handling path is exempt from the memory limit
(it still allocates the strings of the exception message). The client
has not received `Hello` yet, so the communication is not chunked, and
the client handles an exception packet in place of the `Hello` response
(this is also how authentication errors are delivered). The pending
input is read out before closing the socket, so the connection is
terminated gracefully with FIN and the client reliably receives the
exception. The stack trace is not sent: the message already contains
the relevant details, such as the current memory limits.

The failure is reproduced with the new `tcp_handler_fail_connection_setup`
fail point.

An example of such a failure in CI (the test failed with `Code: 210.
DB::NetException: Connection reset by peer` while the server hit the
total memory limit during connection setup):
https://s3.amazonaws.com/clickhouse-test-reports/json.html?PR=107050&sha=59583e112fa696c67340daaf3137e40bb8f63982&name_0=PR&name_1=Stateless+tests+%28arm_asan_ubsan%2C+azure%2C+parallel%29
(from ClickHouse#107050)

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Comment thread src/Analyzer/Resolve/QueryAnalyzer.cpp
*/
void QueryAnalyzer::resolveGroupByNode(QueryNode & query_node_typed, IdentifierResolveScope & scope)
{
QueryTreeNodes nullable_group_by_keys;

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

reserve can be called here

auto not_nullable_rhs_column = removeNullable(rhs_column);

return not_nullable_column->compareAt(0, 0, *not_nullable_rhs_column, 1) == 0;
return constant_value.getType()->equals(*rhs_typed.constant_value.getType())

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Check for NULL is missing now. Is it safe?

*/
bool is_input_node = function_node.isAggregateFunction() || function_node.isWindowFunction()
|| actions_stack.front().containsInputNode(function_node_name);
|| actions_stack.front().containsInputOrConstantNode(function_node_name);

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems to be solving a separate problem

/// in value and type but with different source expressions share a single map entry,
/// and the source expression determines the action node name (hence which aggregation
/// key column the projection reads), so the matched node's own one must be preserved.
node = (node->getNodeType() == QueryTreeNodeType::CONSTANT ? node : it->second)->clone();

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This preserves the matched node only when the top-level node is a CONSTANT, but the same source-expression collision can happen below a non-constant key. For example, GROUP BY keys like number + toUInt8(1) and number + CAST(1, 'UInt8') have FunctionNode roots; their constant children compare/hash equal in ConstantNode::isEqualImpl because source_expression is ignored, so registerNullableGroupByKeys keeps only the first key in the map. When the projection matching the second key reaches this line, it clones it->second because the root is a FUNCTION, replacing the matched key's constant source expression with the first key's source. On paths where planner action names use constant source expressions (secondary queries or disabled AST-level optimizations), the projection can then read the wrong per-grouping-set key column, which is the same invariant the top-level constant special case is trying to protect.

Please preserve source-expression identity for the whole matched key, not only when the root is a CONSTANT (or include the source expression in the key used by nullable_group_by_keys), and add a regression test with same-typed folded constants inside a non-constant GROUP BY key.

This reverts commit d5a026d.
@clickhouse-gh

clickhouse-gh Bot commented Jun 18, 2026

Copy link
Copy Markdown
Contributor

LLVM Coverage Report

Metric Baseline Current Δ
Lines 85.20% 85.20% +0.00%
Functions 92.60% 92.60% +0.00%
Branches 77.50% 77.50% +0.00%

Changed lines: Changed C/C++ lines covered by tests: 103/114 (90.35%) | Lost baseline coverage (was covered on master, now uncovered in this PR): 5 line(s) · Uncovered code

Full report · Diff report

@KochetovNicolai KochetovNicolai added this pull request to the merge queue Jun 19, 2026
Merged via the queue into master with commit 157acd8 Jun 19, 2026
326 of 328 checks passed
@KochetovNicolai KochetovNicolai deleted the group-by-use-nulls-typed-comparison branch June 19, 2026 13:16
@robot-clickhouse-ci-1 robot-clickhouse-ci-1 added the pr-synced-to-cloud The PR is synced to the cloud repo label Jun 19, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

pr-not-for-changelog This PR should not be mentioned in the changelog pr-synced-to-cloud The PR is synced to the cloud repo

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants