Fix JIT crash on arithmetic in correlated subquery with group_by_use_nulls by groeneai · Pull Request #108558 · ClickHouse/ClickHouse · GitHub
Skip to content

Fix JIT crash on arithmetic in correlated subquery with group_by_use_nulls#108558

Closed
groeneai wants to merge 1 commit into
ClickHouse:masterfrom
groeneai:fix-jit-correlated-subquery-group-by-use-nulls
Closed

Fix JIT crash on arithmetic in correlated subquery with group_by_use_nulls#108558
groeneai wants to merge 1 commit into
ClickHouse:masterfrom
groeneai:fix-jit-correlated-subquery-group-by-use-nulls

Conversation

@groeneai

Copy link
Copy Markdown
Contributor

Changelog category (leave one):

  • Bug Fix (user-visible misbehavior in an official stable release)

Changelog entry (a user-readable short description of the changes that goes into CHANGELOG.md):

Fixed a crash (Cannot create binary operator with two operands of differing type) when JIT expression compilation (compile_expressions=1) is used for a correlated scalar subquery whose body has an arithmetic expression over the outer GROUP BY key, evaluated with group_by_use_nulls=1 and WITH CUBE/ROLLUP.

Description

Found by the AST fuzzer (STID 1176-28cc), AST fuzzer (amd_debug).
CI report: https://s3.amazonaws.com/clickhouse-test-reports/PRs/107318/7aa7dc7aa719bcad6ad9683348286fec0b6f07f0/ast_fuzzer_amd_debug/fatal.log

Reproducer:

SELECT (SELECT 1 - (number - 2)) FROM numbers(1) GROUP BY number WITH CUBE
SETTINGS compile_expressions = 1, min_count_to_compile_expression = 0, group_by_use_nulls = 1, allow_experimental_correlated_subqueries = 1;

Root cause: under group_by_use_nulls the outer GROUP BY key becomes Nullable, and correlated-subquery decorrelation feeds that Nullable(UInt64) value into the subquery's minus nodes, which were resolved (and keep their result type) for the non-Nullable UInt64 key. The resulting ActionsDAG is internally inconsistent: a function node whose child type no longer matches the type its function_base was resolved for. The interpreter tolerates this and returns the correct result, but the expression JIT bakes the declared node types and emits an LLVM binary operator with mismatched operand types, hitting llvm::BinaryOperator::Create's assertion in debug builds.

Fix: in isCompilableFunction, skip JIT for a function node when a child's actual result_type differs from the function's resolved getArgumentTypes()[i]. Such a node cannot be faithfully JIT-compiled; it falls back to the interpreter, which is correct. JIT for well-formed expressions is unaffected. Added a regression test (04365_jit_correlated_subquery_group_by_use_nulls).

Note: the underlying decorrelation type inconsistency has a second, JIT-independent symptom (e.g. SELECT (SELECT toUInt32(number) - toFloat64(number - 1)) ... GROUP BY number WITH CUBE SETTINGS group_by_use_nulls = 1, allow_experimental_correlated_subqueries = 1 raises LOGICAL_ERROR: Unexpected return type from toUInt32. Expected UInt32. Got Nullable(UInt32) with compile_expressions = 0 too). That is a separate planner/analyzer issue in the experimental correlated-subqueries feature and is out of scope for this JIT-crash fix.

A correlated scalar subquery whose body contains an arithmetic expression
over the outer GROUP BY key, evaluated under group_by_use_nulls with
WITH CUBE/ROLLUP, produces an internally inconsistent ActionsDAG:
decorrelation feeds a Nullable correlated-column INPUT into minus/plus/etc.
nodes that were resolved (and keep their result type) for the non-Nullable
key. The interpreter tolerates this, but the expression JIT bakes the
declared node types and aborts in llvm::BinaryOperator::Create with
"Cannot create binary operator with two operands of differing type"
(found by the AST fuzzer).

Guard isCompilableFunction against this: when a child's actual result type
differs from the type the function was resolved for
(function.getArgumentTypes()[i]), the node cannot be faithfully JIT-compiled,
so skip JIT for it and let it run in the interpreter, which produces the
correct result.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
@groeneai

Copy link
Copy Markdown
Contributor Author

@groeneai

Copy link
Copy Markdown
Contributor Author

cc @KochetovNicolai @novikd — could you review this? It fixes an AST-fuzzer JIT crash (Cannot create binary operator with two operands of differing type) where correlated-subquery decorrelation under group_by_use_nulls produces an ActionsDAG whose minus node keeps a non-Nullable result type while its decorrelated INPUT becomes Nullable; the guard skips JIT for such inconsistent nodes (interpreter result is correct). The underlying decorrelation type inconsistency also has a separate non-JIT LOGICAL_ERROR symptom noted in the PR description, which I left out of scope.

@Algunenano Algunenano self-assigned this Jun 26, 2026
@@ -388,6 +388,21 @@ static bool isCompilableFunction(const ActionsDAG::Node & node, const std::unord
{
return false;
}

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@groeneai You are making the same mistake that you did in #107263, again.

And it looks, again, related to #106377.

This is a problem in the caller, not in the JIT code.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Algunenano You are right, same mistake as #107263. The JIT guard is the wrong layer.

Root cause is the caller: under group_by_use_nulls, decorrelation feeds the correlated subquery INPUT as Nullable(UInt64) (from the outer Nullable GROUP BY key), but the consuming minus nodes keep the non-nullable Int64 they were resolved for. The ActionsDAG is internally inconsistent. The interpreter tolerates it, the JIT bakes the declared type and aborts. This is the #106377 family.

It is already being fixed in the right layer: the comprehensive analyzer fix #100365 makes a correlated column Nullable at resolution time when it references an outer Nullable group-by key, so the subquery functions are rebuilt as Nullable(Int64) from the start and the DAG stays consistent. Its own code comment describes exactly this shape ("the inner expression DAG, function bindings ... built with the pre-Nullable type and later fail with a type mismatch at runtime"). #104350 is related. novikd already directed on #102597 that this belongs in QueryAnalyzer, not the planner or JIT.

Closing this PR. I will not add a competing fix. Adding to #106377 the JIT-abort reproducer and a sibling JIT-independent variant (LOGICAL_ERROR: Unexpected return type from toUInt32 ... Got Nullable(UInt32), reproduces with compile_expressions=0) so the analyzer fix is validated against both shapes.

@groeneai

Copy link
Copy Markdown
Contributor Author

@groeneai groeneai closed this Jun 26, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants