Fix logical error in disjunction filter push-down through JOIN with type-changing USING key by groeneai · Pull Request #107407 · ClickHouse/ClickHouse · GitHub
Skip to content

Fix logical error in disjunction filter push-down through JOIN with type-changing USING key#107407

Merged
alexey-milovidov merged 2 commits into
ClickHouse:masterfrom
groeneai:fix-join-disjunction-pushdown-using-type-mismatch
Jun 18, 2026
Merged

Fix logical error in disjunction filter push-down through JOIN with type-changing USING key#107407
alexey-milovidov merged 2 commits into
ClickHouse:masterfrom
groeneai:fix-join-disjunction-pushdown-using-type-mismatch

Conversation

@groeneai

@groeneai groeneai commented Jun 13, 2026

Copy link
Copy Markdown
Contributor

Changelog category (leave one):

  • Bug Fix (user-visible misbehavior in an official stable release)

Changelog entry (a user-readable short description of the changes that goes into CHANGELOG.md):

Fix a logical error (Unexpected return type from equals. Expected Nullable(UInt8). Got UInt8) when the disjunction (partial predicate) push-down optimization pushes a condition over a USING key whose type is widened by the JOIN. Also fix a server crash (segmentation fault) in the analyzer when resolving identifiers in JOIN ... USING queries that contain constant-foldable if/multiIf branches referencing unknown identifiers.

Description

Found by the AST fuzzer on PR #68978: report, check AST fuzzer (arm_asan_ubsan), STID 3262-3b6a. The crash is a pre-existing trunk bug unrelated to that PR. Same class as the open fuzz issue #89802.

Minimal reproducer:

CREATE TABLE nr (x Nullable(UInt32), s Nullable(String)) ENGINE = MergeTree ORDER BY tuple();
CREATE TABLE t  (x UInt32, s LowCardinality(String)) ENGINE = MergeTree ORDER BY tuple();

SELECT t.x FROM t AS l SEMI LEFT JOIN nr AS r USING (x)
WHERE and(or(and(and(equals(or(divide(NULL, '922337203685477580.7'), toString(NULL, NULL)), r.s), equals(r.s, '127.0.0.1')), equals(t.x, toLowCardinality(-9223372036854775808))), greater(l.s, toString(NULL, 0.0001))), greater(r.s, toString(10.0001, NULL)))
QUALIFY 1024 LIMIT 0;

Root cause: the USING (x) key is UInt32 on the left and Nullable(UInt32) on the right, so the column is widened to Nullable in the JOIN output. A predicate over x is built against that widened type (so equals returns Nullable(UInt8)). The main filter push-down path repairs such type changes for JoinStepLogical via fix_predicate_for_join_logical_step, and get_available_columns_for_filter therefore intentionally allows type-changing columns through. But the disjunction (partial predicate) push-down path (tryToExtractPartialPredicate + addFilterOnTop) reuses the same column lists and builds a FilterStep directly against the non-widened JOIN input header, with no type-fixup. ActionsDAG::updateHeader then re-evaluates equals as UInt8 while the node still claims Nullable(UInt8), tripping the result-type check in executeActionForPartialResult and aborting on debug/sanitizer builds.

Fix: restrict the disjunction push-down to columns whose type is stable across the JOIN boundary (input type equals JOIN-output type), via a new require_stable_types flag on get_available_columns_for_filter. Partial push-down is an opportunistic broadening pre-filter (the full filter still runs above the JOIN), so skipping type-changing columns is safe and only forgoes a pre-filter that cannot be correctly typed without the fixup the main path applies. Push-down for stable-type columns is unchanged.

A regression test is added in tests/queries/0_stateless/04330_join_disjunctions_pushdown_using_type_mismatch.sql. It fails (aborts) without the fix and passes with it, and also asserts the optimization still fires for a stable-type disjunction.

Additional fix in this PR: analyzer NULL deref in JOIN USING identifier resolution

While fuzzing the new 04330 seed query, the AST fuzzer (amd_debug, targeted) hit a separate, pre-existing crash: a NULL pointer read (SIGSEGV) in IdentifierResolver::tryBindIdentifierToJoinUsingColumn. STID 5737-6ff0, check AST fuzzer (amd_debug, targeted): report.

Root cause: scope.join_using_columns is a list of raw pointers to a stack-local map that tryResolveIdentifierFromJoin pushes while recursing into one JOIN side and pops afterwards. The push/pop wrapped the recursive tryResolveIdentifierFromJoinTreeNode call without an exception guard. The if/multiIf constant-fold special cases in resolveFunction.cpp resolve statically-dead branches inside a try/catch and swallow the exception; when a dead branch referenced an unknown identifier, the resulting UNKNOWN_IDENTIFIER unwound past the pop, the stack-local map was destroyed, and the dangling pointer was left in scope.join_using_columns. A later tryBindIdentifierToJoinUsingColumn dereferenced it and crashed.

Fix: run the pop on every exit path with SCOPE_EXIT, balanced against a captured flag so each push is matched by exactly one pop. Pre-existing bug, independent of the push-down change above. A regression query is added to the same 04330 test (crashes without the fix, returns a result with it).

Version info

  • Merged into: 26.6.1.960
  • Backported to: 26.5.5.8, 26.4.5.86

@groeneai

Copy link
Copy Markdown
Contributor Author

@groeneai

Copy link
Copy Markdown
Contributor Author

cc @vdimir @davenger — could you review? This fixes a logical error (Unexpected return type from equals) in the disjunction (partial predicate) filter push-down: it built a FilterStep against the non-widened JOIN input header without the type-fixup the main push-down path applies (fix_predicate_for_join_logical_step), so a predicate over a type-changing USING key kept its JOIN-output Nullable result type. The fix restricts the disjunction path to columns with stable types across the JOIN. Related to your work on #99278 / the legacy-path removal discussion there.

@nikitamikhaylov nikitamikhaylov added the can be tested Allows running workflows for external contributors label Jun 13, 2026
@clickhouse-gh

clickhouse-gh Bot commented Jun 13, 2026

Copy link
Copy Markdown
Contributor

Workflow [PR], commit [6f21d80]

Summary:


AI Review

Summary

This PR fixes two JOIN ... USING analyzer/query-plan failures: unsafe disjunction partial push-down when a USING key changes type, and a dangling join_using_columns pointer after an exception during identifier resolution. The code changes and regression coverage look sound, but the PR metadata still describes only the first fix while the diff now also fixes a SIGSEGV.

PR Metadata
  • Changelog category should be Critical Bug Fix (crash, data loss, RBAC) because the PR includes an analyzer SIGSEGV fix in IdentifierResolver.
  • Changelog entry is required and should mention both user-visible fixes. Suggested replacement:
    Fix an exception in disjunction filter push-down over type-widened JOIN ... USING keys, and fix a SIGSEGV in analyzer identifier resolution for JOIN ... USING queries with statically-dead if/multiIf branches that reference unknown identifiers.
Findings
  • 💡 Nits
    • [tests/queries/0_stateless/04330_join_disjunctions_pushdown_using_type_mismatch.sql:79] The added analyzer regression covers a second user-visible SIGSEGV fix, but the PR metadata remains categorized as Bug Fix and the changelog entry only describes the disjunction-pushdown exception. Update the category and changelog entry as suggested above so the release note reflects the critical analyzer fix too.
Final Verdict

Status: ⚠️ Request changes

Minimum required action: update the changelog category and entry to cover both the disjunction-pushdown exception and the analyzer JOIN ... USING SIGSEGV.

@clickhouse-gh clickhouse-gh Bot added the pr-bugfix Pull request with bugfix, not backported by default label Jun 13, 2026
…ype-changing USING key

The disjunction (partial predicate) push-down path built a FilterStep directly
against the JOIN input header without the type-fixup that the main push-down path
applies via fix_predicate_for_join_logical_step. When a USING key is widened by the
JOIN (e.g. UInt32 vs Nullable(UInt32)), a pushed predicate over that column kept the
JOIN-output result type while its arguments were re-evaluated against the narrower
input type, tripping the result-type check in executeActionForPartialResult:
"Unexpected return type from equals. Expected Nullable(UInt8). Got UInt8".

Restrict the disjunction push-down to columns whose type is stable across the JOIN
boundary via a require_stable_types flag on get_available_columns_for_filter. Partial
push-down is an opportunistic broadening pre-filter (the full filter still runs above
the JOIN), so skipping type-changing columns is safe; push-down for stable-type
columns is unchanged.

Found by the AST fuzzer (arm_asan_ubsan, STID 3262-3b6a). Same class as issue ClickHouse#89802.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
@groeneai groeneai force-pushed the fix-join-disjunction-pushdown-using-type-mismatch branch from c04071a to 03a72c1 Compare June 13, 2026 22:30
@groeneai

Copy link
Copy Markdown
Contributor Author

Pre-PR validation gate

# Question Answer
a Deterministic repro? Yes. clickhouse local --enable_analyzer=1 --use_join_disjunctions_push_down=0 on the original crash query aborted (RC=134, assertBlocksHaveEqualStructure ... ForPartialResult / "Unexpected return type from equals"); the test reproduces deterministically.
b Root cause explained? The disjunction (partial-predicate) push-down built a per-side FilterStep against the JOIN input header without the type-fixup (fix_predicate_for_join_logical_step) the main push-down applies. With a USING key widened by the JOIN (UInt32 vs Nullable(UInt32)), the pushed predicate kept the JOIN-output result type while its arguments were re-evaluated against the narrower input type, tripping the result-type check.
c Fix matches root cause? Yes. Restrict the disjunction push-down to columns whose input type equals the JOIN-output type (require_stable_types flag), mirroring the existing legacy-path guard. The partial path has no type-fixup mechanism, so skipping type-changing columns is the correct scope.
d Test intent preserved / new tests added? Strengthened per review. The stable-type block is now an EXPLAIN actions=1 plan assertion (modeled on 03610): push-down enabled asserts the two per-side pre-join filters plus the outer filter; disabled asserts only the outer filter. This genuinely fails if push-down stops firing (the result-only check did not). The crash-regression query and a result-safety SELECT are retained.
e Both directions demonstrated? Yes. Crash query: aborts without fix, passes with fix. Plan assertion: 3 Filter column: lines with push-down on, 1 with it off (verified) so a silent push-down regression collapses the enabled block to one line and fails the test. 50/50 with full randomization, 5/5 with --no-random-settings.
f Fix is general, not a narrow patch? Yes. The guard is applied in get_available_columns_for_filter (the single column-availability source for the disjunction/partial path), so all pushed atoms over type-changing columns are skipped, not just the one the fuzzer hit. It does not guard a symptom: it stops type-unstable columns from entering push-down at the source, matching the legacy path's existing invariant.

Session id: cron:clickhouse-worker-slot-2:20260613-220600

scope.join_using_columns is a list of raw pointers to a stack-local map
(join_using_column_name_to_column_node) that tryResolveIdentifierFromJoin
pushes while recursing into one JOIN side and pops afterwards. The push/pop
wrapped the recursive tryResolveIdentifierFromJoinTreeNode call without an
exception guard, so when that recursion threw the pop was skipped.

The throw is reachable from normal analysis: the if/multiIf constant-fold
special cases in resolveFunction.cpp resolve statically-dead branches inside
a try/catch and swallow the exception. If a dead branch references an unknown
identifier (UNKNOWN_IDENTIFIER), the throw unwound past the pop, the stack-local
map was destroyed, and its pointer was left dangling in scope.join_using_columns.
A later tryBindIdentifierToJoinUsingColumn iterated the list and dereferenced the
dangling pointer, segfaulting (NULL pointer read) in the analyzer.

Make the pop run on every exit path with SCOPE_EXIT, and balance it against a
captured flag so a push is always matched by exactly one pop.

Found by the AST fuzzer (amd_debug, targeted, STID 5737-6ff0) on PR ClickHouse#107407,
which mutated the new 04330 JOIN USING / EXPLAIN seed query into a form with a
dead if/multiIf branch referencing a non-existent qualifier. Pre-existing bug,
independent of the disjunction push-down change in this PR. Regression query
added to 04330_join_disjunctions_pushdown_using_type_mismatch.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
@groeneai

Copy link
Copy Markdown
Contributor Author

Pre-PR validation gate (analyzer NULL deref fix)

# Question Answer
a Deterministic repro? Yes. clickhouse local on the verbatim fuzzed query (and a reduced viewExplain(... SEMI LEFT JOIN ... USING ...) form) segfaults 100% of the time on the unfixed binary (exit 139). Saved as the new 04330 regression query.
b Root cause explained? scope.join_using_columns holds raw pointers to a stack-local map that tryResolveIdentifierFromJoin pushes before recursing into a JOIN side and pops after. The push/pop did not guard the recursive tryResolveIdentifierFromJoinTreeNode call. The if/multiIf constant-fold path in resolveFunction.cpp resolves statically-dead branches in a try/catch and swallows the exception; a dead branch with an unknown identifier throws UNKNOWN_IDENTIFIER, unwinding past the pop. The stack-local map is destroyed but its pointer stays in the list, so a later tryBindIdentifierToJoinUsingColumn dereferences a dangling pointer (NULL read).
c Fix matches root cause? Yes. The pop now runs on every exit path via SCOPE_EXIT, so the dangling pointer is never left behind when the recursive resolve throws. Not a symptom guard in the deref loop.
d Test intent preserved / new tests added? Yes. Added a regression query to 04330_join_disjunctions_pushdown_using_type_mismatch that exercises the exact crash path. Existing push-down assertions are unchanged.
e Both directions demonstrated? Yes. Unfixed binary: SIGSEGV 5/5 (exit 139). Fixed binary: clean result 5/5 (exit 0). Full 04330 test passes, including 20/20 runs with randomized prewhere/threads/block_size/join_use_nulls.
f Fix is general, not a narrow patch? Yes. join_using_columns has exactly one push/pop pair (the one fixed). The flag is captured at push time so the pop is balanced even if the map were mutated mid-resolution (the original re-checked .empty() at pop time, a latent imbalance). No sibling code path pushes to this list.

Session id: cron:clickhouse-worker-slot-0:20260613-234100

ORDER BY l.x
FORMAT TSV;

-- Analyzer identifier-resolution crash (NULL deref) reachable from this query family.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This PR now includes a second user-visible fix covered by this regression block: a SIGSEGV in analyzer identifier resolution for JOIN ... USING. The PR metadata is still categorized as Bug Fix and the changelog entry only describes the disjunction-pushdown exception, so the release note will miss the critical analyzer fix. Please switch the category to Critical Bug Fix (crash, data loss, RBAC) and include both fixes in the changelog entry, e.g.:

Fix an exception in disjunction filter push-down over type-widened JOIN ... USING keys, and fix a SIGSEGV in analyzer identifier resolution for JOIN ... USING queries with statically-dead if/multiIf branches that reference unknown identifiers.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated the changelog entry to cover both fixes (the disjunction push-down exception and the analyzer SIGSEGV in JOIN ... USING resolution).

On the category: I have kept Bug Fix rather than self-selecting Critical Bug Fix. Per our contribution policy I do not pick the Critical Bug Fix category on my own; a maintainer applies it. Both fixes here are pre-existing trunk crashes surfaced by the AST fuzzer (not regressions in a released version), so a maintainer is best placed to decide whether the crash class warrants Critical Bug Fix. Happy to switch immediately if a reviewer confirms.

@clickhouse-gh

clickhouse-gh Bot commented Jun 14, 2026

Copy link
Copy Markdown
Contributor

LLVM Coverage Report

Metric Baseline Current Δ
Lines 84.80% 84.70% -0.10%
Functions 92.40% 92.30% -0.10%
Branches 77.30% 77.30% +0.00%

Changed lines: Changed C/C++ lines covered by tests: 26/26 (100.00%) | Lost baseline coverage: none · Uncovered code

Full report · Diff report

@groeneai

Copy link
Copy Markdown
Contributor Author

CI finished on 6f21d8020bf5. No PR-caused failures.

  • AST fuzzer (all variants incl. amd_debug/targeted): success. The earlier segfault (STID 5737-6ff0) was on the prior commit 03a72c1e177 and is fixed by the analyzer exception-safety change in 6f21d8020bf; it does not recur on this HEAD.
  • Bugfix validation (functional tests): job status success. The "Lost forever for SharedMergeTree" / "Lost s3 keys" / "S3_ERROR No such key" / "OOM in dmesg" / "Exception in test runner" rows are CI-harness server-log-scan pseudo-tests, the same set surfacing on ~10 unrelated PRs over the last 7 days inside jobs that pass; unrelated to this change.
  • CH Inc sync: non-gating internal mirror check.

@vdimir vdimir self-assigned this Jun 15, 2026
@clickhouse-gh

clickhouse-gh Bot commented Jun 15, 2026

Copy link
Copy Markdown
Contributor

📊 Cloud Performance Report

✅ AI verdict: no_change — no significant changes across 38 queries analysed

This is a correctness fix: it stops the join disjunction (partial-predicate) push-down from applying a predicate built for a type-widened join output to a non-widened input column, and adds a scope guard so JOIN identifier resolution doesn't leave a dangling pointer when a dead branch throws. The change only alters behavior for joins with type-changing USING keys. All 8 flagged ClickBench improvements are on join-free single-table queries the PR cannot touch, and the 2 TPC-H improvements are on intrinsically noisy queries whose stable join types never hit the narrow type-mismatch path. All flagged improvements were downgraded to not-sure as run-to-run/environment variance; note the source build ran only 6 iterations versus 30 on base.

clickbench

⚠️ 8 inconclusive

Flagged queries (8 of 43)
Query Verdict Baseline med (ms) PR med (ms) Change q-value Hint
⚠️ 4 not_sure 264 214 -18.9% <0.0001 ClickBench Q4 is a join-free single-table scan; PR only touches join push-down/resolution. Off-path, variance.
⚠️ 15 not_sure 246 206 -16.3% <0.0001 Join-free single-table query; PR's join-only code path is never exercised here. Improvement is run-to-run variance.
⚠️ 18 not_sure 1403 1282 -8.6% <0.0001 No joins in this query; PR changes only join disjunction push-down. Delta is environment/variance, not the PR.
⚠️ 22 not_sure 334 232 -30.5% <0.0001 Join-free query, very noisy underlying runs (high spread); PR can't touch this path. Unrelated to the PR.
⚠️ 23 not_sure 297 194 -34.7% <0.0001 Clean 34% delta but on a join-free query the PR's join-only code cannot affect. Off-path environment effect.
⚠️ 28 not_sure 7016 6562 -6.5% <0.0001 Single-table query, no joins; PR only changes join push-down/identifier resolution. Off-path, variance.
⚠️ 33 not_sure 1488 1402 -5.8% <0.0001 Join-free query untouched by the PR's join-only changes. Improvement is run-to-run variance.
⚠️ 34 not_sure 1486 1388 -6.6% <0.0001 No joins here; PR's join disjunction fix cannot affect this query. Off-path delta.

q-value = BH-FDR adjusted p; smaller is stronger evidence. MIRAI flags a query when q < fdr_q (default 0.10) — the value the verdict is based on.

tpch_adapted_1_official

⚠️ 2 inconclusive

Flagged queries (2 of 22)
Query Verdict Baseline med (ms) PR med (ms) Change q-value Hint
⚠️ 7 not_sure 113 71 -37.2% <0.0001 join_operator: PR restricts disjunction push-down only for Nullable USING type mismatch; standard TPC-H schema doesn't hit it. Noisy hi
⚠️ 20 not_sure 248 122 -50.8% <0.0001 join_operator: Narrow type-mismatch push-down fix doesn't apply to this query's stable join types; intrinsically noisy. Variance.

q-value = BH-FDR adjusted p; smaller is stronger evidence. MIRAI flags a query when q < fdr_q (default 0.10) — the value the verdict is based on.

Debug info
  • StressHouse run: cfb25c5f-31f2-47b8-b1e6-7fc3c1a2d81d
  • MIRAI run: cf4a9b8e-1817-4664-bc77-683ce67e9202
  • PR check IDs:
    • clickbench_179908_1781536423
    • clickbench_179914_1781536422
    • clickbench_179933_1781536425
    • tpch_adapted_1_official_179954_1781536421
    • tpch_adapted_1_official_179975_1781536421
    • tpch_adapted_1_official_180333_1781536447

@alexey-milovidov alexey-milovidov merged commit 0f837e1 into ClickHouse:master Jun 18, 2026
165 of 166 checks passed
@robot-clickhouse-ci-2 robot-clickhouse-ci-2 added the pr-synced-to-cloud The PR is synced to the cloud repo label Jun 18, 2026
@robot-clickhouse-ci-1 robot-clickhouse-ci-1 added the pr-must-backport-synced The `*-must-backport` labels are synced into the cloud Sync PR label Jun 28, 2026
robot-clickhouse added a commit that referenced this pull request Jun 30, 2026
Cherry pick #107407 to 26.4: Fix logical error in disjunction filter push-down through JOIN with type-changing USING key
robot-clickhouse added a commit that referenced this pull request Jun 30, 2026
…h-down through JOIN with type-changing USING key
robot-clickhouse added a commit that referenced this pull request Jun 30, 2026
Cherry pick #107407 to 26.5: Fix logical error in disjunction filter push-down through JOIN with type-changing USING key
robot-clickhouse added a commit that referenced this pull request Jun 30, 2026
…h-down through JOIN with type-changing USING key
@robot-clickhouse-ci-2 robot-clickhouse-ci-2 added the pr-backports-created Backport PRs are successfully created, it won't be processed by CI script anymore label Jun 30, 2026
vdimir added a commit that referenced this pull request Jun 30, 2026
Backport #107407 to 26.4: Fix logical error in disjunction filter push-down through JOIN with type-changing USING key
vdimir added a commit that referenced this pull request Jun 30, 2026
Backport #107407 to 26.5: Fix logical error in disjunction filter push-down through JOIN with type-changing USING key
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

can be tested Allows running workflows for external contributors pr-backports-created Backport PRs are successfully created, it won't be processed by CI script anymore pr-bugfix Pull request with bugfix, not backported by default pr-must-backport-synced The `*-must-backport` labels are synced into the cloud Sync PR pr-synced-to-cloud The PR is synced to the cloud repo v26.4-must-backport

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants