Fix join reorder dropping outer-join rows in OUTER JOIN + comma queries by groeneai · Pull Request #101684 · ClickHouse/ClickHouse · GitHub
Skip to content

Fix join reorder dropping outer-join rows in OUTER JOIN + comma queries#101684

Merged
nikitamikhaylov merged 4 commits into
ClickHouse:masterfrom
groeneai:fix/join-order-outer-comma-constraint
Jun 11, 2026
Merged

Fix join reorder dropping outer-join rows in OUTER JOIN + comma queries#101684
nikitamikhaylov merged 4 commits into
ClickHouse:masterfrom
groeneai:fix/join-order-outer-comma-constraint

Conversation

@groeneai

@groeneai groeneai commented Apr 3, 2026

Copy link
Copy Markdown
Contributor

Changelog category (leave one):

  • Bug Fix (user-visible misbehavior in an official stable release)

Changelog entry (a user-readable short description of the changes that goes to CHANGELOG.md):

Fix join reordering silently dropping unmatched rows of a RIGHT/LEFT JOIN when it is comma-joined (cross) with another table, e.g. t1 RIGHT JOIN t2 ON t1.c = t2.c, t3. The query previously returned the inner-join result instead of the outer-join one.

Motivation

buildQueryGraph stored a Cross marker in join_kinds[0] for every Cross/Comma join. join_kinds is otherwise reserved for the OuterJoinRestriction of a null-supplying relation. When the left side of the cross was itself an outer join, this marker overwrote the relation-0 RIGHT/LEFT restriction. With the restriction gone, isValidJoinOrder returned Cross for the pair, which solveGreedy downgraded to Inner (the pair is connected), turning the outer join into an inner join and dropping the null-supplied rows.

This is the same code path that produced the Cannot fold actions for projection logical error (STID 3673-56a1). The crash was fixed by #102516 (the typeChangingSides guard), but the silent wrong-result for the non-type-changing case remained.

Approach

Stop multiplexing the cartesian signal into join_kinds. Cross vs Inner is already derivable from edge connectivity: a pair with no connecting predicate is a cross join. solveGreedy now assigns Cross to an unconnected Inner pair instead of relying on the stored marker, so the outer restriction on relation 0 survives and Cross/Comma semantics are preserved.

Addresses @vdimir's review: the Cross kind is no longer stored on relation 0 (or any relation), so it never collides with an outer-join restriction.

Closes: #103045

Test plan

  • Regression cases added to 04277_join_reorder_constraints (RIGHT/LEFT JOIN + comma). They fail on master HEAD and pass with this change.
  • All existing join-reorder tests pass (03604, 03760, 03822, 04075, 04095, 04277, 04303).
  • The cross-join tests that previously regressed when the marker was fully removed pass: 00202_cross_join, 01109_inflating_cross_join, 03595, 03302, 03701, 01236_graphite_mt.

Documentation entry for user-facing changes

  • Documentation is written (mandatory for new features)

Version info

  • Merged into: 26.6.1.682

@groeneai

groeneai commented Apr 3, 2026

Copy link
Copy Markdown
Contributor Author

@groeneai

groeneai commented Apr 3, 2026

Copy link
Copy Markdown
Contributor Author

cc @vdimir @novikd — could you review this? This fixes a crash in foldActionsByProjection (STID 3673-56a1) caused by buildQueryGraph() unconditionally overwriting join_kinds[0] for Cross/Comma joins, which destroys valid constraints from nested OUTER joins flattened via uniteGraphs. The fix simply removes the overwrite — the absence of a constraint already implies "any order valid" for Cross joins.

@vdimir this is particularly relevant since the regression was introduced by PR #99096 (recursive join optimization in addChildQueryGraph) and interacts with the new query_plan_optimize_join_order_randomize setting from your PR #100643.

@groeneai

groeneai commented Apr 4, 2026

Copy link
Copy Markdown
Contributor Author

Could someone add the can be tested label to enable CI? This fixes a P0 AST fuzzer crash — "Cannot fold actions for projection" (STID 3673-56a1, 9 hits). Thanks!

@groeneai

groeneai commented Apr 7, 2026

Copy link
Copy Markdown
Contributor Author

@alexey-milovidov — thanks for labeling #101683! Could you also add can be tested here and on #101690?

This one fixes the recurring "Cannot fold actions for projection" crash (STID 3673-56a1) — a regression from #99096. Hit rate accelerated after #100643 (join order randomize) merged: 7 hits in the last 7 days (4 on Apr 5 alone), all stress tests across unrelated PRs. Root cause: buildQueryGraph() at optimizeJoin.cpp:660 unconditionally overwrites join_kinds[0] for Cross/Comma joins, destroying valid constraints from nested OUTER joins. Fix removes the overwrite (2-line deletion) + regression test.

#101690 fixes the "ColumnUnique can't contain null values" crash (STID 1503-2bec) — a 1-line guard in FunctionsComparison.h + regression test.

@vdimir vdimir added the can be tested Allows running workflows for external contributors label Apr 7, 2026
@clickhouse-gh

clickhouse-gh Bot commented Apr 7, 2026

Copy link
Copy Markdown
Contributor

Workflow [PR], commit [65627b3]

Summary:


AI Review

Summary

This PR fixes a real JOIN reordering correctness issue by removing the overloaded Cross marker from join_kinds and deriving cartesian JOIN semantics from missing connectivity in the greedy planner. The current code preserves the outer-join restriction for the null-supplying relation and keeps the surrounding comma/Cross join as cartesian; the added regression cases cover the affected RIGHT JOIN and LEFT JOIN shapes. I found no unresolved correctness, safety, or test-coverage findings that need inline comments.

Final Verdict

Status: ✅ Approve

@clickhouse-gh clickhouse-gh Bot added the pr-bugfix Pull request with bugfix, not backported by default label Apr 7, 2026
Comment thread tests/queries/0_stateless/04080_join_reorder_outer_comma.sql Outdated
@groeneai

groeneai commented Apr 7, 2026

Copy link
Copy Markdown
Contributor Author

Fix for Fast test failures

The previous commit removed the join_kinds[0] Cross entry entirely, which broke 7 Fast tests because Inner and Cross are not semantically equivalent — downstream code uses the distinction for join algorithm selection and condition handling.

Updated fix (596beca): The Cross entry is now set conditionally — only when join_kinds.find(0) == join_kinds.end(). This preserves Cross semantics for simple cross/comma joins while protecting nested OUTER join constraints from being overwritten.

Pre-PR validation (session cron:clickhouse-ci-task-worker:20260407-134500)

a) Deterministic repro: The original crash requires stress test + randomized join orders. Crash mechanism confirmed via static analysis: join_kinds[0] overwrite at line 659 destroys RIGHT/LEFT constraints from nested joins propagated via uniteGraphs, allowing isValidJoinOrder to accept invalid orders.

b) Root cause: When buildQueryGraph processes a Comma/Cross join whose LHS is a flattened RIGHT/LEFT join subgraph, uniteGraphs correctly places the OUTER constraint at index 0. The unconditional overwrite then destroys it, allowing invalid join orders that crash in foldActionsByProjection.

c) Fix matches root cause: The conditional check prevents the overwrite when a nested OUTER constraint already exists at index 0, while preserving Cross semantics for all other cases.

d) Test intent preserved: All previously failing Fast tests pass:

  • 00202_cross_join ✅ (20/20)
  • 01109_inflating_cross_join ✅ (20/20)
  • 03595_convert_any_join_to_semi_or_anti
  • 03302_analyzer_join_filter_push_down_bug
  • 03701_analyzer_correlated_subquery_plan_reference
  • 01236_graphite_mt ✅ (20/20)
  • 04080_join_reorder_outer_comma ✅ (regression test, 20/20)
  • All 10 join reorder tests ✅

e) Demonstrated in both directions: Cross join tests fail without the conditional (old complete removal), pass with conditional. OUTER+COMMA regression test passes with conditional (and would crash on trunk without any fix).

f) Fix scope: Minimal 3-line change — single conditional guard. No sibling code paths affected (the join_kinds[0] assignment only exists at this one location for Cross/Comma).

@vdimir vdimir self-assigned this Jun 3, 2026
query_graph.join_kinds[0] = QueryGraph::OuterJoinRestriction{BitSet{}, BitSet{}, JoinKind::Cross};
{
if (query_graph.join_kinds.find(0) == query_graph.join_kinds.end())
query_graph.join_kinds[0] = QueryGraph::OuterJoinRestriction{BitSet{}, BitSet{}, JoinKind::Cross};

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't we use different index than 0 here then?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. Keeping the marker on 0 only works when relation 0 has no other restriction, but the failing shape is exactly the case where relation 0 already carries the RIGHT/LEFT restriction. For t1 RIGHT JOIN t2 ON ... , t3, this branch skips the Cross marker, the optimizer first builds {0,1} as the required outer join, and the remaining unconnected join {0,1} x {2} is then seen by isValidJoinOrder as Inner because the left mask is no longer a single relation. buildPhysicalJoinImpl turns that into a constant-key inner join instead of a Cross join, so the Comma/Cross semantics that this patch is trying to preserve are still lost for the nested-outer case.

The Cross marker needs to be stored on a relation that is not already carrying an outer restriction, or represented separately from join_kinds, so preserving the outer constraint does not drop the surrounding Cross join kind.

A query that comma-joins (cross) an outer join with another table, e.g.
`t1 RIGHT JOIN t2 ON t1.c = t2.c, t3`, could silently drop the unmatched
rows of the RIGHT/LEFT join when join reordering was enabled (the default
greedy algorithm). It returned the inner-join result instead of the
outer-join one.

Root cause: buildQueryGraph stored a Cross marker in join_kinds[0] for every
Cross/Comma join. join_kinds is otherwise reserved for the OuterJoinRestriction
of a null-supplying relation. When the left side of the cross was itself an
outer join, this marker overwrote the relation-0 RIGHT/LEFT restriction set a
few lines later. With the restriction gone, isValidJoinOrder returned Cross for
the pair, which solveGreedy then downgraded to Inner because the pair was
connected, turning the RIGHT/LEFT join into an INNER join and dropping the
null-supplied rows.

This is the same code path that previously produced the "Cannot fold actions
for projection" logical error; that crash was fixed by ClickHouse#102516
(typeChangingSides guard), but the silent wrong-result for the
non-type-changing case remained.

Fix: stop multiplexing the cartesian signal into join_kinds. The Cross vs Inner
distinction is already derivable from edge connectivity: a pair with no
connecting predicate is a cross join. solveGreedy now sets Cross for an
unconnected Inner pair instead of relying on the stored marker, so the outer
restriction on relation 0 is preserved and Cross/Comma semantics are kept.

Regression cases added to 04277_join_reorder_constraints.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
@groeneai groeneai force-pushed the fix/join-order-outer-comma-constraint branch from 1cedaa1 to 1564d50 Compare June 10, 2026 08:40
@groeneai groeneai changed the title Fix join_kinds constraint overwrite in Cross/Comma joins Fix join reorder dropping outer-join rows in OUTER JOIN + comma queries Jun 10, 2026
@groeneai

Copy link
Copy Markdown
Contributor Author

Reworked per the review. The previous conditional approach was a half-fix: when relation 0 already carried a RIGHT/LEFT restriction, it skipped the Cross marker, so the surrounding cartesian join silently became a constant-key inner join and Cross/Comma semantics were still lost for the nested-outer case.

This version stops storing the Cross marker in join_kinds entirely. The cartesian distinction is derived from edge connectivity in solveGreedy (an unconnected Inner pair is a Cross join), so the outer-join restriction on relation 0 is preserved and the cross kind is kept. For t1 RIGHT JOIN t2 ON t1.c = t2.c, t3 the plan is now an inner RIGHT join {0,1} and a CROSS join with {2}, instead of dropping the unmatched rows.

The crash this PR originally targeted is already handled by #102516 (typeChangingSides guard); what remained was a silent wrong-result, which is what the updated regression cases in 04277_join_reorder_constraints cover.

Pre-PR validation gate

# Question Answer
a Deterministic repro? Yes. t1 RIGHT JOIN t2 ON t1.c = t2.c, t3 (t1=1,2,3; t2=1,10,20; t3=99) returns 1 row on master across all 30 query_plan_optimize_join_order_randomize seeds; correct is 3 rows.
b Root cause explained? The join_kinds[0] = {empty, Cross} write in buildQueryGraph overwrites the relation-0 RIGHT/LEFT OuterJoinRestriction. With it gone, isValidJoinOrder returns Cross, solveGreedy downgrades the connected pair to Inner, and the outer join becomes an inner join, dropping null-supplied rows.
c Fix matches root cause? Yes. Removes the marker write and derives Cross from connectivity, so the restriction is never destroyed. Not a band-aid.
d Test intent preserved / new tests added? Yes. Added RIGHT/LEFT + comma cases to 04277_join_reorder_constraints. No existing test weakened.
e Both directions demonstrated? Yes. The new cases fail on master HEAD (missing 0 10 99, 0 20 99) and pass with the fix.
f Fix is general, not a narrow patch? Yes. Fixes at the source (the marker write), covering RIGHT, LEFT, and FULL nested-outer + comma. solveDPsize is unaffected (it skips non-Inner kinds and disconnected pairs). Verified no regression in 00202_cross_join, 01109, 03595, 03302, 03701, 01236_graphite_mt and all *join_reorder* tests.

Session id: cron:clickhouse-worker-slot-0:20260610-072600

@groeneai

Copy link
Copy Markdown
Contributor Author

@vdimir ready for another look — the marker is no longer stored on any relation, so it can't collide with the outer-join restriction. Thanks!

The cartesian (implicit comma) join in Case 13a is now labeled Type: CROSS
in EXPLAIN actions output instead of Type: INNER with a synthetic
Clauses: [(__lhs_const) = (__rhs_const)] line. This follows from deriving
cartesian-ness from edge connectivity in solveGreedy rather than storing a
Cross marker in join_kinds. Data results are unchanged; only the plan label
differs. Drop the obsolete synthetic-clause line from the reference.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
@groeneai

Copy link
Copy Markdown
Contributor Author

Pushed 03a7463: fix the Fast test failure on 03835_join_transitive_predicates_edge_cases.

Root cause: the test was not modified by this PR, but its EXPLAIN actions=1 reference (Case 13a) hard-coded a synthetic Clauses: [(__lhs_const) = (__rhs_const)] line. That line was an artifact of the old code path, which represented an implicit cartesian (comma) join as Type: INNER plus an always-true constant clause. With this PR, the cartesian distinction is derived from edge connectivity in solveGreedy instead of a stored marker, so the join is now correctly labeled Type: CROSS with no synthetic clause.

Verified locally (debug, clickhouse local):

  • Case 13a data results are byte-identical before/after; only the plan label changed (INNER+const-clause -> CROSS).
  • The single reference line was dropped to match. Test passes on the fixed binary and fails on master HEAD by exactly that line, confirming the change is tied to this fix.
  • 03405_merge_filter_into_join (the only other test referencing __lhs_const) is unaffected: its joins use an explicit ON 1 / CROSS JOIN whose constant clause is materialized before reordering, so the !connected relabel does not fire.
  • All join/cross/reorder tests pass: 00202_cross_join, 01109_inflating_cross_join, 02000_join_on_const, 03595, 03302, 04277_join_reorder_constraints.

@groeneai

Copy link
Copy Markdown
Contributor Author

CI finished on 03a746352a5f. The three red checks are pre-existing chronic trunk failures unrelated to this join-reorder change, not regressions from this PR:

  • Stress test (amd_debug) and Stress test (arm_asan_ubsan): Logical error: Table ... not found in catalog (STID: 3215-370f). This is a widespread chronic stress catalog flake that began ~2026-06-09: 9 master hits plus 20+ unrelated PRs in the last 24h. Tracked separately (it is the subject of a separate fix effort). This PR only touches join-order optimization (optimizeJoin.cpp, joinOrder.cpp), not the table catalog / drop path.
  • BuzzHouse (arm_asan_ubsan): std::length_error vector (STID: 2508-3132). This is the chronic StorageURL unbounded reserve() bug in the urlCluster distributed path; a separate fix (PR Clamp num_streams in the urlCluster distributed read path #106946) is in flight. Unrelated to join reordering.

The only PR-caused failure was the earlier Fast test reference diff on 03835_join_transitive_predicates_edge_cases, fixed in 03a7463 (see prior comment). The fix itself is green: all 11 builds, Fast test, and the join-reorder stateless tests pass.

Session: cron:our-pr-ci-monitor:20260610-143000

@groeneai

Copy link
Copy Markdown
Contributor Author

CI fully finished on HEAD 6c0a93a (the 2026-06-10 master-merge). Finish Workflow passed; all checks completed. Two failures, both pre-existing and unrelated to this PR's join-reorder diff (optimizeJoin.cpp / joinOrder.cpp):

Mergeable Check is the aggregator reflecting the two above; no Fast test / Bugfix / Style failure. The earlier Fast test failure on 03835_join_transitive_predicates_edge_cases was resolved in 03a7463 (reference updated for the correct CROSS label; data byte-identical). Approved by @vdimir. Ready for merge.

@vdimir

vdimir commented Jun 11, 2026

Copy link
Copy Markdown
Member

Ready for merge.

we cannot merge with red ci

@groeneai

Copy link
Copy Markdown
Contributor Author

we cannot merge with red ci

Thanks @vdimir. Both red checks on the previous head were pre-existing flaky failures unrelated to this diff (which only touches joinOrder.cpp and optimizeJoin.cpp):

I merged current master (65627b3) to re-trigger CI. Will confirm once it is green.

@clickhouse-gh

clickhouse-gh Bot commented Jun 11, 2026

Copy link
Copy Markdown
Contributor

LLVM Coverage Report

Metric Baseline Current Δ
Lines 84.60% 84.60% +0.00%
Functions 92.30% 92.30% +0.00%
Branches 77.30% 77.30% +0.00%

Changed lines: Changed C/C++ lines covered by tests: 10/10 (100.00%) | Lost baseline coverage: none · Uncovered code

Full report · Diff report

@groeneai

Copy link
Copy Markdown
Contributor Author

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

can be tested Allows running workflows for external contributors pr-bugfix Pull request with bugfix, not backported by default pr-synced-to-cloud The PR is synced to the cloud repo

Projects

None yet

4 participants