Fix Virtual row with union produces wrong result#107053
Conversation
| else | ||
| checkVirtualRowBoundary(cursors[source_num], virtual_row_boundary[source_num], description, source_num); | ||
| } | ||
| if (is_virtual_row && !has_collation) |
There was a problem hiding this comment.
This debug check is now running even when apply_virtual_row_conversions is false, which is a valid mode for existing read-in-order preliminary merges. In that mode setVirtualRow deliberately does not execute the conversion DAG, so for ORDER BY expressions the virtual chunk can contain default values for the computed sort columns instead of the announced boundary. The next real chunk then trips checkVirtualRowBoundary in debug builds for a valid plan.
For example, ReadFromMergeTree can still emit VirtualRowTransform chunks before the preliminary MergingSortedTransform created with apply_virtual_row_conversions=false when read_in_order_two_level_merge_threshold is exceeded. Please keep the apply_virtual_row_conversions guard around both remembering and checking the boundary, or otherwise guarantee that no virtual-row chunks can reach this algorithm when conversions are disabled.
| if (!apply_virtual_row) | ||
| { | ||
| for (auto * reader : virtual_row_readers) | ||
| reader->resetVirtualRowConversions(); |
There was a problem hiding this comment.
This reset can leave a child optimized through an INNER JOIN in a state that buildInputOrderInfo explicitly rejects. The helper only allows non-LEFT joins when uses_virtual_row is true; it then calls requestReadingInOrder and keepLeftPipelineInOrder. If a later union-wide decision sets apply_virtual_row=false, this loop clears the reader conversion, so that same join branch keeps the read-in-order plan but no longer emits virtual rows.
That loses the boundary signal needed for the join-through-order optimization and can make ORDER BY ... LIMIT read far more left-table rows, the case covered by 03711_read_in_order_through_join. Please either defer mutating the reader until the union decision is known, or mark such a child as not eligible when the union cannot keep virtual rows, and add a union + INNER JOIN regression case.
|
@groeneai Find appropriate reviewer for this PR |
|
cc @KochetovNicolai @nihalzp - could you review this? It fixes wrong ORDER BY results over UNION ALL by restricting UnionStep pipeline narrowing and adjusting virtual-row boundary handling (optimizeReadInOrder.cpp, MergingSortedAlgorithm.cpp). Picking you two as the recent authors of the read-in-order / buildInputOrderFromUnorderedKeys path and the ReadInOrder merge-table fix. The clickhouse-gh bot also flagged a possible INNER JOIN read-in-order-through-join eligibility regression (03711_read_in_order_through_join) worth a look. |
Some analysis: Failure: test_parallel_quorum_actually_quorum failed at test.py:178 — node2.query("SYSTEM SYNC REPLICA q", timeout=10) → client timeout. Server actually finished the SYNC in 12.24s, just past the 10s client deadline. Root cause chain:
Structural gap: the pool's staleness check isStale() = socket().poll(SELECT_READ) can't detect a DROP-blackholed peer (socket looks healthy locally), so it always reuses such a connection. Why now (from your CI data): regression is PR #103786 "Try the next resolved address if HTTP connect fails" (merged 2026-06-25). Evidence: its own pre-merge CI failed this exact test on 06-13; a cross-PR burst (14 then 11 failures across ~unrelated PRs) brackets its merge-queue/merge on 06-24/25. Before that, only rare failures on slow (coverage/tsan) builds. Honest caveat: the one log I have fails via the reuse-hang, which is a path PR #103786 doesn't touch — so the correlation nails the culprit, but the exact causal mechanism isn't proven by static analysis. So, root-cause not yet clear, we can try increasing timeout form SYNC query, but it's 10s for quite a while so not sure. |
|
@vdimir Thanks for the deep analysis. This On your open question (just bump the 10s SYNC timeout): we found that alone is not enough. The deterministic part of the root cause is the ReplicatedMergeTree fetch backoff, not only the client deadline. While node2 is partitioned off node1/node3 on 9009, its
I confirmed (1) alone is insufficient: with only backoff=0 the Your The structural gap you raise ( |
|
@groeneai all CI failures have linked issue, but also check that that issue have any fixes in flight and if not prepare a fix |
|
@vdimir Checked each current failure on the head build (
Note |
|
@groeneai what is the issue with |
|
@vdimir Not related to this PR. Base-skew between two PRs merged the same day, neither seeing the other in CI:
The guard was already on master when #105591 merged, so every config failed deterministically with Fix: revert #108812, merged 06-29 14:13 UTC. Master clean since (06-29: 46 fails; 06-30: 0/163 over the day). This PR head |
LLVM Coverage ReportChanged lines: Changed C/C++ lines covered: 39/39 (100.00%) · Uncovered code |

Changelog category (leave one):
Changelog entry (a user-readable short description of the changes that goes into CHANGELOG.md):
ORDER BYqueries overUNION ALLwithoptimize_read_in_orderandread_in_order_use_virtual_rowenabled, close Virtual row with union produces wrong result #106879