Fix Virtual row with union produces wrong result by vdimir · Pull Request #107053 · ClickHouse/ClickHouse · GitHub
Skip to content

Fix Virtual row with union produces wrong result#107053

Open
vdimir wants to merge 8 commits into
masterfrom
vdimir/virtual-row-union-bug
Open

Fix Virtual row with union produces wrong result#107053
vdimir wants to merge 8 commits into
masterfrom
vdimir/virtual-row-union-bug

Conversation

@vdimir

@vdimir vdimir commented Jun 10, 2026

Copy link
Copy Markdown
Member

Changelog category (leave one):

  • Bug Fix (user-visible misbehavior in an official stable release)

Changelog entry (a user-readable short description of the changes that goes into CHANGELOG.md):

@clickhouse-gh

clickhouse-gh Bot commented Jun 10, 2026

Copy link
Copy Markdown
Contributor

@clickhouse-gh clickhouse-gh Bot added the pr-bugfix Pull request with bugfix, not backported by default label Jun 10, 2026
else
checkVirtualRowBoundary(cursors[source_num], virtual_row_boundary[source_num], description, source_num);
}
if (is_virtual_row && !has_collation)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This debug check is now running even when apply_virtual_row_conversions is false, which is a valid mode for existing read-in-order preliminary merges. In that mode setVirtualRow deliberately does not execute the conversion DAG, so for ORDER BY expressions the virtual chunk can contain default values for the computed sort columns instead of the announced boundary. The next real chunk then trips checkVirtualRowBoundary in debug builds for a valid plan.

For example, ReadFromMergeTree can still emit VirtualRowTransform chunks before the preliminary MergingSortedTransform created with apply_virtual_row_conversions=false when read_in_order_two_level_merge_threshold is exceeded. Please keep the apply_virtual_row_conversions guard around both remembering and checking the boundary, or otherwise guarantee that no virtual-row chunks can reach this algorithm when conversions are disabled.

if (!apply_virtual_row)
{
for (auto * reader : virtual_row_readers)
reader->resetVirtualRowConversions();

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This reset can leave a child optimized through an INNER JOIN in a state that buildInputOrderInfo explicitly rejects. The helper only allows non-LEFT joins when uses_virtual_row is true; it then calls requestReadingInOrder and keepLeftPipelineInOrder. If a later union-wide decision sets apply_virtual_row=false, this loop clears the reader conversion, so that same join branch keeps the read-in-order plan but no longer emits virtual rows.

That loses the boundary signal needed for the join-through-order optimization and can make ORDER BY ... LIMIT read far more left-table rows, the case covered by 03711_read_in_order_through_join. Please either defer mutating the reader until the union decision is known, or mark such a child as not eligible when the union cannot keep virtual rows, and add a union + INNER JOIN regression case.

@vdimir

vdimir commented Jun 18, 2026

Copy link
Copy Markdown
Member Author

@groeneai Find appropriate reviewer for this PR

@groeneai

Copy link
Copy Markdown
Contributor

cc @KochetovNicolai @nihalzp - could you review this? It fixes wrong ORDER BY results over UNION ALL by restricting UnionStep pipeline narrowing and adjusting virtual-row boundary handling (optimizeReadInOrder.cpp, MergingSortedAlgorithm.cpp). Picking you two as the recent authors of the read-in-order / buildInputOrderFromUnorderedKeys path and the ReadInOrder merge-table fix. The clickhouse-gh bot also flagged a possible INNER JOIN read-in-order-through-join eligibility regression (03711_read_in_order_through_join) worth a look.

@vdimir vdimir requested a review from nihalzp June 18, 2026 14:42
@nihalzp nihalzp self-assigned this Jun 18, 2026
@vdimir

vdimir commented Jun 25, 2026

Copy link
Copy Markdown
Member Author

Integration tests (amd_asan_ubsan, db disk, old analyzer, 2/6) FAIL
test_quorum_inserts_parallel/test.py::test_parallel_quorum_actually_quorum

Some analysis:

Failure: test_parallel_quorum_actually_quorum failed at test.py:178 — node2.query("SYSTEM SYNC REPLICA q", timeout=10) → client timeout. Server actually finished the SYNC in 12.24s, just past the 10s client deadline.

Root cause chain:

  • SYNC waits for node2's whole replication queue. The blocking entry was GET_PART all_0_0_0 — the part from the first, deliberately-failed quorum insert (INSERT (1,'Hello'), quorum=3).
  • Node2's fetch of all_0_0_0 reused a pooled keep-alive interserver connection opened before the partition. A reused connection skips the 1s connect timeout and instead blocks on the 30s http_receive_timeout while the partition swallows the request — it hung ~26s ("No message received") before failing, holding the queue entry hostage. (Fresh connects fail in 1s and recover fast — that's why all_1_1_0/all_2_2_0 synced fine.)
  • As a side effect, node2 then took the quorum-failed fallback and abandoned the Hello row (would also fail the next count==3 assert).

Structural gap: the pool's staleness check isStale() = socket().poll(SELECT_READ) can't detect a DROP-blackholed peer (socket looks healthy locally), so it always reuses such a connection.

Why now (from your CI data): regression is PR #103786 "Try the next resolved address if HTTP connect fails" (merged 2026-06-25). Evidence: its own pre-merge CI failed this exact test on 06-13; a cross-PR burst (14 then 11 failures across ~unrelated PRs) brackets its merge-queue/merge on 06-24/25. Before that, only rare failures on slow (coverage/tsan) builds.

Honest caveat: the one log I have fails via the reuse-hang, which is a path PR #103786 doesn't touch — so the correlation nails the culprit, but the exact causal mechanism isn't proven by static analysis.


So, root-cause not yet clear, we can try increasing timeout form SYNC query, but it's 10s for quite a while so not sure.

@groeneai

Copy link
Copy Markdown
Contributor

@vdimir Thanks for the deep analysis. This test_parallel_quorum_actually_quorum FAIL is not from your virtual-row diff: it is a pre-existing flaky on the slow sanitizer integration shards (36 hits / 28 distinct PRs / 5 on master over the last 30 days, all on amd_asan_ubsan, db disk, old analyzer, e.g. master report). It is already fixed and up for review in #108433.

On your open question (just bump the 10s SYNC timeout): we found that alone is not enough. The deterministic part of the root cause is the ReplicatedMergeTree fetch backoff, not only the client deadline. While node2 is partitioned off node1/node3 on 9009, its GET_PART entries fail repeatedly; each failure bumps num_tries, and ReplicatedMergeTreeQueue::getPostponeTimeMsForEntry schedules the next try 1 << num_tries ms out (capped at max_postpone_time_for_failed_replicated_fetches_ms, default 60000). After the partition heals, SYSTEM SYNC REPLICA has to wait out that accumulated backoff, which on the slow shards crosses 10s. So #108433 does both:

  1. SETTINGS max_postpone_time_for_failed_replicated_fetches_ms = 0 on table q (removes the deterministic backoff stall in the fault window), and
  2. raises the final SYSTEM SYNC REPLICA q barrier 10s -> 60s (the value 13 other integration tests already use) for residual parallel-sanitizer slowness.

I confirmed (1) alone is insufficient: with only backoff=0 the Integration tests (..., flaky) check still failed under --dist=each ASAN load, which is what drove adding (2). Neither change weakens the gate: the assert_eq_with_retry(node2, "SELECT COUNT() FROM q", "3") that follows still fails a genuine "node2 never syncs" regression.

Your #103786 attribution matches our data: the burst starts 2026-06-13 (its pre-merge CI date) and clusters around the 06-24/25 merge window, and we have it on record as a cause of connection/fetch test flakiness. Your reused-interserver-connection hang (~26s on http_receive_timeout) and our fetch-backoff are complementary paths to the same >10s SYNC: the hung fetch is itself a failed GET_PART that feeds num_tries, and either way the 10->60 barrier in #108433 absorbs your observed 12-26s case.

The structural gap you raise (isStale() = socket().poll(SELECT_READ) cannot detect a DROP-blackholed peer, so the pool keeps reusing a dead interserver connection) is a real but larger change in the HTTPConnectionPool / #103786 area. I would leave the approach to you and the connection-pool owners rather than patch it speculatively; happy to help reproduce or test if you want it pursued separately.

@vdimir

vdimir commented Jun 26, 2026

Copy link
Copy Markdown
Member Author

@groeneai all CI failures have linked issue, but also check that that issue have any fixes in flight and if not prepare a fix

@groeneai

Copy link
Copy Markdown
Contributor

@vdimir Checked each current failure on the head build (982a1a03). Every one has a fix in flight; nothing new to prepare.

Failure (check) Linked issue Fix in flight
Block structure mismatch in UnionStep STID 0993-27f0 (AST fuzzer amd_debug) #108142 #107719 (ours, open). Plan-time UnionStep::updateOutputHeader rejects a const-folded branch vs a full sibling; this PR makes the header check tolerant the same way updatePipeline already is. Closes #106956; sparse facet was #107041 (merged).
TSan data race STID 4071-3348 + Server died (amd_tsan, s3) #107487 #108391 by serxa, merged 2026-06-25 23:49Z. BlockIO::onFinish released the MemoryReservation before pipeline finalize, racing executor threads. Verified: 0 master hits after the merge (7 in the prior 2 days). 982a1a03 predates it, so a rebase clears it.
Hung check ... possible deadlock (Stress arm_tsan) #107941 Chronic stress hung-check bucket (1042 hits / 464 PRs / 330 master in 30d), unrelated to the virtual-row diff. Tracked under our #105905 and #101680.
test_parallel_quorum_actually_quorum (Integration, asan db disk old analyzer) (per your 06-25 note) #108433 (ours, open) as discussed above.

Note #107487/#107941 are generic auto-match buckets keyed by the test_name string, so the precise owner is the STID/signature, not the issue text. Net: all linked, all covered. The only one in this set blocking a clean run on rebase is the UnionStep const-fold mismatch (#107719); the rest are either already merged (#108391) or pre-existing flaky unrelated to your change.

@vdimir

vdimir commented Jun 30, 2026

Copy link
Copy Markdown
Member Author

@groeneai what is the issue with 02354_vector_search_rescoring in last CI run?

@groeneai

Copy link
Copy Markdown
Contributor

@vdimir Not related to this PR. 02354_vector_search_rescoring was a master regression, already reverted.

Base-skew between two PRs merged the same day, neither seeing the other in CI:

The guard was already on master when #105591 merged, so every config failed deterministically with Code 44 ... The _distance column is an internal virtual column ... cannot be referenced directly at useVectorSearch.cpp:237 (267/267 reruns, CI diagnosis "not a randomization issue").

Fix: revert #108812, merged 06-29 14:13 UTC. Master clean since (06-29: 46 fails; 06-30: 0/163 over the day). This PR head 83e5e7f2 last ran CI ~12:00 UTC 06-29, so it carries the #105591 merge but predates the revert. A rebase onto master clears it.

Report: https://s3.amazonaws.com/clickhouse-test-reports/json.html?PR=107053&sha=83e5e7f2177d4db99ccfb15b3ccb011aa046b98f&name_0=PR&name_1=Stateless%20tests%20%28amd_debug%2C%20parallel%29

@clickhouse-gh

clickhouse-gh Bot commented Jul 1, 2026

Copy link
Copy Markdown
Contributor

LLVM Coverage Report

Metric Baseline Current Δ
Lines 85.40% 85.50% +0.10%
Functions 92.60% 92.60% +0.00%
Branches 77.60% 77.70% +0.10%

Changed lines: Changed C/C++ lines covered: 39/39 (100.00%) · Uncovered code

Full report · Diff report

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

pr-bugfix Pull request with bugfix, not backported by default

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Virtual row with union produces wrong result

3 participants