Fix PartLoadingTree to correctly identify the rolled-back part when intersecting parts have transaction metadata#100992
Conversation
c1fb07d to
f88571b
Compare
83ccc10 to
d3af6b4
Compare
d3af6b4 to
e1b5cbb
Compare
7e69815 to
664d85c
Compare
1e9b03c to
2bdfe94
Compare
2bdfe94 to
20bdf6b
Compare
`PartLoadingTree::add` builds a containment tree of MergeTree parts during `loadDataParts`. Parts are sorted by `(level, mutation)` descending so that covering parts are inserted before the parts they contain. When a rolled-back part in the tree intersects an incoming committed part two bugs were present: **Bug 1 — wrong skip direction.** The old code (`has_transaction_metadata`) checked whether either part has a `txn_version.txt` file and unconditionally skipped the *incoming* part. When the already-resident part was the rolled-back one, the committed incoming part was silently discarded instead of replacing it. **Bug 2 — orphans dropped after eviction.** When the rolled-back resident is evicted, its children are orphaned. They must be re-inserted via `add()` — together with the incoming committed part — sorted by `(level, mutation)` descending so that the incoming part (typically higher level) is placed into the tree *before* any orphan that belongs inside it. Without this ordering an orphan at sibling level causes a spurious `LOGICAL_ERROR` when the incoming part is re-inserted. The crash observed in the stress test CI was caused by a third issue: the old helper (`has_transaction_metadata`) returned `bool` — a plain file-existence check. It could not distinguish between a definitively rolled-back part, an in-flight transaction (`Tx::UnknownCSN`), and a read error (e.g. an empty file read between shell `truncate` and `write`). When neither part was definitively rolled-back, the code fell through to `LOGICAL_ERROR`, which aborts the server in sanitizer builds. CI: ClickHouse#100992 Replace the `bool`-returning helper with a 5-state `enum class RollbackStatus` and rename the lambda to `read_txn_status`: `RolledBack` — `creation_csn == Tx::RolledBackCSN` on disk `Committed` — `creation_csn` is a known committed CSN on disk `NoMetadata` — `txn_version.txt` absent (prehistoric/non-transactional part) `UnknownCSN` — file present, `creation_csn == Tx::UnknownCSN` (in-flight) `Unreadable` — file present but could not be parsed (corruption/partial write) Intersection decision table (applied to both prev/next branches): `Unreadable` on either side → `CORRUPTED_DATA` (not a code bug) incoming `RolledBack` → skip incoming existing `RolledBack` → evict existing; collect orphans + incoming sorted by `(level, mutation)` desc; re-add all via `add()`; `return` either `UnknownCSN` → skip incoming (`loadDataPart` resolves later) both `NoMetadata` or `Committed` → `LOGICAL_ERROR` (genuine bug / manual edit) The `traverse` template body is moved inline into the header and `PartLoadingTree` is made `public` so external tests can exercise it directly. Add `tests/queries/0_stateless/04063_part_loading_tree_rollback_reparent.sh`. The test manually constructs four part directories: `all_1_2_2_1` (1-2, L2, M1) — rolled-back parent; `txn_version.txt` holds `Tx::RolledBackCSN`; `creation_tid` uses `start_csn=2` (not `Tx::PrehistoricCSN`) so `wasInvolvedInTransaction` returns true `all_1_1_1_1` (1-1, L1, M1) — committed child; contained in 1-2 → inserted as a child of `all_1_2_2_1` during build `all_2_3_1_0` (2-3, L1, M0) — committed part; intersects `all_1_2_2_1`, triggering the eviction and re-parenting path After `ATTACH TABLE`: `all_1_1_1_1` must be active (orphan successfully re-parented) and `all_1_2_2_1` must not be active (rolled-back part discarded). `txn_version.txt` is written atomically (`printf` to `.tmp` then `mv`) to eliminate a stress-test race where the server reads the file in the window between shell truncation and the actual write, producing an empty file and a parse error (`Unreadable` state → `CORRUPTED_DATA` instead of server abort). Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
20bdf6b to
f2a0273
Compare
|
The Stress test (arm_msan) failure is fixed by #101239, which should be merged first. After it is merged, please update the branch to include the fix. |
`PartLoadingTree::add` builds a containment tree of MergeTree parts during `loadDataParts`. Parts are sorted by `(level, mutation)` descending so that covering parts are inserted before the parts they contain. When a rolled-back part in the tree intersects an incoming committed part two bugs were present: **Bug 1 — wrong skip direction.** The old code (`has_transaction_metadata`) checked whether either part has a `txn_version.txt` file and unconditionally skipped the *incoming* part. When the already-resident part was the rolled-back one, the committed incoming part was silently discarded instead of replacing it. **Bug 2 — orphans dropped after eviction.** When the rolled-back resident is evicted, its children are orphaned. They must be re-inserted via `add()` — together with the incoming committed part — sorted by `(level, mutation)` descending so that the incoming part (typically higher level) is placed into the tree *before* any orphan that belongs inside it. Without this ordering an orphan at sibling level causes a spurious `LOGICAL_ERROR` when the incoming part is re-inserted. The crash observed in the stress test CI was caused by a third issue: the old helper (`has_transaction_metadata`) returned `bool` — a plain file-existence check. It could not distinguish between a definitively rolled-back part, an in-flight transaction (`Tx::UnknownCSN`), and a read error (e.g. an empty file read between shell `truncate` and `write`). When neither part was definitively rolled-back, the code fell through to `LOGICAL_ERROR`, which aborts the server in sanitizer builds. CI: ClickHouse#100992 Replace the `bool`-returning helper with a 5-state `enum class RollbackStatus` and rename the lambda to `read_txn_status`: `RolledBack` — `creation_csn == Tx::RolledBackCSN` on disk `Committed` — `creation_csn` is a known committed CSN on disk `NoMetadata` — `txn_version.txt` absent (prehistoric/non-transactional part) `UnknownCSN` — file present, `creation_csn == Tx::UnknownCSN` (in-flight) `Unreadable` — file present but could not be parsed (corruption/partial write) Intersection decision table (applied to both prev/next branches): `Unreadable` on either side → `CORRUPTED_DATA` (not a code bug) incoming `RolledBack` → skip incoming existing `RolledBack` → evict existing; collect orphans + incoming sorted by `(level, mutation)` desc; re-add all via `add()`; `return` either `UnknownCSN` → skip incoming (`loadDataPart` resolves later) both `NoMetadata` or `Committed` → `LOGICAL_ERROR` (genuine bug / manual edit) The `traverse` template body is moved inline into the header and `PartLoadingTree` is made `public` so external tests can exercise it directly. Add `tests/queries/0_stateless/04063_part_loading_tree_rollback_reparent.sh`. The test manually constructs four part directories: `all_1_2_2_1` (1-2, L2, M1) — rolled-back parent; `txn_version.txt` holds `Tx::RolledBackCSN`; `creation_tid` uses `start_csn=2` (not `Tx::PrehistoricCSN`) so `wasInvolvedInTransaction` returns true `all_1_1_1_1` (1-1, L1, M1) — committed child; contained in 1-2 → inserted as a child of `all_1_2_2_1` during build `all_2_3_1_0` (2-3, L1, M0) — committed part; intersects `all_1_2_2_1`, triggering the eviction and re-parenting path After `ATTACH TABLE`: `all_1_1_1_1` must be active (orphan successfully re-parented) and `all_1_2_2_1` must not be active (rolled-back part discarded). `txn_version.txt` is written atomically (`printf` to `.tmp` then `mv`) to eliminate a stress-test race where the server reads the file in the window between shell truncation and the actual write, producing an empty file and a parse error (`Unreadable` state → `CORRUPTED_DATA` instead of server abort). Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…nfo` format The test writes `txn_version.txt` directly to mark `all_1_2_2_1` as rolled back. Master switched to `VersionInfo::readFromMultiLineBuffer`, which has a silent fallback for the old format that overrides `creation_csn` with `Tx::NonTransactionalCSN`. As a result the test's `creation_csn: 18446744073709551615` (`Tx::RolledBackCSN`) was ignored, `read_txn_status` returned `Committed` for both intersecting parts, and `ATTACH TABLE` threw `LOGICAL_ERROR` again. Write the full new format including `storing_version`, `removal_tid` and `removal_csn` so the rolled-back state is detected. Also make the test stricter: - `ATTACH TABLE` exit code is now checked (previously stderr was suppressed and a thrown exception was silently ignored). - The presence-of-active checks for `all_1_1_1_1`, `all_1_2_2_1` and `all_2_3_1_0` are now explicit `count` comparisons that exit with a clear `FAIL:` message when they don't match. CI report: https://s3.amazonaws.com/clickhouse-test-reports/PRs/100992/c7f836579347baebec2d4a153b9c4a540d236ac0/fast_test/
c7f8365 to
4ff63d0
Compare
…loss Addresses ClickHouse#100992 (comment) Previously, when an intersecting tree-resident part had `creation_csn = Tx::UnknownCSN` on disk, the incoming committed peer was dropped from the loading tree. The plan was that `loadDataPart` would resolve the CSN afterwards. But `loadDataPart` only runs on the top-level nodes of the loading tree, and the dropped peer was never re-introduced — so if the in-flight transaction turned out to be rolled back, the committed peer was permanently absent for that startup (data only became visible after a second restart, once `Tx::RolledBackCSN` was persisted to disk). `read_txn_status` now consults `TransactionLog` when the on-disk CSN is unresolved, mirroring the logic of `VersionMetadata::tryGetCSN`: if the tid has no CSN and no running transaction, it's treated as rolled back. The `UnknownCSN` `RollbackStatus` arm now only fires for transactions that are genuinely still running (an unusual case during `loadDataParts`), and the rolled-back arm correctly evicts the dead part and keeps the committed peer.
…rtLoadingTree::add` Addresses ClickHouse#100992 (comment) A committed part can be silently buried under a rolled-back ancestor when the ancestor contains it (rather than just intersects). The previous code only checked rollback status in the `isDisjoint == false` (intersection) branch, so when an incoming committed part descended through the `contains` branch into a rolled-back ancestor it became a "covered" child and was loaded as `Outdated` instead of active — invisible to queries until the next restart resolved the ancestor's `creation_csn` to `Tx::RolledBackCSN` on disk. Add the same eviction check in both `contains` branches (`prev` and `next`). On match, the helper `evict_rolled_back_and_reinsert` collects the rolled-back node's orphaned descendants, sorts them with the incoming part by `(level, mutation)` descending, and re-inserts the lot — so any new top-level ancestor is added before its descendants and re-parenting happens cleanly. Also extract the inline eviction code from the two intersection branches into the same helper, removing the duplicated logic. New regression test `04241_part_loading_tree_rollback_contains` exercises the `R-contains-{C, N}` topology that triggered a `LOGICAL_ERROR` (or silent data loss, depending on the levels) before this fix.
…NFO` `02497_source_part_is_intact_when_mutation` failed in Fast test (https://s3.amazonaws.com/clickhouse-test-reports/PRs/100992/7587b5c5101032dfd1b873e780b42e1230af45d1/fast_test/) because the test's `stderr` check tripped on `LOG_WARNING` messages emitted when `DETACH` + `ATTACH` reloads the rolled-back mutation parts written by the test. With the eviction now firing in the `contains` branches (added in the previous commit), this happens for every rolled-back mutation chain — which is routine cleanup, not a warning condition. Also rewrite the eviction message to drop the misleading "committed part Y will be used instead" clause, since the incoming part can itself be rolled back (and immediately re-evicted by the recursive `add` call).
Addresses https://github.com/ClickHouse/ClickHouse/actions/runs/26031517236 (`test_transactions/test.py::test_rollback_unfinished_on_restart{1,2}` fail with `len(res) == 5` because the rolled-back mutation parts are not in `system.parts`) and ClickHouse#100992 (comment) (bot flags an `UnknownCSN` race in the same branch). The prior commit (`7587b5c5101`) added a rollback check in the `contains` branches and evicted the ancestor when found rolled-back. But the rest of the load path already handles this — `MergeTreeData::loadDataPartsFromDisk` detects that a top-level part was demoted to `Outdated` (e.g. by `loadDataPart`'s `creation_csn == Tx::RolledBackCSN` check) and pushes its children onto `parts_to_load` to be loaded as `Active`. So the eviction in `PartLoadingTree::add` was redundant, and worse it removed the rolled-back parts from `system.parts` entirely, which the integration test relies on. This commit keeps: - the intersection-branch eviction (the original PR's fix — there the promotion mechanism does not apply because the committed peer is a sibling, not a child, of the rolled-back tree-resident); - the eager `UnknownCSN` resolution in `read_txn_status`; - the helper `evict_rolled_back_and_reinsert` (still used by both intersection branches); - `04241_part_loading_tree_rollback_contains` — the test now passes via the existing promotion mechanism rather than tree-level eviction, but the assertion remains the same (committed children are active).
…nfo` format The rolled-back ancestor's `creation_tid` was `(2, 1, ...)`, but `local_tid=1` is `Tx::NonTransactionalLocalTID`, so `isNonTransactional()` returned true and `wasInvolvedInTransaction()` returned false. As a result `loadDataPart` skipped the `creation_csn == Tx::RolledBackCSN` check at `src/Storages/MergeTree/MergeTreeData.cpp` line 2227, the ancestor stayed `Active`, and the line 2374-2381 promotion never fired — so the committed descendants remained `Outdated` and the test's `check_active_count` assertions failed in CI Fast test: https://s3.amazonaws.com/clickhouse-test-reports/PRs/100992/a59409363103147b8348d73ae8d7a919ef6e67e8/fast_test/ Use `local_tid=33` (above `Tx::MaxReservedLocalTID=32`) so the TID looks transactional. Also add a comment explaining the constraint so the next person who copies this idiom doesn't repeat the mistake. `04063` happens to pass with the same bad TID because there the intersection-branch eviction uses `read_txn_status` (reads `creation_csn` directly from disk) and never goes through `wasInvolvedInTransaction`.
|
📊 Cloud Performance Report
The change here only rewrites how the server decides which on-disk parts to load at startup/ATTACH when rolled-back transaction metadata is present — code that runs once during part loading and never during query execution. The 12 flagged ClickBench regressions are all read queries that don't touch that path, and they shift in the same direction by ~11-49% across the board, which (with only 5 source runs vs 30 base) reads as a slower/noisier measurement environment rather than a real effect of this change. Nine off-path queries are downgraded to not-sure. Q16 (+27%), Q31 (+25%), and Q32 (+49%) have deltas too large to dismiss automatically and are kept for a human to confirm against the run environment, but no part of this diff can plausibly explain them. clickbench🔴 3 regressed · Flagged queries (12 of 43)
q-value = BH-FDR adjusted p; smaller is stronger evidence. MIRAI flags a query when q < fdr_q (default 0.10) — the value the verdict is based on. tpch_adapted_1_official🟢 No significant changes Debug info
|
…ee-rollback-detection
…ee-rollback-detection
…gTree::add When an incoming committed (or non-transactional) part intersects a tree-resident part whose creating transaction is still in flight (`UnknownCSN`), `PartLoadingTree::add` previously skipped the incoming part. If that in-flight transaction later rolled back, the resident part became `Outdated` while the committed peer had already been discarded for this load pass, losing committed data. Two intersecting parts can never both commit, so an in-flight part that intersects an already-committed peer is bound to roll back. Evict the in-flight resident and keep the committed incoming part instead of discarding it. The symmetric case (incoming in flight, resident committed) and the both-in-flight case still skip the incoming part, which is safe. The eviction helper is renamed `evict_and_reinsert` and takes a reason string so the log message stays accurate for both eviction causes. This case is not reachable during normal part loading (no transaction can be creating parts in a table that is still loading its parts, so the eager `TransactionLog` resolution turns every on-disk `UnknownCSN` into `RolledBack`/`Committed`), but the branch now fails safe toward keeping committed data. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
The rolled-back fixture wrote `creation_tid: (2, 1, ...)`, but `local_tid = 1` is `Tx::NonTransactionalLocalTID`. With `start_csn = 2` this violates the `TransactionID` invariant `(local_tid == NonTransactionalLocalTID) == (start_csn == NonTransactionalCSN)`, which trips a `chassert` in `TransactionID::isNonTransactional` in debug/sanitizer builds, and also makes `wasInvolvedInTransaction` treat the part as non-transactional. The test only passed because the part is evicted before that path runs. Use a transactional `local_tid` outside the reserved range (33), matching test 04241, so the fixture is well-formed. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
When `PartLoadingTree::add` evicts a rolled-back part that intersects an incoming committed part, it re-inserts the orphaned descendants and the incoming part sorted by (level, mutation) descending. The new test covers the case the sort exists for: one re-inserted committed part contains another. Because `MergeTreePartInfo::contains` requires a strictly higher level for a wider block range, the container is always re-added before the part it contains, so the contained part descends into it instead of hitting the generic intersection branch (which does not handle `incoming.contains(existing)`) and throwing a `LOGICAL_ERROR`. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…ee-rollback-detection
| return RollbackStatus::NoMetadata; | ||
|
|
||
| auto version_path = fs::path(relative_data_path) / part_name / VersionMetadata::TXN_VERSION_METADATA_FILE_NAME; | ||
| if (!part_disk->existsFile(version_path)) |
There was a problem hiding this comment.
VersionMetadataOnDisk::loadMetadata already treats a lone txn_version.txt.tmp as a rolled-back part (the interrupted-rename / lost-fsync cases), but this fast path only checks for the final txn_version.txt and returns NoMetadata otherwise.
That means a tmp-only transactional leftover is misclassified as a non-transactional part during tree construction. If it intersects a committed peer on startup or ATTACH, we still fall through to the generic LOGICAL_ERROR exception path instead of evicting the rolled-back leftover, so the interrupted-write state handled later by VersionMetadataOnDisk can still fail loading here.
Could we mirror the .tmp handling from VersionMetadataOnDisk::loadMetadata (or share a helper) so tmp-only metadata is treated as rolled back?
LLVM Coverage ReportChanged lines: Changed C/C++ lines covered: 56/123 (45.53%) · Uncovered code |

During
loadDataParts,PartLoadingTree::addbuilds a containment tree of on-disk parts sorted by(level, mutation)descending. When a rolled-back part already in the tree intersects an incoming committed part, the previous implementation had two bugs:Bug 1 — wrong skip direction. The old code (
has_transaction_metadata) checked whether either intersecting part has atxn_version.txtfile and unconditionally skipped the incoming part. When the tree-resident part was the rolled-back one, the committed incoming part was silently discarded instead of replacing it.Bug 2 — orphans dropped after eviction. When the rolled-back resident is evicted, its children become orphaned. They must be re-inserted via
add()— together with the incoming committed part — sorted by(level, mutation)descending so that the incoming part is placed into the tree before any orphan that belongs inside it. Without this ordering an orphan sitting at sibling level causes a spuriousLOGICAL_ERRORwhen the incoming part is re-inserted.The
LOGICAL_ERRORthat caused the fatal in the stress test CI was triggered because the old helper (has_transaction_metadata) returnedbool— a plain file-existence check. It could not distinguish between a definitively rolled-back part, an in-flight transaction (Tx::UnknownCSN), and a read error (e.g. an empty file read between shelltruncateandwrite). When neither part was definitively rolled-back the code fell through toLOGICAL_ERROR, which aborts the server in sanitizer builds. The newenum class RollbackStatuswith five states (RolledBack,Committed,NoMetadata,UnknownCSN,Unreadable) lets each case be handled precisely.Changelog category (leave one):
Changelog entry (a user-readable short description of the changes that goes into CHANGELOG.md):
Fixed a server abort (in sanitizer builds) and silent data loss during table startup when a rolled-back transactional MergeTree part intersects a committed part.
PartLoadingTree::addnow correctly identifies which of the two intersecting parts was rolled back, evicts it, and re-parents its orphaned children under the committed part in the correct order.Documentation entry for user-facing changes