iframe-proxy

groeneai · 2026-06-10T15:00:23Z

test_lost_part_other_replica (and siblings test_lost_part_same_replica, test_lost_part_mutation) drain the replication queue with a tight 10x1s poll loop after lost-part recovery. Recovery produces queue entries that fail transiently: a GET_PART for the recovered part can hit PART_IS_TEMPORARILY_LOCKED while the outdated duplicate is cleaned up, and the dependent MERGE_PARTS hits NO_REPLICA_HAS_PART until the fetch lands. Each failure increments num_tries, and ReplicatedMergeTreeQueue::getPostponeTimeMsForEntry parks the entry for 1 << num_tries ms capped at max_postpone_time_for_failed_replicated_{fetches,merges}_ms (default 60000). After ~14 retries the backoff exceeds the 10s window, the queue is still non-empty, and the test fails with "Still have something in replication queue" (seen in CI with num_postponed=156, 29s backoff).

Fix: set max_postpone_time_for_failed_replicated_fetches_ms=0 and max_postpone_time_for_failed_replicated_merges_ms=0 in these tests, mirroring the existing max_postpone_time_for_failed_mutations_ms=0 in test_lost_part_mutation. Entries then retry immediately and the queue drains in time. test_lost_last_part is unchanged (it tolerates pending entries, 50s wait).

Reproduced deterministically with the replicated_queue_fail_next_entry failpoint: forcing num_tries>=14 leaves the queue non-empty after 10s without the setting and empty with it. Fixed suite: 12/12 (4 tests x 3 runs).

No related open issue found.

Changelog category (leave one):

CI Fix or Improvement (changelog entry is not required)

test_lost_part_other_replica (and the sibling same_replica and mutation cases) drain the replication queue with a tight loop of 10 one-second polls after lost-part recovery. Recovery produces queue entries that fail transiently: a GET_PART for the recovered part can hit PART_IS_TEMPORARILY_LOCKED while the outdated duplicate is being cleaned up, and the dependent MERGE_PARTS hits NO_REPLICA_HAS_PART until the fetch lands. Each failure increments num_tries, and ReplicatedMergeTreeQueue::getPostponeTimeMsForEntry parks the entry for 1 << num_tries ms, capped at max_postpone_time_for_failed_replicated_ {fetches,merges}_ms (default 60000). After ~14 retries the backoff exceeds the 10s drain window, so the queue is still non-empty when the loop ends and the test fails with "Still have something in replication queue" (observed in CI with num_postponed=156 and a 29s backoff). Disable the fetch and merge backoff in these tests by setting max_postpone_time_for_failed_replicated_fetches_ms and max_postpone_time_for_failed_replicated_merges_ms to 0, mirroring the existing max_postpone_time_for_failed_mutations_ms=0 in test_lost_part_mutation. Entries then retry immediately and the queue drains within the window. test_lost_last_part is left unchanged: it tolerates pending entries and waits up to 50s. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

groeneai · 2026-06-10T15:00:41Z

Pre-PR validation gate

#	Question	Answer
a	Deterministic repro?	Yes. Using the `replicated_queue_fail_next_entry` failpoint (same tool as `02943_rmt_alter_metadata_merge_checksum_mismatch`), I forced a queue entry to fail until `num_tries>=14` (backoff `1<<14`=16.4s). With default settings the replication queue is still non-empty after the test's 10x1s drain window; with the postpone caps set to 0 it drains. This matches the CI failure (`num_postponed=156`, backoff 29383ms, `assert False` at test.py:160).
b	Root cause explained?	Yes. Lost-part recovery produces transiently-failing entries (`GET_PART` -> `PART_IS_TEMPORARILY_LOCKED` on the just-recovered outdated dup; dependent `MERGE_PARTS` -> `NO_REPLICA_HAS_PART`). `getPostponeTimeMsForEntry` parks each failed entry for `1<<num_tries` ms capped at `max_postpone_time_for_failed_replicated_{fetches,merges}_ms` (default 60000). Once backoff exceeds the test's fixed 10s drain loop, the queue is non-empty when the loop ends -> failure.
c	Fix matches root cause?	Yes. The fix disables exactly the two backoff caps that govern these entry types (`...replicated_fetches_ms=0`, `...replicated_merges_ms=0`), so failed entries retry immediately instead of being parked past the drain window. It does not widen any assertion, reduce data, or add a blanket `no-random-*` tag.
d	Test intent preserved / new tests added?	Yes. The tests still verify lost-part recovery (empty-part creation, row counts, queue eventually empties). Only the per-entry retry cadence changes; nothing about what the test asserts is weakened. No new test needed (this is a test-only timing fix, no engine change).
e	Both directions demonstrated?	Yes. Failpoint demo: queue NOT drained in 10s without the setting (bug reproduced), drained in 10s with the setting (fix confirmed). Fixed suite: 12/12 passes (test_lost_part_other_replica / same_replica / mutation / last_part, 3 runs each) on `Integration tests (amd_tsan, 1/6)`.
f	Fix is general, not a narrow patch?	Yes. The same tight 10x1s drain-after-recovery pattern exists in `test_lost_part_same_replica`, `test_lost_part_other_replica`, and `test_lost_part_mutation` (which already disabled the mutation backoff but not fetch/merge), so the caps are applied to all three. `test_lost_last_part` uses a different, tolerant structure (50s wait, `<=2` assertion) and is intentionally left unchanged. This is a test-only fix; no engine code path is involved.

Session id: cron:clickhouse-worker-slot-7:20260610-141800

groeneai · 2026-06-10T15:00:58Z

cc @tavplubix could you review this? Test-only fix: test_lost_part drains the replication queue with a fixed 10x1s loop, but lost-part recovery produces GET_PART/MERGE_PARTS entries that fail transiently (PART_IS_TEMPORARILY_LOCKED, NO_REPLICA_HAS_PART) and get parked by exponential backoff (1<<num_tries, capped at 60s) past that window. Sets max_postpone_time_for_failed_replicated_{fetches,merges}_ms=0 in the three affected cases, mirroring the existing max_postpone_time_for_failed_mutations_ms=0.

clickhouse-gh · 2026-06-10T16:59:19Z

Workflow [PR], commit [d1bb3d1]

Summary: ✅

AI Review

Summary

This PR is a test-only stabilization for test_lost_part: it disables failed replicated fetch and merge backoff in the three tests that require the replication queue to drain inside a fixed short window. The change matches the described failure mode and applies the settings to both replicas before the affected queue entries are produced. I found no actionable line-level issues.

Final Verdict

Status: ✅ Approve

clickhouse-gh · 2026-06-10T19:26:53Z

LLVM Coverage Report

Metric	Baseline	Current	Δ
Lines	84.60%	84.60%	+0.00%
Functions	92.30%	92.30%	+0.00%
Branches	77.20%	77.30%	+0.10%

Changed lines: No C/C++ source files changed — skipping uncovered code analysis.

Newly covered by added/modified tests: 459 line(s), 27 function(s) across 130 file(s) · Details

Top files

src/Databases/DataLake/HiveCatalog.cpp: 45 line(s), 10 function(s)
src/Storages/StorageReplicatedMergeTree.cpp: 40 line(s)
src/Storages/WindowView/StorageWindowView.cpp: 22 line(s), 1 function(s)
src/Storages/MergeTree/ReplicatedMergeTreeRestartingThread.cpp: 15 line(s)
src/Common/ZooKeeper/ZooKeeper.cpp: 13 line(s)

Full report

groeneai · 2026-06-10T20:50:23Z

alexey-milovidov

Reasonable.

groeneai mentioned this pull request Jun 10, 2026

Fix exception when Iceberg format version is upgraded by external tool #100407

Merged

1 task

nikitamikhaylov added the can be tested Allows running workflows for external contributors label Jun 10, 2026

clickhouse-gh Bot added the pr-ci label Jun 10, 2026

alexey-milovidov reviewed Jun 11, 2026

View reviewed changes

alexey-milovidov self-assigned this Jun 11, 2026

alexey-milovidov added this pull request to the merge queue Jun 11, 2026

Merged via the queue into ClickHouse:master with commit 8bc677b Jun 11, 2026
166 checks passed

robot-clickhouse-ci-2 added the pr-synced-to-cloud The PR is synced to the cloud repo label Jun 11, 2026

Sunbelt Computer Software

PL/B Language Development and Support

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Fix flaky test_lost_part by disabling failed-fetch/merge backoff#106984

Fix flaky test_lost_part by disabling failed-fetch/merge backoff#106984
alexey-milovidov merged 1 commit into
ClickHouse:masterfrom
groeneai:fix-flaky-test-lost-part-backoff

groeneai commented Jun 10, 2026

Uh oh!

groeneai commented Jun 10, 2026

Uh oh!

groeneai commented Jun 10, 2026

Uh oh!

clickhouse-gh Bot commented Jun 10, 2026 •

edited

Loading

Uh oh!

clickhouse-gh Bot commented Jun 10, 2026

Uh oh!

groeneai commented Jun 10, 2026

Uh oh!

alexey-milovidov left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Sunbelt Computer Software

PL/B Language Development and Support

Uh oh!

Conversation

groeneai commented Jun 10, 2026

Changelog category (leave one):

Uh oh!

groeneai commented Jun 10, 2026

Pre-PR validation gate

Uh oh!

groeneai commented Jun 10, 2026

Uh oh!

clickhouse-gh Bot commented Jun 10, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

AI Review

Summary

Final Verdict

Uh oh!

clickhouse-gh Bot commented Jun 10, 2026

LLVM Coverage Report

Uh oh!

groeneai commented Jun 10, 2026

Uh oh!

alexey-milovidov left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

clickhouse-gh Bot commented Jun 10, 2026 •

edited

Loading