Fix flaky test_lost_part by disabling failed-fetch/merge backoff#106984
Conversation
test_lost_part_other_replica (and the sibling same_replica and mutation
cases) drain the replication queue with a tight loop of 10 one-second
polls after lost-part recovery. Recovery produces queue entries that fail
transiently: a GET_PART for the recovered part can hit
PART_IS_TEMPORARILY_LOCKED while the outdated duplicate is being cleaned
up, and the dependent MERGE_PARTS hits NO_REPLICA_HAS_PART until the
fetch lands. Each failure increments num_tries, and
ReplicatedMergeTreeQueue::getPostponeTimeMsForEntry parks the entry for
1 << num_tries ms, capped at max_postpone_time_for_failed_replicated_
{fetches,merges}_ms (default 60000). After ~14 retries the backoff
exceeds the 10s drain window, so the queue is still non-empty when the
loop ends and the test fails with "Still have something in replication
queue" (observed in CI with num_postponed=156 and a 29s backoff).
Disable the fetch and merge backoff in these tests by setting
max_postpone_time_for_failed_replicated_fetches_ms and
max_postpone_time_for_failed_replicated_merges_ms to 0, mirroring the
existing max_postpone_time_for_failed_mutations_ms=0 in
test_lost_part_mutation. Entries then retry immediately and the queue
drains within the window. test_lost_last_part is left unchanged: it
tolerates pending entries and waits up to 50s.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
|
cc @tavplubix could you review this? Test-only fix: |
|
Workflow [PR], commit [d1bb3d1] Summary: ✅ AI ReviewSummaryThis PR is a test-only stabilization for Final VerdictStatus: ✅ Approve |
LLVM Coverage Report
Changed lines: No C/C++ source files changed — skipping uncovered code analysis. Newly covered by added/modified tests: 459 line(s), 27 function(s) across 130 file(s) · Details Top files
|

test_lost_part_other_replica(and siblingstest_lost_part_same_replica,test_lost_part_mutation) drain the replication queue with a tight 10x1s poll loop after lost-part recovery. Recovery produces queue entries that fail transiently: aGET_PARTfor the recovered part can hitPART_IS_TEMPORARILY_LOCKEDwhile the outdated duplicate is cleaned up, and the dependentMERGE_PARTShitsNO_REPLICA_HAS_PARTuntil the fetch lands. Each failure incrementsnum_tries, andReplicatedMergeTreeQueue::getPostponeTimeMsForEntryparks the entry for1 << num_triesms capped atmax_postpone_time_for_failed_replicated_{fetches,merges}_ms(default 60000). After ~14 retries the backoff exceeds the 10s window, the queue is still non-empty, and the test fails with "Still have something in replication queue" (seen in CI withnum_postponed=156, 29s backoff).Fix: set
max_postpone_time_for_failed_replicated_fetches_ms=0andmax_postpone_time_for_failed_replicated_merges_ms=0in these tests, mirroring the existingmax_postpone_time_for_failed_mutations_ms=0intest_lost_part_mutation. Entries then retry immediately and the queue drains in time.test_lost_last_partis unchanged (it tolerates pending entries, 50s wait).Reproduced deterministically with the
replicated_queue_fail_next_entryfailpoint: forcingnum_tries>=14leaves the queue non-empty after 10s without the setting and empty with it. Fixed suite: 12/12 (4 tests x 3 runs).No related open issue found.
Changelog category (leave one):