Fix flaky test_lost_part by disabling failed-fetch/merge backoff by groeneai · Pull Request #106984 · ClickHouse/ClickHouse · GitHub
Skip to content

Fix flaky test_lost_part by disabling failed-fetch/merge backoff#106984

Merged
alexey-milovidov merged 1 commit into
ClickHouse:masterfrom
groeneai:fix-flaky-test-lost-part-backoff
Jun 11, 2026
Merged

Fix flaky test_lost_part by disabling failed-fetch/merge backoff#106984
alexey-milovidov merged 1 commit into
ClickHouse:masterfrom
groeneai:fix-flaky-test-lost-part-backoff

Conversation

@groeneai

Copy link
Copy Markdown
Contributor

test_lost_part_other_replica (and siblings test_lost_part_same_replica, test_lost_part_mutation) drain the replication queue with a tight 10x1s poll loop after lost-part recovery. Recovery produces queue entries that fail transiently: a GET_PART for the recovered part can hit PART_IS_TEMPORARILY_LOCKED while the outdated duplicate is cleaned up, and the dependent MERGE_PARTS hits NO_REPLICA_HAS_PART until the fetch lands. Each failure increments num_tries, and ReplicatedMergeTreeQueue::getPostponeTimeMsForEntry parks the entry for 1 << num_tries ms capped at max_postpone_time_for_failed_replicated_{fetches,merges}_ms (default 60000). After ~14 retries the backoff exceeds the 10s window, the queue is still non-empty, and the test fails with "Still have something in replication queue" (seen in CI with num_postponed=156, 29s backoff).

Fix: set max_postpone_time_for_failed_replicated_fetches_ms=0 and max_postpone_time_for_failed_replicated_merges_ms=0 in these tests, mirroring the existing max_postpone_time_for_failed_mutations_ms=0 in test_lost_part_mutation. Entries then retry immediately and the queue drains in time. test_lost_last_part is unchanged (it tolerates pending entries, 50s wait).

Reproduced deterministically with the replicated_queue_fail_next_entry failpoint: forcing num_tries>=14 leaves the queue non-empty after 10s without the setting and empty with it. Fixed suite: 12/12 (4 tests x 3 runs).

No related open issue found.

Changelog category (leave one):

  • CI Fix or Improvement (changelog entry is not required)

test_lost_part_other_replica (and the sibling same_replica and mutation
cases) drain the replication queue with a tight loop of 10 one-second
polls after lost-part recovery. Recovery produces queue entries that fail
transiently: a GET_PART for the recovered part can hit
PART_IS_TEMPORARILY_LOCKED while the outdated duplicate is being cleaned
up, and the dependent MERGE_PARTS hits NO_REPLICA_HAS_PART until the
fetch lands. Each failure increments num_tries, and
ReplicatedMergeTreeQueue::getPostponeTimeMsForEntry parks the entry for
1 << num_tries ms, capped at max_postpone_time_for_failed_replicated_
{fetches,merges}_ms (default 60000). After ~14 retries the backoff
exceeds the 10s drain window, so the queue is still non-empty when the
loop ends and the test fails with "Still have something in replication
queue" (observed in CI with num_postponed=156 and a 29s backoff).

Disable the fetch and merge backoff in these tests by setting
max_postpone_time_for_failed_replicated_fetches_ms and
max_postpone_time_for_failed_replicated_merges_ms to 0, mirroring the
existing max_postpone_time_for_failed_mutations_ms=0 in
test_lost_part_mutation. Entries then retry immediately and the queue
drains within the window. test_lost_last_part is left unchanged: it
tolerates pending entries and waits up to 50s.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
@groeneai

Copy link
Copy Markdown
Contributor Author

@groeneai

Copy link
Copy Markdown
Contributor Author

cc @tavplubix could you review this? Test-only fix: test_lost_part drains the replication queue with a fixed 10x1s loop, but lost-part recovery produces GET_PART/MERGE_PARTS entries that fail transiently (PART_IS_TEMPORARILY_LOCKED, NO_REPLICA_HAS_PART) and get parked by exponential backoff (1<<num_tries, capped at 60s) past that window. Sets max_postpone_time_for_failed_replicated_{fetches,merges}_ms=0 in the three affected cases, mirroring the existing max_postpone_time_for_failed_mutations_ms=0.

@nikitamikhaylov nikitamikhaylov added the can be tested Allows running workflows for external contributors label Jun 10, 2026
@clickhouse-gh

clickhouse-gh Bot commented Jun 10, 2026

Copy link
Copy Markdown
Contributor

Workflow [PR], commit [d1bb3d1]

Summary:


AI Review

Summary

This PR is a test-only stabilization for test_lost_part: it disables failed replicated fetch and merge backoff in the three tests that require the replication queue to drain inside a fixed short window. The change matches the described failure mode and applies the settings to both replicas before the affected queue entries are produced. I found no actionable line-level issues.

Final Verdict

Status: ✅ Approve

@clickhouse-gh clickhouse-gh Bot added the pr-ci label Jun 10, 2026
@clickhouse-gh

clickhouse-gh Bot commented Jun 10, 2026

Copy link
Copy Markdown
Contributor

LLVM Coverage Report

Metric Baseline Current Δ
Lines 84.60% 84.60% +0.00%
Functions 92.30% 92.30% +0.00%
Branches 77.20% 77.30% +0.10%

Changed lines: No C/C++ source files changed — skipping uncovered code analysis.

Newly covered by added/modified tests: 459 line(s), 27 function(s) across 130 file(s) · Details

Top files
  • src/Databases/DataLake/HiveCatalog.cpp: 45 line(s), 10 function(s)
  • src/Storages/StorageReplicatedMergeTree.cpp: 40 line(s)
  • src/Storages/WindowView/StorageWindowView.cpp: 22 line(s), 1 function(s)
  • src/Storages/MergeTree/ReplicatedMergeTreeRestartingThread.cpp: 15 line(s)
  • src/Common/ZooKeeper/ZooKeeper.cpp: 13 line(s)

Full report

@groeneai

Copy link
Copy Markdown
Contributor Author

@alexey-milovidov alexey-milovidov left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reasonable.

@alexey-milovidov alexey-milovidov self-assigned this Jun 11, 2026
@alexey-milovidov alexey-milovidov added this pull request to the merge queue Jun 11, 2026
Merged via the queue into ClickHouse:master with commit 8bc677b Jun 11, 2026
166 checks passed
@robot-clickhouse-ci-2 robot-clickhouse-ci-2 added the pr-synced-to-cloud The PR is synced to the cloud repo label Jun 11, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

can be tested Allows running workflows for external contributors pr-ci pr-synced-to-cloud The PR is synced to the cloud repo

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants