Retry integration-test image pre-pull with exponential backoff by groeneai · Pull Request #108975 · ClickHouse/ClickHouse · GitHub
Skip to content

Retry integration-test image pre-pull with exponential backoff#108975

Closed
groeneai wants to merge 1 commit into
ClickHouse:masterfrom
groeneai:groeneai/ci-prefetch-images-retry-backoff
Closed

Retry integration-test image pre-pull with exponential backoff#108975
groeneai wants to merge 1 commit into
ClickHouse:masterfrom
groeneai:groeneai/ci-prefetch-images-retry-backoff

Conversation

@groeneai

Copy link
Copy Markdown
Contributor

Changelog category (leave one):

  • CI Fix or Improvement (changelog entry is not required)

Changelog entry (a user-readable short description of the changes that goes into CHANGELOG.md):

Description

Integration test jobs pre-pull all Docker images for the batch before tests start (ci/jobs/scripts/prefetch-integration-test-images). When the registry returns a transient error for one image, the whole job fails at the pre-pull stage with Failed to pre-pull Docker images needed by the test batch and no per-test result, so an unrelated PR's check goes red.

Example: PR #108550, Integration tests (arm_binary, distributed plan, 4/4), CI report. clickhouse/test-mysql57 and clickhouse/kerberos-kdc each returned received unexpected HTTP status: 500 Internal Server Error on all 3 attempts within a ~13s window, while ~18 other images pulled fine. This pattern recurs across many unrelated PRs (e.g. a registry outage on 2026-06-15 produced ~295 such job-level failures across 22 PRs in one day).

The previous retry policy was 3 attempts with a fixed 5s sleep, a ~13s window that is too narrow to ride out a registry hiccup, and because all images pull in parallel the retries hit the registry in lockstep.

This change:

  • raises the default retries to 5 (matching the buildx retry count introduced in Retry docker buildx on transient docker.io registry errors #108969),
  • replaces the fixed sleep with exponential backoff (PULL_BACKOFF_BASE doubled each retry, capped at PULL_BACKOFF_MAX) plus a small random jitter that de-synchronizes the parallel retries.

Genuine errors still fail fast: arch-missing manifests are skipped on the first attempt, and a persistent failure still exits non-zero after the retries are exhausted.

The pre-pull step in integration test jobs fails the whole batch when the
Docker registry returns a transient error (e.g. HTTP 500) for an image. The
previous policy was 3 attempts with a fixed 5s sleep, a ~13s window that is too
narrow to ride out a registry hiccup, and because all images pull in parallel
the retries hit the registry in lockstep.

Raise the default retries to 5 and replace the fixed sleep with exponential
backoff plus jitter (PULL_BACKOFF_BASE doubled each retry, capped at
PULL_BACKOFF_MAX, plus a small random jitter). The jitter de-synchronizes the
parallel retries. Genuine errors still fail fast: arch-missing manifests are
skipped on the first attempt, and a persistent failure still exits non-zero
after the retries are exhausted.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
@groeneai

Copy link
Copy Markdown
Contributor Author

@groeneai

Copy link
Copy Markdown
Contributor Author

cc @maxknv — could you review this? It makes the integration-test Docker image pre-pull resilient to transient registry errors (e.g. HTTP 500), which currently fail the whole batch with no per-test result. Raises retries 3->5 and replaces the fixed 5s sleep with exponential backoff + jitter (the parallel pulls were retrying in lockstep). Same direction as #108969 (buildx retry). Repro and validation in the gate comment above.

@alexey-milovidov alexey-milovidov added the can be tested Allows running workflows for external contributors label Jun 30, 2026
@clickhouse-gh

clickhouse-gh Bot commented Jun 30, 2026

Copy link
Copy Markdown
Contributor

Workflow [PR], commit [52b3927]


AI Review

Summary

This PR changes integration-test Docker image prefetching to use more retry attempts with exponential backoff and jitter, and keeps the Python prefetch_images default aligned with the shell script. I did not find correctness, safety, or CI-contract issues in the changed code. There are no existing inline review threads, and the available ClickHouse CI report for commit 52b39273756e76be9cb7c39b4feb004179982fe5 shows no failed tests.

Final Verdict

Status: ✅ Approve

@clickhouse-gh clickhouse-gh Bot added the pr-ci label Jun 30, 2026
@maxknv

maxknv commented Jun 30, 2026

Copy link
Copy Markdown
Member

@maxknv maxknv closed this Jun 30, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

can be tested Allows running workflows for external contributors pr-ci

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants