Retry integration-test image pre-pull with exponential backoff#108975
Retry integration-test image pre-pull with exponential backoff#108975groeneai wants to merge 1 commit into
Conversation
The pre-pull step in integration test jobs fails the whole batch when the Docker registry returns a transient error (e.g. HTTP 500) for an image. The previous policy was 3 attempts with a fixed 5s sleep, a ~13s window that is too narrow to ride out a registry hiccup, and because all images pull in parallel the retries hit the registry in lockstep. Raise the default retries to 5 and replace the fixed sleep with exponential backoff plus jitter (PULL_BACKOFF_BASE doubled each retry, capped at PULL_BACKOFF_MAX, plus a small random jitter). The jitter de-synchronizes the parallel retries. Genuine errors still fail fast: arch-missing manifests are skipped on the first attempt, and a persistent failure still exits non-zero after the retries are exhausted. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
|
cc @maxknv — could you review this? It makes the integration-test Docker image pre-pull resilient to transient registry errors (e.g. HTTP 500), which currently fail the whole batch with no per-test result. Raises retries 3->5 and replaces the fixed 5s sleep with exponential backoff + jitter (the parallel pulls were retrying in lockstep). Same direction as #108969 (buildx retry). Repro and validation in the gate comment above. |
|
Workflow [PR], commit [52b3927] AI ReviewSummaryThis PR changes integration-test Docker image prefetching to use more retry attempts with exponential backoff and jitter, and keeps the Python Final VerdictStatus: ✅ Approve |

Changelog category (leave one):
Changelog entry (a user-readable short description of the changes that goes into CHANGELOG.md):
Description
Integration test jobs pre-pull all Docker images for the batch before tests start (
ci/jobs/scripts/prefetch-integration-test-images). When the registry returns a transient error for one image, the whole job fails at the pre-pull stage withFailed to pre-pull Docker images needed by the test batchand no per-test result, so an unrelated PR's check goes red.Example: PR #108550,
Integration tests (arm_binary, distributed plan, 4/4), CI report.clickhouse/test-mysql57andclickhouse/kerberos-kdceach returnedreceived unexpected HTTP status: 500 Internal Server Erroron all 3 attempts within a ~13s window, while ~18 other images pulled fine. This pattern recurs across many unrelated PRs (e.g. a registry outage on 2026-06-15 produced ~295 such job-level failures across 22 PRs in one day).The previous retry policy was 3 attempts with a fixed 5s sleep, a ~13s window that is too narrow to ride out a registry hiccup, and because all images pull in parallel the retries hit the registry in lockstep.
This change:
PULL_BACKOFF_BASEdoubled each retry, capped atPULL_BACKOFF_MAX) plus a small random jitter that de-synchronizes the parallel retries.Genuine errors still fail fast: arch-missing manifests are skipped on the first attempt, and a persistent failure still exits non-zero after the retries are exhausted.