iframe-proxy

groeneai · 2026-06-30T18:28:18Z

Changelog category (leave one):

CI Fix or Improvement (changelog entry is not required)

Changelog entry (a user-readable short description of the changes that goes into CHANGELOG.md):

Description

Integration test jobs pre-pull all Docker images for the batch before tests start (ci/jobs/scripts/prefetch-integration-test-images). When the registry returns a transient error for one image, the whole job fails at the pre-pull stage with Failed to pre-pull Docker images needed by the test batch and no per-test result, so an unrelated PR's check goes red.

Example: PR #108550, Integration tests (arm_binary, distributed plan, 4/4), CI report. clickhouse/test-mysql57 and clickhouse/kerberos-kdc each returned received unexpected HTTP status: 500 Internal Server Error on all 3 attempts within a ~13s window, while ~18 other images pulled fine. This pattern recurs across many unrelated PRs (e.g. a registry outage on 2026-06-15 produced ~295 such job-level failures across 22 PRs in one day).

The previous retry policy was 3 attempts with a fixed 5s sleep, a ~13s window that is too narrow to ride out a registry hiccup, and because all images pull in parallel the retries hit the registry in lockstep.

This change:

raises the default retries to 5 (matching the buildx retry count introduced in Retry docker buildx on transient docker.io registry errors #108969),
replaces the fixed sleep with exponential backoff (PULL_BACKOFF_BASE doubled each retry, capped at PULL_BACKOFF_MAX) plus a small random jitter that de-synchronizes the parallel retries.

Genuine errors still fail fast: arch-missing manifests are skipped on the first attempt, and a persistent failure still exits non-zero after the retries are exhausted.

The pre-pull step in integration test jobs fails the whole batch when the Docker registry returns a transient error (e.g. HTTP 500) for an image. The previous policy was 3 attempts with a fixed 5s sleep, a ~13s window that is too narrow to ride out a registry hiccup, and because all images pull in parallel the retries hit the registry in lockstep. Raise the default retries to 5 and replace the fixed sleep with exponential backoff plus jitter (PULL_BACKOFF_BASE doubled each retry, capped at PULL_BACKOFF_MAX, plus a small random jitter). The jitter de-synchronizes the parallel retries. Genuine errors still fail fast: arch-missing manifests are skipped on the first attempt, and a persistent failure still exits non-zero after the retries are exhausted. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

groeneai · 2026-06-30T18:28:41Z

Pre-PR validation gate (click to expand)

#	Question	Answer
a	Deterministic repro?	Yes. With a fake `docker` that returns HTTP 500 for a fixed outage window then succeeds: the old policy (3 attempts, fixed 5s) fails every time on a 30s outage with the exact CI message `ERROR: Failed to pull the following image(s)`; the new policy succeeds on the same outage.
b	Root cause explained?	The registry intermittently returns a transient error (HTTP 500) for an image. The pre-pull retries 3 times with a fixed 5s sleep (~13s total) and all images pull in parallel, so the retry window is too short to outlast a registry hiccup and the lockstep retries add load. The batch then fails before any test runs (no per-test row, framework label `INFRA`).
c	Fix matches root cause?	Yes. It widens the retry window (5 attempts, exponential backoff) so a transient burst is ridden out, and adds jitter so parallel retries de-synchronize instead of hammering the registry in lockstep. It does not mask any test failure.
d	Test intent preserved / new tests added?	N/A — CI infrastructure script, no test suite touched. Behavior verified with a fake-`docker` harness covering success-after-burst, arch-skip, and permanent-failure paths.
e	Both directions demonstrated?	Yes. Same simulated 30s transient-500 outage: OLD (master) script exits 1 (`Failed to pull ...`); NEW script exits 0 (`All images pre-fetched successfully`) because the 4th attempt lands after recovery.
f	Fix is general across code paths?	Yes. The script is the single pre-pull implementation; the Python caller (`prefetch_images`) default is bumped to match so both the script default and the orchestrator agree. Genuine errors still fail fast (arch-missing → skip on attempt 1; persistent error → exit non-zero after retries).
g	Fix generalizes across inputs?	N/A — not a datatype/value-dependent code path; retry policy applies uniformly to all images. Verified arch-missing, transient-then-recover, and permanent-failure inputs.
h	Backward compatible?	Yes. Defaults change only the retry policy (more resilient); all knobs remain overridable via env (`PULL_RETRIES`, `PULL_TIMEOUT`, `PULL_BACKOFF_BASE`, `PULL_BACKOFF_MAX`). No format/setting/data change.
i	Invariants and contracts preserved?	Yes. Script still returns 0 only when every image is present (or legitimately arch-skipped) and non-zero otherwise; the parallel-pull / fail-file collection logic is unchanged.

Session id: cron:clickhouse-worker-slot-5:20260630-181000

groeneai · 2026-06-30T18:29:08Z

cc @maxknv — could you review this? It makes the integration-test Docker image pre-pull resilient to transient registry errors (e.g. HTTP 500), which currently fail the whole batch with no per-test result. Raises retries 3->5 and replaces the fixed 5s sleep with exponential backoff + jitter (the parallel pulls were retrying in lockstep). Same direction as #108969 (buildx retry). Repro and validation in the gate comment above.

clickhouse-gh · 2026-06-30T19:40:57Z

Workflow [PR], commit [52b3927]

AI Review

Summary

This PR changes integration-test Docker image prefetching to use more retry attempts with exponential backoff and jitter, and keeps the Python prefetch_images default aligned with the shell script. I did not find correctness, safety, or CI-contract issues in the changed code. There are no existing inline review threads, and the available ClickHouse CI report for commit 52b39273756e76be9cb7c39b4feb004179982fe5 shows no failed tests.

Final Verdict

Status: ✅ Approve

maxknv · 2026-06-30T20:59:52Z

groeneai mentioned this pull request Jun 30, 2026

Fix sort order violation for TTL GROUP BY with SET on a sorting key column #108550

Open

alexey-milovidov added the can be tested Allows running workflows for external contributors label Jun 30, 2026

clickhouse-gh Bot added the pr-ci label Jun 30, 2026

maxknv closed this Jun 30, 2026

Sunbelt Computer Software

PL/B Language Development and Support

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Retry integration-test image pre-pull with exponential backoff#108975

Retry integration-test image pre-pull with exponential backoff#108975
groeneai wants to merge 1 commit into
ClickHouse:masterfrom
groeneai:groeneai/ci-prefetch-images-retry-backoff

groeneai commented Jun 30, 2026

Uh oh!

groeneai commented Jun 30, 2026

Uh oh!

groeneai commented Jun 30, 2026

Uh oh!

clickhouse-gh Bot commented Jun 30, 2026 •

edited

Loading

Uh oh!

maxknv commented Jun 30, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Sunbelt Computer Software

PL/B Language Development and Support

Uh oh!

Conversation

groeneai commented Jun 30, 2026

Changelog category (leave one):

Changelog entry (a user-readable short description of the changes that goes into CHANGELOG.md):

Description

Uh oh!

groeneai commented Jun 30, 2026

Uh oh!

groeneai commented Jun 30, 2026

Uh oh!

clickhouse-gh Bot commented Jun 30, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

AI Review

Summary

Final Verdict

Uh oh!

maxknv commented Jun 30, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

clickhouse-gh Bot commented Jun 30, 2026 •

edited

Loading