Retry docker buildx on transient docker.io registry errors#108969
Conversation
The "Docker server image" job builds the distroless/alpine/ubuntu images with `docker buildx build`. buildx resolves the SBOM scanner image (docker/buildkit-syft-scanner, pulled because of --sbom=true) and base images from docker.io. When docker.io has a transient outage, the build fails at: ClickHouse#2 resolve image config for docker-image://docker.io/docker/buildkit-syft-scanner:stable-1 ClickHouse#2 ERROR: unexpected status from HEAD request to https://registry-1.docker.io/... and the downstream "docker library image test" then reports "image does not exist!" as a secondary victim. praktika's Shell.run / Result.from_commands_run already support retries with exponential backoff and an error allowlist, but the buildx build and the imagetools merge commands ran with no retries. Wire retries=5 and an allowlist of transient registry/network signatures into both. The allowlist keeps fail-fast behavior for genuine Dockerfile/build errors. This mirrors the existing retry hardening for the in-Dockerfile wget downloads (ClickHouse#105139) and the keeper image S3 artifact fetch (ClickHouse#108675). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
|
cc @maxknv — could you review this? It adds retries (praktika's native |
|
Workflow [PR], commit [f0de148] Summary: ✅ AI ReviewSummaryThis PR adds retry handling for transient registry/network failures around the Final VerdictStatus: ✅ Approve Minimum required actions: none. |
AI review on ClickHouse#108969 flagged that BUILDX_RETRY_ERRORS matched progress-only text. buildx emits "resolve image config for docker-image://..." as normal --progress=plain output (--sbom/--provenance pull the syft scanner and base images on every build), and praktika's retry scans every stderr line for any substring. So a later RUN/COPY/package-install/Dockerfile failure that runs after that progress line was retried 5x even though its error was not transient. Drop the progress-only and broad substrings ("resolve image config", "registry-1.docker.io", "error from registry") and keep only actual registry/network failure signatures: "failed to do request", "unexpected status from HEAD request" (the motivating docker.io HEAD-500 outage), "TLS handshake timeout", "i/o timeout", "connection reset by peer", "connection refused". None of these appear in normal progress output, so genuine Dockerfile/build errors fail fast on the first attempt. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

Related: #105139
Related: #108675
Changelog category (leave one):
Description
The
Docker server imagejob builds the distroless/alpine/ubuntu images withdocker buildx build. buildx resolves the SBOM scanner image (docker/buildkit-syft-scanner, pulled because of--sbom=true) and the base images from docker.io. A transient docker.io outage fails the build at the resolve step, e.g.:and the downstream
docker library image testthen reportsimage does not exist!as a secondary victim.praktika's
Shell.run/Result.from_commands_runalready support retries with exponential backoff plus an error allowlist, but the buildx build and the imagetools merge ran with no retries. This wiresretries=5and an allowlist of transient registry/network signatures into both calls. The allowlist keeps fail-fast behavior for genuine Dockerfile/build errors. Mirrors the existing retry hardening for in-Dockerfile wget downloads (#105139) and the keeper image S3 artifact fetch (#108675).Observed on master and PRs during a ~6 min docker.io blip on 2026-06-30 15:00-15:06 UTC (recovered on its own). Report (master): https://s3.amazonaws.com/clickhouse-test-reports/json.html?REF=master&sha=36f0c2d95475aafa597cacebace14f5888908dcd&name_0=MasterCI&name_1=Docker%20server%20image
Version info
26.7.1.307