Retry docker buildx on transient docker.io registry errors by groeneai · Pull Request #108969 · ClickHouse/ClickHouse · GitHub
Skip to content

Retry docker buildx on transient docker.io registry errors#108969

Merged
maxknv merged 2 commits into
ClickHouse:masterfrom
groeneai:fix-docker-server-image-buildx-registry-retry
Jun 30, 2026
Merged

Retry docker buildx on transient docker.io registry errors#108969
maxknv merged 2 commits into
ClickHouse:masterfrom
groeneai:fix-docker-server-image-buildx-registry-retry

Conversation

@groeneai

@groeneai groeneai commented Jun 30, 2026

Copy link
Copy Markdown
Contributor

Related: #105139
Related: #108675

Changelog category (leave one):

  • CI Fix or Improvement (changelog entry is not required)

Description

The Docker server image job builds the distroless/alpine/ubuntu images with docker buildx build. buildx resolves the SBOM scanner image (docker/buildkit-syft-scanner, pulled because of --sbom=true) and the base images from docker.io. A transient docker.io outage fails the build at the resolve step, e.g.:

#2 resolve image config for docker-image://docker.io/docker/buildkit-syft-scanner:stable-1
#2 ERROR: unexpected status from HEAD request to https://registry-1.docker.io/...

and the downstream docker library image test then reports image does not exist! as a secondary victim.

praktika's Shell.run / Result.from_commands_run already support retries with exponential backoff plus an error allowlist, but the buildx build and the imagetools merge ran with no retries. This wires retries=5 and an allowlist of transient registry/network signatures into both calls. The allowlist keeps fail-fast behavior for genuine Dockerfile/build errors. Mirrors the existing retry hardening for in-Dockerfile wget downloads (#105139) and the keeper image S3 artifact fetch (#108675).

Observed on master and PRs during a ~6 min docker.io blip on 2026-06-30 15:00-15:06 UTC (recovered on its own). Report (master): https://s3.amazonaws.com/clickhouse-test-reports/json.html?REF=master&sha=36f0c2d95475aafa597cacebace14f5888908dcd&name_0=MasterCI&name_1=Docker%20server%20image

Version info

  • Merged into: 26.7.1.307

The "Docker server image" job builds the distroless/alpine/ubuntu images with
`docker buildx build`. buildx resolves the SBOM scanner image
(docker/buildkit-syft-scanner, pulled because of --sbom=true) and base images
from docker.io. When docker.io has a transient outage, the build fails at:

  ClickHouse#2 resolve image config for docker-image://docker.io/docker/buildkit-syft-scanner:stable-1
  ClickHouse#2 ERROR: unexpected status from HEAD request to https://registry-1.docker.io/...

and the downstream "docker library image test" then reports "image does not
exist!" as a secondary victim.

praktika's Shell.run / Result.from_commands_run already support retries with
exponential backoff and an error allowlist, but the buildx build and the
imagetools merge commands ran with no retries. Wire retries=5 and an
allowlist of transient registry/network signatures into both. The allowlist
keeps fail-fast behavior for genuine Dockerfile/build errors. This mirrors the
existing retry hardening for the in-Dockerfile wget downloads (ClickHouse#105139) and the
keeper image S3 artifact fetch (ClickHouse#108675).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
@groeneai

groeneai commented Jun 30, 2026

Copy link
Copy Markdown
Contributor Author

@groeneai

Copy link
Copy Markdown
Contributor Author

cc @maxknv — could you review this? It adds retries (praktika's native retries/retry_errors with backoff + fail-fast) to the docker buildx build and imagetools-merge calls in the Docker server image job, so a transient docker.io registry blip (seen 2026-06-30 resolving buildkit-syft-scanner) is retried instead of failing the whole job.

@clickhouse-gh

clickhouse-gh Bot commented Jun 30, 2026

Copy link
Copy Markdown
Contributor

Workflow [PR], commit [f0de148]

Summary:


AI Review

Summary

This PR adds retry handling for transient registry/network failures around the docker buildx build and docker buildx imagetools create calls in the Docker image CI job. The previous retry-filter issue has been fixed by removing progress-only and overly broad tokens, and I did not find any remaining correctness or CI-contract findings in the current diff.

Final Verdict

Status: ✅ Approve

Minimum required actions: none.

@clickhouse-gh clickhouse-gh Bot added the pr-ci label Jun 30, 2026
Comment thread ci/jobs/docker_server.py Outdated
AI review on ClickHouse#108969 flagged that BUILDX_RETRY_ERRORS matched progress-only
text. buildx emits "resolve image config for docker-image://..." as normal
--progress=plain output (--sbom/--provenance pull the syft scanner and base
images on every build), and praktika's retry scans every stderr line for any
substring. So a later RUN/COPY/package-install/Dockerfile failure that runs
after that progress line was retried 5x even though its error was not transient.

Drop the progress-only and broad substrings ("resolve image config",
"registry-1.docker.io", "error from registry") and keep only actual
registry/network failure signatures: "failed to do request", "unexpected
status from HEAD request" (the motivating docker.io HEAD-500 outage),
"TLS handshake timeout", "i/o timeout", "connection reset by peer",
"connection refused". None of these appear in normal progress output, so
genuine Dockerfile/build errors fail fast on the first attempt.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
@clickhouse-gh clickhouse-gh Bot added the manual approve Manual approve required to run CI label Jun 30, 2026
@groeneai

Copy link
Copy Markdown
Contributor Author

@maxknv maxknv enabled auto-merge June 30, 2026 20:49
@maxknv maxknv self-requested a review June 30, 2026 20:49
@maxknv maxknv added this pull request to the merge queue Jun 30, 2026
Merged via the queue into ClickHouse:master with commit 1a00098 Jun 30, 2026
174 checks passed
@robot-ch-test-poll3 robot-ch-test-poll3 added the pr-synced-to-cloud The PR is synced to the cloud repo label Jun 30, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

can be tested Allows running workflows for external contributors manual approve Manual approve required to run CI pr-ci pr-synced-to-cloud The PR is synced to the cloud repo

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants