Retry wget downloads in Keeper Dockerfiles#108675
Conversation
The `Docker keeper image` job intermittently fails while fetching the prebuilt packages from `clickhouse-builds.s3.amazonaws.com`. Example from PR ClickHouse#108573 (`docker/keeper/Dockerfile.distroless`, amd64): the `clickhouse-common-static` deb downloads fine, then the `clickhouse-keeper` deb gets HTTP request sent, awaiting response... 503 Service Unavailable ERROR 503: Service Unavailable. and the single-shot `wget ... || exit 1` kills the whole build, so the downstream `docker library image test` then reports `image does not exist!`. The server Dockerfiles already wrap `wget` with a retry helper: `Dockerfile.alpine` since ClickHouse#100380 (tgz path) and `Dockerfile.ubuntu` / `Dockerfile.distroless` since ClickHouse#105139 (deb path). The Keeper Dockerfiles were left with the brittle first-attempt `wget`, so a single 500/503 from S3 fails the build. This ports the same `wget_with_retry` wrapper to both Keeper files: - `docker/keeper/Dockerfile.distroless` deb path (`DIRECT_DOWNLOAD_URLS`), mirroring `docker/server/Dockerfile.distroless`. - `docker/keeper/Dockerfile` tgz path (`DIRECT_DOWNLOAD_URLS` and the repository fallback), mirroring `docker/server/Dockerfile.alpine`. `WGET_RETRIES` (default 5) and `WGET_RETRY_DELAY` (default 1s) match the server Dockerfiles and stay overridable via `--build-arg`. A persistent error still fails the build after the retries are exhausted. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
|
cc @Algunenano — could you review this? It ports the |
|
Workflow [PR], commit [6a2d2ce] Summary: ✅ AI ReviewSummaryThis PR ports the existing server Dockerfile Final VerdictStatus: ✅ Approve |
|
📊 Cloud Performance Report ✅ AI verdict: This PR only adds wget retry logic to the keeper Dockerfiles (both private and public diffs are identical and Docker-build-only). Nothing in it touches the server's query-execution path, so the flagged improvements on ClickBench Q4 (-13.88%) and Q15 (-14.35%) cannot be caused by this change. Both are downgraded to not-sure as run-to-run measurement variance. No action needed on these queries. clickbenchFlagged queries (2 of 43)q-value = BH-FDR adjusted p; smaller is stronger evidence. MIRAI flags a query when q < fdr_q (default 0.10) — the value the verdict is based on. tpch_adapted_1_official🟢 No significant changes Debug info
|
The "Docker server image" job builds the distroless/alpine/ubuntu images with `docker buildx build`. buildx resolves the SBOM scanner image (docker/buildkit-syft-scanner, pulled because of --sbom=true) and base images from docker.io. When docker.io has a transient outage, the build fails at: #2 resolve image config for docker-image://docker.io/docker/buildkit-syft-scanner:stable-1 #2 ERROR: unexpected status from HEAD request to https://registry-1.docker.io/... and the downstream "docker library image test" then reports "image does not exist!" as a secondary victim. praktika's Shell.run / Result.from_commands_run already support retries with exponential backoff and an error allowlist, but the buildx build and the imagetools merge commands ran with no retries. Wire retries=5 and an allowlist of transient registry/network signatures into both. The allowlist keeps fail-fast behavior for genuine Dockerfile/build errors. This mirrors the existing retry hardening for the in-Dockerfile wget downloads (ClickHouse#105139) and the keeper image S3 artifact fetch (ClickHouse#108675). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

Changelog category (leave one):
Changelog entry (a user-readable short description of the changes that goes into CHANGELOG.md):
...
Description
The
Docker keeper imagejob intermittently fails while fetching the prebuilt packages fromclickhouse-builds.s3.amazonaws.com. Theclickhouse-common-staticpackage downloads fine, then theclickhouse-keeperpackage gets a503 Service Unavailableand the single-shotwget ... || exit 1kills the whole build. The downstreamdocker library image testthen reportsimage does not exist!(a secondary victim, not a second failure).Recent example, PR #108573 (
docker/keeper/Dockerfile.distroless, amd64):The server Dockerfiles already wrap
wgetwith a retry helper for exactly this transient-S3 class:docker/server/Dockerfile.alpinesince #100380 (tgz path) anddocker/server/Dockerfile.ubuntu/docker/server/Dockerfile.distrolesssince #105139 (deb path). The Keeper Dockerfiles were left callingwgeton the first attempt only, so a single500/503from S3 fails the build.This ports the same
wget_with_retrywrapper to both Keeper Dockerfiles:docker/keeper/Dockerfile.distroless: the debDIRECT_DOWNLOAD_URLSpath (the one CI uses), mirroringdocker/server/Dockerfile.distroless.docker/keeper/Dockerfile(alpine/ubuntu): the tgzDIRECT_DOWNLOAD_URLSpath and the repository fallback, mirroringdocker/server/Dockerfile.alpine.WGET_RETRIES(default5) andWGET_RETRY_DELAY(default1s) match the server Dockerfiles and remain overridable via--build-arg. A persistent error still fails the build after the retries are exhausted, so a genuinely-broken S3 is not masked.Recurrence (30 days):
Docker keeper imageFAIL = 165 hits across 15 distinct PRs + 21 on master, all the scattered transient-S3 signature that re-runs clean.CI report (PR #108573): https://s3.amazonaws.com/clickhouse-test-reports/json.html?PR=108573&sha=1766fdb7f5dd9384c444db648ceaa4a4548dc4db&name_0=PR&name_1=Docker%20keeper%20image
Version info
26.7.1.196