Retry wget downloads in Keeper Dockerfiles by groeneai · Pull Request #108675 · ClickHouse/ClickHouse · GitHub
Skip to content

Retry wget downloads in Keeper Dockerfiles#108675

Merged
alexey-milovidov merged 1 commit into
ClickHouse:masterfrom
groeneai:fix-keeper-docker-wget-retry
Jun 28, 2026
Merged

Retry wget downloads in Keeper Dockerfiles#108675
alexey-milovidov merged 1 commit into
ClickHouse:masterfrom
groeneai:fix-keeper-docker-wget-retry

Conversation

@groeneai

@groeneai groeneai commented Jun 27, 2026

Copy link
Copy Markdown
Contributor

Changelog category (leave one):

  • CI Fix or Improvement (changelog entry is not required)

Changelog entry (a user-readable short description of the changes that goes into CHANGELOG.md):

...

Description

The Docker keeper image job intermittently fails while fetching the prebuilt packages from clickhouse-builds.s3.amazonaws.com. The clickhouse-common-static package downloads fine, then the clickhouse-keeper package gets a 503 Service Unavailable and the single-shot wget ... || exit 1 kills the whole build. The downstream docker library image test then reports image does not exist! (a secondary victim, not a second failure).

Recent example, PR #108573 (docker/keeper/Dockerfile.distroless, amd64):

--2026-06-26 22:05:41--  https://clickhouse-builds.s3.amazonaws.com/PRs/108573/.../build_amd_release/clickhouse-keeper_26.7.1.1_amd64.deb
HTTP request sent, awaiting response... 503 Service Unavailable
2026-06-26 22:05:41 ERROR 503: Service Unavailable.
Dockerfile.distroless:43
ERROR: failed to build: ... exit code: 1

The server Dockerfiles already wrap wget with a retry helper for exactly this transient-S3 class: docker/server/Dockerfile.alpine since #100380 (tgz path) and docker/server/Dockerfile.ubuntu / docker/server/Dockerfile.distroless since #105139 (deb path). The Keeper Dockerfiles were left calling wget on the first attempt only, so a single 500/503 from S3 fails the build.

This ports the same wget_with_retry wrapper to both Keeper Dockerfiles:

  • docker/keeper/Dockerfile.distroless: the deb DIRECT_DOWNLOAD_URLS path (the one CI uses), mirroring docker/server/Dockerfile.distroless.
  • docker/keeper/Dockerfile (alpine/ubuntu): the tgz DIRECT_DOWNLOAD_URLS path and the repository fallback, mirroring docker/server/Dockerfile.alpine.

WGET_RETRIES (default 5) and WGET_RETRY_DELAY (default 1s) match the server Dockerfiles and remain overridable via --build-arg. A persistent error still fails the build after the retries are exhausted, so a genuinely-broken S3 is not masked.

Recurrence (30 days): Docker keeper image FAIL = 165 hits across 15 distinct PRs + 21 on master, all the scattered transient-S3 signature that re-runs clean.

CI report (PR #108573): https://s3.amazonaws.com/clickhouse-test-reports/json.html?PR=108573&sha=1766fdb7f5dd9384c444db648ceaa4a4548dc4db&name_0=PR&name_1=Docker%20keeper%20image

Version info

  • Merged into: 26.7.1.196

The `Docker keeper image` job intermittently fails while fetching the
prebuilt packages from `clickhouse-builds.s3.amazonaws.com`. Example
from PR ClickHouse#108573 (`docker/keeper/Dockerfile.distroless`, amd64): the
`clickhouse-common-static` deb downloads fine, then the
`clickhouse-keeper` deb gets

    HTTP request sent, awaiting response... 503 Service Unavailable
    ERROR 503: Service Unavailable.

and the single-shot `wget ... || exit 1` kills the whole build, so the
downstream `docker library image test` then reports `image does not
exist!`.

The server Dockerfiles already wrap `wget` with a retry helper:
`Dockerfile.alpine` since ClickHouse#100380 (tgz path) and `Dockerfile.ubuntu` /
`Dockerfile.distroless` since ClickHouse#105139 (deb path). The Keeper
Dockerfiles were left with the brittle first-attempt `wget`, so a
single 500/503 from S3 fails the build.

This ports the same `wget_with_retry` wrapper to both Keeper files:
- `docker/keeper/Dockerfile.distroless` deb path (`DIRECT_DOWNLOAD_URLS`),
  mirroring `docker/server/Dockerfile.distroless`.
- `docker/keeper/Dockerfile` tgz path (`DIRECT_DOWNLOAD_URLS` and the
  repository fallback), mirroring `docker/server/Dockerfile.alpine`.

`WGET_RETRIES` (default 5) and `WGET_RETRY_DELAY` (default 1s) match the
server Dockerfiles and stay overridable via `--build-arg`. A persistent
error still fails the build after the retries are exhausted.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
@groeneai

Copy link
Copy Markdown
Contributor Author

@groeneai

Copy link
Copy Markdown
Contributor Author

cc @Algunenano — could you review this? It ports the wget_with_retry wrapper (already in the server Dockerfiles via #100380 and #105139) to the two Keeper Dockerfiles, so a transient S3 500/503 while fetching the keeper .deb/.tgz retries instead of failing the Docker keeper image build on the first attempt.

@alexey-milovidov alexey-milovidov added the can be tested Allows running workflows for external contributors label Jun 27, 2026
@clickhouse-gh

clickhouse-gh Bot commented Jun 27, 2026

Copy link
Copy Markdown
Contributor

Workflow [PR], commit [6a2d2ce]

Summary:


AI Review

Summary

This PR ports the existing server Dockerfile wget_with_retry pattern to the Keeper image download paths that CI exercises. The changed DIRECT_DOWNLOAD_URLS and repository package fetches preserve the previous failure behavior after retry exhaustion while covering transient 500/503 download failures. I found no correctness, safety, or PR metadata issues that need inline comments.

Final Verdict

Status: ✅ Approve

@clickhouse-gh clickhouse-gh Bot added the pr-ci label Jun 27, 2026
@clickhouse-gh

clickhouse-gh Bot commented Jun 28, 2026

Copy link
Copy Markdown
Contributor

📊 Cloud Performance Report

✅ AI verdict: no_change — no significant changes across 38 queries analysed

This PR only adds wget retry logic to the keeper Dockerfiles (both private and public diffs are identical and Docker-build-only). Nothing in it touches the server's query-execution path, so the flagged improvements on ClickBench Q4 (-13.88%) and Q15 (-14.35%) cannot be caused by this change. Both are downgraded to not-sure as run-to-run measurement variance. No action needed on these queries.

clickbench

⚠️ 2 inconclusive

Flagged queries (2 of 43)
Query Verdict Baseline median (ms) PR median (ms) Change q-value Hint
⚠️ 4 not_sure 263 226 -13.9% <0.0001 PR only edits keeper Dockerfiles (wget retry); cannot touch query execution. -13.88% is run-to-run variance.
⚠️ 15 not_sure 237 203 -14.3% <0.0001 Same Docker-only change; nothing in this PR runs in Q15's path. -14.35% is infrastructure variance, not a real gain.

q-value = BH-FDR adjusted p; smaller is stronger evidence. MIRAI flags a query when q < fdr_q (default 0.10) — the value the verdict is based on.

tpch_adapted_1_official

🟢 No significant changes

Debug info
  • StressHouse run: 26b44bbd-0e69-4669-ada5-b47e5d212af5
  • MIRAI run: 9a54702f-535c-4c96-b097-1f06edbc7daf
  • PR check IDs:
    • clickbench_286902_1782627527
    • clickbench_286908_1782627527
    • clickbench_286918_1782627527
    • tpch_adapted_1_official_286926_1782627527
    • tpch_adapted_1_official_286937_1782627527
    • tpch_adapted_1_official_286947_1782627527

@alexey-milovidov alexey-milovidov self-assigned this Jun 28, 2026
@alexey-milovidov alexey-milovidov added this pull request to the merge queue Jun 28, 2026
Merged via the queue into ClickHouse:master with commit f1abdda Jun 28, 2026
169 checks passed
@robot-clickhouse-ci-2 robot-clickhouse-ci-2 added the pr-synced-to-cloud The PR is synced to the cloud repo label Jun 28, 2026
pull Bot pushed a commit to admariner/ClickHouse that referenced this pull request Jun 30, 2026
The "Docker server image" job builds the distroless/alpine/ubuntu images with
`docker buildx build`. buildx resolves the SBOM scanner image
(docker/buildkit-syft-scanner, pulled because of --sbom=true) and base images
from docker.io. When docker.io has a transient outage, the build fails at:

  #2 resolve image config for docker-image://docker.io/docker/buildkit-syft-scanner:stable-1
  #2 ERROR: unexpected status from HEAD request to https://registry-1.docker.io/...

and the downstream "docker library image test" then reports "image does not
exist!" as a secondary victim.

praktika's Shell.run / Result.from_commands_run already support retries with
exponential backoff and an error allowlist, but the buildx build and the
imagetools merge commands ran with no retries. Wire retries=5 and an
allowlist of transient registry/network signatures into both. The allowlist
keeps fail-fast behavior for genuine Dockerfile/build errors. This mirrors the
existing retry hardening for the in-Dockerfile wget downloads (ClickHouse#105139) and the
keeper image S3 artifact fetch (ClickHouse#108675).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

can be tested Allows running workflows for external contributors pr-ci pr-synced-to-cloud The PR is synced to the cloud repo

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants