Retry wget downloads in Ubuntu and distroless Dockerfiles by groeneai · Pull Request #105139 · ClickHouse/ClickHouse · GitHub
Skip to content

Retry wget downloads in Ubuntu and distroless Dockerfiles#105139

Merged
alexey-milovidov merged 1 commit into
ClickHouse:masterfrom
groeneai:groeneai/fix-dockerfile-s3-retry
May 19, 2026
Merged

Retry wget downloads in Ubuntu and distroless Dockerfiles#105139
alexey-milovidov merged 1 commit into
ClickHouse:masterfrom
groeneai:groeneai/fix-dockerfile-s3-retry

Conversation

@groeneai

@groeneai groeneai commented May 17, 2026

Copy link
Copy Markdown
Contributor

The Docker server image job intermittently fails when fetching the prebuilt .deb packages from clickhouse-builds.s3.amazonaws.com:

ERROR 500: Internal Server Error.
ERROR 503: Service Unavailable.

Three unrelated PRs hit this S3 download flake over three days — #100173 on 2026-05-15 (500), #104694 on 2026-05-14 (503), #104853 on 2026-05-13 (503). The 14-day CIDB pattern is below.

day         master_hits  pr_hits  distinct_prs
2026-05-15           0        1            1
2026-05-14           0        2            1
2026-05-13           0        1            1
2026-05-08           0        6            1
2026-05-07           0        1            1
2026-05-05           0       12            5   ← apt mirror outage, different shape
2026-05-04           1       20           10   ← apt mirror outage, different shape

(Most of 2026-05-04/2026-05-05 was a different, apt-ca-certificates failure mode; the recent 500/503 shape on May 13–15 is the one this PR addresses.)

docker/server/Dockerfile.alpine already wraps wget with a retry helper (added in #100380 to absorb the same kind of transient errors during Alpine cross-architecture builds via QEMU). The Ubuntu and distroless flavours still call wget ... || exit 1 on the very first attempt, so a single 500/503 from S3 kills the whole build.

This change ports the same wget_with_retry wrapper to docker/server/Dockerfile.ubuntu and docker/server/Dockerfile.distroless for the DIRECT_DOWNLOAD_URLS block (the path used by CI). WGET_RETRIES (default 5) and WGET_RETRY_DELAY (default 1s) match the Alpine version and remain overridable via --build-arg.

Triggered by @alexey-milovidov's directive on #100173:

@groeneai, could you investigate the latter two (Docker server image S3 500 / Stress test (arm_tsan) global timeout) and provide a fix in a separate PR if a fix is needed?

CI report (#100173, sha 3ceac71): https://s3.amazonaws.com/clickhouse-test-reports/json.html?PR=100173&sha=3ceac71a43a5fd4fc1a7859937b23e71b8fe42ae&name_0=PR&name_1=Docker%20server%20image

Changelog category (leave one):

  • CI Fix or Improvement (changelog entry is not required)

Changelog entry (a user-readable short description of the changes that goes into CHANGELOG.md):

...

Documentation entry for user-facing changes

  • Documentation is written (mandatory for new features)

Version info

  • Merged into: 26.5.1.797

The `Docker server image` job intermittently fails when fetching the
prebuilt `.deb` packages from `clickhouse-builds.s3.amazonaws.com`:

```
ERROR 500: Internal Server Error.
ERROR 503: Service Unavailable.
```

Three unrelated PRs hit this S3 download flake over three days
(`ClickHouse#100173` on 2026-05-15 with `500`, `ClickHouse#104694` on 2026-05-14 with `503`,
`ClickHouse#104853` on 2026-05-13 with `503`). On master in the last 14 days the
master `Docker server image` job has 1 failure of the same shape.

`docker/server/Dockerfile.alpine` already wraps `wget` with a retry
helper (added in ClickHouse#100380 to absorb the same kind of transient errors
during Alpine cross-architecture builds via QEMU). The Ubuntu and
distroless flavours still call `wget ... || exit 1` on the very first
attempt, so a single `500`/`503` from S3 kills the whole build.

This change ports the same `wget_with_retry` wrapper to
`docker/server/Dockerfile.ubuntu` and `docker/server/Dockerfile.distroless`
for the `DIRECT_DOWNLOAD_URLS` block (the path used by CI). `WGET_RETRIES`
(default `5`) and `WGET_RETRY_DELAY` (default `1s`) match the Alpine
version and remain overridable via `--build-arg`.

CI report (PR ClickHouse#100173, sha 3ceac71): https://s3.amazonaws.com/clickhouse-test-reports/json.html?PR=100173&sha=3ceac71a43a5fd4fc1a7859937b23e71b8fe42ae&name_0=PR&name_1=Docker%20server%20image

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
@groeneai

Copy link
Copy Markdown
Contributor Author

@groeneai

Copy link
Copy Markdown
Contributor Author

Pre-PR validation gate (per worker policy):

This is an infrastructure / Dockerfile fix, not a flaky-test fix, so the deterministic-repro gate doesn't apply 1:1. The relevant checks:

  • a) Deterministic repro? Not on-demand reproducible (the trigger is an external S3 outage), but the failure shape is reproduced 3× across 3 unrelated PRs in 3 days (Add user_files_policy server setting for custom disks in user_files directory #100173 / 500, Wire functions stress test to ProcessList for KILL-QUERY-style timeouts #104694 / 503, Fix block structure mismatch in UnionStep after filter push-down with NULL constants #104853 / 503) — concrete signal, not anecdote.
  • b) Root cause explained? Yes. Dockerfile.ubuntu and Dockerfile.distroless run wget ... || exit 1 once. A single 5xx from clickhouse-builds.s3.amazonaws.com then fails the whole docker build step. Dockerfile.alpine already wraps wget with wget_with_retry (added in Add wget retry logic to Alpine Docker image build #100380); Ubuntu/distroless were just never updated.
  • c) Fix matches root cause? Yes — adds the same wget_with_retry wrapper around the DIRECT_DOWNLOAD_URLS wget call in both Dockerfiles. Same approach @alexey-milovidov used for Alpine in Add wget retry logic to Alpine Docker image build #100380.
  • d) Test intent preserved? No tests changed. Build behaviour on success is unchanged (wrapper succeeds on first attempt → same as before). On transient failure, wrapper retries up to WGET_RETRIES=5 times (overridable) instead of failing immediately.
  • e) Both directions demonstrated? Shell function syntax checked under sh/bash/dash (all parse clean). Retry behaviour validated locally against an unreachable URL — confirmed it retries WGET_RETRIES times with WGET_RETRY_DELAY between attempts, then exits non-zero.
  • f) Fix is general? Yes — wrapper is applied in both Ubuntu and distroless Dockerfiles in one go. Alpine already has it. The deb_location_url/single_binary_location_url blocks are untouched because CI doesn't take those paths (DIRECT_DOWNLOAD_URLS is what fails). Happy to extend the wrapper to those blocks too if you'd prefer one consistent style across all install paths.

Cross-PR check at PR-creation time: no competing open PR touching Dockerfile.ubuntu/Dockerfile.distroless or mentioning wget_with_retry/DIRECT_DOWNLOAD_URLS.

@alexey-milovidov alexey-milovidov self-assigned this May 17, 2026
@alexey-milovidov alexey-milovidov added the can be tested Allows running workflows for external contributors label May 17, 2026
@clickhouse-gh

clickhouse-gh Bot commented May 17, 2026

Copy link
Copy Markdown
Contributor

@clickhouse-gh clickhouse-gh Bot added the pr-ci label May 17, 2026
@alexey-milovidov alexey-milovidov added this pull request to the merge queue May 19, 2026
Merged via the queue into ClickHouse:master with commit ab7131b May 19, 2026
167 checks passed
@robot-clickhouse robot-clickhouse added the pr-synced-to-cloud The PR is synced to the cloud repo label May 19, 2026
pull Bot pushed a commit to Spencerx/ClickHouse that referenced this pull request Jun 28, 2026
The `Docker keeper image` job intermittently fails while fetching the
prebuilt packages from `clickhouse-builds.s3.amazonaws.com`. Example
from PR ClickHouse#108573 (`docker/keeper/Dockerfile.distroless`, amd64): the
`clickhouse-common-static` deb downloads fine, then the
`clickhouse-keeper` deb gets

    HTTP request sent, awaiting response... 503 Service Unavailable
    ERROR 503: Service Unavailable.

and the single-shot `wget ... || exit 1` kills the whole build, so the
downstream `docker library image test` then reports `image does not
exist!`.

The server Dockerfiles already wrap `wget` with a retry helper:
`Dockerfile.alpine` since ClickHouse#100380 (tgz path) and `Dockerfile.ubuntu` /
`Dockerfile.distroless` since ClickHouse#105139 (deb path). The Keeper
Dockerfiles were left with the brittle first-attempt `wget`, so a
single 500/503 from S3 fails the build.

This ports the same `wget_with_retry` wrapper to both Keeper files:
- `docker/keeper/Dockerfile.distroless` deb path (`DIRECT_DOWNLOAD_URLS`),
  mirroring `docker/server/Dockerfile.distroless`.
- `docker/keeper/Dockerfile` tgz path (`DIRECT_DOWNLOAD_URLS` and the
  repository fallback), mirroring `docker/server/Dockerfile.alpine`.

`WGET_RETRIES` (default 5) and `WGET_RETRY_DELAY` (default 1s) match the
server Dockerfiles and stay overridable via `--build-arg`. A persistent
error still fails the build after the retries are exhausted.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
pull Bot pushed a commit to admariner/ClickHouse that referenced this pull request Jun 30, 2026
The "Docker server image" job builds the distroless/alpine/ubuntu images with
`docker buildx build`. buildx resolves the SBOM scanner image
(docker/buildkit-syft-scanner, pulled because of --sbom=true) and base images
from docker.io. When docker.io has a transient outage, the build fails at:

  #2 resolve image config for docker-image://docker.io/docker/buildkit-syft-scanner:stable-1
  #2 ERROR: unexpected status from HEAD request to https://registry-1.docker.io/...

and the downstream "docker library image test" then reports "image does not
exist!" as a secondary victim.

praktika's Shell.run / Result.from_commands_run already support retries with
exponential backoff and an error allowlist, but the buildx build and the
imagetools merge commands ran with no retries. Wire retries=5 and an
allowlist of transient registry/network signatures into both. The allowlist
keeps fail-fast behavior for genuine Dockerfile/build errors. This mirrors the
existing retry hardening for the in-Dockerfile wget downloads (ClickHouse#105139) and the
keeper image S3 artifact fetch (ClickHouse#108675).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

can be tested Allows running workflows for external contributors pr-ci pr-synced-to-cloud The PR is synced to the cloud repo

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants