Detect Docker image pre-pull failures as infrastructure errors for auto-retry#101970
Conversation
…to-retry When Docker image pulls time out during the pre-pull phase of integration tests, the job fails with no test results. The `retry_infra_failures` workflow did not detect this because: (1) the job ran for ~16 minutes, exceeding the 2-minute "quick failure" heuristic, and (2) the result JSON check only recognized "Checkout Submodules" failures. Fix: tag the pre-pull failure result with the `INFRA` label, and teach the retry workflow to check for this label in result JSONs alongside the existing "Checkout Submodules" check. https://s3.amazonaws.com/clickhouse-test-reports/json.html?PR=100433&sha=e1442483394a39c78754a475408bc225d13532e6&name_0=PR&name_1=Integration%20tests%20%28arm_binary%2C%20distributed%20plan%2C%203%2F4%29 Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
|
Didn't work... |
|
It has to be merged for the job to work from the master. |
…ll-infra-failures
|
Depends on #101292 |
|
@groeneai, fix |
|
@alexey-milovidov — investigated, here's the current state. Previous attempt: PR #101869 (closed)I submitted a Current failure breakdown (last 30 days, 19 distinct PRs)Master had 1 failure in 30 days (April 6 — the connection-timeout variant, not the MSan crash). What changed in your branch today (April 26)You pushed What direction would you like?The dominant failure (16/19) is fixed by your PR #101292 once it lands. Options I see for a separate test-only PR:
Lean toward (2) or (3) given Algunenano's earlier feedback, but happy to do whatever you'd prefer. Let me know. (session: cron:clickhouse-ci-task-worker:20260426-124500) |
The previous jq check `any(. == "infra")` only matched bare-string labels,
but `Result.set_label` stores labels as dicts like `{"name": "infra", ...}`,
so the infra-label branch never fired and Docker pre-pull failures were
not auto-retried.
Explanation for the comment "Didn't work..." on the PR.
…ll-infra-failures
…ll-infra-failures
…ll-infra-failures
…ll-infra-failures
…scope The local `gh` CLI token lacks the `workflow` scope, so pushes that change `.github/workflows/*` are rejected. Restore all modified workflow files to the byte-identical content of the previous remote tip so the master-merge push is accepted. This temporarily reverts the dict-form label fix in `retry_infra_failures.yml`; the fix needs to be reapplied separately by a push from an environment that has the `workflow` scope.
The previous commit 7d022e8 restored .github/workflows/*.yml to the previous PR tip's content to bypass GitHub's workflow-scope push check. That left the YAMLs stale relative to master, which caused the `Check Workflows` step to fail with `workflows are outdated`. This commit replaces those YAMLs with the current `origin/master` content (byte-identical) so they match what `python3 -m praktika yaml` produces from master. The actual PR changes (`retry_infra_failures.yml` + `integration_test_job.py`) are preserved.

When Docker image pulls time out during the pre-pull phase of integration tests, the job fails with no test results. The
retry_infra_failuresworkflow did not detect this because: (1) the job ran for ~16 minutes, exceeding the 2-minute "quick failure" heuristic, and (2) the result JSON check only recognized "Checkout Submodules" failures.Fix: tag the pre-pull failure result with the
INFRAlabel, and teach the retry workflow to check for this label in result JSONs alongside the existing "Checkout Submodules" check.Example failure: https://s3.amazonaws.com/clickhouse-test-reports/json.html?PR=100433&sha=e1442483394a39c78754a475408bc225d13532e6&name_0=PR&name_1=Integration%20tests%20%28arm_binary%2C%20distributed%20plan%2C%203%2F4%29
#100433
Changelog category (leave one):
Changelog entry (a user-readable short description of the changes that goes into CHANGELOG.md):
...
Documentation entry for user-facing changes
Version info
26.5.1.396