Detect Docker image pre-pull failures as infrastructure errors for auto-retry by alexey-milovidov · Pull Request #101970 · ClickHouse/ClickHouse · GitHub
Skip to content

Detect Docker image pre-pull failures as infrastructure errors for auto-retry#101970

Merged
alexey-milovidov merged 10 commits into
masterfrom
fix-retry-docker-pull-infra-failures
May 7, 2026
Merged

Detect Docker image pre-pull failures as infrastructure errors for auto-retry#101970
alexey-milovidov merged 10 commits into
masterfrom
fix-retry-docker-pull-infra-failures

Conversation

@alexey-milovidov

@alexey-milovidov alexey-milovidov commented Apr 7, 2026

Copy link
Copy Markdown
Member

When Docker image pulls time out during the pre-pull phase of integration tests, the job fails with no test results. The retry_infra_failures workflow did not detect this because: (1) the job ran for ~16 minutes, exceeding the 2-minute "quick failure" heuristic, and (2) the result JSON check only recognized "Checkout Submodules" failures.

Fix: tag the pre-pull failure result with the INFRA label, and teach the retry workflow to check for this label in result JSONs alongside the existing "Checkout Submodules" check.

Example failure: https://s3.amazonaws.com/clickhouse-test-reports/json.html?PR=100433&sha=e1442483394a39c78754a475408bc225d13532e6&name_0=PR&name_1=Integration%20tests%20%28arm_binary%2C%20distributed%20plan%2C%203%2F4%29
#100433

Changelog category (leave one):

  • CI Fix or Improvement (changelog entry is not required)

Changelog entry (a user-readable short description of the changes that goes into CHANGELOG.md):

...

Documentation entry for user-facing changes

  • Documentation is written (mandatory for new features)

Version info

  • Merged into: 26.5.1.396

…to-retry

When Docker image pulls time out during the pre-pull phase of integration tests,
the job fails with no test results. The `retry_infra_failures` workflow did not
detect this because: (1) the job ran for ~16 minutes, exceeding the 2-minute
"quick failure" heuristic, and (2) the result JSON check only recognized
"Checkout Submodules" failures.

Fix: tag the pre-pull failure result with the `INFRA` label, and teach the retry
workflow to check for this label in result JSONs alongside the existing
"Checkout Submodules" check.

https://s3.amazonaws.com/clickhouse-test-reports/json.html?PR=100433&sha=e1442483394a39c78754a475408bc225d13532e6&name_0=PR&name_1=Integration%20tests%20%28arm_binary%2C%20distributed%20plan%2C%203%2F4%29

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@clickhouse-gh

clickhouse-gh Bot commented Apr 7, 2026

Copy link
Copy Markdown
Contributor

@clickhouse-gh clickhouse-gh Bot added the pr-ci label Apr 7, 2026
@alexey-milovidov

Copy link
Copy Markdown
Member Author

Didn't work...

@alexey-milovidov

Copy link
Copy Markdown
Member Author

It has to be merged for the job to work from the master.

@alexey-milovidov

Copy link
Copy Markdown
Member Author

Depends on #101292

@alexey-milovidov

Copy link
Copy Markdown
Member Author

@groeneai, fix test_max_bytes_ratio_before_external_order_group_by_for_server and send a separate PR with the fix.

@groeneai

Copy link
Copy Markdown
Contributor

@alexey-milovidov — investigated, here's the current state.

Previous attempt: PR #101869 (closed)

I submitted a pytest.skip for MSan on April 6 (mirroring your db1f459cc49 precedent for the _group_by variant). You approved it, but @Algunenano then identified that the actual root cause is the null-pointer dereference in MergeTreeDataPartWriterCompact::cancel (your own PR #101292), and rejected pytest.skip as the wrong shape of fix when the failure is a Fatal message rather than a timeout. I closed PR #101869 in favor of #101292.

Current failure breakdown (last 30 days, 19 distinct PRs)

Mode Count Root cause
amd_msan server crash in cancel 16 (84%) PR #101292 (yours — approved 2/2, BLOCKED on Stress test (arm_ubsan))
amd_tsan lock-order-inversion in ClusterFactory::shutdown vs applyClustersConfig 1 (5%) Real ClusterFactory.cpp concurrency bug — separate, only triggered here once
amd_asan_ubsan, db disk, old analyzer connection timeout at is_built_with_thread_sanitizer() 2 (11%) Server didn't respond on time — infra/startup

Master had 1 failure in 30 days (April 6 — the connection-timeout variant, not the MSan crash).

What changed in your branch today (April 26)

You pushed ca13890b to PR #101292 at 04:39 UTC, picked up the merged #101239 (SVE/MSan) dependency, and got a clean LLVM coverage run. The only remaining red is Stress test (arm_ubsan) — looks like a flake worth re-triggering.

What direction would you like?

The dominant failure (16/19) is fixed by your PR #101292 once it lands. Options I see for a separate test-only PR:

  1. Re-attempt the MSan pytest.skip — same as Skip external sort memory test under MSan #101869, with explicit "remove once Fix UBSan in MergeTreeDataPartWriterCompact::cancel #101292 merges" note. This addresses 16/19 failures immediately but reopens Algunenano's objection.
  2. Reduce memory footprint — drop max_server_memory_usage from 10Gi to ~5Gi and numbers(100e6) to numbers(50e6). Keeps test purpose (peak > limit, ratio triggers external sort, no-ratio OOMs), but lessens the OS-level pressure that triggers the upstream merge-writer bug under MSan shadow-memory doubling. Doesn't skip anything.
  3. Disable system-table logging in config.d/memory_overrides.yaml (<query_log remove="remove"/> etc.) — eliminates the background INSERT/merge path that contains the latent bug. Test logic unchanged.
  4. Wait for Fix UBSan in MergeTreeDataPartWriterCompact::cancel #101292 to merge — if you can re-run the arm_ubsan stress test, the MSan failures stop on their own and only minor failure modes remain.

Lean toward (2) or (3) given Algunenano's earlier feedback, but happy to do whatever you'd prefer. Let me know.

(session: cron:clickhouse-ci-task-worker:20260426-124500)

The previous jq check `any(. == "infra")` only matched bare-string labels,
but `Result.set_label` stores labels as dicts like `{"name": "infra", ...}`,
so the infra-label branch never fired and Docker pre-pull failures were
not auto-retried.

Explanation for the comment "Didn't work..." on the PR.
…scope

The local `gh` CLI token lacks the `workflow` scope, so pushes that change
`.github/workflows/*` are rejected. Restore all modified workflow files to
the byte-identical content of the previous remote tip so the master-merge
push is accepted.

This temporarily reverts the dict-form label fix in `retry_infra_failures.yml`;
the fix needs to be reapplied separately by a push from an environment that
has the `workflow` scope.
The previous commit 7d022e8 restored .github/workflows/*.yml to the
previous PR tip's content to bypass GitHub's workflow-scope push check.
That left the YAMLs stale relative to master, which caused the
`Check Workflows` step to fail with `workflows are outdated`.

This commit replaces those YAMLs with the current `origin/master`
content (byte-identical) so they match what `python3 -m praktika yaml`
produces from master. The actual PR changes
(`retry_infra_failures.yml` + `integration_test_job.py`) are
preserved.
@alexey-milovidov alexey-milovidov added this pull request to the merge queue May 7, 2026
Merged via the queue into master with commit 5c178ec May 7, 2026
165 checks passed
@alexey-milovidov alexey-milovidov deleted the fix-retry-docker-pull-infra-failures branch May 7, 2026 13:35
@robot-clickhouse robot-clickhouse added the pr-synced-to-cloud The PR is synced to the cloud repo label May 7, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

pr-ci pr-synced-to-cloud The PR is synced to the cloud repo

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants