iframe-proxy

alexey-milovidov · 2026-04-07T16:22:13Z

When Docker image pulls time out during the pre-pull phase of integration tests, the job fails with no test results. The retry_infra_failures workflow did not detect this because: (1) the job ran for ~16 minutes, exceeding the 2-minute "quick failure" heuristic, and (2) the result JSON check only recognized "Checkout Submodules" failures.

Fix: tag the pre-pull failure result with the INFRA label, and teach the retry workflow to check for this label in result JSONs alongside the existing "Checkout Submodules" check.

Example failure: https://s3.amazonaws.com/clickhouse-test-reports/json.html?PR=100433&sha=e1442483394a39c78754a475408bc225d13532e6&name_0=PR&name_1=Integration%20tests%20%28arm_binary%2C%20distributed%20plan%2C%203%2F4%29
#100433

Changelog category (leave one):

CI Fix or Improvement (changelog entry is not required)

Changelog entry (a user-readable short description of the changes that goes into CHANGELOG.md):

...

Documentation entry for user-facing changes

Documentation is written (mandatory for new features)

Version info

Merged into: 26.5.1.396

…to-retry When Docker image pulls time out during the pre-pull phase of integration tests, the job fails with no test results. The `retry_infra_failures` workflow did not detect this because: (1) the job ran for ~16 minutes, exceeding the 2-minute "quick failure" heuristic, and (2) the result JSON check only recognized "Checkout Submodules" failures. Fix: tag the pre-pull failure result with the `INFRA` label, and teach the retry workflow to check for this label in result JSONs alongside the existing "Checkout Submodules" check. https://s3.amazonaws.com/clickhouse-test-reports/json.html?PR=100433&sha=e1442483394a39c78754a475408bc225d13532e6&name_0=PR&name_1=Integration%20tests%20%28arm_binary%2C%20distributed%20plan%2C%203%2F4%29 Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

clickhouse-gh · 2026-04-07T16:22:52Z

alexey-milovidov · 2026-04-08T23:10:30Z

Didn't work...

alexey-milovidov · 2026-04-08T23:27:02Z

It has to be merged for the job to work from the master.

…ll-infra-failures

alexey-milovidov · 2026-04-12T15:53:14Z

Depends on #101292

alexey-milovidov · 2026-04-26T10:54:44Z

@groeneai, fix test_max_bytes_ratio_before_external_order_group_by_for_server and send a separate PR with the fix.

groeneai · 2026-04-26T13:02:45Z

@alexey-milovidov — investigated, here's the current state.

Previous attempt: PR #101869 (closed)

I submitted a pytest.skip for MSan on April 6 (mirroring your db1f459cc49 precedent for the _group_by variant). You approved it, but @Algunenano then identified that the actual root cause is the null-pointer dereference in MergeTreeDataPartWriterCompact::cancel (your own PR #101292), and rejected pytest.skip as the wrong shape of fix when the failure is a Fatal message rather than a timeout. I closed PR #101869 in favor of #101292.

Current failure breakdown (last 30 days, 19 distinct PRs)

Master had 1 failure in 30 days (April 6 — the connection-timeout variant, not the MSan crash).

What changed in your branch today (April 26)

You pushed ca13890b to PR #101292 at 04:39 UTC, picked up the merged #101239 (SVE/MSan) dependency, and got a clean LLVM coverage run. The only remaining red is Stress test (arm_ubsan) — looks like a flake worth re-triggering.

What direction would you like?

The dominant failure (16/19) is fixed by your PR #101292 once it lands. Options I see for a separate test-only PR:

Re-attempt the MSan pytest.skip — same as Skip external sort memory test under MSan #101869, with explicit "remove once Fix UBSan in MergeTreeDataPartWriterCompact::cancel #101292 merges" note. This addresses 16/19 failures immediately but reopens Algunenano's objection.
Reduce memory footprint — drop max_server_memory_usage from 10Gi to ~5Gi and numbers(100e6) to numbers(50e6). Keeps test purpose (peak > limit, ratio triggers external sort, no-ratio OOMs), but lessens the OS-level pressure that triggers the upstream merge-writer bug under MSan shadow-memory doubling. Doesn't skip anything.
Disable system-table logging in config.d/memory_overrides.yaml (<query_log remove="remove"/> etc.) — eliminates the background INSERT/merge path that contains the latent bug. Test logic unchanged.
Wait for Fix UBSan in MergeTreeDataPartWriterCompact::cancel #101292 to merge — if you can re-run the arm_ubsan stress test, the MSan failures stop on their own and only minor failure modes remain.

Lean toward (2) or (3) given Algunenano's earlier feedback, but happy to do whatever you'd prefer. Let me know.

(session: cron:clickhouse-ci-task-worker:20260426-124500)

The previous jq check `any(. == "infra")` only matched bare-string labels, but `Result.set_label` stores labels as dicts like `{"name": "infra", ...}`, so the infra-label branch never fired and Docker pre-pull failures were not auto-retried. Explanation for the comment "Didn't work..." on the PR.

…101292 fix

…ll-infra-failures

…scope The local `gh` CLI token lacks the `workflow` scope, so pushes that change `.github/workflows/*` are rejected. Restore all modified workflow files to the byte-identical content of the previous remote tip so the master-merge push is accepted. This temporarily reverts the dict-form label fix in `retry_infra_failures.yml`; the fix needs to be reapplied separately by a push from an environment that has the `workflow` scope.

The previous commit 7d022e8 restored .github/workflows/*.yml to the previous PR tip's content to bypass GitHub's workflow-scope push check. That left the YAMLs stale relative to master, which caused the `Check Workflows` step to fail with `workflows are outdated`. This commit replaces those YAMLs with the current `origin/master` content (byte-identical) so they match what `python3 -m praktika yaml` produces from master. The actual PR changes (`retry_infra_failures.yml` + `integration_test_job.py`) are preserved.

clickhouse-gh Bot added the pr-ci label Apr 7, 2026

divanik approved these changes Apr 7, 2026

View reviewed changes

clickhouse-gh Bot assigned divanik Apr 7, 2026

nikitamikhaylov mentioned this pull request Apr 7, 2026

feat(promql): Parse Prometheus Query API POST bodies as urlencoded form #101794

Merged

Merge remote-tracking branch 'origin/master' into fix-retry-docker-pu…

9cb75b9

…ll-infra-failures

alexey-milovidov added 8 commits May 3, 2026 05:49

Merge master into fix-retry-docker-pull-infra-failures to pick up PR #…

1aa2f83

…101292 fix

Merge remote-tracking branch 'origin/master' into fix-retry-docker-pu…

b1db8e6

…ll-infra-failures

Merge remote-tracking branch 'origin/master' into fix-retry-docker-pu…

329788a

…ll-infra-failures

Merge remote-tracking branch 'origin/master' into fix-retry-docker-pu…

a81c6ed

…ll-infra-failures

Merge remote-tracking branch 'origin/master' into fix-retry-docker-pu…

2b45bfb

…ll-infra-failures

alexey-milovidov added this pull request to the merge queue May 7, 2026

Merged via the queue into master with commit 5c178ec May 7, 2026
165 checks passed

alexey-milovidov deleted the fix-retry-docker-pull-infra-failures branch May 7, 2026 13:35

robot-clickhouse added the pr-synced-to-cloud The PR is synced to the cloud repo label May 7, 2026

Sunbelt Computer Software

PL/B Language Development and Support

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Detect Docker image pre-pull failures as infrastructure errors for auto-retry#101970

Detect Docker image pre-pull failures as infrastructure errors for auto-retry#101970
alexey-milovidov merged 10 commits into
masterfrom
fix-retry-docker-pull-infra-failures

alexey-milovidov commented Apr 7, 2026 •

edited by robot-clickhouse

Loading

Uh oh!

clickhouse-gh Bot commented Apr 7, 2026 •

edited

Loading

Uh oh!

alexey-milovidov commented Apr 8, 2026

Uh oh!

alexey-milovidov commented Apr 8, 2026

Uh oh!

alexey-milovidov commented Apr 12, 2026

Uh oh!

alexey-milovidov commented Apr 26, 2026

Uh oh!

groeneai commented Apr 26, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Sunbelt Computer Software

PL/B Language Development and Support

Uh oh!

Conversation

alexey-milovidov commented Apr 7, 2026 • edited by robot-clickhouse Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Changelog category (leave one):

Changelog entry (a user-readable short description of the changes that goes into CHANGELOG.md):

Documentation entry for user-facing changes

Version info

Uh oh!

clickhouse-gh Bot commented Apr 7, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

AI Review

Summary

ClickHouse Rules

Final Verdict

Uh oh!

alexey-milovidov commented Apr 8, 2026

Uh oh!

alexey-milovidov commented Apr 8, 2026

Uh oh!

alexey-milovidov commented Apr 12, 2026

Uh oh!

alexey-milovidov commented Apr 26, 2026

Uh oh!

groeneai commented Apr 26, 2026

Previous attempt: PR #101869 (closed)

Current failure breakdown (last 30 days, 19 distinct PRs)

What changed in your branch today (April 26)

What direction would you like?

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

alexey-milovidov commented Apr 7, 2026 •

edited by robot-clickhouse

Loading

clickhouse-gh Bot commented Apr 7, 2026 •

edited

Loading