Fix flaky timeouts in union race condition tests by antonio2368 · Pull Request #102372 · ClickHouse/ClickHouse · GitHub
Skip to content

Fix flaky timeouts in union race condition tests#102372

Merged
antonio2368 merged 1 commit into
masterfrom
fix-race-test-timeouts
Apr 10, 2026
Merged

Fix flaky timeouts in union race condition tests#102372
antonio2368 merged 1 commit into
masterfrom
fix-race-test-timeouts

Conversation

@antonio2368

@antonio2368 antonio2368 commented Apr 10, 2026

Copy link
Copy Markdown
Member

Tests 00090, 00091, 00093, 00094 use fixed-iteration loops (10 x 100 queries) that cannot self-terminate before the test runner's 180-second TOO_LONG threshold. Under flaky check, ThreadFuzzer enables pthread mutex wrapping (50% CPU-migration probability per mutex op), inflating per-query latency 4-6x. Combined with the 14 new randomized query-plan settings from #101638, these tests exceed the timeout on ASan/Debug/Binary builds (100% failure rate on ASan).

Sibling test 00092 had the identical problem and was already fixed in #96976 by switching to a TIMELIMIT=$((SECONDS + 100)) wall-clock loop. This PR applies the same pattern to the remaining four tests.

Race detection is preserved: incorrect query results still cause grep to match, print Fail!, and break the loop, producing a reference file mismatch.

CI report: https://s3.amazonaws.com/clickhouse-test-reports/json.html?PR=100773&sha=43fb80eb293d22cd90d58702f50a018bd651934f&name_0=PR&name_1=Stateless%20tests%20%28amd_debug%2C%20flaky%20check%29

Related: #100773, #95354

Changelog category (leave one):

  • CI Fix or Improvement (changelog entry is not required)

Changelog entry (a user-readable short description of the changes that goes into CHANGELOG.md):

...

Documentation entry for user-facing changes

  • Documentation is written (mandatory for new features)

Version info

  • Merged into: 26.4.1.809

The fixed-iteration loops could take varying amounts of wall-clock time
depending on system load, causing flaky timeouts in CI. Replace them
with `while [ $SECONDS -lt $TIMELIMIT ]` loops capped at 100 seconds.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@clickhouse-gh

clickhouse-gh Bot commented Apr 10, 2026

Copy link
Copy Markdown
Contributor

@clickhouse-gh clickhouse-gh Bot added the pr-ci label Apr 10, 2026
@kssenii kssenii self-assigned this Apr 10, 2026

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: TIMELIMIT is a bit difficult to read, may be TIME_LIMIT is better

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's naming used in other tests with the exact same mechanic so I would prefer keeping it for now.

@antonio2368 antonio2368 added this pull request to the merge queue Apr 10, 2026
Merged via the queue into master with commit 7c4cf23 Apr 10, 2026
164 checks passed
@antonio2368 antonio2368 deleted the fix-race-test-timeouts branch April 10, 2026 15:05
@robot-ch-test-poll1 robot-ch-test-poll1 added the pr-synced-to-cloud The PR is synced to the cloud repo label Apr 10, 2026
@antonio2368 antonio2368 mentioned this pull request Apr 13, 2026
1 task
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

pr-ci pr-synced-to-cloud The PR is synced to the cloud repo

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants