Disable 03357_join_pk_sharding on TSan by groeneai · Pull Request #105902 · ClickHouse/ClickHouse · GitHub
Skip to content

Disable 03357_join_pk_sharding on TSan#105902

Merged
vdimir merged 1 commit into
ClickHouse:masterfrom
groeneai:groeneai/fix-03357-no-tsan
May 27, 2026
Merged

Disable 03357_join_pk_sharding on TSan#105902
vdimir merged 1 commit into
ClickHouse:masterfrom
groeneai:groeneai/fix-03357-no-tsan

Conversation

@groeneai

@groeneai groeneai commented May 26, 2026

Copy link
Copy Markdown
Contributor

Disable 03357_join_pk_sharding on TSan to stop chronic timeout flakes.

The test creates three 1M-row MergeTree tables and runs ten multi-table JOIN queries with EXPLAIN ACTIONS. On TSan it runs ~15-30x slower than on debug or release builds, occasionally exceeding the 600s per-test timeout budget for long-tagged tests under unfavorable randomized settings.

CIDB over the last 60 days for 03357_join_pk_sharding:

  • 5 timeout failures, all at exactly 600.06s.
  • All 5 on Stateless tests (amd_tsan, parallel, ...) jobs (4 amd_tsan parallel + 1 amd_tsan s3).
  • 0 timeouts on any non-TSan build.
  • TSan p95 = 120s, p99 = 178s, max = 469s for the passing runs.

5 unrelated PRs were affected: #103112 (vector_spann), #98939 (text index), #102444 (hash join prefetch), #103602, #102001. The test already excludes ASan and MSan with no-asan, no-msan for the same reason. TSan is typically the slowest sanitizer and belongs in the same list. Functional coverage of query_plan_join_shard_by_pk_ranges is preserved on debug, release, ARM, and ASan+UBSan (distributed plan) builds.

Report: https://s3.amazonaws.com/clickhouse-test-reports/json.html?PR=103112&sha=3cfc7fde0c115e6368c6f235eddb2cd112767c94&name_0=PR&name_1=Stateless%20tests%20%28amd_tsan%2C%20parallel%2C%202%2F2%29

Changelog category (leave one):

  • CI Fix or Improvement (changelog entry is not required)

Changelog entry (a user-readable short description of the changes that goes into CHANGELOG.md):

...

Documentation entry for user-facing changes

  • Documentation is unchanged

Version info

  • Merged into: 26.6.1.189

The test creates three 1M-row MergeTree tables and runs ten multi-table
`JOIN` queries with `EXPLAIN ACTIONS`. On TSan the test runs ~15-30x
slower than on debug or release builds (CIDB shows TSan p95 = 120s,
p99 = 178s, max = 469s) and very occasionally exceeds the 600s per-test
timeout budget for `long`-tagged tests under unfavorable randomized
settings, producing chronic flaky CI failures across unrelated PRs.

CIDB (60 days): 5 timeout failures (all at exactly 600.06s), all five
on `Stateless tests (amd_tsan, parallel, ...)` jobs. Zero timeouts on
any non-TSan build over the same period.

The test already excludes ASan and MSan with `no-asan, no-msan` for the
same reason; TSan is the slowest sanitizer and belongs in the same
list. Functional coverage of `query_plan_join_shard_by_pk_ranges` is
preserved on debug, release, ARM, and ASan+UBSan (distributed plan)
builds.

Report: https://s3.amazonaws.com/clickhouse-test-reports/json.html?PR=103112&sha=3cfc7fde0c115e6368c6f235eddb2cd112767c94&name_0=PR&name_1=Stateless%20tests%20%28amd_tsan%2C%20parallel%2C%202%2F2%29
@groeneai

Copy link
Copy Markdown
Contributor Author

@groeneai

Copy link
Copy Markdown
Contributor Author

cc @vdimir — could you review this? Test-only change adding no-tsan to 03357_join_pk_sharding, consistent with the existing no-asan, no-msan tags, to stop chronic 600s timeouts that exclusively hit on amd_tsan jobs (CIDB: 5/5 timeouts on TSan, 0 on any other build over 60 days).

@nikitamikhaylov nikitamikhaylov added the can be tested Allows running workflows for external contributors label May 26, 2026
@clickhouse-gh

clickhouse-gh Bot commented May 26, 2026

Copy link
Copy Markdown
Contributor

Workflow [PR], commit [f8bd40c]


AI Review

Summary

This PR changes one line in tests/queries/0_stateless/03357_join_pk_sharding.sql, adding no-tsan to the test tags to stop known TSan-only timeout flakiness for this long-running test. The change is narrowly scoped, consistent with the existing no-asan/no-msan exclusions, and does not alter product behavior.

Final Verdict

Status: ✅ Approve

@clickhouse-gh clickhouse-gh Bot added the pr-ci label May 26, 2026
@clickhouse-gh

clickhouse-gh Bot commented May 27, 2026

Copy link
Copy Markdown
Contributor

LLVM Coverage Report

Metric Baseline Current Δ
Lines 84.10% 84.20% +0.10%
Functions 91.40% 91.40% +0.00%
Branches 76.60% 76.60% +0.00%

Changed lines: No C/C++ source files changed — skipping uncovered code analysis.

Newly covered by added/modified tests: 856 line(s), 67 function(s) across 117 file(s) · Details

Top files
  • src/Server/ACME/Client.cpp: 187 line(s), 9 function(s)
  • src/Server/ACME/API.cpp: 160 line(s), 15 function(s)
  • src/Common/OpenSSLHelpers.cpp: 52 line(s), 2 function(s)
  • src/Storages/StorageReplicatedMergeTree.cpp: 43 line(s)
  • src/Common/Crypto/KeyPair.cpp: 28 line(s), 5 function(s)

Full report

@vdimir vdimir enabled auto-merge May 27, 2026 10:32
@antaljanosbenjamin antaljanosbenjamin self-assigned this May 27, 2026
@antaljanosbenjamin

Copy link
Copy Markdown
Member

@vdimir vdimir added this pull request to the merge queue May 27, 2026
Merged via the queue into ClickHouse:master with commit f74c9de May 27, 2026
163 of 165 checks passed
@robot-ch-test-poll4 robot-ch-test-poll4 added the pr-synced-to-cloud The PR is synced to the cloud repo label May 27, 2026
groeneai added a commit to groeneai/ClickHouse that referenced this pull request May 27, 2026
The 2026-05-27 master merge into this PR branch accidentally undid the
master-side changes to three files unrelated to this PR's StorageMergeTree
ALTER fix:

  - src/Functions/tests/gtest_functions_stress.cpp: re-introduced the
    per-iteration max_execution_time / CancellationChecker /
    ProcessList wiring (originally landed in ClickHouse#105146, reverted by
    ClickHouse#105163, re-landed on master). Without this, function_prop_fuzzer
    iterations cannot be interrupted and the job regresses to long/stuck
    behavior.
  - src/Interpreters/CancellationChecker.cpp: re-introduced the
    'stop_thread = false' reset at the end of workerFunction so the
    singleton can be restarted in tests.
  - tests/queries/0_stateless/03357_join_pk_sharding.sql: re-added the
    'no-tsan' tag landed by ClickHouse#105902.

Reverting these hunks restores the master versions so this PR only ships
the intended StorageMergeTree / StorageReplicatedMergeTree areNonReplicatedAlterCommands
branch and the new 04248 regression test.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

can be tested Allows running workflows for external contributors pr-ci pr-synced-to-cloud The PR is synced to the cloud repo

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants