Fix flaky timeout in test_profile_max_sessions_for_user (postgres)#107086
Conversation
threaded_run_test starts N session threads, each running an infinite SELECT ..., COUNT(*) FROM system.numbers, then relies on a single KILL QUERY WHERE user='test_user' SYNC to terminate them so the following thread.join() (no timeout) can return. That single KILL snapshots system.processes once. The test only waits for the "overflown session count" log, which is emitted at session acquisition (Session::trackSession), not at query execution. An accepted session whose SELECT has not yet registered in system.processes when the KILL runs is therefore not killed and runs forever; its thread blocks in the driver until pytest's 900s test-level timeout fires. This is timing sensitive and shows up almost exclusively on the postgres variant under MSan: psycopg2 has no client-side query timeout (the native clickhouse-client variants have the 600s timeout from helpers/client.py that masks the same race), and MSan widens the connect-to-execute window. Over 30 days CIDB shows 14 timeouts on the _postgres variant across 13 distinct PRs and master (11 of 14 on amd_msan) versus 1 each for the tcp/http/mysql/grpc variants, all with the identical stack: hang in thread.join() at test.py:98. Re-issue KILL QUERY ... SYNC in a loop until every worker thread has exited (bounded by a 60s deadline) so a late-registering query is caught by a subsequent iteration. The loop is driven by thread liveness, not by system.processes being empty: a process-count loop can break early on a transient gap (the fast query already killed, the slow query not yet registered) and reintroduces the same hang. The overflow assertion is unchanged. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
|
cc @vitlibar — could you review this? Test-only fix for the flaky 900s timeout in |
|
Workflow [PR], commit [5cd884a] Summary: ✅ AI ReviewSummaryThis PR makes Final VerdictStatus: ✅ Approve No further action required. |
The bounded KILL loop in threaded_run_test was followed by an unconditional thread.join(). If the 60s deadline expired with a worker still alive (a late-registering query that KILL never caught), control fell through to that join and blocked until pytest's 900s test-level timeout, reintroducing the exact hang this PR removes. The deadline only bounded the retry loop, not the final join. After the loop, fail fast with pytest.fail when any worker thread is still alive instead of blocking on an unbounded join. The happy path (all workers killed within the deadline) is unchanged. Addresses the clickhouse-gh[bot] review on this PR. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Pre-PR validation gateRe-posted for HEAD
Session id: cron:clickhouse-worker-slot-8:20260611-004300 |
LLVM Coverage ReportChanged lines: No C/C++ source files changed — skipping uncovered code analysis. Newly covered by added/modified tests: 330 line(s), 18 function(s) across 124 file(s) · Details Top files
|

Changelog category (leave one):
Changelog entry (a user-readable short description of the changes that goes into CHANGELOG.md):
...
Description
Fixes a recurring 900s pytest timeout in
test_profile_max_sessions_for_user/test.py::test_profile_max_sessions_for_user_postgres, almost exclusively on theamd_msanintegration shard. Reported by @alexey-milovidov on #106175 (the failure is unrelated to that docs PR).threaded_run_teststarts the session threads, waits for theoverflown session countlog (emitted at session acquisition, not at query execution), then issues a singleKILL QUERY WHERE user='test_user' SYNCand joins the threads with no timeout. The single KILL snapshotssystem.processesonce; an accepted session whose infiniteSELECT ... COUNT(*) FROM system.numbershas not registered yet is not killed and runs forever, so its thread blocks until the pytest limit. The postgres worker uses psycopg2 with no client-side query timeout (native variants have the 600s timeout fromhelpers/client.pythat masks the same race), and MSan widens the connect-to-execute window. CIDB over 30 days: 14 timeouts on the postgres variant across 13 distinct PRs and master (11 of 14 on amd_msan) vs 1 each for tcp/http/mysql/grpc, all with the identicalthread.join()stack.Fix: re-issue the KILL in a bounded loop until every worker thread has exited, so a late-registering query is caught by a later iteration. The loop is driven by thread liveness rather than
system.processesbeing empty, because a process-count loop can break early on a transient gap and reintroduces the same hang. The overflow assertion is unchanged.Reproduced locally against a debug server: the current code hangs deterministically when an accepted session's query registers after the KILL; the patched code joins all threads within seconds.
Version info
26.6.1.661