Fix flaky timeout in test_profile_max_sessions_for_user (postgres) by groeneai · Pull Request #107086 · ClickHouse/ClickHouse · GitHub
Skip to content

Fix flaky timeout in test_profile_max_sessions_for_user (postgres)#107086

Merged
nikitamikhaylov merged 2 commits into
ClickHouse:masterfrom
groeneai:fix-flaky-max-sessions-postgres-timeout
Jun 11, 2026
Merged

Fix flaky timeout in test_profile_max_sessions_for_user (postgres)#107086
nikitamikhaylov merged 2 commits into
ClickHouse:masterfrom
groeneai:fix-flaky-max-sessions-postgres-timeout

Conversation

@groeneai

@groeneai groeneai commented Jun 10, 2026

Copy link
Copy Markdown
Contributor

Changelog category (leave one):

  • CI Fix or Improvement (changelog entry is not required)

Changelog entry (a user-readable short description of the changes that goes into CHANGELOG.md):

...

Description

Fixes a recurring 900s pytest timeout in test_profile_max_sessions_for_user/test.py::test_profile_max_sessions_for_user_postgres, almost exclusively on the amd_msan integration shard. Reported by @alexey-milovidov on #106175 (the failure is unrelated to that docs PR).

threaded_run_test starts the session threads, waits for the overflown session count log (emitted at session acquisition, not at query execution), then issues a single KILL QUERY WHERE user='test_user' SYNC and joins the threads with no timeout. The single KILL snapshots system.processes once; an accepted session whose infinite SELECT ... COUNT(*) FROM system.numbers has not registered yet is not killed and runs forever, so its thread blocks until the pytest limit. The postgres worker uses psycopg2 with no client-side query timeout (native variants have the 600s timeout from helpers/client.py that masks the same race), and MSan widens the connect-to-execute window. CIDB over 30 days: 14 timeouts on the postgres variant across 13 distinct PRs and master (11 of 14 on amd_msan) vs 1 each for tcp/http/mysql/grpc, all with the identical thread.join() stack.

Fix: re-issue the KILL in a bounded loop until every worker thread has exited, so a late-registering query is caught by a later iteration. The loop is driven by thread liveness rather than system.processes being empty, because a process-count loop can break early on a transient gap and reintroduces the same hang. The overflow assertion is unchanged.

Reproduced locally against a debug server: the current code hangs deterministically when an accepted session's query registers after the KILL; the patched code joins all threads within seconds.

Version info

  • Merged into: 26.6.1.661

threaded_run_test starts N session threads, each running an infinite
SELECT ..., COUNT(*) FROM system.numbers, then relies on a single
KILL QUERY WHERE user='test_user' SYNC to terminate them so the
following thread.join() (no timeout) can return.

That single KILL snapshots system.processes once. The test only waits
for the "overflown session count" log, which is emitted at session
acquisition (Session::trackSession), not at query execution. An accepted
session whose SELECT has not yet registered in system.processes when the
KILL runs is therefore not killed and runs forever; its thread blocks in
the driver until pytest's 900s test-level timeout fires.

This is timing sensitive and shows up almost exclusively on the postgres
variant under MSan: psycopg2 has no client-side query timeout (the native
clickhouse-client variants have the 600s timeout from helpers/client.py
that masks the same race), and MSan widens the connect-to-execute window.
Over 30 days CIDB shows 14 timeouts on the _postgres variant across 13
distinct PRs and master (11 of 14 on amd_msan) versus 1 each for the
tcp/http/mysql/grpc variants, all with the identical stack: hang in
thread.join() at test.py:98.

Re-issue KILL QUERY ... SYNC in a loop until every worker thread has
exited (bounded by a 60s deadline) so a late-registering query is caught
by a subsequent iteration. The loop is driven by thread liveness, not by
system.processes being empty: a process-count loop can break early on a
transient gap (the fast query already killed, the slow query not yet
registered) and reintroduces the same hang. The overflow assertion is
unchanged.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
@groeneai

Copy link
Copy Markdown
Contributor Author

@groeneai

Copy link
Copy Markdown
Contributor Author

cc @vitlibar — could you review this? Test-only fix for the flaky 900s timeout in test_profile_max_sessions_for_user_postgres (amd_msan): the single KILL QUERY ... SYNC in threaded_run_test could miss an accepted session whose query had not registered in system.processes yet, leaving its thread blocked forever. The fix re-issues the KILL in a bounded loop until all worker threads exit. Validated locally both directions; the overflown session count assertion is unchanged.

@nikitamikhaylov nikitamikhaylov added the can be tested Allows running workflows for external contributors label Jun 10, 2026
@clickhouse-gh

clickhouse-gh Bot commented Jun 10, 2026

Copy link
Copy Markdown
Contributor

Workflow [PR], commit [5cd884a]

Summary:


AI Review

Summary

This PR makes threaded_run_test retry KILL QUERY until the worker threads exit, then fail explicitly if the cleanup deadline expires. I did not find remaining correctness issues in the changed test helper; the earlier unbounded thread.join finding is fixed in the current PR head.

Final Verdict

Status: ✅ Approve

No further action required.

@clickhouse-gh clickhouse-gh Bot added the pr-ci label Jun 10, 2026
Comment thread tests/integration/test_profile_max_sessions_for_user/test.py Outdated
The bounded KILL loop in threaded_run_test was followed by an
unconditional thread.join(). If the 60s deadline expired with a worker
still alive (a late-registering query that KILL never caught), control
fell through to that join and blocked until pytest's 900s test-level
timeout, reintroducing the exact hang this PR removes. The deadline only
bounded the retry loop, not the final join.

After the loop, fail fast with pytest.fail when any worker thread is
still alive instead of blocking on an unbounded join. The happy path
(all workers killed within the deadline) is unchanged. Addresses the
clickhouse-gh[bot] review on this PR.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
@groeneai

Copy link
Copy Markdown
Contributor Author

Pre-PR validation gate

Re-posted for HEAD 5cd884aa9735 (fixup addressing the clickhouse-gh[bot] review on the cleanup path).

# Question Answer
a Deterministic repro? Yes. The concern is a Python control-flow defect, reproduced deterministically with a standalone harness (.scratch/verify_cleanup.py, not committed) that runs the byte-identical bounded loop + terminal logic against mock threads: a stubborn un-killable thread makes the OLD unconditional thread.join() block past the deadline (would hang to pytest's 900s timeout).
b Root cause explained? The bounded loop's 60s deadline bounded only the KILL retry loop. On deadline expiry with a worker still alive (a late-registering query KILL QUERY never caught), control fell through to a trailing unconditional for thread in thread_list: thread.join(), which blocks until pytest's 900s test-level timeout, reintroducing the exact hang this PR removes.
c Fix matches root cause? Yes. Replaces the unbounded trailing join with if any(t.is_alive() for t in thread_list): pytest.fail(...), so the cleanup path is fully bounded by the deadline and fails fast (explicitly) instead of hanging. Directly targets the fall-through identified in (b), not a symptom.
d Test intent preserved / new tests added? Yes. The overflow assertion and the happy path (all workers killed within the deadline → normal return) are unchanged. The test still verifies max_sessions_for_user enforcement; it now additionally surfaces a stuck-session as an explicit fast failure rather than a 900s timeout. No assertion was weakened.
e Both directions demonstrated? Yes. Harness output: A-OLD — old unconditional join blocks past the deadline (would hang); A-NEW — new code raises pytest.fail and returns in ~deadline seconds (no unbounded join); B-NEW — normal threads that exit during the loop return normally with no spurious pytest.fail. All checks pass.
f Fix is general, not a narrow patch? The fall-through join was the single terminal cleanup path shared by all six threaded_run_test variants (tcp/http/mysql/postgres/grpc/http_named), so bounding it covers every variant. It is not a defensive guard masking an upstream bug — pytest.fail is the correct, explicit terminal behavior when a session genuinely cannot be killed within the deadline.

Session id: cron:clickhouse-worker-slot-8:20260611-004300

@clickhouse-gh

clickhouse-gh Bot commented Jun 11, 2026

Copy link
Copy Markdown
Contributor

LLVM Coverage Report

Metric Baseline Current Δ
Lines 84.60% 84.60% +0.00%
Functions 92.30% 92.30% +0.00%
Branches 77.20% 77.20% +0.00%

Changed lines: No C/C++ source files changed — skipping uncovered code analysis.

Newly covered by added/modified tests: 330 line(s), 18 function(s) across 124 file(s) · Details

Top files
  • src/Backups/BackupEntriesCollector.cpp: 31 line(s), 2 function(s)
  • src/Functions/multiIf.cpp: 9 line(s)
  • src/Common/ColumnsHashing.h: 8 line(s), 2 function(s)
  • src/Functions/array/arrayNorm.cpp: 7 line(s)
  • src/IO/S3/deleteFileFromS3.cpp: 7 line(s)

Full report

@nikitamikhaylov nikitamikhaylov self-assigned this Jun 11, 2026
@nikitamikhaylov nikitamikhaylov added this pull request to the merge queue Jun 11, 2026
Merged via the queue into ClickHouse:master with commit 6e9047c Jun 11, 2026
166 checks passed
@robot-clickhouse robot-clickhouse added the pr-synced-to-cloud The PR is synced to the cloud repo label Jun 11, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

can be tested Allows running workflows for external contributors pr-ci pr-synced-to-cloud The PR is synced to the cloud repo

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants