iframe-proxy

groeneai · 2026-06-10T21:42:58Z

Changelog category (leave one):

CI Fix or Improvement (changelog entry is not required)

Changelog entry (a user-readable short description of the changes that goes into CHANGELOG.md):

...

Description

Fixes a recurring 900s pytest timeout in test_profile_max_sessions_for_user/test.py::test_profile_max_sessions_for_user_postgres, almost exclusively on the amd_msan integration shard. Reported by @alexey-milovidov on #106175 (the failure is unrelated to that docs PR).

threaded_run_test starts the session threads, waits for the overflown session count log (emitted at session acquisition, not at query execution), then issues a single KILL QUERY WHERE user='test_user' SYNC and joins the threads with no timeout. The single KILL snapshots system.processes once; an accepted session whose infinite SELECT ... COUNT(*) FROM system.numbers has not registered yet is not killed and runs forever, so its thread blocks until the pytest limit. The postgres worker uses psycopg2 with no client-side query timeout (native variants have the 600s timeout from helpers/client.py that masks the same race), and MSan widens the connect-to-execute window. CIDB over 30 days: 14 timeouts on the postgres variant across 13 distinct PRs and master (11 of 14 on amd_msan) vs 1 each for tcp/http/mysql/grpc, all with the identical thread.join() stack.

Fix: re-issue the KILL in a bounded loop until every worker thread has exited, so a late-registering query is caught by a later iteration. The loop is driven by thread liveness rather than system.processes being empty, because a process-count loop can break early on a transient gap and reintroduces the same hang. The overflow assertion is unchanged.

Reproduced locally against a debug server: the current code hangs deterministically when an accepted session's query registers after the KILL; the patched code joins all threads within seconds.

Version info

Merged into: 26.6.1.661

threaded_run_test starts N session threads, each running an infinite SELECT ..., COUNT(*) FROM system.numbers, then relies on a single KILL QUERY WHERE user='test_user' SYNC to terminate them so the following thread.join() (no timeout) can return. That single KILL snapshots system.processes once. The test only waits for the "overflown session count" log, which is emitted at session acquisition (Session::trackSession), not at query execution. An accepted session whose SELECT has not yet registered in system.processes when the KILL runs is therefore not killed and runs forever; its thread blocks in the driver until pytest's 900s test-level timeout fires. This is timing sensitive and shows up almost exclusively on the postgres variant under MSan: psycopg2 has no client-side query timeout (the native clickhouse-client variants have the 600s timeout from helpers/client.py that masks the same race), and MSan widens the connect-to-execute window. Over 30 days CIDB shows 14 timeouts on the _postgres variant across 13 distinct PRs and master (11 of 14 on amd_msan) versus 1 each for the tcp/http/mysql/grpc variants, all with the identical stack: hang in thread.join() at test.py:98. Re-issue KILL QUERY ... SYNC in a loop until every worker thread has exited (bounded by a 60s deadline) so a late-registering query is caught by a subsequent iteration. The loop is driven by thread liveness, not by system.processes being empty: a process-count loop can break early on a transient gap (the fast query already killed, the slow query not yet registered) and reintroduces the same hang. The overflow assertion is unchanged. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

groeneai · 2026-06-10T21:43:21Z

Pre-PR validation gate

#	Question	Answer
a	Deterministic repro?	Yes. Against a local debug server (postgresql_port enabled, `max_sessions_for_user=2`), with one accepted postgres session whose `SELECT ..., COUNT(*) FROM system.numbers` registers in `system.processes` after the single `KILL QUERY ... SYNC` (modeling MSan slowness via a fixed delay), the current `threaded_run_test` hangs every time and had to be SIGTERM-killed by an outer watchdog.
b	Root cause explained?	The test waits for the `overflown session count` log, which is emitted at session acquisition (`Session::trackSession`), not at query execution. The single `KILL QUERY ... SYNC` snapshots `system.processes` once; an accepted session whose infinite SELECT has not registered yet is not killed and runs forever, so `thread.join()` (no timeout) blocks until the pytest 900s limit. psycopg2 has no client-side query timeout (native variants are masked by the 600s timeout in `helpers/client.py`); MSan widens the connect-to-execute window. CIDB 30d: 14 postgres-variant timeouts across 13 distinct PRs + master (11/14 amd_msan) vs 1 each for the other protocols, all with the identical `thread.join()` stack.
c	Fix matches root cause?	Yes. Re-issue `KILL QUERY ... SYNC` in a bounded loop until every worker thread has exited, so a late-registering query is killed by a later iteration. This directly targets the snapshot-once race rather than masking it.
d	Test intent preserved / new tests added?	Preserved. The `overflown session count` assertion is unchanged; the change only affects post-assertion cleanup. No assertion is weakened. No new test needed (this is a test-synchronization fix, not a product-code change).
e	Both directions demonstrated?	Yes. Verbatim old code path: hangs (watchdog-terminated). Verbatim patched code path: all threads join within ~2.4s, 3/3 runs.
f	Fix is general, not a narrow patch?	The loop is driven by thread liveness, not by `system.processes` being empty. A process-count loop was tested and ALSO hangs: it breaks early on a transient gap (fast query already killed, slow query not yet registered). Thread-liveness avoids that. The same cleanup serves all protocol variants that route through `threaded_run_test`, so the fix is general, not postgres-specific.

Session id: cron:clickhouse-worker-slot-19:20260610-211000

groeneai · 2026-06-10T21:44:00Z

cc @vitlibar — could you review this? Test-only fix for the flaky 900s timeout in test_profile_max_sessions_for_user_postgres (amd_msan): the single KILL QUERY ... SYNC in threaded_run_test could miss an accepted session whose query had not registered in system.processes yet, leaving its thread blocked forever. The fix re-issues the KILL in a bounded loop until all worker threads exit. Validated locally both directions; the overflown session count assertion is unchanged.

clickhouse-gh · 2026-06-10T21:52:56Z

Workflow [PR], commit [5cd884a]

Summary: ✅

AI Review

Summary

This PR makes threaded_run_test retry KILL QUERY until the worker threads exit, then fail explicitly if the cleanup deadline expires. I did not find remaining correctness issues in the changed test helper; the earlier unbounded thread.join finding is fixed in the current PR head.

Final Verdict

Status: ✅ Approve

No further action required.

The bounded KILL loop in threaded_run_test was followed by an unconditional thread.join(). If the 60s deadline expired with a worker still alive (a late-registering query that KILL never caught), control fell through to that join and blocked until pytest's 900s test-level timeout, reintroducing the exact hang this PR removes. The deadline only bounded the retry loop, not the final join. After the loop, fail fast with pytest.fail when any worker thread is still alive instead of blocking on an unbounded join. The happy path (all workers killed within the deadline) is unchanged. Addresses the clickhouse-gh[bot] review on this PR. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

groeneai · 2026-06-11T00:54:34Z

Pre-PR validation gate

Re-posted for HEAD 5cd884aa9735 (fixup addressing the clickhouse-gh[bot] review on the cleanup path).

#	Question	Answer
a	Deterministic repro?	Yes. The concern is a Python control-flow defect, reproduced deterministically with a standalone harness (`.scratch/verify_cleanup.py`, not committed) that runs the byte-identical bounded loop + terminal logic against mock threads: a stubborn un-killable thread makes the OLD unconditional `thread.join()` block past the deadline (would hang to pytest's 900s timeout).
b	Root cause explained?	The bounded loop's 60s deadline bounded only the KILL retry loop. On deadline expiry with a worker still alive (a late-registering query `KILL QUERY` never caught), control fell through to a trailing unconditional `for thread in thread_list: thread.join()`, which blocks until pytest's 900s test-level timeout, reintroducing the exact hang this PR removes.
c	Fix matches root cause?	Yes. Replaces the unbounded trailing join with `if any(t.is_alive() for t in thread_list): pytest.fail(...)`, so the cleanup path is fully bounded by the deadline and fails fast (explicitly) instead of hanging. Directly targets the fall-through identified in (b), not a symptom.
d	Test intent preserved / new tests added?	Yes. The overflow assertion and the happy path (all workers killed within the deadline → normal return) are unchanged. The test still verifies `max_sessions_for_user` enforcement; it now additionally surfaces a stuck-session as an explicit fast failure rather than a 900s timeout. No assertion was weakened.
e	Both directions demonstrated?	Yes. Harness output: A-OLD — old unconditional join blocks past the deadline (would hang); A-NEW — new code raises `pytest.fail` and returns in ~deadline seconds (no unbounded join); B-NEW — normal threads that exit during the loop return normally with no spurious `pytest.fail`. All checks pass.
f	Fix is general, not a narrow patch?	The fall-through join was the single terminal cleanup path shared by all six `threaded_run_test` variants (tcp/http/mysql/postgres/grpc/http_named), so bounding it covers every variant. It is not a defensive guard masking an upstream bug — `pytest.fail` is the correct, explicit terminal behavior when a session genuinely cannot be killed within the deadline.

Session id: cron:clickhouse-worker-slot-8:20260611-004300

clickhouse-gh · 2026-06-11T03:05:40Z

LLVM Coverage Report

Changed lines: No C/C++ source files changed — skipping uncovered code analysis.

Newly covered by added/modified tests: 330 line(s), 18 function(s) across 124 file(s) · Details

Top files

src/Backups/BackupEntriesCollector.cpp: 31 line(s), 2 function(s)
src/Functions/multiIf.cpp: 9 line(s)
src/Common/ColumnsHashing.h: 8 line(s), 2 function(s)
src/Functions/array/arrayNorm.cpp: 7 line(s)
src/IO/S3/deleteFileFromS3.cpp: 7 line(s)

Full report

groeneai mentioned this pull request Jun 10, 2026

fix(docs): correct h3Distance example output value #106175

Merged

nikitamikhaylov added the can be tested Allows running workflows for external contributors label Jun 10, 2026

clickhouse-gh Bot added the pr-ci label Jun 10, 2026

clickhouse-gh Bot reviewed Jun 10, 2026

View reviewed changes

Comment thread tests/integration/test_profile_max_sessions_for_user/test.py Outdated

groeneai mentioned this pull request Jun 10, 2026

Fix crash moving an empty part to a plain_rewritable disk #107040

Open

This was referenced Jun 11, 2026

Fix Bad cast in StatisticsPartPruner with LowCardinality column #106793

Merged

Fix duplicate ALIAS columns causing column count mismatch in Distributed #101789

Open

Support group_by_use_nulls in WITH TOTALS #107106

Open

nikitamikhaylov approved these changes Jun 11, 2026

View reviewed changes

nikitamikhaylov self-assigned this Jun 11, 2026

nikitamikhaylov added this pull request to the merge queue Jun 11, 2026

Merged via the queue into ClickHouse:master with commit 6e9047c Jun 11, 2026
166 checks passed

robot-clickhouse added the pr-synced-to-cloud The PR is synced to the cloud repo label Jun 11, 2026

alexey-milovidov mentioned this pull request Jun 12, 2026

Restore subquery plans after failed in-place set build #102308

Draft

1 task

Sunbelt Computer Software

PL/B Language Development and Support

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Fix flaky timeout in test_profile_max_sessions_for_user (postgres)#107086

Fix flaky timeout in test_profile_max_sessions_for_user (postgres)#107086
nikitamikhaylov merged 2 commits into
ClickHouse:masterfrom
groeneai:fix-flaky-max-sessions-postgres-timeout

groeneai commented Jun 10, 2026 •

edited by robot-clickhouse

Loading

Uh oh!

groeneai commented Jun 10, 2026

Uh oh!

groeneai commented Jun 10, 2026

Uh oh!

clickhouse-gh Bot commented Jun 10, 2026 •

edited

Loading

Uh oh!

Uh oh!

groeneai commented Jun 11, 2026

Uh oh!

clickhouse-gh Bot commented Jun 11, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Sunbelt Computer Software

PL/B Language Development and Support

Uh oh!

Conversation

groeneai commented Jun 10, 2026 • edited by robot-clickhouse Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Changelog category (leave one):

Changelog entry (a user-readable short description of the changes that goes into CHANGELOG.md):

Description

Version info

Uh oh!

groeneai commented Jun 10, 2026

Pre-PR validation gate

Uh oh!

groeneai commented Jun 10, 2026

Uh oh!

clickhouse-gh Bot commented Jun 10, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

AI Review

Summary

Final Verdict

Uh oh!

Uh oh!

groeneai commented Jun 11, 2026

Pre-PR validation gate

Uh oh!

clickhouse-gh Bot commented Jun 11, 2026

LLVM Coverage Report

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

groeneai commented Jun 10, 2026 •

edited by robot-clickhouse

Loading

clickhouse-gh Bot commented Jun 10, 2026 •

edited

Loading