iframe-proxy

groeneai · 2026-06-14T16:44:00Z

Changelog category (leave one):

CI Fix or Improvement (changelog entry is not required)

Changelog entry (a user-readable short description of the changes that goes into CHANGELOG.md):

...

Description

Related: #107425 (where this failure was triaged and the fix was requested).

Fixes a use-after-dtor of the HandledSignals singleton on shutdown, originally seen in Stress test (arm_msan) / Stress test (amd_msan) as an infinitely recursive MSan report. CI report: https://s3.amazonaws.com/clickhouse-test-reports/json.html?PR=107425&sha=b2f68b66f7510cdaaec64fb58e7c5601b58180c0&name_0=PR&name_1=Stress%20test%20%28arm_msan%29

Root cause: HandledSignals::instance() was a function-local static, so the singleton was destroyed at process exit. But the object is reachable from the signal handlers and from the registered sanitizer death callback (sanitizerDeathCallback() -> instance().reset(false)), both of which can run after static destructors. After destruction MSan poisons the members ("Member fields were destroyed" is MSan's use-after-dtor origin), so reset() iterating the destroyed handled_signals vector reads poisoned memory, raises an MSan warning, which calls Die() -> the death callback -> reset() again, recursing until the original report is lost.

Fix: intentionally leak the singleton (never destroy it), the same idiom already used for DateLUT::getInstance(). Its members then outlive every signal handler and death callback, so the post-exit read is on live, valid memory and no warning (and therefore no recursion) is produced. The now-redundant destructor (which only called reset() during static teardown, the exact window this removes) is dropped; normal shutdown still resets handlers and closes the pipe via the explicit instance().reset() calls in the BaseDaemon and ClientApplicationBase destructors.

No behavior change in non-sanitizer builds.

groeneai · 2026-06-14T16:44:18Z

Pre-PR validation gate

#	Question	Answer
a	Deterministic repro?	The full-server MSan repro is non-deterministic (the seeding warning is chronic exit-time noise), so I reproduced the fix mechanism deterministically at codegen level instead: compiling `reset()`'s exact body with the CI MSan flags (`clang++-21 -fsanitize=memory -fsanitize-memory-use-after-dtor`) emits 4 `__msan_warning` calls without the attribute and 0 with it. That is the precise instrumentation whose firing inside `Die()`'s death callback drives the recursion.
b	Root cause explained?	At process exit the static `HandledSignals` singleton is destroyed; MSan poisons its members (report ends "Member fields were destroyed" = use-after-dtor origin). A later MSan warning calls `__sanitizer::Die()`, which runs `sanitizerDeathCallback()`. The callback is `DISABLE_SANITIZER_INSTRUMENTATION` but that attribute does not propagate to callees; it calls `HandledSignals::reset(false)`, which iterates the poisoned `handled_signals` while still instrumented -> fires another `__msan_warning` -> `Die()` -> callback -> `reset()` -> unbounded recursion (frames #16-#32 repeat at one PC), destroying the original report and turning the job into ERROR.
c	Fix matches root cause?	Yes. `reset()` is the only callee reachable from the death callback; marking it `DISABLE_SANITIZER_INSTRUMENTATION` removes exactly the instrumentation that re-enters the sanitizer. No behavior change (still resets handlers + closes pipe); macro is a no-op in non-sanitizer builds.
d	Test intent preserved / new tests added?	No existing test is modified or weakened. No new functional test is added: the failure is a sanitizer-build-only shutdown artifact with a nondeterministic seed and no SQL-level trigger, so a stable regression test is not feasible; the codegen IR check in (a) is the verification.
e	Both directions demonstrated?	Yes, at codegen level: `reset` body -> 4 `__msan_warning` calls without the attribute (recursion seed present), 0 with it (seed removed). The `-O2` inlined form (matching the single-PC stack) goes 14 -> 0 `__msan_` calls.
f	Fix is general, not a narrow patch?	The death callback (`sanitizerDeathCallback`) has exactly one callee, `HandledSignals::reset`; there is no sibling path with the same pattern. Prior art #78178 already hardened the other reentrancy hazard in the same callback (writing to a closed pipe). This addresses the root cause (instrumentation reentrancy from the death callback), not a symptom.

Session id: cron:clickhouse-worker-slot-2:20260614-161500

groeneai · 2026-06-14T16:45:07Z

cc @azat @Algunenano — could you review this? It marks HandledSignals::reset with DISABLE_SANITIZER_INSTRUMENTATION to stop an infinite MSan recursion on shutdown: the sanitizer death callback (already non-instrumented) calls reset(), which iterates the poisoned handled_signals while still instrumented, re-triggering Die(). Same callback hardened earlier in #78178.

clickhouse-gh · 2026-06-14T17:57:57Z

Workflow [PR], commit [fe7e8f8]

Summary: ❌

Performance Comparison: Performance dashboard

job_name	test_name	status	info
Performance Comparison (arm_release, master_head, 3/6)		FAIL	Performance dashboard
	group_by_fixed_keys #5::old	FAIL	query history
	group_by_fixed_keys #5::new	FAIL	query history
	group_by_fixed_keys #6::old	FAIL	query history
	group_by_fixed_keys #6::new	FAIL	query history
	group_by_fixed_keys #9::old	FAIL	query history
	group_by_fixed_keys #9::new	FAIL	query history
	group_by_fixed_keys #10::old	FAIL	query history
	group_by_fixed_keys #10::new	FAIL	query history
	group_by_fixed_keys #11::old	FAIL	query history
	group_by_fixed_keys #11::new	FAIL	query history
	10 more test cases not shown

AI Review

Summary

The PR changes HandledSignals::instance from a destructed Meyers singleton to an intentionally leaked singleton and removes the destructor that reset handlers during static teardown. That matches the shutdown invariant the PR is trying to restore: signal handlers and the sanitizer death callback can still reach HandledSignals after static destructors, so the storage they read must not be destroyed. I did not find any remaining code issue that warrants an inline review comment.

Final Verdict

Status: ✅ Approve

groeneai · 2026-06-14T19:01:28Z

Pre-PR validation gate (revised fix)

#	Question	Answer
a	Deterministic repro?	The failure is a sanitizer-only, nondeterministic shutdown artifact (exit-time death-callback reentrancy, no SQL trigger), so a stable on-demand functional repro is not feasible. Instead the exact mechanism is reproduced deterministically at the IR level: compiling `instance()` with the CI MSan flags (`-fsanitize=memory -fsanitize-memory-use-after-dtor`) shows the Meyers-singleton form schedules a destructor (`__cxa_atexit`) and poisons members on destruction (`__sanitizer_dtor_callback`), which is the use-after-dtor seed read by `reset()`.
b	Root cause explained?	`HandledSignals::instance()` was a function-local static, destroyed at process exit. The object is reachable from signal handlers and the registered sanitizer death callback (`sanitizerDeathCallback -> instance().reset(false)`), both of which can run after static destructors. After destruction MSan poisons the members; `reset()` iterating the destroyed `handled_signals` vector reads poisoned memory, fires an MSan warning, which calls `Die()` -> the death callback -> `reset()` again, recursing until the original report is lost and the job becomes ERROR.
c	Fix matches root cause?	Yes. The bad state is a destroyed-then-read singleton. The fix makes the singleton never-destroyed (intentionally leaked), so the read is on live, valid memory rather than freed/poisoned storage. This addresses the source (invalid access), not just the symptom (the MSan report).
d	Test intent preserved / new tests added?	No existing test is modified or weakened. No new functional test is added: a sanitizer-only, nondeterministic, no-SQL-trigger shutdown artifact has no stable functional regression test; the IR-level structural check (no `__cxa_atexit` / no dtor poisoning for the leaked instance) is the verification.
e	Both directions demonstrated?	Yes, at the IR level with the CI MSan flags: OLD (Meyers) emits `__cxa_atexit=2` and `__sanitizer_dtor_callback=2` (the seed present); NEW (leaked) emits `__cxa_atexit=0` and `__sanitizer_dtor_callback=0` (seed removed). The modified TU also compiles clean in the debug build (RC=0).
f	Fix is general, not a narrow patch?	Yes. The leak also closes the related window where signal handlers (which write to `instance().signal_pipe` at lines 75/101/126/213) could touch a destroyed `PipeFDs` if a signal arrived during/after static teardown, not only the `reset()` path the bot flagged. All other `instance().reset()` callers (`BaseDaemon`, `ClientApplicationBase`, disks, gtest) are unaffected and still perform explicit cleanup.

Session id: cron:clickhouse-worker-slot-3:20260614-184900

The HandledSignals singleton is reachable from signal handlers and the sanitizer death callback, both of which can run after static destructors. With the previous Meyers singleton it was destroyed at process exit, so HandledSignals::reset() iterated the destroyed handled_signals vector. Under MSan that read fired a use-of-uninitialized-value warning, whose death callback called reset() again, recursing until the original report was lost. Leak the singleton (same idiom as DateLUT::getInstance) so its members outlive all teardown and reset() always reads valid memory. The explicit ~HandledSignals (which only called reset() during static teardown, the exact window removed) is dropped; normal shutdown still resets via the explicit instance().reset() calls in ~BaseDaemon and ~ClientApplicationBase. Do not disable instrumentation on reset(): an uninstrumented reset() does not write defined shadow for its loop-iterator locals, so they inherit stale poison from just-destroyed locals in the preceding writeSignalIDtoSignalPipe call; the still-instrumented operator== then reports a false positive on every client/local exit. The leak alone removes the recursion seed. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

groeneai · 2026-06-14T23:56:27Z

Pre-PR validation gate (updated for head `fe7e8f8`)

#	Question	Answer
a	Deterministic repro?	Yes. `clickhouse local -q "SELECT 1"` under MSan aborts on every exit (exit 134). Confirmed on the `amd_msan` binary for the prior head `f057d41` and reproduced locally.
b	Root cause explained?	Two layers. (1) Original: the Meyers-singleton `HandledSignals` is destroyed at process exit, but it is reachable from signal handlers + the sanitizer death callback (which run after static dtors); `reset()` then iterates the destroyed `handled_signals` vector -> MSan warning -> `Die()` -> death callback -> `reset()` -> recursion. (2) The first revision kept `DISABLE_SANITIZER_INSTRUMENTATION` on `reset()`; an uninstrumented `reset()` never writes "defined" shadow for its range-loop iterator locals, so they inherit stale poison from the just-destroyed locals in the preceding `writeSignalIDtoSignalPipe()` call, and the instrumented `operator==` callee reports it -> deterministic abort on the normal shutdown path (`~ClientApplicationBase:67`).
c	Fix matches root cause?	Yes. Leak the singleton so members are never destroyed/poisoned (fixes both the recursion and the destroyed-state access at the source), and remove `DISABLE_SANITIZER_INSTRUMENTATION` (the cause of layer 2; unnecessary once nothing is poisoned). Not a suppression.
d	Test intent preserved / new tests added?	No assertions weakened. No dedicated regression test added: the artifact is a sanitizer-only, nondeterministic shutdown report with no SQL trigger, so a stable functional test is not feasible; the `amd_msan` job itself (every `client`/`local` invocation exits cleanly) is the regression guard, and the integration suite exercises it thousands of times per run.
e	Both directions demonstrated?	Yes, with `amd_msan` binaries built from source. OLD (`f057d41`): `clickhouse local -q "SELECT 1"` -> exit 134, 36 use-of-uninit warnings. NEW (`fe7e8f8`): exit 0, 0 warnings, `HandledSignals::reset` absent from all stacks. Also clean: `numbers(1000)`, `version()`, `client --help`. `MSAN_OPTIONS=poison_in_dtor=0` makes OLD pass, confirming the use-after-dtor origin.
f	Fix is general, not a narrow patch?	Yes. The fix is at the source (object lifetime), not a guard at the crash site. The only other callers of `HandledSignals::instance().reset()` (`~BaseDaemon`, `~ClientApplicationBase`) now also operate on the live, never-destroyed object, so the same class of post-teardown read is fixed for all of them rather than one path. No sibling implementations exist.

Session id: cron:clickhouse-worker-slot-0:20260614-225000

clickhouse-gh · 2026-06-15T03:11:31Z

LLVM Coverage Report

Changed lines: Changed C/C++ lines covered by tests: 3/3 (100.00%) | Lost baseline coverage (was covered on master, now uncovered in this PR): 107 line(s) · Uncovered code

Full report · Diff report

`test_implicit_index_upgrade_alter_replay` creates a second replica `node2` and asserts `SELECT count() FROM test_alter_replay` equals `10001` immediately after `wait_for_active_replica`. But `wait_for_active_replica` only waits for `is_readonly = 0`, not for the freshly-joined replica to finish fetching parts. So `node2` can observe only the small `VALUES` part (1 row, `key = 99999`) before the 10000-row part is fetched, and the assertion fails with `assert '1' == '10001'`. This raced once each on several unrelated PRs over the last 30 days (e.g. ClickHouse#108084, ClickHouse#99280, ClickHouse#107566, ClickHouse#107442). Add `SYSTEM SYNC REPLICA test_alter_replay` after the replica becomes active, which blocks until all parts are fetched, before checking the row count. CI report: https://s3.amazonaws.com/clickhouse-test-reports/json.html?PR=108770&sha=bb760616d54e3d8e5d20968022085506c31b5f37&name_0=PR&name_1=Integration%20tests%20%28arm_binary%2C%20distributed%20plan%2C%201%2F4%29 Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

groeneai mentioned this pull request Jun 14, 2026

Honor materialized_postgresql_schema in MaterializedPostgreSQL table engine #107425

Open

1 task

alexey-milovidov added the can be tested Allows running workflows for external contributors label Jun 14, 2026

clickhouse-gh Bot added the pr-ci label Jun 14, 2026

clickhouse-gh Bot reviewed Jun 14, 2026

View reviewed changes

Comment thread src/Common/SignalHandlers.cpp Outdated

groeneai force-pushed the fix-msan-recursion-handledsignals-reset branch from 36f9f92 to f057d41 Compare June 14, 2026 19:00

groeneai force-pushed the fix-msan-recursion-handledsignals-reset branch from f057d41 to fe7e8f8 Compare June 14, 2026 23:54

This was referenced Jun 15, 2026

Feature: Paimon minmax index #100160

Open

Randomize tests with DETACH/ATTACH table before query execution #96130

Open

This was referenced Jun 24, 2026

Fix data-type names containing the substring INT being parsed as MySQL integers #108354

Merged

Fix flaky test_dns_cache: use default start timeout in cluster_ready #108498

Open

alexey-milovidov mentioned this pull request Jun 29, 2026

Add as-you-type autocompletion (inline hints) to the CLI #108070

Merged

1 task

This was referenced Jun 30, 2026

Reduce flakiness of test_distributed_ddl_parallel and test_restore_replica #108933

Merged

Backport #108493 to 25.8: Fix crash on deeply nested expressions with raised max_parser_depth #108896

Merged

This was referenced Jun 30, 2026

Implement CREATE HANDLER: SQL-defined HTTP handlers #106231

Open

Fix flaky test_implicit_index_upgrade_alter_replay (replica fetch race) #108961

Merged

Fix hang in SYSTEM SYNC MERGES with the Manual merge selector #108770

Open

groeneai mentioned this pull request Jun 30, 2026

Clear ColumnReplicated insertion cache on rollback #108935

Open

alexey-milovidov mentioned this pull request Jun 30, 2026

Parallelize read-in-order from a single part with PrefetchingConcatProcessor #100391

Open

groeneai mentioned this pull request Jun 30, 2026

Fix flaky test_backup_restore_on_cluster restore timeout #108993

Merged

This was referenced Jul 1, 2026

Switch the default compression to ZSTD(3) for table data and network #108786

Open

Do not leave an orphan table when CREATE ... AS SELECT is denied #108048

Open

Restore subquery plans after failed in-place set build #102308

Draft

alexey-milovidov mentioned this pull request Jul 2, 2026

Make to_throw settings cheangeable without restart #106821

Merged

Sunbelt Computer Software

PL/B Language Development and Support

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Fix recursive MSan report in HandledSignals::reset on shutdown#107442

Fix recursive MSan report in HandledSignals::reset on shutdown#107442
groeneai wants to merge 1 commit into
ClickHouse:masterfrom
groeneai:fix-msan-recursion-handledsignals-reset

groeneai commented Jun 14, 2026 •

edited

Loading

Uh oh!

groeneai commented Jun 14, 2026

Uh oh!

groeneai commented Jun 14, 2026

Uh oh!

clickhouse-gh Bot commented Jun 14, 2026 •

edited

Loading

Uh oh!

Uh oh!

groeneai commented Jun 14, 2026

Uh oh!

groeneai commented Jun 14, 2026

Uh oh!

clickhouse-gh Bot commented Jun 15, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Sunbelt Computer Software

PL/B Language Development and Support

Uh oh!

Conversation

groeneai commented Jun 14, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Changelog category (leave one):

Changelog entry (a user-readable short description of the changes that goes into CHANGELOG.md):

Description

Uh oh!

groeneai commented Jun 14, 2026

Pre-PR validation gate

Uh oh!

groeneai commented Jun 14, 2026

Uh oh!

clickhouse-gh Bot commented Jun 14, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

AI Review

Summary

Final Verdict

Uh oh!

Uh oh!

groeneai commented Jun 14, 2026

Pre-PR validation gate (revised fix)

Uh oh!

groeneai commented Jun 14, 2026

Pre-PR validation gate (updated for head fe7e8f8)

Uh oh!

clickhouse-gh Bot commented Jun 15, 2026

LLVM Coverage Report

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

groeneai commented Jun 14, 2026 •

edited

Loading

clickhouse-gh Bot commented Jun 14, 2026 •

edited

Loading

Pre-PR validation gate (updated for head `fe7e8f8`)