iframe-proxy

groeneai · 2026-07-02T14:43:21Z

Related: #95472

Changelog category (leave one):

Bug Fix (user-visible misbehavior in an official stable release)

Changelog entry (a user-readable short description of the changes that goes into CHANGELOG.md):

Fixed a false LOGICAL_ERROR ("There was an error on : Cannot obtain error message (probably it's a bug)") that could abort the server in debug and sanitizer builds when running a DDL query on a Replicated database and its per-host finished status node in Keeper had already been cleaned up.

Description

Found on public master CI (chronic, not tied to a single PR). Public CIDB (play.clickhouse.com), last 60 days: test_context_raw LIKE '%handleNonZeroStatusCode%' -> 75 hits (2 on master); LIKE '%There was an error on %' -> 81 hits (3 on master). Report example: https://s3.amazonaws.com/clickhouse-test-reports/json.html . Prior auto-generated issues for the same signature (both closed, but the crash still recurs): #95472, #95539.

Root cause. In debug/sanitizer builds only, ReplicatedDatabaseQueryStatusSource::checkStatus re-reads the per-host finished/<host_id> ZooKeeper node to double-check the remote status. For a Replicated database that node is always written with status 0 in the same transaction that applies the query (DDLWorker::processTask, finished_node_path created with ExecutionStatus(0)), and on error the executing replica throws UNFINISHED and retries without ever persisting a non-zero status (the is_replicated_database_task branch never reaches the makeSetRequest(finished_node_path, ...) line). So a present finished/<host_id> node always holds status 0, and an absent one is the benign cleaner/retry race. The sibling DistributedQueryStatusSource::generate() already handles the same race for the parent node via its UNFINISHED "deleted by the cleaner" branch.

When the node was absent, DistributedQueryStatusSource::getExecutionStatus returned its sentinel ExecutionStatus(-1, "Cannot obtain error message") unchanged (the message is never overwritten, which is exactly the string seen in the crash). Back in generate(), status.code != 0 (it is -1), so handleNonZeroStatusCode threw LOGICAL_ERROR, which aborts the server in debug/sanitizer builds. The debug cross-check could therefore only ever produce a false positive for a Replicated database, never catch a real remote error.

Fix. In ReplicatedDatabaseQueryStatusSource::checkStatus, treat a negative sentinel code (absent/unreadable finished/<host_id> node) as success, mirroring the non-debug branch, so the benign race is no longer misreported. A genuinely persisted non-zero status (positive code) still throws, preserving the cross-check's real purpose. The change is confined to the Replicated-database path; the ON CLUSTER path (DDLOnClusterQueryStatusSource), which reads real persisted error statuses, is untouched.

Testing. Added 04401_replicated_database_ddl_status_finished_node_missing, which uses a new failpoint (replicated_database_status_finished_node_missing) to force the "finished node missing" read deterministically. Verified both directions on a debug build: without the fix the test aborts the server with the exact signature above (stack: handleNonZeroStatusCode <- DistributedQueryStatusSource::generate); with the fix the DDL succeeds and the server stays alive.

groeneai · 2026-07-02T14:43:47Z

Pre-PR validation gate (click to expand)

#	Question	Answer
a	Deterministic repro?	Yes. New REGULAR failpoint `replicated_database_status_finished_node_missing` forces the "finished node absent" read; with it enabled, a `CREATE TABLE` on a `Replicated` database aborts the server every time (no timing luck needed).
b	Root cause explained?	Yes. For a `Replicated` DB, `DDLWorker::processTask` writes `finished/<host_id>` with status 0 and on error throws `UNFINISHED` and retries without persisting a non-zero status; so an absent node is the benign cleaner/retry race. The debug-only `checkStatus` returned `getExecutionStatus`'s sentinel `ExecutionStatus(-1, ...)` for the absent node, `generate()` saw `code != 0` and `handleNonZeroStatusCode` threw `LOGICAL_ERROR` -> debug/sanitizer abort.
c	Fix matches root cause?	Yes. `checkStatus` now treats a negative sentinel code (absent/unreadable node) as success `{0}`, exactly the benign-race case, at the source. This protects both consumers of the status (`handleNonZeroStatusCode` and `fillHostStatus`, which both throw on `code != 0`).
d	Test intent preserved / new tests added?	New regression test `04401_replicated_database_ddl_status_finished_node_missing`; no existing test weakened. A genuinely persisted non-zero (positive) status still throws, so the cross-check's real purpose is preserved.
e	Both directions demonstrated?	Yes. Debug build: without the fix the test aborts the server with the exact signature (`There was an error on s1\|r1: Cannot obtain error message (probably it's a bug)`, stack `handleNonZeroStatusCode` <- `DistributedQueryStatusSource::generate`); with the fix it passes 9/9 and the server stays alive.
f	Fix is general across code paths?	Yes. The fix is at the source in `checkStatus`, so both downstream throwers are covered. The sibling `ON CLUSTER` path (`DDLOnClusterQueryStatusSource`) is intentionally untouched: for that path `DDLWorker` does persist real error statuses, so it must keep reporting them.
g	Fix generalizes across inputs (params/datatypes/wrappers)?	Yes. `getExecutionStatus` only ever produces `code == -1` as its sentinel; `code < 0` covers it. Any real deserialized non-zero remote status has a positive code and still throws. Independent of query/table type.
h	Backward compatible? (maintainer-approved exception only)	Yes. Change is confined to a `#ifdef DEBUG_OR_SANITIZER_BUILD` cross-check; no setting default, on-disk/wire/replication format, or user-visible behavior changes.
i	Invariants and contracts preserved?	Yes. The debug cross-check's contract (catch a genuinely persisted non-zero remote status, which "should not happen" for a Replicated DB) is preserved; only the sentinel false positive is suppressed, matching the invariant the non-debug branch already relies on.

Session id: cron:clickhouse-worker-slot-4:20260702-134800

groeneai · 2026-07-02T14:44:14Z

cc @tavplubix @tuanpach — could you review this? In debug/sanitizer builds ReplicatedDatabaseQueryStatusSource::checkStatus re-reads finished/<host_id> from Keeper; for a Replicated DB that node holds status 0 when present and is absent only in the benign cleaner/retry race, but an absent node returned the getExecutionStatus sentinel (code -1) which generate() reported as a remote error and threw LOGICAL_ERROR ("... probably it's a bug"), aborting the server. The fix treats the negative sentinel as success like the non-debug branch; a real persisted non-zero status still throws. Regression test uses a new failpoint to force the missing-node read.

clickhouse-gh · 2026-07-02T16:12:48Z

Workflow [PR], commit [7334ddf]

Summary: ❌

Performance Comparison: Performance dashboard

job_name	test_name	status	info
Stateless tests (amd_tsan, parallel, 2/2)		FAIL
	02898_parallel_replicas_progress_bar	FAIL	cidb
Performance Comparison (arm_release, master_head, 2/6)		FAIL	Performance dashboard
	codecs_int_insert #6::old	FAIL	query history
	codecs_int_insert #6::new	FAIL	query history
	codecs_int_insert #16::old	FAIL	query history
	codecs_int_insert #16::new	FAIL	query history
	file_table_function #11::old	FAIL	query history
	file_table_function #11::new	FAIL	query history
	file_table_function #23::old	FAIL	query history
	file_table_function #23::new	FAIL	query history
	number_formatting_formats #1::old	FAIL	query history
	number_formatting_formats #1::new	FAIL	query history
	2 more test cases not shown
Finish Workflow		FAIL
	python3 ./ci/jobs/scripts/workflow_hooks/new_tests_check.py	FAIL

AI Review

Summary

This PR fixes the debug/sanitizer-only false LOGICAL_ERROR in replicated-database DDL status handling by making ExecutionStatus::tryDeserializeText atomic and by treating only the truly absent finished/<host_id> node as the benign race. The current code preserves the corrupt-payload cross-check, the new unit and stateless regressions cover both sides of the fix, and I did not find any remaining code-level blockers.

PR Metadata

💡 The selected Changelog category is not semantically correct for this change. The bug only affects debug/sanitizer builds on master; release behavior is unchanged, so this should not go into the user-facing Bug Fix changelog section.
Exact replacement: ### Changelog category (leave one): then - Not for changelog (changelog entry is not required)
Changelog entry is then not required.

Final Verdict

✅ No remaining code findings. After changing the PR metadata to Not for changelog, this looks ready.

…issing In debug/sanitizer builds ReplicatedDatabaseQueryStatusSource::checkStatus re-reads the per-host finished/<host_id> ZooKeeper node to double-check the remote status. For a Replicated database this node is always written with status 0 in the transaction that applies the query (DDLWorker::processTask), and on error the executing replica throws UNFINISHED and retries without ever persisting a non-zero status. So a present finished/<host_id> node always holds status 0, and an absent one is the benign cleaner/retry race (the sibling generate() already handles the same race for the parent node via its UNFINISHED deleted-by-the-cleaner branch). When the node was absent, DistributedQueryStatusSource::getExecutionStatus returned its sentinel ExecutionStatus(-1, 'Cannot obtain error message') unchanged. generate() then saw code != 0 and called handleNonZeroStatusCode, which threw LOGICAL_ERROR 'There was an error on <host>: Cannot obtain error message (probably its a bug)', aborting the server in debug/sanitizer builds. getExecutionStatus returns that same sentinel in two cases: the node was absent (tryGet == false, the benign race), and the node was present but its payload could not be deserialized (a corrupt finished/<host_id>). Distinguishing them by the sentinel code alone would silently turn a corrupt payload into success and defeat that arm of the cross-check. getExecutionStatus now reports node presence via an optional out-parameter, and checkStatus maps only the truly absent-node case to ExecutionStatus{0}. A present-but-unreadable node keeps the sentinel so the cross-check still fires. The ON CLUSTER consumer keeps the default nullptr and is unaffected. Adds a regression test using a new failpoint (replicated_database_status_finished_node_missing) that forces the missing-node read deterministically. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

groeneai · 2026-07-02T17:55:07Z

Updated pre-PR validation gate after the review refinement (5c370b9):

Pre-PR validation gate (click to expand)

#	Question	Answer
a	Deterministic repro?	Yes. Failpoint `replicated_database_status_finished_node_missing` forces the missing-node read; a Replicated-DB DDL under it deterministically aborts a debug/sanitizer server without the fix.
b	Root cause explained?	For a Replicated DB, `finished/<host_id>` is written with status 0 and never with an error (the executing replica throws UNFINISHED and retries), so it is either present-with-status-0 or absent (benign cleaner/retry race). When absent, `getExecutionStatus` returns its `(-1, "Cannot obtain error message")` sentinel unchanged; `generate()` sees `code != 0` and `handleNonZeroStatusCode` throws LOGICAL_ERROR, aborting in debug/sanitizer. The debug cross-check could only ever produce a false positive.
c	Fix matches root cause?	Yes, and at the source. `checkStatus` treats only the absent-node case as the benign race. The refinement (this review) distinguishes absent (`tryGet == false`) from present-but-corrupt: only the absent case maps to `ExecutionStatus{0}`; a present-but-unreadable payload keeps the sentinel so the cross-check still fires.
d	Test intent preserved / new tests added?	New regression test `04401_replicated_database_ddl_status_finished_node_missing` (Tags: zookeeper, no-parallel, no-fasttest). The cross-check's purpose (catch a genuine non-zero remote status, or a corrupt payload) is preserved: only the truly-absent race is swallowed.
e	Both directions demonstrated?	Yes. Missing-node failpoint: WITHOUT fix -> server aborts with the exact task signature (`handleNonZeroStatusCode` <- `DistributedQueryStatusSource::generate()`); WITH fix -> server survives, DDL succeeds (11/11 iterations). Present-but-corrupt (throwaway failpoint, not committed): WITH fix the cross-check still aborts, proving arm B is not silenced.
f	Fix is general across code paths?	Yes. Fixing at the `checkStatus` source protects both consumers of a non-zero status (`handleNonZeroStatusCode` and `fillHostStatus`, both throw on `code != 0`). The ON CLUSTER path (`DDLOnClusterQueryStatusSource`) is intentionally untouched: it reads real persisted error statuses and passes the default `nullptr`.
g	Fix generalizes across inputs?	Yes. The absent vs present-but-corrupt distinction covers both sentinel-producing inputs of the shared `getExecutionStatus`; a genuinely persisted positive status still throws.
h	Backward compatible?	Yes. No setting, format, wire, or on-disk change. `getExecutionStatus` gains an optional out-parameter defaulting to `nullptr`; its return value is byte-for-byte unchanged for existing callers. Behavior differs only inside a `DEBUG_OR_SANITIZER_BUILD` cross-check.
i	Invariants and contracts preserved?	Yes. `getExecutionStatus` still returns the sentinel on failure; `node_exists` merely surfaces the `tryGet` result it already computed. `checkStatus` still returns a valid `ExecutionStatus` on all paths. No lock/ownership/lifetime changes.

Session id: cron:clickhouse-worker-slot-11:20260702-170800

…ross-check arm survives The previous refinement distinguished an absent finished/<host_id> node (benign cleaner/retry race) from a present one via a node_exists out-parameter, but a present-but-corrupt node was still collapsed to success. The reason is in tryDeserializeText itself. deserializeText does "rb >> code >> \n >> escape >> message", i.e. it reads code before it can fail on the rest of the payload. tryDeserializeText caught the exception but left the target object partially overwritten: a payload beginning with a parsable integer, for example "0garbage" or "0", set code=0 before the failure was reported. Because getExecutionStatus deserializes directly into its (-1, "Cannot obtain error message") sentinel, such a corrupt present node became code=0 and ReplicatedDatabaseQueryStatusSource::checkStatus reported it as success, defeating that arm of the debug/sanitizer cross-check. Fix the try-pattern at its source: tryDeserializeText now parses into a temporary ExecutionStatus and assigns *this only on success, so a failed parse leaves the target untouched (the general contract every caller expects). getExecutionStatus therefore keeps its (-1) sentinel intact for a present-but-corrupt node, and checkStatus already maps only the truly absent node (node_exists false) to success while the sentinel (negative code) still trips the cross-check. The ON CLUSTER consumer, whose finished/<host_id> holds a real persisted status, deserializes fully and is unaffected. Add gtest_execution_status.cpp asserting tryDeserializeText leaves the target unchanged on malformed input ("0garbage", "0", "garbage", "") and round-trips a valid payload. The corrupt-node arm cannot be a stateless "server stays alive" test because tripping the cross-check throws LOGICAL_ERROR, which aborts the server in debug/sanitizer builds; the existing failpoint test still covers the absent-node arm. Verified both directions on a debug build: the malformed-input assertions fail against the old in-place implementation and pass with this change. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

groeneai · 2026-07-02T19:04:46Z

Pre-PR validation gate (click to expand)

#	Question	Answer
a	Deterministic repro?	Yes. Unit test `ExecutionStatus.TryDeserializeTextKeepsTargetOnMalformedInput` fails deterministically on the old in-place `tryDeserializeText` for payloads `"0garbage"`, `"0"`, `"garbage"` (`status.code` clobbered from -1 to 0).
b	Root cause explained?	`ExecutionStatus::deserializeText` reads `code` first (`rb >> code >> "\\n" >> escape >> message`), then can fail on the rest. `tryDeserializeText` deserialized in place, so a payload starting with a parsable int left `code` overwritten on failure. `getExecutionStatus` deserializes into its `(-1, "Cannot obtain error message")` sentinel, so a corrupt present node became `code=0` and `checkStatus` reported success, defeating the corrupt-node arm of the cross-check.
c	Fix matches root cause?	Yes. `tryDeserializeText` now parses into a temporary and assigns `*this` only on success, so a failed parse leaves the target untouched. This is the try-pattern contract the callers rely on; `getExecutionStatus` then keeps the `(-1)` sentinel for a corrupt present node and `checkStatus` surfaces it.
d	Test intent preserved / new tests added?	Added `src/Common/tests/gtest_execution_status.cpp` (malformed input leaves target unchanged; valid payload round-trips). The absent-node arm stays covered by the existing failpoint stateless test. No test weakened.
e	Both directions demonstrated?	Yes. Malformed-input assertions FAIL against the old implementation, PASS with the fix (both rebuilt and run on a debug build, Build ID verified). Absent-node stateless steps: server stays alive, DDL succeeds.
f	Fix is general across code paths?	Yes. The only two `ExecutionStatus` deserialize callers are `getExecutionStatus` (fixed here) and `fromText` (uses `deserializeText` on a fresh object, unaffected). Fixing the shared `tryDeserializeText` covers every present and future caller rather than guarding one call site.
g	Fix generalizes across inputs?	Yes. Test covers a parsable-prefix payload (`"0garbage"`), a bare code (`"0"`), a non-numeric payload (`"garbage"`), and empty (`""`); a valid `serializeText()` payload still parses. Behavior is independent of the concrete code/message values.
h	Backward compatible?	Yes. No setting, on-disk, wire, or protocol format change. `serializeText`/`deserializeText` output and the `getExecutionStatus` return value are byte-for-byte unchanged; only the failure-path mutation is removed.
i	Invariants and contracts preserved?	Yes. `tryDeserializeText` now upholds its documented try-pattern contract (target unchanged on failure). `getExecutionStatus` sentinel/`node_exists` pairing is preserved; `checkStatus` maps absent -> success, corrupt-present (sentinel, negative code) -> throw, positive code -> throw (unchanged). ON CLUSTER path deserializes real statuses and is unaffected.

Session id: cron:clickhouse-worker-slot-3:20260702-183000

clickhouse-gh · 2026-07-02T22:37:20Z

LLVM Coverage Report

Metric	Baseline	Current	Δ
Lines	85.50%	85.50%	+0.00%
Functions	92.70%	92.60%	-0.10%
Branches	77.70%	77.70%	+0.00%

Changed lines: Changed C/C++ lines covered: 32/33 (96.97%) · Uncovered code

Full report · Diff report

groeneai · 2026-07-03T00:42:50Z

CI finish ledger — `7334ddf`

CI is complete on this head. No failure below is caused by this PR's diff (a Replicated-DB DDL-status correctness fix + atomic ExecutionStatus::tryDeserializeText + unit test). Only the private CH Inc sync is exempt.

Check / test	Reason	Owner / fixing PR
Stateless tests (amd_tsan, parallel, 2/2) / `02898_parallel_replicas_progress_bar`	flaky (3 hits / 2 unrelated PRs / 0 master in 30d) — a Replicated-DB DDL-status fix cannot affect parallel-replicas progress-bar timing	a fix task is created (investigating at full effort; fixing-PR link to follow here)
Performance Comparison (arm_release, master_head, 2/6)	perf noise (report-side variance, unrelated to a DDL-status/Common change)	a fix task is created (investigating at full effort; fixing-PR link to follow here)
Finish Workflow / Mergeable Check / PR	rollup of the above	resolves once the above clear
CH Inc sync	-	CH Inc sync (private, not actionable)

Session id: cron:our-pr-ci-monitor:20260703-000000

groeneai · 2026-07-03T01:27:38Z

alexey-milovidov added the can be tested Allows running workflows for external contributors label Jul 2, 2026

clickhouse-gh Bot added the pr-bugfix Pull request with bugfix, not backported by default label Jul 2, 2026

clickhouse-gh Bot reviewed Jul 2, 2026

View reviewed changes

Comment thread src/Interpreters/ReplicatedDatabaseQueryStatusSource.cpp Outdated

groeneai force-pushed the groeneai/fix-replicated-db-ddl-status-finished-node-missing branch from 75a8045 to 5c370b9 Compare July 2, 2026 17:54

Sunbelt Computer Software

PL/B Language Development and Support

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Fix false LOGICAL_ERROR when Replicated database DDL status node is missing#109192

Fix false LOGICAL_ERROR when Replicated database DDL status node is missing#109192
groeneai wants to merge 2 commits into
ClickHouse:masterfrom
groeneai:groeneai/fix-replicated-db-ddl-status-finished-node-missing

groeneai commented Jul 2, 2026

Uh oh!

groeneai commented Jul 2, 2026

Uh oh!

groeneai commented Jul 2, 2026

Uh oh!

clickhouse-gh Bot commented Jul 2, 2026 •

edited

Loading

Uh oh!

Uh oh!

groeneai commented Jul 2, 2026

Uh oh!

groeneai commented Jul 2, 2026

Uh oh!

clickhouse-gh Bot commented Jul 2, 2026

Uh oh!

groeneai commented Jul 3, 2026

Uh oh!

groeneai commented Jul 3, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Sunbelt Computer Software

PL/B Language Development and Support

Uh oh!

Conversation

groeneai commented Jul 2, 2026

Changelog category (leave one):

Changelog entry (a user-readable short description of the changes that goes into CHANGELOG.md):

Description

Uh oh!

groeneai commented Jul 2, 2026

Uh oh!

groeneai commented Jul 2, 2026

Uh oh!

clickhouse-gh Bot commented Jul 2, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

AI Review

Summary

PR Metadata

Final Verdict

Uh oh!

Uh oh!

groeneai commented Jul 2, 2026

Uh oh!

groeneai commented Jul 2, 2026

Uh oh!

clickhouse-gh Bot commented Jul 2, 2026

LLVM Coverage Report

Uh oh!

groeneai commented Jul 3, 2026

CI finish ledger — 7334ddf

Uh oh!

groeneai commented Jul 3, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

clickhouse-gh Bot commented Jul 2, 2026 •

edited

Loading

CI finish ledger — `7334ddf`