Fix false LOGICAL_ERROR when Replicated database DDL status node is missing#109192
Fix false LOGICAL_ERROR when Replicated database DDL status node is missing#109192groeneai wants to merge 2 commits into
Conversation
|
cc @tavplubix @tuanpach — could you review this? In debug/sanitizer builds |
|
Workflow [PR], commit [7334ddf] Summary: ❌
AI ReviewSummaryThis PR fixes the debug/sanitizer-only false PR Metadata
Final Verdict✅ No remaining code findings. After changing the PR metadata to |
…issing
In debug/sanitizer builds ReplicatedDatabaseQueryStatusSource::checkStatus
re-reads the per-host finished/<host_id> ZooKeeper node to double-check the
remote status. For a Replicated database this node is always written with
status 0 in the transaction that applies the query (DDLWorker::processTask),
and on error the executing replica throws UNFINISHED and retries without ever
persisting a non-zero status. So a present finished/<host_id> node always holds
status 0, and an absent one is the benign cleaner/retry race (the sibling
generate() already handles the same race for the parent node via its UNFINISHED
deleted-by-the-cleaner branch).
When the node was absent, DistributedQueryStatusSource::getExecutionStatus
returned its sentinel ExecutionStatus(-1, 'Cannot obtain error message')
unchanged. generate() then saw code != 0 and called handleNonZeroStatusCode,
which threw LOGICAL_ERROR 'There was an error on <host>: Cannot obtain error
message (probably its a bug)', aborting the server in debug/sanitizer builds.
getExecutionStatus returns that same sentinel in two cases: the node was absent
(tryGet == false, the benign race), and the node was present but its payload
could not be deserialized (a corrupt finished/<host_id>). Distinguishing them
by the sentinel code alone would silently turn a corrupt payload into success
and defeat that arm of the cross-check. getExecutionStatus now reports node
presence via an optional out-parameter, and checkStatus maps only the truly
absent-node case to ExecutionStatus{0}. A present-but-unreadable node keeps the
sentinel so the cross-check still fires. The ON CLUSTER consumer keeps the
default nullptr and is unaffected.
Adds a regression test using a new failpoint
(replicated_database_status_finished_node_missing) that forces the missing-node
read deterministically.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
75a8045 to
5c370b9
Compare
|
Updated pre-PR validation gate after the review refinement (5c370b9): Pre-PR validation gate (click to expand)
Session id: cron:clickhouse-worker-slot-11:20260702-170800 |
…ross-check arm survives
The previous refinement distinguished an absent finished/<host_id> node
(benign cleaner/retry race) from a present one via a node_exists out-parameter,
but a present-but-corrupt node was still collapsed to success.
The reason is in tryDeserializeText itself. deserializeText does
"rb >> code >> \n >> escape >> message", i.e. it reads code before it can fail
on the rest of the payload. tryDeserializeText caught the exception but left the
target object partially overwritten: a payload beginning with a parsable integer,
for example "0garbage" or "0", set code=0 before the failure was reported. Because
getExecutionStatus deserializes directly into its (-1, "Cannot obtain error
message") sentinel, such a corrupt present node became code=0 and
ReplicatedDatabaseQueryStatusSource::checkStatus reported it as success, defeating
that arm of the debug/sanitizer cross-check.
Fix the try-pattern at its source: tryDeserializeText now parses into a temporary
ExecutionStatus and assigns *this only on success, so a failed parse leaves the
target untouched (the general contract every caller expects). getExecutionStatus
therefore keeps its (-1) sentinel intact for a present-but-corrupt node, and
checkStatus already maps only the truly absent node (node_exists false) to success
while the sentinel (negative code) still trips the cross-check. The ON CLUSTER
consumer, whose finished/<host_id> holds a real persisted status, deserializes
fully and is unaffected.
Add gtest_execution_status.cpp asserting tryDeserializeText leaves the target
unchanged on malformed input ("0garbage", "0", "garbage", "") and round-trips a
valid payload. The corrupt-node arm cannot be a stateless "server stays alive"
test because tripping the cross-check throws LOGICAL_ERROR, which aborts the
server in debug/sanitizer builds; the existing failpoint test still covers the
absent-node arm. Verified both directions on a debug build: the malformed-input
assertions fail against the old in-place implementation and pass with this change.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Pre-PR validation gate (click to expand)
Session id: cron:clickhouse-worker-slot-3:20260702-183000 |
LLVM Coverage Report
Changed lines: Changed C/C++ lines covered: 32/33 (96.97%) · Uncovered code |
CI finish ledger — 7334ddfCI is complete on this head. No failure below is caused by this PR's diff (a Replicated-DB DDL-status correctness fix + atomic
Session id: cron:our-pr-ci-monitor:20260703-000000 |

Related: #95472
Changelog category (leave one):
Changelog entry (a user-readable short description of the changes that goes into CHANGELOG.md):
Fixed a false
LOGICAL_ERROR("There was an error on : Cannot obtain error message (probably it's a bug)") that could abort the server in debug and sanitizer builds when running a DDL query on aReplicateddatabase and its per-hostfinishedstatus node in Keeper had already been cleaned up.Description
Found on public master CI (chronic, not tied to a single PR). Public CIDB (play.clickhouse.com), last 60 days:
test_context_raw LIKE '%handleNonZeroStatusCode%'-> 75 hits (2 on master);LIKE '%There was an error on %'-> 81 hits (3 on master). Report example: https://s3.amazonaws.com/clickhouse-test-reports/json.html . Prior auto-generated issues for the same signature (both closed, but the crash still recurs): #95472, #95539.Root cause. In debug/sanitizer builds only,
ReplicatedDatabaseQueryStatusSource::checkStatusre-reads the per-hostfinished/<host_id>ZooKeeper node to double-check the remote status. For aReplicateddatabase that node is always written with status 0 in the same transaction that applies the query (DDLWorker::processTask,finished_node_pathcreated withExecutionStatus(0)), and on error the executing replica throwsUNFINISHEDand retries without ever persisting a non-zero status (theis_replicated_database_taskbranch never reaches themakeSetRequest(finished_node_path, ...)line). So a presentfinished/<host_id>node always holds status 0, and an absent one is the benign cleaner/retry race. The siblingDistributedQueryStatusSource::generate()already handles the same race for the parent node via itsUNFINISHED"deleted by the cleaner" branch.When the node was absent,
DistributedQueryStatusSource::getExecutionStatusreturned its sentinelExecutionStatus(-1, "Cannot obtain error message")unchanged (the message is never overwritten, which is exactly the string seen in the crash). Back ingenerate(),status.code != 0(it is -1), sohandleNonZeroStatusCodethrewLOGICAL_ERROR, which aborts the server in debug/sanitizer builds. The debug cross-check could therefore only ever produce a false positive for aReplicateddatabase, never catch a real remote error.Fix. In
ReplicatedDatabaseQueryStatusSource::checkStatus, treat a negative sentinel code (absent/unreadablefinished/<host_id>node) as success, mirroring the non-debug branch, so the benign race is no longer misreported. A genuinely persisted non-zero status (positive code) still throws, preserving the cross-check's real purpose. The change is confined to the Replicated-database path; theON CLUSTERpath (DDLOnClusterQueryStatusSource), which reads real persisted error statuses, is untouched.Testing. Added
04401_replicated_database_ddl_status_finished_node_missing, which uses a new failpoint (replicated_database_status_finished_node_missing) to force the "finished node missing" read deterministically. Verified both directions on a debug build: without the fix the test aborts the server with the exact signature above (stack:handleNonZeroStatusCode<-DistributedQueryStatusSource::generate); with the fix the DDL succeeds and the server stays alive.