Fix false LOGICAL_ERROR when Replicated database DDL status node is missing by groeneai · Pull Request #109192 · ClickHouse/ClickHouse · GitHub
Skip to content

Fix false LOGICAL_ERROR when Replicated database DDL status node is missing#109192

Open
groeneai wants to merge 2 commits into
ClickHouse:masterfrom
groeneai:groeneai/fix-replicated-db-ddl-status-finished-node-missing
Open

Fix false LOGICAL_ERROR when Replicated database DDL status node is missing#109192
groeneai wants to merge 2 commits into
ClickHouse:masterfrom
groeneai:groeneai/fix-replicated-db-ddl-status-finished-node-missing

Conversation

@groeneai

@groeneai groeneai commented Jul 2, 2026

Copy link
Copy Markdown
Contributor

Related: #95472

Changelog category (leave one):

  • Bug Fix (user-visible misbehavior in an official stable release)

Changelog entry (a user-readable short description of the changes that goes into CHANGELOG.md):

Fixed a false LOGICAL_ERROR ("There was an error on : Cannot obtain error message (probably it's a bug)") that could abort the server in debug and sanitizer builds when running a DDL query on a Replicated database and its per-host finished status node in Keeper had already been cleaned up.

Description

Found on public master CI (chronic, not tied to a single PR). Public CIDB (play.clickhouse.com), last 60 days: test_context_raw LIKE '%handleNonZeroStatusCode%' -> 75 hits (2 on master); LIKE '%There was an error on %' -> 81 hits (3 on master). Report example: https://s3.amazonaws.com/clickhouse-test-reports/json.html . Prior auto-generated issues for the same signature (both closed, but the crash still recurs): #95472, #95539.

Root cause. In debug/sanitizer builds only, ReplicatedDatabaseQueryStatusSource::checkStatus re-reads the per-host finished/<host_id> ZooKeeper node to double-check the remote status. For a Replicated database that node is always written with status 0 in the same transaction that applies the query (DDLWorker::processTask, finished_node_path created with ExecutionStatus(0)), and on error the executing replica throws UNFINISHED and retries without ever persisting a non-zero status (the is_replicated_database_task branch never reaches the makeSetRequest(finished_node_path, ...) line). So a present finished/<host_id> node always holds status 0, and an absent one is the benign cleaner/retry race. The sibling DistributedQueryStatusSource::generate() already handles the same race for the parent node via its UNFINISHED "deleted by the cleaner" branch.

When the node was absent, DistributedQueryStatusSource::getExecutionStatus returned its sentinel ExecutionStatus(-1, "Cannot obtain error message") unchanged (the message is never overwritten, which is exactly the string seen in the crash). Back in generate(), status.code != 0 (it is -1), so handleNonZeroStatusCode threw LOGICAL_ERROR, which aborts the server in debug/sanitizer builds. The debug cross-check could therefore only ever produce a false positive for a Replicated database, never catch a real remote error.

Fix. In ReplicatedDatabaseQueryStatusSource::checkStatus, treat a negative sentinel code (absent/unreadable finished/<host_id> node) as success, mirroring the non-debug branch, so the benign race is no longer misreported. A genuinely persisted non-zero status (positive code) still throws, preserving the cross-check's real purpose. The change is confined to the Replicated-database path; the ON CLUSTER path (DDLOnClusterQueryStatusSource), which reads real persisted error statuses, is untouched.

Testing. Added 04401_replicated_database_ddl_status_finished_node_missing, which uses a new failpoint (replicated_database_status_finished_node_missing) to force the "finished node missing" read deterministically. Verified both directions on a debug build: without the fix the test aborts the server with the exact signature above (stack: handleNonZeroStatusCode <- DistributedQueryStatusSource::generate); with the fix the DDL succeeds and the server stays alive.

@groeneai

groeneai commented Jul 2, 2026

Copy link
Copy Markdown
Contributor Author

@groeneai

groeneai commented Jul 2, 2026

Copy link
Copy Markdown
Contributor Author

cc @tavplubix @tuanpach — could you review this? In debug/sanitizer builds ReplicatedDatabaseQueryStatusSource::checkStatus re-reads finished/<host_id> from Keeper; for a Replicated DB that node holds status 0 when present and is absent only in the benign cleaner/retry race, but an absent node returned the getExecutionStatus sentinel (code -1) which generate() reported as a remote error and threw LOGICAL_ERROR ("... probably it's a bug"), aborting the server. The fix treats the negative sentinel as success like the non-debug branch; a real persisted non-zero status still throws. Regression test uses a new failpoint to force the missing-node read.

@alexey-milovidov alexey-milovidov added the can be tested Allows running workflows for external contributors label Jul 2, 2026
@clickhouse-gh

clickhouse-gh Bot commented Jul 2, 2026

Copy link
Copy Markdown
Contributor

Workflow [PR], commit [7334ddf]

Summary:

job_name test_name status info comment
Stateless tests (amd_tsan, parallel, 2/2) FAIL
02898_parallel_replicas_progress_bar FAIL cidb
Performance Comparison (arm_release, master_head, 2/6) FAIL Performance dashboard
codecs_int_insert #6::old FAIL query history
codecs_int_insert #6::new FAIL query history
codecs_int_insert #16::old FAIL query history
codecs_int_insert #16::new FAIL query history
file_table_function #11::old FAIL query history
file_table_function #11::new FAIL query history
file_table_function #23::old FAIL query history
file_table_function #23::new FAIL query history
number_formatting_formats #1::old FAIL query history
number_formatting_formats #1::new FAIL query history
2 more test cases not shown
Finish Workflow FAIL
python3 ./ci/jobs/scripts/workflow_hooks/new_tests_check.py FAIL

AI Review

Summary

This PR fixes the debug/sanitizer-only false LOGICAL_ERROR in replicated-database DDL status handling by making ExecutionStatus::tryDeserializeText atomic and by treating only the truly absent finished/<host_id> node as the benign race. The current code preserves the corrupt-payload cross-check, the new unit and stateless regressions cover both sides of the fix, and I did not find any remaining code-level blockers.

PR Metadata
  • 💡 The selected Changelog category is not semantically correct for this change. The bug only affects debug/sanitizer builds on master; release behavior is unchanged, so this should not go into the user-facing Bug Fix changelog section.
    Exact replacement: ### Changelog category (leave one): then - Not for changelog (changelog entry is not required)
    Changelog entry is then not required.
Final Verdict

✅ No remaining code findings. After changing the PR metadata to Not for changelog, this looks ready.

@clickhouse-gh clickhouse-gh Bot added the pr-bugfix Pull request with bugfix, not backported by default label Jul 2, 2026
Comment thread src/Interpreters/ReplicatedDatabaseQueryStatusSource.cpp Outdated
…issing

In debug/sanitizer builds ReplicatedDatabaseQueryStatusSource::checkStatus
re-reads the per-host finished/<host_id> ZooKeeper node to double-check the
remote status. For a Replicated database this node is always written with
status 0 in the transaction that applies the query (DDLWorker::processTask),
and on error the executing replica throws UNFINISHED and retries without ever
persisting a non-zero status. So a present finished/<host_id> node always holds
status 0, and an absent one is the benign cleaner/retry race (the sibling
generate() already handles the same race for the parent node via its UNFINISHED
deleted-by-the-cleaner branch).

When the node was absent, DistributedQueryStatusSource::getExecutionStatus
returned its sentinel ExecutionStatus(-1, 'Cannot obtain error message')
unchanged. generate() then saw code != 0 and called handleNonZeroStatusCode,
which threw LOGICAL_ERROR 'There was an error on <host>: Cannot obtain error
message (probably its a bug)', aborting the server in debug/sanitizer builds.

getExecutionStatus returns that same sentinel in two cases: the node was absent
(tryGet == false, the benign race), and the node was present but its payload
could not be deserialized (a corrupt finished/<host_id>). Distinguishing them
by the sentinel code alone would silently turn a corrupt payload into success
and defeat that arm of the cross-check. getExecutionStatus now reports node
presence via an optional out-parameter, and checkStatus maps only the truly
absent-node case to ExecutionStatus{0}. A present-but-unreadable node keeps the
sentinel so the cross-check still fires. The ON CLUSTER consumer keeps the
default nullptr and is unaffected.

Adds a regression test using a new failpoint
(replicated_database_status_finished_node_missing) that forces the missing-node
read deterministically.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
@groeneai groeneai force-pushed the groeneai/fix-replicated-db-ddl-status-finished-node-missing branch from 75a8045 to 5c370b9 Compare July 2, 2026 17:54
@groeneai

groeneai commented Jul 2, 2026

Copy link
Copy Markdown
Contributor Author

Updated pre-PR validation gate after the review refinement (5c370b9):

Pre-PR validation gate (click to expand)
# Question Answer
a Deterministic repro? Yes. Failpoint replicated_database_status_finished_node_missing forces the missing-node read; a Replicated-DB DDL under it deterministically aborts a debug/sanitizer server without the fix.
b Root cause explained? For a Replicated DB, finished/<host_id> is written with status 0 and never with an error (the executing replica throws UNFINISHED and retries), so it is either present-with-status-0 or absent (benign cleaner/retry race). When absent, getExecutionStatus returns its (-1, "Cannot obtain error message") sentinel unchanged; generate() sees code != 0 and handleNonZeroStatusCode throws LOGICAL_ERROR, aborting in debug/sanitizer. The debug cross-check could only ever produce a false positive.
c Fix matches root cause? Yes, and at the source. checkStatus treats only the absent-node case as the benign race. The refinement (this review) distinguishes absent (tryGet == false) from present-but-corrupt: only the absent case maps to ExecutionStatus{0}; a present-but-unreadable payload keeps the sentinel so the cross-check still fires.
d Test intent preserved / new tests added? New regression test 04401_replicated_database_ddl_status_finished_node_missing (Tags: zookeeper, no-parallel, no-fasttest). The cross-check's purpose (catch a genuine non-zero remote status, or a corrupt payload) is preserved: only the truly-absent race is swallowed.
e Both directions demonstrated? Yes. Missing-node failpoint: WITHOUT fix -> server aborts with the exact task signature (handleNonZeroStatusCode <- DistributedQueryStatusSource::generate()); WITH fix -> server survives, DDL succeeds (11/11 iterations). Present-but-corrupt (throwaway failpoint, not committed): WITH fix the cross-check still aborts, proving arm B is not silenced.
f Fix is general across code paths? Yes. Fixing at the checkStatus source protects both consumers of a non-zero status (handleNonZeroStatusCode and fillHostStatus, both throw on code != 0). The ON CLUSTER path (DDLOnClusterQueryStatusSource) is intentionally untouched: it reads real persisted error statuses and passes the default nullptr.
g Fix generalizes across inputs? Yes. The absent vs present-but-corrupt distinction covers both sentinel-producing inputs of the shared getExecutionStatus; a genuinely persisted positive status still throws.
h Backward compatible? Yes. No setting, format, wire, or on-disk change. getExecutionStatus gains an optional out-parameter defaulting to nullptr; its return value is byte-for-byte unchanged for existing callers. Behavior differs only inside a DEBUG_OR_SANITIZER_BUILD cross-check.
i Invariants and contracts preserved? Yes. getExecutionStatus still returns the sentinel on failure; node_exists merely surfaces the tryGet result it already computed. checkStatus still returns a valid ExecutionStatus on all paths. No lock/ownership/lifetime changes.

Session id: cron:clickhouse-worker-slot-11:20260702-170800

…ross-check arm survives

The previous refinement distinguished an absent finished/<host_id> node
(benign cleaner/retry race) from a present one via a node_exists out-parameter,
but a present-but-corrupt node was still collapsed to success.

The reason is in tryDeserializeText itself. deserializeText does
"rb >> code >> \n >> escape >> message", i.e. it reads code before it can fail
on the rest of the payload. tryDeserializeText caught the exception but left the
target object partially overwritten: a payload beginning with a parsable integer,
for example "0garbage" or "0", set code=0 before the failure was reported. Because
getExecutionStatus deserializes directly into its (-1, "Cannot obtain error
message") sentinel, such a corrupt present node became code=0 and
ReplicatedDatabaseQueryStatusSource::checkStatus reported it as success, defeating
that arm of the debug/sanitizer cross-check.

Fix the try-pattern at its source: tryDeserializeText now parses into a temporary
ExecutionStatus and assigns *this only on success, so a failed parse leaves the
target untouched (the general contract every caller expects). getExecutionStatus
therefore keeps its (-1) sentinel intact for a present-but-corrupt node, and
checkStatus already maps only the truly absent node (node_exists false) to success
while the sentinel (negative code) still trips the cross-check. The ON CLUSTER
consumer, whose finished/<host_id> holds a real persisted status, deserializes
fully and is unaffected.

Add gtest_execution_status.cpp asserting tryDeserializeText leaves the target
unchanged on malformed input ("0garbage", "0", "garbage", "") and round-trips a
valid payload. The corrupt-node arm cannot be a stateless "server stays alive"
test because tripping the cross-check throws LOGICAL_ERROR, which aborts the
server in debug/sanitizer builds; the existing failpoint test still covers the
absent-node arm. Verified both directions on a debug build: the malformed-input
assertions fail against the old in-place implementation and pass with this change.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
@groeneai

groeneai commented Jul 2, 2026

Copy link
Copy Markdown
Contributor Author
Pre-PR validation gate (click to expand)
# Question Answer
a Deterministic repro? Yes. Unit test ExecutionStatus.TryDeserializeTextKeepsTargetOnMalformedInput fails deterministically on the old in-place tryDeserializeText for payloads "0garbage", "0", "garbage" (status.code clobbered from -1 to 0).
b Root cause explained? ExecutionStatus::deserializeText reads code first (rb >> code >> "\\n" >> escape >> message), then can fail on the rest. tryDeserializeText deserialized in place, so a payload starting with a parsable int left code overwritten on failure. getExecutionStatus deserializes into its (-1, "Cannot obtain error message") sentinel, so a corrupt present node became code=0 and checkStatus reported success, defeating the corrupt-node arm of the cross-check.
c Fix matches root cause? Yes. tryDeserializeText now parses into a temporary and assigns *this only on success, so a failed parse leaves the target untouched. This is the try-pattern contract the callers rely on; getExecutionStatus then keeps the (-1) sentinel for a corrupt present node and checkStatus surfaces it.
d Test intent preserved / new tests added? Added src/Common/tests/gtest_execution_status.cpp (malformed input leaves target unchanged; valid payload round-trips). The absent-node arm stays covered by the existing failpoint stateless test. No test weakened.
e Both directions demonstrated? Yes. Malformed-input assertions FAIL against the old implementation, PASS with the fix (both rebuilt and run on a debug build, Build ID verified). Absent-node stateless steps: server stays alive, DDL succeeds.
f Fix is general across code paths? Yes. The only two ExecutionStatus deserialize callers are getExecutionStatus (fixed here) and fromText (uses deserializeText on a fresh object, unaffected). Fixing the shared tryDeserializeText covers every present and future caller rather than guarding one call site.
g Fix generalizes across inputs? Yes. Test covers a parsable-prefix payload ("0garbage"), a bare code ("0"), a non-numeric payload ("garbage"), and empty (""); a valid serializeText() payload still parses. Behavior is independent of the concrete code/message values.
h Backward compatible? Yes. No setting, on-disk, wire, or protocol format change. serializeText/deserializeText output and the getExecutionStatus return value are byte-for-byte unchanged; only the failure-path mutation is removed.
i Invariants and contracts preserved? Yes. tryDeserializeText now upholds its documented try-pattern contract (target unchanged on failure). getExecutionStatus sentinel/node_exists pairing is preserved; checkStatus maps absent -> success, corrupt-present (sentinel, negative code) -> throw, positive code -> throw (unchanged). ON CLUSTER path deserializes real statuses and is unaffected.

Session id: cron:clickhouse-worker-slot-3:20260702-183000

@clickhouse-gh

clickhouse-gh Bot commented Jul 2, 2026

Copy link
Copy Markdown
Contributor

LLVM Coverage Report

Metric Baseline Current Δ
Lines 85.50% 85.50% +0.00%
Functions 92.70% 92.60% -0.10%
Branches 77.70% 77.70% +0.00%

Changed lines: Changed C/C++ lines covered: 32/33 (96.97%) · Uncovered code

Full report · Diff report

@groeneai

groeneai commented Jul 3, 2026

Copy link
Copy Markdown
Contributor Author

CI finish ledger — 7334ddf

CI is complete on this head. No failure below is caused by this PR's diff (a Replicated-DB DDL-status correctness fix + atomic ExecutionStatus::tryDeserializeText + unit test). Only the private CH Inc sync is exempt.

Check / test Reason Owner / fixing PR
Stateless tests (amd_tsan, parallel, 2/2) / 02898_parallel_replicas_progress_bar flaky (3 hits / 2 unrelated PRs / 0 master in 30d) — a Replicated-DB DDL-status fix cannot affect parallel-replicas progress-bar timing a fix task is created (investigating at full effort; fixing-PR link to follow here)
Performance Comparison (arm_release, master_head, 2/6) perf noise (report-side variance, unrelated to a DDL-status/Common change) a fix task is created (investigating at full effort; fixing-PR link to follow here)
Finish Workflow / Mergeable Check / PR rollup of the above resolves once the above clear
CH Inc sync - CH Inc sync (private, not actionable)

Session id: cron:our-pr-ci-monitor:20260703-000000

@groeneai

groeneai commented Jul 3, 2026

Copy link
Copy Markdown
Contributor Author

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

can be tested Allows running workflows for external contributors pr-bugfix Pull request with bugfix, not backported by default

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants