{{ message }}
Fix leaving connection in a broken state after preliminary cancellation distributed queries#93029
Merged
Merged
Conversation
…on distributed queries The problem is that some places in the code (i.e. RemoteSource::onUpdatePorts()), can call finish() even before sendQueryAsync() finishes sending the query, and so if after it will try to call finish() again (after sendQueryAsync() finishes) it will be no-op. The problem pops up after ClickHouse#92807, since RemoteSource has these (anti-)pattern.
Contributor
f7243d1 to
9ac4c8b
Compare
This was referenced Dec 24, 2025
rienath
approved these changes
Dec 24, 2025
rienath
left a comment
Member
There was a problem hiding this comment.
Makes sense, thanks for the quick fix and great explanation!
Merged
via the queue into
ClickHouse:master
with commit Dec 25, 2025
076f883
126 of 131 checks passed
robot-clickhouse
added a commit
that referenced
this pull request
Dec 25, 2025
…er preliminary cancellation distributed queries
robot-clickhouse
added a commit
that referenced
this pull request
Dec 25, 2025
…ter preliminary cancellation distributed queries
robot-clickhouse
added a commit
that referenced
this pull request
Dec 25, 2025
…ter preliminary cancellation distributed queries
robot-clickhouse
added a commit
that referenced
this pull request
Dec 25, 2025
…ter preliminary cancellation distributed queries
clickhouse-gh Bot
added a commit
that referenced
this pull request
Dec 25, 2025
Backport #93029 to 25.11: Fix leaving connection in a broken state after preliminary cancellation distributed queries
clickhouse-gh Bot
added a commit
that referenced
this pull request
Dec 25, 2025
Backport #93029 to 25.12: Fix leaving connection in a broken state after preliminary cancellation distributed queries
azat
added a commit
that referenced
this pull request
Dec 25, 2025
Backport #93029 to 25.10: Fix leaving connection in a broken state after preliminary cancellation distributed queries
azat
added a commit
that referenced
this pull request
Dec 25, 2025
Backport #93029 to 25.8: Fix leaving connection in a broken state after preliminary cancellation distributed queries
AVMusorin
pushed a commit
to AVMusorin/ClickHouse
that referenced
this pull request
Jan 28, 2026
The race condition causes a crash when reading from a disconnected
connection in distributed queries.
The `in` buffer:
```
(gdb) frame 1
(gdb) print this->in
$19 = {__ptr_ = 0x0, __cntrl_ = 0x0}
(gdb) print this->connected
$18 = false
```
But the same time:
```
(gdb) frame 2
async_callback=...) at ./ci/tmp/build/./src/Client/MultiplexedConnections.cpp:398
(gdb) print this->sent_query
$15 = true
(gdb) print this->cancelled
$16 = false
(gdb) print this->active_connection_count
$17 = 1
```
Current solution will not fix the issue, it just apply a check like in
`sendCancel()`.
Related PRs:
- ClickHouse#89740
- ClickHouse#92807
- ClickHouse#93029
1 task
This was referenced Jun 30, 2026
pull Bot
pushed a commit
to Lobinson1/ClickHouse
that referenced
this pull request
Jul 1, 2026
…TablesStatusResponse, got ProfileInfo)"
Distributed queries over hedged connections intermittently fail during connection
establishment with:
Code: 102. UNEXPECTED_PACKET_FROM_SERVER:
Unexpected packet from server 127.0.0.2:9000 (expected TablesStatusResponse, got ProfileInfo)
thrown from `Connection::getTablesStatus` via `ConnectionEstablisher::run`. A pooled
connection (initiator to replica) is reused while it still holds undelivered packets
(`ProfileInfo`/`EndOfStream`) that a previous, cancelled distributed query left behind, so
the table-status request reads a stale packet instead of the expected response.
This is a recurrence of the issue fixed in ClickHouse#93029 (originally reported as ClickHouse#93018). It came
back after ClickHouse#95466 added `|| was_cancelled` to the early-return guard in
`RemoteQueryExecutor::finish`, which makes `finish` skip the drain after a cancellation.
Two changes:
1. `RemoteQueryExecutor::finish`: when the query was already cancelled and its connections may
still hold undelivered packets, disconnect them instead of leaving them to be returned to
the pool in an out-of-sync state. We do not drain them here (the read side may already be
torn down, which previously crashed - see ClickHouse#95466), but they must not be reused. This mirrors
the cleanup the destructor already does, performed eagerly so it cannot be skipped if
`finished` later becomes true through another path.
2. `ConnectionEstablisher::run`: treat `UNEXPECTED_PACKET_FROM_SERVER` like the other "the
pooled connection turned out to be unusable" errors (NETWORK_ERROR, ATTEMPT_TO_READ_AFTER_EOF,
...): disconnect the entry and report a soft failure so connection establishment retries on a
freshly established connection, instead of failing the whole distributed query. This is the
documented design (the optimistic path does not ping a pooled connection before use and relies
on retrying), it just did not cover a connection left out of sync.
Adds a fail point (`unexpected_packet_in_table_status_response`) and a deterministic test
(`04489_distributed_table_status_desync_recovery`) that injects the desync and verifies the
query recovers.
CI report: https://s3.amazonaws.com/clickhouse-test-reports/json.html?REF=master&sha=799ffffcce85c605e19d0116ccad68955ebd784a&name_0=MasterCI&name_1=Stateless%20tests%20%28arm_binary%2C%20parallel%29
Related: ClickHouse#93018
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.

Changelog category (leave one):
Changelog entry (a user-readable short description of the changes that goes into CHANGELOG.md):
Fix leaving connection in a broken state after preliminary cancellation distributed queries
The problem is that some places in the code (i.e.
RemoteSource::onUpdatePorts()), can call finish() even before sendQueryAsync() finishes sending the query, and so if after it will try to call finish() again (after sendQueryAsync() finishes) it will be no-op.
The problem pops up after #92807, since
RemoteSourcehas these (anti-)pattern.Fixes: #92807
Fixes: #93018