Fix leaving connection in a broken state after preliminary cancellation distributed queries by azat · Pull Request #93029 · ClickHouse/ClickHouse · GitHub
Skip to content

Fix leaving connection in a broken state after preliminary cancellation distributed queries#93029

Merged
azat merged 1 commit into
ClickHouse:masterfrom
azat:fix-preliminary-dist-query-finish
Dec 25, 2025
Merged

Fix leaving connection in a broken state after preliminary cancellation distributed queries#93029
azat merged 1 commit into
ClickHouse:masterfrom
azat:fix-preliminary-dist-query-finish

Conversation

@azat

@azat azat commented Dec 24, 2025

Copy link
Copy Markdown
Member

Changelog category (leave one):

  • Bug Fix (user-visible misbehavior in an official stable release)

Changelog entry (a user-readable short description of the changes that goes into CHANGELOG.md):

Fix leaving connection in a broken state after preliminary cancellation distributed queries

The problem is that some places in the code (i.e.
RemoteSource::onUpdatePorts()), can call finish() even before sendQueryAsync() finishes sending the query, and so if after it will try to call finish() again (after sendQueryAsync() finishes) it will be no-op.

The problem pops up after #92807, since RemoteSource has these (anti-)pattern.

Fixes: #92807
Fixes: #93018

…on distributed queries

The problem is that some places in the code (i.e.
RemoteSource::onUpdatePorts()), can call finish() even before
sendQueryAsync() finishes sending the query, and so if after it will try
to call finish() again (after sendQueryAsync() finishes) it will be
no-op.

The problem pops up after ClickHouse#92807, since RemoteSource has these (anti-)pattern.
@clickhouse-gh

clickhouse-gh Bot commented Dec 24, 2025

Copy link
Copy Markdown
Contributor

@azat azat force-pushed the fix-preliminary-dist-query-finish branch from f7243d1 to 9ac4c8b Compare December 24, 2025 20:48
@clickhouse-gh clickhouse-gh Bot added the pr-bugfix Pull request with bugfix, not backported by default label Dec 24, 2025

@rienath rienath left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense, thanks for the quick fix and great explanation!

@rienath rienath self-assigned this Dec 24, 2025
@azat azat enabled auto-merge December 24, 2025 21:06
@azat azat added this pull request to the merge queue Dec 25, 2025
Merged via the queue into ClickHouse:master with commit 076f883 Dec 25, 2025
126 of 131 checks passed
@azat azat deleted the fix-preliminary-dist-query-finish branch December 25, 2025 08:12
robot-clickhouse added a commit that referenced this pull request Dec 25, 2025
…er preliminary cancellation distributed queries
robot-clickhouse added a commit that referenced this pull request Dec 25, 2025
…ter preliminary cancellation distributed queries
robot-clickhouse added a commit that referenced this pull request Dec 25, 2025
…ter preliminary cancellation distributed queries
robot-clickhouse added a commit that referenced this pull request Dec 25, 2025
…ter preliminary cancellation distributed queries
@robot-clickhouse robot-clickhouse added the pr-synced-to-cloud The PR is synced to the cloud repo label Dec 25, 2025
@robot-ch-test-poll4 robot-ch-test-poll4 added the pr-must-backport-synced The `*-must-backport` labels are synced into the cloud Sync PR label Dec 25, 2025
@robot-ch-test-poll robot-ch-test-poll added the pr-backports-created Backport PRs are successfully created, it won't be processed by CI script anymore label Dec 25, 2025
clickhouse-gh Bot added a commit that referenced this pull request Dec 25, 2025
Backport #93029 to 25.11: Fix leaving connection in a broken state after preliminary cancellation distributed queries
clickhouse-gh Bot added a commit that referenced this pull request Dec 25, 2025
Backport #93029 to 25.12: Fix leaving connection in a broken state after preliminary cancellation distributed queries
azat added a commit that referenced this pull request Dec 25, 2025
Backport #93029 to 25.10: Fix leaving connection in a broken state after preliminary cancellation distributed queries
azat added a commit that referenced this pull request Dec 25, 2025
Backport #93029 to 25.8: Fix leaving connection in a broken state after preliminary cancellation distributed queries
AVMusorin pushed a commit to AVMusorin/ClickHouse that referenced this pull request Jan 28, 2026
The race condition causes a crash when reading from a disconnected
connection in distributed queries.

The `in` buffer:

```
(gdb) frame 1
(gdb) print this->in
$19 = {__ptr_ = 0x0, __cntrl_ = 0x0}
(gdb) print this->connected
$18 = false
```

But the same time:

```
(gdb) frame 2
    async_callback=...) at ./ci/tmp/build/./src/Client/MultiplexedConnections.cpp:398
(gdb) print this->sent_query
$15 = true
(gdb) print this->cancelled
$16 = false
(gdb) print this->active_connection_count
$17 = 1
```

Current solution will not fix the issue, it just apply a check like in
`sendCancel()`.

Related PRs:
- ClickHouse#89740
- ClickHouse#92807
- ClickHouse#93029
pull Bot pushed a commit to Lobinson1/ClickHouse that referenced this pull request Jul 1, 2026
…TablesStatusResponse, got ProfileInfo)"

Distributed queries over hedged connections intermittently fail during connection
establishment with:

    Code: 102. UNEXPECTED_PACKET_FROM_SERVER:
    Unexpected packet from server 127.0.0.2:9000 (expected TablesStatusResponse, got ProfileInfo)

thrown from `Connection::getTablesStatus` via `ConnectionEstablisher::run`. A pooled
connection (initiator to replica) is reused while it still holds undelivered packets
(`ProfileInfo`/`EndOfStream`) that a previous, cancelled distributed query left behind, so
the table-status request reads a stale packet instead of the expected response.

This is a recurrence of the issue fixed in ClickHouse#93029 (originally reported as ClickHouse#93018). It came
back after ClickHouse#95466 added `|| was_cancelled` to the early-return guard in
`RemoteQueryExecutor::finish`, which makes `finish` skip the drain after a cancellation.

Two changes:

1. `RemoteQueryExecutor::finish`: when the query was already cancelled and its connections may
   still hold undelivered packets, disconnect them instead of leaving them to be returned to
   the pool in an out-of-sync state. We do not drain them here (the read side may already be
   torn down, which previously crashed - see ClickHouse#95466), but they must not be reused. This mirrors
   the cleanup the destructor already does, performed eagerly so it cannot be skipped if
   `finished` later becomes true through another path.

2. `ConnectionEstablisher::run`: treat `UNEXPECTED_PACKET_FROM_SERVER` like the other "the
   pooled connection turned out to be unusable" errors (NETWORK_ERROR, ATTEMPT_TO_READ_AFTER_EOF,
   ...): disconnect the entry and report a soft failure so connection establishment retries on a
   freshly established connection, instead of failing the whole distributed query. This is the
   documented design (the optimistic path does not ping a pooled connection before use and relies
   on retrying), it just did not cover a connection left out of sync.

Adds a fail point (`unexpected_packet_in_table_status_response`) and a deterministic test
(`04489_distributed_table_status_desync_recovery`) that injects the desync and verifies the
query recovers.

CI report: https://s3.amazonaws.com/clickhouse-test-reports/json.html?REF=master&sha=799ffffcce85c605e19d0116ccad68955ebd784a&name_0=MasterCI&name_1=Stateless%20tests%20%28arm_binary%2C%20parallel%29
Related: ClickHouse#93018

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

pr-backports-created Backport PRs are successfully created, it won't be processed by CI script anymore pr-bugfix Pull request with bugfix, not backported by default pr-must-backport-synced The `*-must-backport` labels are synced into the cloud Sync PR pr-synced-to-cloud The PR is synced to the cloud repo v25.8-must-backport v25.10-must-backport v25.12-must-backport

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Connection left in a broken state (was: flaky test 03259_grouping_sets_aliases)

5 participants