Reapply "Add silk fibers aware socket implementations" by mstetsyuk · Pull Request #107680 · ClickHouse/ClickHouse · GitHub
Skip to content

Reapply "Add silk fibers aware socket implementations"#107680

Merged
mstetsyuk merged 6 commits into
masterfrom
reapply-silk-sockets
Jun 23, 2026
Merged

Reapply "Add silk fibers aware socket implementations"#107680
mstetsyuk merged 6 commits into
masterfrom
reapply-silk-sockets

Conversation

@mstetsyuk

@mstetsyuk mstetsyuk commented Jun 16, 2026

Copy link
Copy Markdown
Member

Changelog category (leave one):

  • Improvement

Changelog entry (a user-readable short description of the changes that goes into CHANGELOG.md):

Reapply #106078 after #107539.

Add silk fiber aware secure and plain socket implementations.

Version info

  • Merged into: 26.6.1.1148

@clickhouse-gh

clickhouse-gh Bot commented Jun 16, 2026

Copy link
Copy Markdown
Contributor

@clickhouse-gh clickhouse-gh Bot added pr-improvement Pull request with some product improvements submodule changed At least one submodule changed in this PR. labels Jun 16, 2026
Comment thread src/IO/SilkFiberStreamSocketImpl.cpp Outdated
{
auto * socket_impl = static_cast<Poco::Net::SocketImpl *>(BIO_get_data(bio));
const int fd = socket_impl->sockfd();
const uint64_t timeout_ns = timeoutNs(socket_impl->getReceiveTimeout());

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

SecureSocketImpl bounds TLS handshakes and SSL I/O by getMaxTimeoutOrLimit, so either a send or receive timeout should be enough to stop the operation. This BIO read waits only on getReceiveTimeout; if only setSendTimeout is configured and the peer never sends TLS data, SSL_do_handshake never returns to the outer RemainingTimeCounter / mustRetry loop, so the fiber can hang indefinitely. The write side has the symmetric problem with only setReceiveTimeout. Please use the same max(send, receive) timeout semantics here, with non-positive values meaning no timeout, or make the BIO operation return WANT_READ / WANT_WRITE so SecureSocketImpl owns the remaining-time accounting.

@nikitamikhaylov

Copy link
Copy Markdown
Member

@groeneai Help sort out the CI failures.

@groeneai

Copy link
Copy Markdown
Contributor

@nikitamikhaylov Triaged the 3 failing checks on 3f23dfc. None are caused by this PR: no silk-fiber/socket frames, and contrib/silk is not in any failing path. All three are pre-existing trunk flakes, each hitting master and many unrelated PRs in the last 30d.

Stress test (amd_tsan) - LOGICAL_ERROR: Unexpected return type from if. Expected UInt64. Got String (STID 1611-483a), in ExpressionActions::execute -> MergeTreeRangeReader::executePrewhere. Triggered by a fuzzer-mutated if(0, CAST(toString(...) AS UInt64), v). This is a real trunk bug in the if short-circuit / PREWHERE return-type path, not socket-related. CIDB 30d: 19 distinct refs incl. 4 on master (this branch is 1 of them, same as the rest). Report: https://s3.amazonaws.com/clickhouse-test-reports/json.html?PR=107680&sha=3f23dfc85cace53649bd8ac14135dc2429aa2cd1&name_0=PR&name_1=Stress%20test%20%28amd_tsan%29

Stress test (amd_asan_ubsan) - Hung check. The hung query is an AST-fuzzer runaway: SELECT arrayFold(acc, x -> arrayPushFront(acc, x), range(number), emptyArrayUInt64()) FROM system.numbers LIMIT 5 (already is_cancelled=1, elapsed 2709s, 1.6 GiB). Generic fuzzer/hung-check flake. CIDB 30d: 39 distinct PRs + 9 master. Report: https://s3.amazonaws.com/clickhouse-test-reports/json.html?PR=107680&sha=3f23dfc85cace53649bd8ac14135dc2429aa2cd1&name_0=PR&name_1=Stress%20test%20%28amd_asan_ubsan%29

Stateless tests (arm_binary, parallel) - 03518_alter_logical_race (showed up after this run). Chronic race, 17 PRs + 1 master in 30d.

vadimskipin already approved. None of these block the reapply. Suggest rebasing on master and re-running (the TSAN if/PREWHERE error is a separate trunk bug worth its own fix; it will recur on any PR until then).

@nikitamikhaylov

Copy link
Copy Markdown
Member

@groeneai Link the existing issues or fixing PRs for all the failures.

@mstetsyuk mstetsyuk added this pull request to the merge queue Jun 22, 2026
@groeneai

Copy link
Copy Markdown
Contributor

@nikitamikhaylov Links per failure:

  1. Stress test (amd_tsan), Unexpected return type from if (STID 1611-483a): fixing PR Fix LOGICAL_ERROR in on-fly mutation read when an UPDATE keep-old fallback is read under a non-constant condition #108128 (open, CI running), tracking issue ALTER MODIFY COLUMN after a lightweight DELETE fails when soft-deleted rows hold unconvertible values #105947. It is a post-Fix LOGICAL_ERROR in on-fly mutation prewhere when reading an overwritten LowCardinality source column #107475 on-fly-mutation if keep-old-fallback return-type bug, not socket-related.

  2. Stress test (amd_asan_ubsan), Hung check: auto-generated CI tracking issue Hung check failed, possible deadlock found #107941 ("Hung check failed, possible deadlock found", labels testing+fuzz). The arrayFold runaway here is one instance of the generic fuzzer hung-check (cancelled query not terminating); there is no dedicated fixing PR for that specific query.

  3. Stateless tests (arm_binary, parallel), 03518_alter_logical_race: fixing PRs Let 03518_alter_logical_race extend its budget when no progress is made #108003 (merged 2026-06-20) and Fix flaky 03518_alter_logical_race: unthrottle the MODIFY COLUMN mutation #108060 (open follow-up that unthrottles the MODIFY COLUMN mutation).

@groeneai

Copy link
Copy Markdown
Contributor

@nikitamikhaylov Links per failure:

Stress (amd_tsan) if return-type LOGICAL_ERROR: tracking issue #105947, fixing PR #108128 (mine, open, closes #105947). It is the if manifestation introduced by merged #107475, which fixed the earlier materialize manifestation of the same on-fly-mutation read bug.

Stress (amd_asan_ubsan) hung check (arrayFold runaway): umbrella issue #107941 ("Hung check failed, possible deadlock found"); diagnostic logging in open #105905. No dedicated fix for this specific fuzzer cancelled-query-not-terminating shape, it is the chronic fuzzer hung-check flake (39 PRs + 9 master in 30d).

Stateless (arm_binary, parallel) 03518_alter_logical_race: no tracking issue, two fixes by alexey-milovidov: merged #108003 (extend the test budget) and open #108060 (unthrottle the MODIFY COLUMN mutation).

@alexey-milovidov alexey-milovidov removed this pull request from the merge queue due to a manual request Jun 22, 2026
@alexey-milovidov

Copy link
Copy Markdown
Member

@groeneai, provide a fix for runaway arrayFold. We can add limits for too complex operations or introduce cancellation checks inside.

@alexey-milovidov

Copy link
Copy Markdown
Member

It will be reasonable to update the branch and check for CI again.

@groeneai

Copy link
Copy Markdown
Contributor

@alexey-milovidov fix for the runaway arrayFold: #108192.

FunctionArrayFold::executeImpl folds an entire chunk in one call, and the only time/cancellation check is at the pipeline level between chunks, so a fold over one long array ignored KILL QUERY and max_execution_time (the hung-check query ran 2709s while already cancelled). The PR adds checkTimeLimit() inside the in-function loops (cancellation-checks approach), plus a regression test that times out promptly. Verified: without the fix the test query runs ~15-17s ignoring max_execution_time=1; with it, ~1.5s.

@clickhouse-gh

clickhouse-gh Bot commented Jun 22, 2026

Copy link
Copy Markdown
Contributor

LLVM Coverage Report

Metric Baseline Current Δ
Lines 85.40% 85.30% -0.10%
Functions 92.50% 92.60% +0.10%
Branches 77.60% 77.60% +0.00%

Changed lines: Changed C/C++ lines covered by tests: 375/441 (85.03%) | Lost baseline coverage (was covered on master, now uncovered in this PR): 5 line(s) · Uncovered code

Full report · Diff report

@nikitamikhaylov

Copy link
Copy Markdown
Member

@groeneai what's up with the CI failures?

@groeneai

Copy link
Copy Markdown
Contributor

@nikitamikhaylov The only real failure on the current head (59ebbd08) is Stress test (arm_tsan). Everything else red is either a consequence of it or non-blocking infra, and none of it comes from this PR.

Re-running the arm_tsan stress job should clear it; the assertion is intermittent and not produced by this PR.

@mstetsyuk

mstetsyuk commented Jun 23, 2026

Copy link
Copy Markdown
Member Author

@mstetsyuk mstetsyuk added this pull request to the merge queue Jun 23, 2026
Merged via the queue into master with commit 2d224b0 Jun 23, 2026
165 of 167 checks passed
@mstetsyuk mstetsyuk deleted the reapply-silk-sockets branch June 23, 2026 12:15
@robot-ch-test-poll4 robot-ch-test-poll4 added the pr-synced-to-cloud The PR is synced to the cloud repo label Jun 23, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

pr-improvement Pull request with some product improvements pr-synced-to-cloud The PR is synced to the cloud repo submodule changed At least one submodule changed in this PR.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants