iframe-proxy

CheSema · 2026-04-29T07:27:30Z

Builds on top of #103234 (ReadPipeline). Gated by SET use_reader_executor = 1 (experimental, default off).

This commit replaces the matryoshka of ReadBuffer wrappers with a ReaderExecutor that owns offset mapping, cache decisions, prefetch and decryption in one place. Caches plug in via a uniform ICacheProvider / ICacheHandle API.

What this brings

Zero-copy page-cache hits. Reads served from PageCache reach the caller as a shared_ptr into the mmap'd cell — no memcpy between cache and working buffer (compare to CachedInMemoryReadBufferFromFile's per-block copy).

Connection reuse across remote read calls. Sequential reads against the same S3/Azure/HTTP object reuse an open buffer instead of issuing a new HTTP request per window. SourceBufferLimit caps live connections globally (max_remote_read_connections, default 1000); over-limit reads fall back to stateless open-read-close. Visible via system.remote_read_connections.

Coordinated cache + prefetch decisions. The matryoshka design had each layer making independent choices (prefetch ahead in AsynchronousBoundedReadBuffer, decide what to cache in CachedOnDiskReadBufferFromFile, gather across objects in ReadBufferFromRemoteFSGather). The executor sees the whole window:

prefetches only what isn't already cache-hit
retains source over-read only when paired with a live connection (drops it otherwise)
knows which bytes are user-requested vs cache-fill, so it doesn't account cache-fill bytes against the user's read budget

Per-object cache identity, not per-pipeline. DiskCacheProvider derives FileCacheKey / FileCacheOriginInfo per StoredObject (etag-keyed caching for StorageObjectStorageSource, segment-key-type classification for Data vs System queues). The old single-key-per-executor approach silently shared cache identity across all objects in a gather-mode read.

Foundation for new cache backends. New caches (vector cache, distributed cache, etc.) implement one interface (ICacheProvider::lookup) and plug into the chain without touching read paths.

Observability

system.remote_read_connections — open connections (path, query id, position, elapsed)
system.reader_executor_log — per-executor stats (cache hit/miss/populate bytes, prefetch hits/cancellations, source request count, decrypt time, …)
ProfileEvents: ReaderExecutor* counters and microsecond timers; LiveSourceBuffer* counters
HistogramMetrics: cache get/populate/source-read/prefetch-wait latencies

Server settings

reader_executor_prefetch_pool_size (default 8) — shared prefetch thread pool size
reader_executor_prefetch_queue_size (default 0 = pool_size * 10) — pool queue depth
max_remote_read_connections (default 1000) — live-buffer slot count, reload-able

Internals

Rope / RopeNode / OwnedRopeBuffer — refcounted buffer chains with mixed provenance (owned, page-cache-pinned), built-in cursor (peek/advance/tryRewind), sorted-on-insert nodes, disjoint-interval coverage tracking.
ICacheProvider / ICacheHandle — uniform cache API (lookup → status/get/put). Two implementations: PageCacheProvider (file-level, zero-copy), DiskCacheProvider (per-object, FileCache-backed).
ISourceReader — stateless range-read interface. LocalSourceReader, ObjectStorageSourceReader, BufferSourceReader (adapter for backup / BufferCreator).
OffsetMap — replaces ReadBufferFromRemoteFSGather's gather logic with a logical-to-(object, object-offset) lookup; supports single-object unknown-size (S3 HEAD without Content-Length → streams to EOF).
ReaderExecutor — owns: position, offset map, cache chain, prefetch handle, live buffer, over-read tail, source-buffer slot.
PipelineReadBuffer — thin ReadBufferFromFileBase exposing the executor through BufferBase::set/next/seek (so legacy callers see no change).
PrefetchThreadPool — shared bounded pool, returns nullptr on overflow (sync fallback), task cancellation is race-free.

Testing

~3,700 lines of tests (≈46% of the PR), 130+ gtest cases:

gtest_rope — 31 cases (append / peek / advance / tryRewind / slice / copyTo / coverage queries / shift).
gtest_reader_executor — 41+ cases (cache-chain combinations, per-object lookup, prefetch races, EOF release, slot leak fixes, unknown-size streaming, …).
gtest_filecache — 22 cases including over-read / bypass-mode / partial coverage scenarios.
gtest_read_pipeline — known/unknown size selection through the executor.
Functional: 04262_reader_executor_observability plus opt-outs (SET use_reader_executor = 0) on tests that intentionally exercise the legacy path.

Load tesing

the simpliest cache API"

┌─────────┬────────────┬──────────┬───────┬────────────────────────┐
│ regime  │ legacy QPS │ exec QPS │ QPS Δ │ per-query-type latency │
├─────────┼────────────┼──────────┼───────┼────────────────────────┤
│ cold    │      0.252 │    0.320 │  +27% │       −12% (faster) ✅ │
├─────────┼────────────┼──────────┼───────┼────────────────────────┤
│ warm    │      0.407 │    0.252 │ −38%¹ │       +27% (slower) ❌ │
├─────────┼────────────┼──────────┼───────┼────────────────────────┤
│ partial │      0.247 │    0.251 │   +2% │       +37% (slower) ❌ │
└─────────┴────────────┴──────────┴───────┴────────────────────────┘

changed cache API (stream aware):

Full clean verdict — 9e9abe9d (26.6.1.1)

┌─────────────┬─────────────────────────┬───────────────────────────────────────────────────┐
│   Regime    │   Executor vs legacy    │                       Note                        │
├─────────────┼─────────────────────────┼───────────────────────────────────────────────────┤
│ ✅ cold     │ −15% (win)              │ 0 slot fail; coalesced reads                      │
├─────────────┼─────────────────────────┼───────────────────────────────────────────────────┤
│ ✅ warm     │ −8% (win)               │ 0 slot fail; cache reads healthy                  │
├─────────────┼─────────────────────────┼───────────────────────────────────────────────────┤
│ ⚠️ partial  │ +7%                     │ S3-source over-read / churn on uncached half      │
├─────────────┼─────────────────────────┼───────────────────────────────────────────────────┤
│ ❌ populate │ +17% (worst)            │ 50% connection reuse on cache-fill path — churn   │
├─────────────┼─────────────────────────┼───────────────────────────────────────────────────┤
│ ✅ stress   │ 27 ok / 1 OOM vs 4 / 11 │ controlled degradation = net win (bounded, alive) │
└─────────────┴─────────────────────────┴───────────────────────────────────────────────────┘


stress details:

┌───────────────────────┬───────────┬──────────┐
│                       │  legacy   │ executor │
├───────────────────────┼───────────┼──────────┤
│ queries finished (ok) │         4 │       27 │
├───────────────────────┼───────────┼──────────┤
│ failed                │        11 │        1 │
├───────────────────────┼───────────┼──────────┤
│ OOM                   │        11 │        1 │
├───────────────────────┼───────────┼──────────┤
│ connection reuse      │     86.8% │    98.8% │
├───────────────────────┼───────────┼──────────┤
│ connection resets     │   233,618 │   38,581 │
├───────────────────────┼───────────┼──────────┤
│ peak TCP recv-buffer  │ 4,152 MiB │  813 MiB │
├───────────────────────┼───────────┼──────────┤
│ TCP sockets           │    17,333 │    1,548 │
├───────────────────────┼───────────┼──────────┤
│ cgroup mem            │  24.4 GiB │ 24.6 GiB │
└───────────────────────┴───────────┴──────────┘

sha -- 494fcfe8825c

┌─────────┬──────────────┬──────────────┬──────────────┐
│ regime  │ baseline QPS │ executor QPS │  exec/base   │
├─────────┼──────────────┼──────────────┼──────────────┤
│ cold    │        0.207 │        0.292 │ 1.41× (+41%) │
├─────────┼──────────────┼──────────────┼──────────────┤
│ warm    │        0.330 │        0.307 │  0.93× (−7%) │
├─────────┼──────────────┼──────────────┼──────────────┤
│ partial │        0.375 │        0.420 │ 1.12× (+12%) │
└─────────┴──────────────┴──────────────┴──────────────┘

That is good outcome. I need to optimize the work with caches. Need to stream data from available cache segments without any additional costs.

Changelog category (leave one):

Experimental Feature

Changelog entry (a user-readable short description of the changes that goes into CHANGELOG.md):

Add experimental ReaderExecutor for pipeline-based read orchestration with unified cache API (PageCacheProvider, DiskCacheProvider), connection reuse for remote reads, shared prefetch pool, and system.remote_read_connections / system.reader_executor_log observability tables. Enable with SET use_reader_executor = 1.

Documentation entry for user-facing changes

Documentation is written (mandatory for new features)

clickhouse-gh · 2026-04-30T03:09:11Z

Workflow [PR], commit [330ee26]

Summary: ❌

job_name	test_name	status	info
Build (arm_tidy)		FAIL
	Build ClickHouse	FAIL	cidb
Fast test (arm_darwin)		DROPPED
Build (amd_debug)		DROPPED
Build (amd_asan_ubsan)		DROPPED
Build (amd_tsan)		DROPPED
Build (amd_msan)		DROPPED
Build (amd_binary)		DROPPED
Build (arm_debug)		DROPPED
Build (arm_asan_ubsan)		DROPPED
Build (arm_tsan)		DROPPED

AI Review

Summary

This PR adds the experimental ReaderExecutor read pipeline, cache-provider integration, long remote connection reuse, memory-pressure sizing, and reader observability. The implementation is still not ready to merge: there are active correctness issues in cache scheduling, unknown-size source handling, long-connection coordinates, encrypted offset handling, cache dependency ordering, and several observability/test contracts that currently pass vacuously or point at missing surfaces.

PR Metadata

💡 Changelog category as Experimental Feature is appropriate.
💡 The Changelog entry is currently inaccurate because it promises system.remote_read_connections, but the current diff does not register that system table. Either implement the table or remove it from the entry. Suggested replacement until implemented: "Add experimental ReaderExecutor for pipeline-based read orchestration with unified cache API (PageCacheProvider, DiskCacheProvider), long-connection reuse for remote reads, shared prefetch pool, and system.reader_executor_log observability. Enable with SET use_reader_executor = 1."
💡 The documentation checkbox is unchecked even though this is a user-facing experimental feature with new settings and observability. Add user-facing documentation or update the PR metadata if the feature is intentionally undocumented while experimental.

Missing context / blind spots

⚠️ The checkout is GitHub's synthetic merge commit d1c410a7 for PR head 330ee263; .git was read-only, so I could not fetch/check out the exact PR head object or base commit locally. Review comments were posted against 330ee263, and the merge tree was used for code inspection.
⚠️ I did not run a build or tests during this review; the findings are from source and diff inspection. Build/style-only failures were intentionally not raised as inline findings.

Findings

❌ Blockers

[src/IO/ReaderExecutor.cpp:758] Planned cache residency promises [pos, serve_end) is pinned and readable, but serveHitStep treats an empty/short readHitFromView as normal EOF/progress. A cache detach or short committed prefix can silently truncate a known resident window instead of failing or refetching. Require full coverage for planned hits, or reclassify the missing suffix as a source miss.
[src/IO/ReaderExecutor.cpp] LongConnection offsets are object-local, but continuation checks pass file-global physical offsets for multi-object logical files. A connection opened inside object N can be considered unusable for the next range in the same object, causing avoidable reopen/abandon behavior and broken reuse. Convert to object-local offsets before canStartServing / canContinue.
[src/IO/ReaderExecutor.cpp:1732] For unknown-size objects with no finite extent, longConnectionBound can return SIZE_MAX as a concrete right bound and pass it to setReadUntilPosition. Unknown size is not a valid bounded-read end; avoid opening long bounded connections without a finite object end/extent, or cap to a finite reach.
[src/IO/ReaderExecutor.cpp:1819] Unknown-size reads can keep a long-connection slot after EOF because release only happens at numeric read_until, which is SIZE_MAX in this path. Drop/release the long connection when EOF is learned from a short unknown-size read.
[src/IO/ReaderExecutor.cpp] The encrypted path can widen the first miss before data_start_offset, then use those physical bytes as if they were logical payload. This can expose encrypted header bytes or trip missing-prefix checks. Clamp fetch/serve windows to the logical payload start before decrypt-shifted assembly.
[src/IO/PlanSchedule.cpp:320] Same-cell dependencies are derived from retrieve vector order, but UpperCacheRead / HandedChain retrieves are appended after remote retrieves. Append-only cache cells need byte-order dependencies, otherwise a later offset can write before an earlier one. Build deps by target cell and range offset, independent of vector position.
[src/IO/PlanSchedule.cpp:299] Small faster-tier resident runs are skipped for UpperCacheRead because the remote connection will bridge them, but bridged bytes are discarded by LongConnection::skipForward and never written into the lower cell. Keep an UpperCacheRead fill for lower-cache cells even when the source GET skips the resident bytes.
[src/IO/ReaderExecutor.cpp] mergeRanges can stitch led ranges across sibling-led holes after led_disjoint excluded those holes. That makes this reader fetch/write bytes owned by another downloader. Do not merge across excluded sibling-led intervals.
[src/IO/ReaderExecutor.cpp:2288] A collected cache-fill retrieve can remain Ready forever because reapPutMachine checks the old st.fetched before this line advances it, then this assignment overwrites the phase. Update the fetched frontier before reap, or mark Done after the frontier update.
[src/IO/ReaderExecutor.cpp:2403] Foreground synchronous fallback can complete a cache write but leave the retrieve in Ready; depsSatisfied requires Done, so later same-cell retrieves can be blocked from read-ahead forever. Mark the retrieve Done when the inline write completed the scheduled range.
[src/IO/DiskCacheProvider.h] fetchHeadAlignment / fetchTailAlignment always use cache->getMaxFileSegmentSize, while planning and getOrSet use cache_settings.boundary_alignment.value_or(cache->getBoundaryAlignment()). The executor can fetch/write ranges on a different grid than the file cache allocated. Use the same resolved boundary alignment everywhere.
[src/Interpreters/FileCache/FileCache.cpp:1447] Per-query filesystem-cache reservation accounting can stay at zero after query_context->add, because query_priority_iterator is not refreshed. This bypasses filesystem_cache_max_download_size for the query. Re-read or return the per-query iterator after adding it and increment both queues.
[programs/server/Server.cpp:2426] Invalid memory-pressure thresholds are validated after config().replace, so a failed SYSTEM RELOAD CONFIG can still leave the rejected config in the global application config. Validate against the loaded config before replacing the live config, or roll the config replacement back on failure.

⚠️ Majors

[src/Core/Settings.cpp:760] reader_executor_window_size and reader_executor_block_size accept 0, but ReaderExecutor assumes positive sizes. Use NonZeroUInt64 or reject zero before constructing the executor.
[src/IO/ReadPipeline.cpp:350] The local executor branch drops local_filesystem_read_prefetch; callers can request local prefetch but get a PipelineReadBuffer with no prefetch implementation. Preserve the legacy async/local prefetch behavior or reject it explicitly.
[src/IO/PageCacheProvider.cpp] Page-cache hits are touched during lookup before miss blocks from the same handle are inserted. A small page cache can evict the just-hit block before the handle serves it. Defer hit promotion until after put/fill, as the filesystem-cache path does.
[src/Interpreters/FileCache/FileCache.cpp:1437] The create-directory rollback invalidates only the main queue entry when the per-query entry was just added but query_priority_iterator stayed null. Roll back the per-query entry too.
[src/Core/ServerSettings.cpp:1424] reader_executor_prefetch_queue_size defaults to 80 while documenting 0 as the derived pool_size * 10 default. Make the configured default 0 or update the text.
[src/Core/Settings.cpp:6615] enable_reader_executor_log promises system.reader_executor_log, but the default server config lacks <reader_executor_log>, so enabling the setting can still emit no rows. Add default log config.
[src/Common/MemoryPressureMonitor.cpp:103] setThresholds resets sticky cooldown on every config reload, even when thresholds are unchanged. Preserve current state when the packed thresholds do not change.
[src/Common/CurrentMetrics.cpp] The PR/changelog promise system.remote_read_connections, but the current diff only exposes LongConnections/metrics. Add the table or change the user-facing contract.
[src/IO/ReadPipeline.cpp:435] The executor filesystem-cache path drops dc.cache_log, so enable_filesystem_cache_log no longer emits system.filesystem_cache_log rows under use_reader_executor. Plumb the log sink or document/reject unsupported logging.
[src/IO/ReadPipeline.cpp:435] The executor skips page cache for StoredObject::UnknownSize but still installs DiskCacheProvider; the legacy path uses canUseFilesystemCache to avoid file-cache segment math without a real object size. Skip filesystem cache for unknown-size objects or fall back to the legacy no-cache behavior.
[src/IO/ReadPipeline.cpp:450] The executor path forwards only the prefetch pool and drops FilesystemReadPrefetchesLog, so enable_filesystem_read_prefetches_log loses rows. Thread the log sink into executor prefetch logging.
[src/Core/SettingsChangesHistory.cpp:52] reader_executor_use_long_connections is default-true but records true -> true, so compatibility cannot restore the old disabled behavior. Record the compatibility value as false.
[src/Interpreters/ReaderExecutorLog.cpp] cache_populate_requests includes promotion writes, but system.reader_executor_log has no bytes_promoted column, so the log cannot reconcile populate requests with bytes. Add a promoted-bytes column or exclude promotions from that counter.
[src/IO/ReaderExecutor.cpp] Promotion stops at the first provider with the same coarse CacheTier; multiple filesystem-cache layers all report FilesystemCache, so a slow-cache hit does not backfill a faster filesystem cache. Stop at the actual serving provider/entry, not just tier.
[src/IO/PipelineReadBuffer.cpp] nextImpl no longer invokes the inherited profile_callback, so MergeTreeReadPool loses slow-read backoff feedback when the executor is enabled. Call the callback with the fetched window metrics.
[src/IO/ReaderExecutor.cpp:1811] Long-connection skip bytes in prefetch are counted as source/over-read but not as PrefetchIssuedSourceBytes, under-reporting prefetch bandwidth. Account skipped bytes in the prefetch-issued counter too.
[src/IO/ReaderExecutor.cpp:1947] Collected prefetch stats are merged in tryCollectMachine and then merged again when the same machine is reaped for its put step. Split fetch and put stats or clear fetch stats before reuse.
[src/Processors/QueryPlan/ReadFromMergeTree.cpp:612] MergeTreePrefetchedReadPool is disabled before knowing whether ReaderExecutor will actually be used. In fallback cases such as distributed cache, this silently disables legacy prefetch. Delay disabling until executor construction succeeds.
[src/Common/CurrentMetrics.cpp:61] ReaderExecutorPrefetchInFlight is documented as queued/running tasks but tracks FetchMachine lifetime, including already finished AwaitCollect machines. Tie the metric to the actual job handle lifetime or change the description.

⚠️ Test and evidence issues

[tests/queries/0_stateless/04336_reader_executor_object_storage_live_connections.sh:2] Missing no-distributed-cache makes executor profile-event assertions run against the legacy fallback path.
[tests/queries/0_stateless/04330_reader_executor_honors_read_prefetch_setting.sh:42] The assertions are non-vacuous only when ReaderExecutor is used; add no-distributed-cache or assert executor activation first.
[tests/queries/0_stateless/04337_reader_executor_warm_slot_reservation.sh:42] The test uses removed event names, so map lookups return 0 and the slot-churn guard can pass vacuously. Use current LongConnection* events and add an activation guard.
[tests/queries/0_stateless/04328_dynamically_change_filesystem_cache_size_reader_executor.sh] This executor-specific s3_cache test needs no-distributed-cache.
[tests/queries/0_stateless/04327_reader_executor_metrics.sql:55] Zero-cache assertions are invalid under randomized page-cache settings. Add no-random-settings or explicitly disable page-cache settings.
[tests/queries/0_stateless/04333_reader_executor_s3_connection_pool_reuse.sh:2] The test mutates global filesystem-cache state while running in parallel. Add strict isolation or avoid global SYSTEM DROP FILESYSTEM CACHE.
[src/IO/tests/gtest_reader_executor_cache_chain.cpp:1214] The test locks in the same-tier promotion bug as expected behavior. Update it to expect the faster filesystem cache to be populated from the slower hit.
[tests/queries/0_stateless/04333_reader_executor_s3_connection_pool_reuse.sh:70] The retry loop falls through after 20 failed attempts and exits with cleanup status, so it passes when the condition is never observed. Fail explicitly after the loop.
[tests/integration/test_reader_executor_metric/test.py:407] test_page_cache_path only checks reduced source requests, not page-cache bytes/events, so it would pass without actually serving warm reads from PageCacheProvider.
⚠️ Add a focused unknown-size object + filesystem-cache + use_reader_executor test proving the executor skips filesystem cache or handles EOF-streaming without creating UInt64::max-sized cache segments.

Performance & Safety

⚠️ The remaining long-connection issues affect global remote connection capacity and can strand slots until reader destruction.
⚠️ The remaining cache scheduling issues can produce duplicate remote reads, wrong append order for cache cells, missing cache fills, or cache hits that are treated as EOF.
⚠️ The fallback interaction with distributed cache can disable legacy prefetch while not enabling the executor, causing a performance regression in a supported fallback mode.

User-Lens

⚠️ Observability is inconsistent: system.remote_read_connections is promised but absent, system.reader_executor_log may not be configured by default, filesystem-cache logs are dropped, prefetch logs are dropped, and several counters under/over-count important paths.
⚠️ compatibility cannot currently restore the old behavior for the default-true reader_executor_use_long_connections setting.

Final Verdict

❌ Blocked. The feature is still experimental, but the current PR changes core read/cache behavior and has unresolved correctness, resource, compatibility, observability, and test-validity issues that should be fixed before merge.

…ions field `reader_executor_decrypt_ahead` is experimental, defaults off, and has no production reach: only one gtest exercises the worker-decrypt path, and it sets `ReaderExecutor::Options::decrypt_ahead` directly. Drop the user-facing setting and its plumbing (Settings.cpp, the SettingsChangesHistory 26.7 block, Context.cpp, ReadSettings.h, ReadPipeline.cpp); keep the `Options` field and the gated worker-decrypt path, so the behavior stays available to tests and is unchanged in production (decryption still happens at the serve boundary). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

- `ChainedBuffers::extract` was `slice()` plus a `chassert(covers())` that can never fire (all 5 callers guard `covers()` first); delete it and call `slice()`. - `coveredBytes` had a single production use - the overlap guard in `copyTo`; inline that debug-gated unique-byte sum and drop the public method. - `planResidencyView` aligned its miss ranges in three passes (collect, sort, merge), but the forward segment walk already yields them ascending, so the sort was redundant; fuse the alignment into the single merge pass. - Extract `segmentCommittedEnd` for the committed-end ternary shared by `readOverlappingSegments` and `DiskCacheReader::readable`. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

Behavior-neutral simplifications, verified against the full executor gtest suite: - `RetrieveStatus::machine` duplicated the single global in-flight `machine`; replace it with `machineFor(ri)` (`machine && machine->retrieve_index == ri`), which also removes the stale-pointer clear in `cancelMachine`. - `FetchMachine::requested_range` was always `physical_window` shifted by `data_start_offset`; drop the field and derive it at the read sites. - `FetchMachine::sibling_led` was a vector the worker filled but collect only tested for emptiness; replace it with a `bool contended` set by the worker. - `effectivePrefetchWindowSize` was only ever compared `== 0`; replace it with a `prefetchEnabled(level)` predicate. - Inline the single-caller helpers `LongConnection::isComplete` and `maybeDrainLongTail`. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

- `ContinuityTracker::Options::ewma_alpha` was never configured (always the 0.5 default); inline 0.5 in `closeRun` and drop the field. - Comment-only sweep of references to removed/renamed entities: "live connection" -> "long connection", the removed `requested_range` / `coverWindow` / `PrefetchJob` / `serveCacheTiersCollectingMisses` / `fill_ahead_lead_ram` asides, `job->pressure_level` -> `pressure_snapshot`, the retired `read_extent_end` plan-sizing claim, and the `reader_executor_block_size` doc (it sizes ChainedBuffers nodes, not the read window). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

`reader_executor_decrypt_ahead` (already demoted from a setting) decrypted prefetched bytes on the worker thread ahead of serving - experimental, never enabled, never measured. Drop it entirely so decryption always happens at the serve boundary (`decryptWindow`), the only behavior that ever shipped. Removes the `Options`/member toggle, `decryptFetchedAhead`, `FetchMachine::plaintext_fetched`, `RetrieveStatus::ready_bytes_is_plaintext`, `served_window_is_plaintext`, the `is_plaintext` plumbing through `tryCollectMachine`/collect/serve, and the gtest that exercised it. Decrypt-on-serve coverage is unchanged (`DecryptsMultiNodeWindow` etc.). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

…ptively The readNextWindow boundary block bridged two concerns in one tangle. Separate them: - collect the in-flight machine at the boundary (its own focused comment), then - replan only once the plan is fully consumed - cursor fell before plan_start, or reached plan_end and the plan doesn't already run to EOF. Drop the pre-emptive `position_phys + window_size > plan_end` look-ahead: the plan is now used to its end before a rebuild. Safe because the boundary collect runs before the serve in the same call, so the cursor never serves a consumed step at plan_end (no premature interior EOF). The replan + next prefetch launch now happen at the boundary rather than one window early. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

`to_read` (the per-call read ceiling) was threaded through interpretStep and the six serve functions. Replace the parameter with a private `readCeiling()` helper that recomputes the same value on demand - `position` is stable during a serve, so the value is identical. Behavior-neutral; cleans up seven signatures. The helper deliberately omits the `reached_eof` check, so an in-flight machine's final bytes are still collected after EOF latches (the `atEnd() && !machine` guard owns the no-machine EOF). `serveCacheBlock`'s own byte-budget parameter is unchanged. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

…Cursor/finishWindow readNextWindow gated EOF as `atEnd() && !machine`, then fell through to a `readCeiling() > 0`-gated replan block and a post-serve tail - so the in-flight machine's final-window drain rode the general path and the gate had to ignore `reached_eof`. Hoist EOF to an explicit `if (atEnd())` branch: drain an in-flight machine's final window, else report EOF. The rest is then strictly `!atEnd()`, so the replan gate simplifies to `!atExtent()` (new helper) - the read extent, not the file end. Factor the collect-then-replan block into `prepareCursor` and the post-serve tail into `finishWindow`, shared by the drain and normal paths. Behavior-neutral: for a reachable `atEnd() && machine` (unknown-size short read) `readCeiling() > 0` always held, so the drain's unconditional `prepareCursor` matches the old gated block; and under `!atEnd()`, `!atExtent()` equals the old `readCeiling() > 0`. Covered by `UnknownSizePrefetchedFinalBytesAreServed`. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

…handleExtentOrReplan The serve path had three functions making the same "plan stale -> reschedule" decision: prepareCursor (proactive, before every serve), interpretStep's guard (detect + route), and handleExtentOrReplan (reactive replan + retry). Since readNextWindow's EOF restructure runs prepareCursor before every interpretStep, the reactive replan/retry is redundant - the plan is already fresh when interpretStep runs, so handleExtentOrReplan only ever hits its early empty return. Verified the replan/retry path is unreachable: a chassert probe stayed clean across the unit grid AND the multi-thread integration that caught the original premature-EOF bug, and a 5-way adversarial reachability analysis agreed - whenever interpretStep's guard fires the executor is at atEnd() or readCeiling()==0 (the cursor is structurally bounded by findStepContaining's clamp, and empty steps == empty plan span == atEnd/atExtent). - Delete handleExtentOrReplan; interpretStep returns {} directly when there is nothing to serve. - prepareCursor no-ops at the extent (atExtent), so readNextWindow calls it unconditionally - one scheduler call site, no !atExtent gate. Behavior-neutral; grid 145 + integration 4/4 green. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

serveCacheBlock had a single caller - serveHitStep, a one-line wrapper that computed the byte budget (min of the read ceiling and the step's remaining span) and passed it down. Fold the resident-run streaming body up into serveHitStep (computing the budget from the step internally) and drop serveCacheBlock and its generic to_read parameter. interpretStep still dispatches the symmetric pair serveHitStep(step) / serveRetrieveStep(step). Behavior-neutral; grid 145 green. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

…tchMachineRunner The runner that drives FetchMachine machines was a single concrete class bound to the prefetch pool. Extract an abstract `IFetchMachineRunner` (the four-verb protocol: schedule / tryCancelQueued / requestInterrupt / waitReleased) and rename the pool-backed implementation to `PoolFetchMachineRunner`; the executor's `runner` member is now `unique_ptr<IFetchMachineRunner>`. This is the seam for a future inline LocalFetchMachineRunner that runs the machine synchronously on the serve thread, so foreground and background can share one FetchMachine flow. Pure type-level refactor, behavior-neutral; grid 166 green. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

The `run_step` closure was inlined in `launchRetrieve`. Factor it into a named `makeFetchStep(FetchMachine&)` helper so the step body is independent of the scheduling decision - a future inline runner can reuse the identical closure. Behavior-identical extraction; grid 166 green. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

Add `LocalFetchMachineRunner`: an `IFetchMachineRunner` whose `schedule` runs the machine's step synchronously on the calling thread (mirroring the pool wrapper's state writes + exception-into-`failure` capture) and leaves `current_step` null. Not constructed by the executor yet - this is the inline driver the foreground will use so it can share the FetchMachine flow without the pool. Because an inline machine never carries a `JobHandle`, `waitReleased` / `tryCancelQueued` and the abandoned-machine drain all no-op on it - structurally avoiding the double-join a pool future would risk. Unit tests cover inline-run + terminal state + the step-throw-lands-in-`failure` hazard + the no-handle verbs. grid 170 green (4 new tests). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

…chMachine (flag-gated) Add `reader_executor_unified_foreground` (experimental, default OFF). When ON, the synchronous foreground populatable serve runs the SAME FetchMachine flow as the prefetch - built by `launchMachineForWindow` and run INLINE on the serve thread by a `LocalFetchMachineRunner` - then collected immediately, so the serve reads the just-committed cells (cache-as-buffer) instead of the bespoke sync read+bank. The legacy sync path stays as the fallback for sibling-led contention and bypass gaps. - New `launchMachineForWindow(ri, window, runner)` factored from `launchRetrieve`, shared by the pool (read-ahead) and the inline (foreground) runners. - New `local_runner` member; `collectRunner()` returns `runner ? *runner : *local_runner`, so the collect verbs work with or without a pool (a settled inline machine no-ops through either - it carries no JobHandle). - The inline launch is guarded by `!machine` (slot free), never clobbering an in-flight read-ahead machine, and is collected within the same serve call. Default OFF -> behavior-neutral. Validated: gtest 172 (flag OFF identical; flag ON no-pool serves+populates from cells with no sync fallback; flag ON + pool coexists with read-ahead); a 5-way adversarial reachability pass (long-conn, machine-slot, contended, teardown, collect-runner - all clean); and integration on real S3 - the flag-OFF matrix unchanged + a new multi-thread `executor==legacy` flag-ON repro. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

The unified-foreground inline FetchMachine (a `LocalFetchMachineRunner`, no prefetch pool) now fetches only the contiguous LED PREFIX up to the first sibling-led cache segment, serving it as a short window. The caller's next read resolves the sibling boundary through the existing sync fallback (elect/wait), instead of this thread blocking to fetch a led run past a sibling-led hole. `coordinatedPrefetch` bounds the fetch at the first sibling offset when `inline_serve` is set; `serveRetrievePopulatable` then narrows the serve window to the committed prefix via the new `committedCellPrefixEnd`, so the existing cell-serve covers it fully. The prefix is this executor's own led work, so it is in this writer's per-writer `committed()` set - there is no dead work: a byte a sibling downloaded is never in `committed()`, which is exactly what bounds the prefix. The pool read-ahead path is unchanged (`inline_serve` is false, so the fetch bound stays at the window end). Still gated behind `reader_executor_unified_foreground` (default off). Adds `UnifiedForegroundStopsAtFirstSiblingLedSegment`, a real `DiskCacheWriter` contention test - the downloader-blind `MockCacheWriter` masked an earlier dead-work defect in this area. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

…g-gated) The unified-foreground path now runs the SCHEDULED fills inline on the serve thread (the same `FetchMachine` flow as the background prefetch, driven by a `LocalFetchMachineRunner`) instead of inventing a per-cursor fetch: - `serveRetrievePopulatable` drains the scheduled fill from the cache cell's append-only frontier (`committedCellPrefixEnd(fetchWindowAt(window))`) until the cell covers the cursor, then serves from the committed cells. Fetching from the cell frontier (not the cursor) fills a mid-cell read from the cell floor, so the append-only cell write commits instead of being skipped and re-fetched; the frontier also advances past a down-filled embedded hit, so a gap after the hit resumes there rather than re-fetching it from source. - `serveHitStep` down-promotes a served upper-tier hit into the lower cells the schedule marks for cross-cache fill (a `Fill(UpperCacheRead)`), using the bytes already in hand - so a lower segment completes across an embedded faster-tier hit without a remote over-read. - The cell-alignment head fetched below the cursor is recorded as over-read. - A `readBigAt` transient takes no fill pin (mirrors `finalizeAssembledWindow`'s `is_transient` guard), matching the legacy non-executor path: populate, never pin. Gated behind `reader_executor_unified_foreground` (default off). The remaining unified-path work - the inline fill fetching the whole cell-aligned gap run in one source read while skipping upper-resident parts (the legacy `readPhysicalWindow` fetch folded in) - is tracked for a follow-up. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

The unified-foreground inline fill now fetches like the legacy `readPhysicalWindow` did - the whole cell-aligned miss run in one source read, skipping embedded faster-tier hits - and attributes the served bytes the same way. Four coupled changes (all behind `reader_executor_unified_foreground`, still default off): - `serveRetrievePopulatable` launches the inline machine over the whole cell- aligned run `[base, gapEndWithin(base, aligned.end()))` instead of just `[base, cursor.end())`. One cold cell is now ONE source read and the cursor's later windows in the cell are served from it - matching the legacy `fetchWindowAt`. The cell-alignment head and the run tail past the cursor window are recorded as `overread_pending` (the inline collect writes via `pushChainToWriters`, which - unlike the sync `assembleAndWriteBack` - does not account over-read), keyed off the committed frontier so a short fill records only what landed. - `CoverageMap::gapEndWithin(from, limit)` caps that launch at the first embedded resident byte of any tier, so an embedded faster-tier hit fills the lower cell DOWN (`downFillScheduledLower`) instead of being re-fetched from the source. `gapEnd` itself is unsuitable here: it is bounded by `plan_end`, which the read-extent advances one window at a time, so it would collapse the run to the cursor window. - `coordinatedPrefetch` fetches each led run in ONE `fetchGapsFromSource` when the machine serves inline (no concurrent reader to hand a growing prefix to) - one GET on the stateless arm instead of one per window. The background run-ahead keeps its per-tile progressive commit. - The cell-serve credits a cache hit only for bytes already committed BEFORE this serve started filling (a prior window's prefill or a worker's run-ahead). Bytes this serve fetches from the source transit the cell but are counted as `BytesFromSource`, not a cache hit - otherwise a cold all-miss reported itself as served entirely from cache (`recreditCommittedPrefixes` / `serveWindowFromCells` gained a `cache_credit` mask; the legacy callers pass none and are unchanged). `EvictionInChainRefetchesEvictedCells` now exercises the unified path explicitly and asserts its re-fetch count (9, vs the legacy path's 7): under the eviction flood the held long connection streams the first cold run but re-opens a GET per window for the later flood-evicted segments. Correctness (no missing bytes) is unchanged. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

# Conflicts: # src/Disks/DiskObjectStorage/ObjectStorages/HDFS/HDFSObjectStorage.cpp # src/Disks/DiskObjectStorage/ObjectStorages/S3/S3ObjectStorage.cpp # src/Disks/DiskObjectStorage/ObjectStorages/Web/WebObjectStorage.cpp # src/Disks/IO/ReadBufferFromWebServer.cpp # src/IO/ReadPipeline.cpp # src/IO/ReadPipeline.h # src/Storages/ObjectStorage/StorageObjectStorageSource.cpp

clickhouse-gh · 2026-06-30T21:25:57Z

+            st.ready_bytes.append(std::move(collected));
+        }
+        st.fetched += window;
+        st.phase = RetrievePhase::Ready;


This collect path can leave a fully-fetched cache-fill retrieve in Ready forever. tryCollectMachine runs schedulePutStep/reapPutMachine before st.fetched is advanced here, so the reapPutMachine check sees the old frontier and does not mark the retrieve Done; then this assignment overwrites the phase with Ready. depsSatisfied only accepts Done, so a later retrieve that writes the same append-only cache cell can remain blocked even though this prefetch has already fetched and committed the predecessor range.

For example, with retrieve 0 filling [0, 128) and retrieve 1 depending on it for [128, 256), collecting retrieve 0's final window writes the cell, but the dependency still sees retrieve 0 as Ready and never launches retrieve 1. Please advance the fetched frontier before reaping, or set Done after this line whenever st.fetched >= read_plan.schedule.retrieves[ri].range.size.

Flip `reader_executor_unified_foreground` on by default: the `ReaderExecutor`'s synchronous foreground serve now runs the scheduled fills inline through the same `FetchMachine` flow as the background prefetch (a `LocalFetchMachineRunner`), instead of the bespoke synchronous read-and-assemble path. Set the setting to 0 to fall back to the legacy synchronous path (kept for one release as a safety valve before it is removed). The inline path reached parity with the legacy `readPhysicalWindow` in the preceding change (whole cell-aligned fetch, embedded-hit skip, over-read and cache-hit attribution), so this flip is behaviour-equivalent on the metric grid and the schedule KPIs. Flipped in the four plumbing sites (Settings declaration + history, the `ReadSettings` member, and the `ReaderExecutor::Options` default the unit grid uses). Validated: the no-pool gtest grid (`PlanSchedule*:*ReaderExecutor*`) green with the default on, and the real-`FileCache`/S3 `test_reader_executor_metric` integration (the main matrix now exercises the unified path; the legacy arm of `test_repro_unified_foreground_matches_legacy` sets the setting off explicitly). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

With the unified inline foreground the default, the bespoke synchronous read-and-assemble path is dead weight: delete `serveRetrieveForeground`, `syncGapRead`, `readWindowLogical`, `readPhysicalWindow`, `fetchAndBackfillGaps`, and `serveResidentFromPlan`. The three call sites that used it are rerouted: - The two `serveRetrieveForeground` uses (a not-prefetched bypass gap, and the populatable contended fallback where the cursor sits on a segment a sibling executor leads) collapse into one `serveWindowInline`. It fills the cell-aligned window from the source (`fetchWindowAt`, opening a long connection per object piece so the reads coalesce across windows), but first WAITS on a disk cell a sibling is downloading and reads the sibling's committed bytes - deduping the concurrent cold populate instead of re-fetching it - and pins the filled segment via `finalizeAssembledWindow`, exactly as the deleted path did. A bypass gap has no cell, so it is a plain source read banked in `ready_bytes`. - `initDecryption`'s header read goes through a direct one-shot `fetchGapsFromSource` (no plan, no long connection) instead of `readPhysicalWindow` with an empty geometry. `assembleAndWriteBack`, `finalizeAssembledWindow`, `recreditCommittedPrefixes`, `serveLateHits`, `readHitFromView`, and `pushAssembledToWriteBuffers` stay - they are shared with the worker collect path (`collectInFlightInto`). `UnifiedForegroundStopsAtFirstSiblingLedSegment` now asserts the sibling-led segment at the cursor is resolved from the sibling's committed cell (a cache read), not a synchronous source read (`syncReadMicros == 0`). Validated: the no-pool gtest grid `PlanSchedule*:*ReaderExecutor*` green (173), and the real-`FileCache`/S3 `test_reader_executor_metric` integration. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

The unified inline foreground is now the ReaderExecutor's default and, since the legacy synchronous stack was deleted, its ONLY foreground path - so the `reader_executor_unified_foreground` setting is vestigial and misleading: setting it to 0 no longer restores a legacy path (there is none), it only disabled the inline drain-loop and the cross-cache down-fill, leaving a degraded unified path. Remove it: the SQL setting (`Settings.cpp` DECLARE + `SettingsChangesHistory` - added and removed within the same unreleased version - + `Context.cpp` + the `ReadSettings` member + the `ReadPipeline` propagation) and the `ReaderExecutor::Options::unified_foreground` field. The two `if (unified_foreground)` sites - the inline drain-loop in `serveRetrievePopulatable` and the cross-cache down-fill in `serveHitStep` - become unconditional. Tests: drop the now-redundant `opts.unified_foreground = true` from the gtests (the path is always on); remove `test_repro_unified_foreground_matches_legacy` (its "flag on vs off" premise is moot - `test_repro_legacy_vs_executor` already covers executor-vs-non-executor equivalence). Validated: gtest grid `PlanSchedule*:*ReaderExecutor*` green (173), real `FileCache`/S3 `test_reader_executor_metric` integration. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

clickhouse-gh · 2026-07-01T08:35:05Z

+        const auto & dc = *it;
+        if (dc.cache)
+        {
+            executor_caches.push_back(std::make_shared<DiskCacheProvider>(


tryBuildReaderExecutor skips the page cache when any StoredObject has UnknownSize, but it still installs every filesystem-cache provider here. The legacy path gates filesystem cache with canUseFilesystemCache(object), because FileCache needs a real object size for segment bounds and tail completion. With an S3/Web object that has no Content-Length, this executor path passes StoredObject::UnknownSize (UInt64::max) through DiskCacheProvider::openWriteBuffers into getOrSet, so the cache treats an EOF-streaming object as an effectively infinite file and can leave/download segments past the real EOF instead of taking the plain stream-until-EOF path. Please skip DiskCacheProvider when any mapped object is unknown-size (or fall back to the legacy no-cache behavior for those objects), matching the existing canUseFilesystemCache guard.

…t credit mask) The unified serve reads every delivered byte OUT of a committed cell - that is a real cache read (`w.writer->read()`), the cache being the buffer. The consolidated fetch had added a `cache_credit` mask that suppressed counting a cell-read as a cache hit when this same serve had just fetched the byte from the source, so a cold miss reported 0 bytes from cache - the legacy DISJOINT attribution, inconsistent with the cache-as-buffer model (the legacy path served source bytes straight from the assembled result and never re-read the cell; the unified path always routes through the cell). Drop the mask. Now: - `BytesFromSource` = bytes fetched from the origin into cells. - `BytesFrom{Fs,Page}Cache` = bytes read out of cells to serve the client. A cold miss shows BOTH (source filled the cell, the serve read it back out); the two count distinct operations and legitimately overlap. Removes `cacheCredit` + the `cache_credit` parameter on `recreditCommittedPrefixes` / `serveWindowFromCells` + the `pre_committed` snapshot in `serveRetrievePopulatable`. Tests: the schedule's KPI oracle `predictKpi` now predicts `served_from_cache` as ALL delivered User bytes (resident hits AND filled misses), not just pre-resident ones - so `PlanScheduleValidation` (ColdAllMiss / TwoTierDisjointHits / ResidentIsland) and `SchedulePredictsByteKpis` compare actual-vs-oracle on the new semantics with no literal changes. `PartialFsHitTailFromSource` re-baselined: `hit_delta` is the whole file (prefix hit + tail served from the filled cell), the tail still counted once as source. The cost formula and the integration invariants are unaffected (they use `BytesFromSource` and the cache-GET COUNT, not the cache byte counters). Validated: gtest grid `PlanSchedule*:*ReaderExecutor*` green (173), real `FileCache`/S3 `test_reader_executor_metric` integration. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

…odel M1) First stage of the Plan/Schedule/Execute/Display model (`tmp/reader-executor-model/DESIGN.md`): the schedule carries everything needed to RUN a fetch job, so the inline foreground fill consumes the schedule verbatim instead of re-deriving the fetch geometry per serve call. - `PlanSchedule::Retrieve` gains `fetch_runs` - the sub-ranges of `range` to read from the source, split at EVERY embedded resident region (served / filled down from its tier, never scheduled as a source read). `buildSchedule` computes them with the walk `fillRegion` was already doing before `mergeSorted` discarded the per-gap structure. Whether the executor reads THROUGH a resident hole at run time (its down-fill was skipped by the append-only cell) is a display-state decision - the drain-loop's frontier-in-hole rule - not a schedule property. - `Retrieve` also gains `fetch_head_grid` / `fetch_tail_grid`: the coarsest cache-cell alignment across the plan's tiers (every Remote byte is a miss on every tier, so the coarsest extension applies). The `into` cell edges cannot serve this purpose - the providers merge adjacent aligned misses, so those "cells" are multi-segment runs. - The inline fill's drain-loop becomes a pure schedule consumer: grid-floor the window's first missing display byte (walking ACROSS runs, so a before-slack run a seek jumped over is still fetched and the append-only cell fills whole from its floor), take the display frontier from there, and fetch to the first run end past it - reading through an inter-run resident hole exactly when the frontier sits in one, as the legacy assembler did - tail-ceiled to the grid. `fetchWindowAt` / `gapEndWithin` no longer appear on the serve path. Behavior-preserving: the pieces the loop launches match the previous geometry re-derivation (metric grid R/I/O unchanged, no re-baselining). An adversarial review pass killed the first cut (which baked `min_bytes_for_seek` bridging into the runs - reading through hits the old code clipped at); the landed rule - static resident-split jobs + dynamic frontier read-through - matches the old `gapEndWithin` semantics. Known accepted divergence: with a refused-write / contended cell state the new frontier walk starts at the merged-range grid floor (the old per-tier miss-cell clamp could start later), so it can RESUME a refused cell from its committed end - same data, different piece shape, only reachable under write refusal. The background launch path is untouched (its frontier unification is the next stage). New `PlanScheduleRetrieves` tests: the runs split at an embedded resident region regardless of `min_bytes_for_seek`; a cold gap's after-slack past `plan_end` extends one maximal run. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

Second stage of the Plan/Schedule/Execute/Display model (`tmp/reader-executor-model/DESIGN.md`): the DISPLAY - the committed cell state - becomes the data-progress surface for populatable jobs, and the two wait sites collapse into one. - New `jobFrontier(ri)`: the first byte of a job's `fetch_runs` not yet committed to the cells - the job's DATA progress, derived from the same display state the serve reads, so a piece stopped anywhere (background or inline) is continued by whoever runs next with no handover. - New `launchProgress(ri)` = `jobFrontier` advanced past bytes already ATTEMPTED (the `RetrieveStatus::fetched` high-water, now END-based and cursor-snapped): the background launch POLICY. The two must differ: a refused cell write (cache full / download budget) or a segment a SIBLING executor downloaded never enters this executor's per-writer committed set, so a raw display frontier pins there and the launcher would re-GET the same lead every serve window, never retire the job, and starve later jobs of read-ahead (an adversarial review pass constructed both loops - the old counter's drift was quietly load-bearing as this heal). `launchRetrieve`, the launch scan, and the Ready->Done transition use `launchProgress`; the serve never reads it. The old counter's accumulation drift is gone: the high-water advances by attempted-window END and cursor position, both well-defined for inline (cell-floor-based) machines. - New `waitOnDisplay(window, own_worker_only, ...)`: the one "a live writer is filling this cell - wait on its download frontier and read it" primitive. `serveWindowFromCells`'s read-behind wait (own worker's targets) and `serveWindowInline`'s contended sibling wait (any disk writer) were the same loop split by who-writes; they now differ only by the policy argument. Page cells are never waited on (filled by promotion, no downloader; the old worker-path iteration over them could never produce coverage). Also corrects a wrong claim from M1's design notes: within one plan the per-writer committed set is MONOTONE (the writer remembers evicted bytes), so eviction is healed at the next re-plan, not by the frontier moving back. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

…l M3) Third stage of the Plan/Schedule/Execute/Display model (`tmp/reader-executor-model/DESIGN.md`): every fill the executor performs is now a SCHEDULED job, and the fill kinds whose input is the just-served bytes run through one source-branching executor. - `buildSchedule` now emits a `HandedChain` promote job for every User GAP range (previously only for resident ranges): `into` = the user-overlapping miss cells `writeTargetsFor` deliberately left out of the Remote fetch (the faster tiers - the fetch fills only the bottom tier and same-tier slower layers). The promote-of-fetched work the serve was doing ad hoc is now visible schedule data. The resident promote gains the `[CF-promote]` guard as data: a same-tier faster LAYER is never a promote target (it is a fetch target) - previously that rule lived only in `maybePromote`'s tier-equality break. - New `Retrieve::ahead_eligible`: the fg/bg partition as schedule data. `Remote` fills depend on nothing but the source and may run ahead; the handed kinds (`UpperCacheRead`, `HandedChain`) take the serve's output as input, so they are serve-front only. `maybeLaunchAhead` filters on it. - New `runHandedFills(served_range, bytes)`: the ONE executor of the handed jobs, replacing three ad-hoc functions - `downFillScheduledLower` (its schedule-driven body, generalized to both sources), `promoteFetchedToUpper` (which re-READ the bottom cell to find the bytes the serve already had in hand), and `maybePromote` (whose tier-walk the scheduled `into` cells now encode). Every serve site hands the served bytes to the same call: the hit serve (up-promote + cross-cache down-fill in one pass), the populatable cell-serve, the inline fallback, and the banked bypass serve. The last one matters: the schedule can emit a promote over an `into`-empty gap (a whole-cell run the fetch parts do not fully span is not a FETCH target, yet the served bytes may complete its blocks), and a planned job must run regardless of which path served the bytes - an adversarial review pass caught the bank path skipping it, which would have let prefetch timing decide whether a planned job ran. Handing the served bytes also promotes where the old committed-only walk could not: when the bottom-cell write was refused (cache full) or the segment is sibling-owned, the faster tiers still fill from the bytes in hand - the stated `HandedChain` semantics, and strictly less re-fetching. - `isScheduledFillTarget` (the MACHINE's fill-target predicate) is constrained to `Remote` jobs' cells: without this the new promote jobs would have made the fetch write the faster-tier cells directly - exactly what `writeTargetsFor` forbids (the pc fill trails the serve cursor; it does not ride the fetch lead). New `PlanScheduleRetrieves.PromoteNeverCrossesSameTier` pins `[CF-promote]` as schedule data (the schedule twin of `SlowFsHitIsNotPromotedToFastFs`); `DesignWorkedExample` / `SlackNotPromotedToFasterTier` now assert the visible promote job and `ahead_eligible`. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

… (model M4) Final stage of the Plan/Schedule/Execute/Display model (`tmp/reader-executor-model/DESIGN.md`): the serve paths share one window and consume only schedule data; the per-job sidecar keeps only what cannot be derived. - ONE step-bounded serve window, computed once in `serveRetrieveStep` for every path (the populatable cell-serve, the bypass bank, and the inline fallback); `serveRetrievePopulatable` / `serveWindowInline` take it as a parameter. The fallback receives the ORIGINAL window (not the narrowed committed prefix), as before - the raced/short path re-serves it whole. - The LAST geometry call leaves the serve path: `serveWindowInline`'s `fetchWindowAt` is replaced by the job's fetch grids clamped into its range - the same schedule data the drain-loop uses (byte-equal for gap windows, identity for a bypass job with grids of 1). Geometry is now consulted only at plan/schedule build, as the model requires. - `RetrievePhase` is gone. The enum was the third progress tracker: "in flight" is `machineFor(ri)`, the serve gates on display coverage, and only `Done` was ever read (the `depsSatisfied` append-only ordering gate). `RetrieveStatus` now holds exactly the underivable rest: `done`, the launch high-water, and the bypass bank. The test inspector exposes `retrieveDone` and a job-relative `retrieveLaunchProgress` instead of raw phases. The fetch grids are tightened to the plan's POPULATING tiers (an adversarial review pass): a bypass-mode cache (`read_from_*_cache_if_exists_otherwise_bypass_cache`) reports its coarse alignment but schedules NO fill cells, so it must not shape the fetch - nothing could hold the grid extension, and the schedule-derived fetch window would have fetched-and-discarded it EVERY window (quadratic source over-read on partially-cached bypass reads; the old geometry query was identity there because extension required a containing miss run). An `into`-empty job can only exist in an all-bypass plan - any populating tier missing a gap gives it a cell - so bypass jobs now truly carry grids of 1. This also stops a bypass tier's alignment from inflating the populatable drain-loop pieces (latent since the grids were introduced). Pinned by `PlanScheduleRetrieves.BypassTierDoesNotShapeFetchGrids`. Known accepted bounded divergence (documented in-code): the schedule-derived head clamp can reach across a same-tier resident run the old per-tier clamp stopped at - at most one grid cell, served from cache when resident, once per plan. Explicitly not folded in (with rationale in the design doc): routing the bypass serve through inline machines would lose the per-object-piece long-connection opens (re-breaking the multi-object bounding), and the bypass bank stays a bank rather than a virtual display cell - `serveWindowInline` remains the one bespoke assembler (sibling dedup + source remainder + pin + bank), now fully schedule-driven in its inputs. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

CheSema force-pushed the reader-executor branch 4 times, most recently from 77b9b0a to fb8ddfb Compare April 30, 2026 01:32

CheSema changed the base branch from read-pipeline-redesign to master April 30, 2026 02:57

clickhouse-gh Bot added the pr-feature Pull request with new product feature label Apr 30, 2026