ReaderExecutor: pipeline-based read orchestration by CheSema · Pull Request #103706 · ClickHouse/ClickHouse · GitHub
Skip to content

ReaderExecutor: pipeline-based read orchestration#103706

Draft
CheSema wants to merge 519 commits into
masterfrom
reader-executor
Draft

ReaderExecutor: pipeline-based read orchestration#103706
CheSema wants to merge 519 commits into
masterfrom
reader-executor

Conversation

@CheSema

@CheSema CheSema commented Apr 29, 2026

Copy link
Copy Markdown
Member

Builds on top of #103234 (ReadPipeline). Gated by SET use_reader_executor = 1 (experimental, default off).

This commit replaces the matryoshka of ReadBuffer wrappers with a ReaderExecutor that owns offset mapping, cache decisions, prefetch and decryption in one place. Caches plug in via a uniform ICacheProvider / ICacheHandle API.

What this brings

Zero-copy page-cache hits. Reads served from PageCache reach the caller as a shared_ptr into the mmap'd cell — no memcpy between cache and working buffer (compare to CachedInMemoryReadBufferFromFile's per-block copy).

Connection reuse across remote read calls. Sequential reads against the same S3/Azure/HTTP object reuse an open buffer instead of issuing a new HTTP request per window. SourceBufferLimit caps live connections globally (max_remote_read_connections, default 1000); over-limit reads fall back to stateless open-read-close. Visible via system.remote_read_connections.

Coordinated cache + prefetch decisions. The matryoshka design had each layer making independent choices (prefetch ahead in AsynchronousBoundedReadBuffer, decide what to cache in CachedOnDiskReadBufferFromFile, gather across objects in ReadBufferFromRemoteFSGather). The executor sees the whole window:

  • prefetches only what isn't already cache-hit
  • retains source over-read only when paired with a live connection (drops it otherwise)
  • knows which bytes are user-requested vs cache-fill, so it doesn't account cache-fill bytes against the user's read budget

Per-object cache identity, not per-pipeline. DiskCacheProvider derives FileCacheKey / FileCacheOriginInfo per StoredObject (etag-keyed caching for StorageObjectStorageSource, segment-key-type classification for Data vs System queues). The old single-key-per-executor approach silently shared cache identity across all objects in a gather-mode read.

Foundation for new cache backends. New caches (vector cache, distributed cache, etc.) implement one interface (ICacheProvider::lookup) and plug into the chain without touching read paths.

Observability

  • system.remote_read_connections — open connections (path, query id, position, elapsed)
  • system.reader_executor_log — per-executor stats (cache hit/miss/populate bytes, prefetch hits/cancellations, source request count, decrypt time, …)
  • ProfileEvents: ReaderExecutor* counters and microsecond timers; LiveSourceBuffer* counters
  • HistogramMetrics: cache get/populate/source-read/prefetch-wait latencies

Server settings

  • reader_executor_prefetch_pool_size (default 8) — shared prefetch thread pool size
  • reader_executor_prefetch_queue_size (default 0 = pool_size * 10) — pool queue depth
  • max_remote_read_connections (default 1000) — live-buffer slot count, reload-able

Internals

  • Rope / RopeNode / OwnedRopeBuffer — refcounted buffer chains with mixed provenance (owned, page-cache-pinned), built-in cursor (peek/advance/tryRewind), sorted-on-insert nodes, disjoint-interval coverage tracking.
  • ICacheProvider / ICacheHandle — uniform cache API (lookup → status/get/put). Two implementations: PageCacheProvider (file-level, zero-copy), DiskCacheProvider (per-object, FileCache-backed).
  • ISourceReader — stateless range-read interface. LocalSourceReader, ObjectStorageSourceReader, BufferSourceReader (adapter for backup / BufferCreator).
  • OffsetMap — replaces ReadBufferFromRemoteFSGather's gather logic with a logical-to-(object, object-offset) lookup; supports single-object unknown-size (S3 HEAD without Content-Length → streams to EOF).
  • ReaderExecutor — owns: position, offset map, cache chain, prefetch handle, live buffer, over-read tail, source-buffer slot.
  • PipelineReadBuffer — thin ReadBufferFromFileBase exposing the executor through BufferBase::set/next/seek (so legacy callers see no change).
  • PrefetchThreadPool — shared bounded pool, returns nullptr on overflow (sync fallback), task cancellation is race-free.

Testing

~3,700 lines of tests (≈46% of the PR), 130+ gtest cases:

  • gtest_rope — 31 cases (append / peek / advance / tryRewind / slice / copyTo / coverage queries / shift).
  • gtest_reader_executor — 41+ cases (cache-chain combinations, per-object lookup, prefetch races, EOF release, slot leak fixes, unknown-size streaming, …).
  • gtest_filecache — 22 cases including over-read / bypass-mode / partial coverage scenarios.
  • gtest_read_pipeline — known/unknown size selection through the executor.
  • Functional: 04262_reader_executor_observability plus opt-outs (SET use_reader_executor = 0) on tests that intentionally exercise the legacy path.

Load tesing

the simpliest cache API"

┌─────────┬────────────┬──────────┬───────┬────────────────────────┐
│ regime  │ legacy QPS │ exec QPS │ QPS Δ │ per-query-type latency │
├─────────┼────────────┼──────────┼───────┼────────────────────────┤
│ cold    │      0.252 │    0.320 │  +27% │       −12% (faster) ✅ │
├─────────┼────────────┼──────────┼───────┼────────────────────────┤
│ warm    │      0.407 │    0.252 │ −38%¹ │       +27% (slower) ❌ │
├─────────┼────────────┼──────────┼───────┼────────────────────────┤
│ partial │      0.247 │    0.251 │   +2% │       +37% (slower) ❌ │
└─────────┴────────────┴──────────┴───────┴────────────────────────┘

changed cache API (stream aware):

Full clean verdict — 9e9abe9d (26.6.1.1)

┌─────────────┬─────────────────────────┬───────────────────────────────────────────────────┐
│   Regime    │   Executor vs legacy    │                       Note                        │
├─────────────┼─────────────────────────┼───────────────────────────────────────────────────┤
│ ✅ cold     │ −15% (win)              │ 0 slot fail; coalesced reads                      │
├─────────────┼─────────────────────────┼───────────────────────────────────────────────────┤
│ ✅ warm     │ −8% (win)               │ 0 slot fail; cache reads healthy                  │
├─────────────┼─────────────────────────┼───────────────────────────────────────────────────┤
│ ⚠️ partial  │ +7%                     │ S3-source over-read / churn on uncached half      │
├─────────────┼─────────────────────────┼───────────────────────────────────────────────────┤
│ ❌ populate │ +17% (worst)            │ 50% connection reuse on cache-fill path — churn   │
├─────────────┼─────────────────────────┼───────────────────────────────────────────────────┤
│ ✅ stress   │ 27 ok / 1 OOM vs 4 / 11 │ controlled degradation = net win (bounded, alive) │
└─────────────┴─────────────────────────┴───────────────────────────────────────────────────┘


stress details:

┌───────────────────────┬───────────┬──────────┐
│                       │  legacy   │ executor │
├───────────────────────┼───────────┼──────────┤
│ queries finished (ok) │         4 │       27 │
├───────────────────────┼───────────┼──────────┤
│ failed                │        11 │        1 │
├───────────────────────┼───────────┼──────────┤
│ OOM                   │        11 │        1 │
├───────────────────────┼───────────┼──────────┤
│ connection reuse      │     86.8% │    98.8% │
├───────────────────────┼───────────┼──────────┤
│ connection resets     │   233,618 │   38,581 │
├───────────────────────┼───────────┼──────────┤
│ peak TCP recv-buffer  │ 4,152 MiB │  813 MiB │
├───────────────────────┼───────────┼──────────┤
│ TCP sockets           │    17,333 │    1,548 │
├───────────────────────┼───────────┼──────────┤
│ cgroup mem            │  24.4 GiB │ 24.6 GiB │
└───────────────────────┴───────────┴──────────┘

sha -- 494fcfe8825c

┌─────────┬──────────────┬──────────────┬──────────────┐
│ regime  │ baseline QPS │ executor QPS │  exec/base   │
├─────────┼──────────────┼──────────────┼──────────────┤
│ cold    │        0.207 │        0.292 │ 1.41× (+41%) │
├─────────┼──────────────┼──────────────┼──────────────┤
│ warm    │        0.330 │        0.307 │  0.93× (−7%) │
├─────────┼──────────────┼──────────────┼──────────────┤
│ partial │        0.375 │        0.420 │ 1.12× (+12%) │
└─────────┴──────────────┴──────────────┴──────────────┘

That is good outcome. I need to optimize the work with caches. Need to stream data from available cache segments without any additional costs.

Changelog category (leave one):

  • Experimental Feature

Changelog entry (a user-readable short description of the changes that goes into CHANGELOG.md):

Add experimental ReaderExecutor for pipeline-based read orchestration with unified cache API (PageCacheProvider, DiskCacheProvider), connection reuse for remote reads, shared prefetch pool, and system.remote_read_connections / system.reader_executor_log observability tables. Enable with SET use_reader_executor = 1.

Documentation entry for user-facing changes

  • Documentation is written (mandatory for new features)

@CheSema CheSema force-pushed the reader-executor branch 4 times, most recently from 77b9b0a to fb8ddfb Compare April 30, 2026 01:32
@CheSema CheSema changed the base branch from read-pipeline-redesign to master April 30, 2026 02:57
@clickhouse-gh

clickhouse-gh Bot commented Apr 30, 2026

Copy link
Copy Markdown
Contributor

@clickhouse-gh clickhouse-gh Bot added the pr-feature Pull request with new product feature label Apr 30, 2026
Comment thread src/IO/ReaderExecutor.cpp Outdated
Comment thread src/IO/ReaderExecutor.h Outdated
Comment thread src/IO/ReaderExecutor.cpp Outdated
Comment thread src/IO/Rope.cpp Outdated
@CheSema CheSema force-pushed the reader-executor branch 2 times, most recently from 8e5f067 to f9d35b8 Compare April 30, 2026 04:47
Comment thread src/IO/ReadPipeline.cpp Outdated
Comment thread src/IO/LocalSourceReader.cpp Outdated
Comment thread src/IO/ReadPipeline.cpp Outdated
Comment thread src/IO/PipelineReadBuffer.cpp Outdated
@CheSema CheSema changed the base branch from master to read-pipeline-redesign April 30, 2026 05:31
@CheSema CheSema changed the base branch from read-pipeline-redesign to master April 30, 2026 05:34
Comment thread src/IO/ReaderExecutor.cpp Outdated
@CheSema CheSema force-pushed the reader-executor branch 7 times, most recently from cd5c38a to fbb1388 Compare April 30, 2026 06:39
CheSema and others added 15 commits June 29, 2026 15:46
…ions field

`reader_executor_decrypt_ahead` is experimental, defaults off, and has no
production reach: only one gtest exercises the worker-decrypt path, and it sets
`ReaderExecutor::Options::decrypt_ahead` directly. Drop the user-facing setting and
its plumbing (Settings.cpp, the SettingsChangesHistory 26.7 block, Context.cpp,
ReadSettings.h, ReadPipeline.cpp); keep the `Options` field and the gated
worker-decrypt path, so the behavior stays available to tests and is unchanged in
production (decryption still happens at the serve boundary).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
- `ChainedBuffers::extract` was `slice()` plus a `chassert(covers())` that can
  never fire (all 5 callers guard `covers()` first); delete it and call `slice()`.
- `coveredBytes` had a single production use - the overlap guard in `copyTo`;
  inline that debug-gated unique-byte sum and drop the public method.
- `planResidencyView` aligned its miss ranges in three passes (collect, sort,
  merge), but the forward segment walk already yields them ascending, so the sort
  was redundant; fuse the alignment into the single merge pass.
- Extract `segmentCommittedEnd` for the committed-end ternary shared by
  `readOverlappingSegments` and `DiskCacheReader::readable`.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Behavior-neutral simplifications, verified against the full executor gtest suite:
- `RetrieveStatus::machine` duplicated the single global in-flight `machine`;
  replace it with `machineFor(ri)` (`machine && machine->retrieve_index == ri`),
  which also removes the stale-pointer clear in `cancelMachine`.
- `FetchMachine::requested_range` was always `physical_window` shifted by
  `data_start_offset`; drop the field and derive it at the read sites.
- `FetchMachine::sibling_led` was a vector the worker filled but collect only
  tested for emptiness; replace it with a `bool contended` set by the worker.
- `effectivePrefetchWindowSize` was only ever compared `== 0`; replace it with a
  `prefetchEnabled(level)` predicate.
- Inline the single-caller helpers `LongConnection::isComplete` and
  `maybeDrainLongTail`.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
- `ContinuityTracker::Options::ewma_alpha` was never configured (always the 0.5
  default); inline 0.5 in `closeRun` and drop the field.
- Comment-only sweep of references to removed/renamed entities: "live connection"
  -> "long connection", the removed `requested_range` / `coverWindow` / `PrefetchJob` /
  `serveCacheTiersCollectingMisses` / `fill_ahead_lead_ram` asides, `job->pressure_level`
  -> `pressure_snapshot`, the retired `read_extent_end` plan-sizing claim, and the
  `reader_executor_block_size` doc (it sizes ChainedBuffers nodes, not the read window).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
`reader_executor_decrypt_ahead` (already demoted from a setting) decrypted
prefetched bytes on the worker thread ahead of serving - experimental, never
enabled, never measured. Drop it entirely so decryption always happens at the
serve boundary (`decryptWindow`), the only behavior that ever shipped. Removes the
`Options`/member toggle, `decryptFetchedAhead`, `FetchMachine::plaintext_fetched`,
`RetrieveStatus::ready_bytes_is_plaintext`, `served_window_is_plaintext`, the
`is_plaintext` plumbing through `tryCollectMachine`/collect/serve, and the gtest
that exercised it. Decrypt-on-serve coverage is unchanged (`DecryptsMultiNodeWindow`
etc.).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…ptively

The readNextWindow boundary block bridged two concerns in one tangle. Separate them:
- collect the in-flight machine at the boundary (its own focused comment), then
- replan only once the plan is fully consumed - cursor fell before plan_start, or
  reached plan_end and the plan doesn't already run to EOF.

Drop the pre-emptive `position_phys + window_size > plan_end` look-ahead: the plan
is now used to its end before a rebuild. Safe because the boundary collect runs
before the serve in the same call, so the cursor never serves a consumed step at
plan_end (no premature interior EOF). The replan + next prefetch launch now happen
at the boundary rather than one window early.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
`to_read` (the per-call read ceiling) was threaded through interpretStep and the
six serve functions. Replace the parameter with a private `readCeiling()` helper
that recomputes the same value on demand - `position` is stable during a serve, so
the value is identical. Behavior-neutral; cleans up seven signatures. The helper
deliberately omits the `reached_eof` check, so an in-flight machine's final bytes
are still collected after EOF latches (the `atEnd() && !machine` guard owns the
no-machine EOF). `serveCacheBlock`'s own byte-budget parameter is unchanged.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…Cursor/finishWindow

readNextWindow gated EOF as `atEnd() && !machine`, then fell through to a
`readCeiling() > 0`-gated replan block and a post-serve tail - so the in-flight
machine's final-window drain rode the general path and the gate had to ignore
`reached_eof`.

Hoist EOF to an explicit `if (atEnd())` branch: drain an in-flight machine's
final window, else report EOF. The rest is then strictly `!atEnd()`, so the
replan gate simplifies to `!atExtent()` (new helper) - the read extent, not the
file end. Factor the collect-then-replan block into `prepareCursor` and the
post-serve tail into `finishWindow`, shared by the drain and normal paths.

Behavior-neutral: for a reachable `atEnd() && machine` (unknown-size short read)
`readCeiling() > 0` always held, so the drain's unconditional `prepareCursor`
matches the old gated block; and under `!atEnd()`, `!atExtent()` equals the old
`readCeiling() > 0`. Covered by `UnknownSizePrefetchedFinalBytesAreServed`.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…handleExtentOrReplan

The serve path had three functions making the same "plan stale -> reschedule"
decision: prepareCursor (proactive, before every serve), interpretStep's guard
(detect + route), and handleExtentOrReplan (reactive replan + retry). Since
readNextWindow's EOF restructure runs prepareCursor before every interpretStep,
the reactive replan/retry is redundant - the plan is already fresh when
interpretStep runs, so handleExtentOrReplan only ever hits its early empty return.

Verified the replan/retry path is unreachable: a chassert probe stayed clean across
the unit grid AND the multi-thread integration that caught the original
premature-EOF bug, and a 5-way adversarial reachability analysis agreed - whenever
interpretStep's guard fires the executor is at atEnd() or readCeiling()==0 (the
cursor is structurally bounded by findStepContaining's clamp, and empty steps ==
empty plan span == atEnd/atExtent).

- Delete handleExtentOrReplan; interpretStep returns {} directly when there is
  nothing to serve.
- prepareCursor no-ops at the extent (atExtent), so readNextWindow calls it
  unconditionally - one scheduler call site, no !atExtent gate.

Behavior-neutral; grid 145 + integration 4/4 green.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
serveCacheBlock had a single caller - serveHitStep, a one-line wrapper that
computed the byte budget (min of the read ceiling and the step's remaining span)
and passed it down. Fold the resident-run streaming body up into serveHitStep
(computing the budget from the step internally) and drop serveCacheBlock and its
generic to_read parameter. interpretStep still dispatches the symmetric pair
serveHitStep(step) / serveRetrieveStep(step).

Behavior-neutral; grid 145 green.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…tchMachineRunner

The runner that drives FetchMachine machines was a single concrete class bound to
the prefetch pool. Extract an abstract `IFetchMachineRunner` (the four-verb
protocol: schedule / tryCancelQueued / requestInterrupt / waitReleased) and rename
the pool-backed implementation to `PoolFetchMachineRunner`; the executor's `runner`
member is now `unique_ptr<IFetchMachineRunner>`. This is the seam for a future
inline LocalFetchMachineRunner that runs the machine synchronously on the serve
thread, so foreground and background can share one FetchMachine flow.

Pure type-level refactor, behavior-neutral; grid 166 green.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
The `run_step` closure was inlined in `launchRetrieve`. Factor it into a named
`makeFetchStep(FetchMachine&)` helper so the step body is independent of the
scheduling decision - a future inline runner can reuse the identical closure.

Behavior-identical extraction; grid 166 green.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Add `LocalFetchMachineRunner`: an `IFetchMachineRunner` whose `schedule` runs the
machine's step synchronously on the calling thread (mirroring the pool wrapper's
state writes + exception-into-`failure` capture) and leaves `current_step` null.
Not constructed by the executor yet - this is the inline driver the foreground
will use so it can share the FetchMachine flow without the pool.

Because an inline machine never carries a `JobHandle`, `waitReleased` /
`tryCancelQueued` and the abandoned-machine drain all no-op on it - structurally
avoiding the double-join a pool future would risk. Unit tests cover inline-run +
terminal state + the step-throw-lands-in-`failure` hazard + the no-handle verbs.

grid 170 green (4 new tests).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…chMachine (flag-gated)

Add `reader_executor_unified_foreground` (experimental, default OFF). When ON, the
synchronous foreground populatable serve runs the SAME FetchMachine flow as the
prefetch - built by `launchMachineForWindow` and run INLINE on the serve thread by a
`LocalFetchMachineRunner` - then collected immediately, so the serve reads the
just-committed cells (cache-as-buffer) instead of the bespoke sync read+bank. The
legacy sync path stays as the fallback for sibling-led contention and bypass gaps.

- New `launchMachineForWindow(ri, window, runner)` factored from `launchRetrieve`,
  shared by the pool (read-ahead) and the inline (foreground) runners.
- New `local_runner` member; `collectRunner()` returns `runner ? *runner :
  *local_runner`, so the collect verbs work with or without a pool (a settled inline
  machine no-ops through either - it carries no JobHandle).
- The inline launch is guarded by `!machine` (slot free), never clobbering an
  in-flight read-ahead machine, and is collected within the same serve call.

Default OFF -> behavior-neutral. Validated: gtest 172 (flag OFF identical; flag ON
no-pool serves+populates from cells with no sync fallback; flag ON + pool coexists
with read-ahead); a 5-way adversarial reachability pass (long-conn, machine-slot,
contended, teardown, collect-runner - all clean); and integration on real S3 - the
flag-OFF matrix unchanged + a new multi-thread `executor==legacy` flag-ON repro.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
The unified-foreground inline FetchMachine (a `LocalFetchMachineRunner`,
no prefetch pool) now fetches only the contiguous LED PREFIX up to the
first sibling-led cache segment, serving it as a short window. The caller's
next read resolves the sibling boundary through the existing sync fallback
(elect/wait), instead of this thread blocking to fetch a led run past a
sibling-led hole.

`coordinatedPrefetch` bounds the fetch at the first sibling offset when
`inline_serve` is set; `serveRetrievePopulatable` then narrows the serve
window to the committed prefix via the new `committedCellPrefixEnd`, so the
existing cell-serve covers it fully. The prefix is this executor's own led
work, so it is in this writer's per-writer `committed()` set - there is no
dead work: a byte a sibling downloaded is never in `committed()`, which is
exactly what bounds the prefix.

The pool read-ahead path is unchanged (`inline_serve` is false, so the
fetch bound stays at the window end). Still gated behind
`reader_executor_unified_foreground` (default off).

Adds `UnifiedForegroundStopsAtFirstSiblingLedSegment`, a real
`DiskCacheWriter` contention test - the downloader-blind `MockCacheWriter`
masked an earlier dead-work defect in this area.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
CheSema and others added 3 commits June 30, 2026 18:15
…g-gated)

The unified-foreground path now runs the SCHEDULED fills inline on the serve
thread (the same `FetchMachine` flow as the background prefetch, driven by a
`LocalFetchMachineRunner`) instead of inventing a per-cursor fetch:

- `serveRetrievePopulatable` drains the scheduled fill from the cache cell's
  append-only frontier (`committedCellPrefixEnd(fetchWindowAt(window))`) until
  the cell covers the cursor, then serves from the committed cells. Fetching
  from the cell frontier (not the cursor) fills a mid-cell read from the cell
  floor, so the append-only cell write commits instead of being skipped and
  re-fetched; the frontier also advances past a down-filled embedded hit, so a
  gap after the hit resumes there rather than re-fetching it from source.
- `serveHitStep` down-promotes a served upper-tier hit into the lower cells the
  schedule marks for cross-cache fill (a `Fill(UpperCacheRead)`), using the
  bytes already in hand - so a lower segment completes across an embedded
  faster-tier hit without a remote over-read.
- The cell-alignment head fetched below the cursor is recorded as over-read.
- A `readBigAt` transient takes no fill pin (mirrors `finalizeAssembledWindow`'s
  `is_transient` guard), matching the legacy non-executor path: populate, never
  pin.

Gated behind `reader_executor_unified_foreground` (default off). The remaining
unified-path work - the inline fill fetching the whole cell-aligned gap run in
one source read while skipping upper-resident parts (the legacy
`readPhysicalWindow` fetch folded in) - is tracked for a follow-up.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
The unified-foreground inline fill now fetches like the legacy `readPhysicalWindow`
did - the whole cell-aligned miss run in one source read, skipping embedded
faster-tier hits - and attributes the served bytes the same way. Four coupled
changes (all behind `reader_executor_unified_foreground`, still default off):

- `serveRetrievePopulatable` launches the inline machine over the whole cell-
  aligned run `[base, gapEndWithin(base, aligned.end()))` instead of just
  `[base, cursor.end())`. One cold cell is now ONE source read and the cursor's
  later windows in the cell are served from it - matching the legacy
  `fetchWindowAt`. The cell-alignment head and the run tail past the cursor
  window are recorded as `overread_pending` (the inline collect writes via
  `pushChainToWriters`, which - unlike the sync `assembleAndWriteBack` - does not
  account over-read), keyed off the committed frontier so a short fill records
  only what landed.
- `CoverageMap::gapEndWithin(from, limit)` caps that launch at the first embedded
  resident byte of any tier, so an embedded faster-tier hit fills the lower cell
  DOWN (`downFillScheduledLower`) instead of being re-fetched from the source.
  `gapEnd` itself is unsuitable here: it is bounded by `plan_end`, which the
  read-extent advances one window at a time, so it would collapse the run to the
  cursor window.
- `coordinatedPrefetch` fetches each led run in ONE `fetchGapsFromSource` when the
  machine serves inline (no concurrent reader to hand a growing prefix to) - one
  GET on the stateless arm instead of one per window. The background run-ahead
  keeps its per-tile progressive commit.
- The cell-serve credits a cache hit only for bytes already committed BEFORE this
  serve started filling (a prior window's prefill or a worker's run-ahead). Bytes
  this serve fetches from the source transit the cell but are counted as
  `BytesFromSource`, not a cache hit - otherwise a cold all-miss reported itself
  as served entirely from cache (`recreditCommittedPrefixes` /
  `serveWindowFromCells` gained a `cache_credit` mask; the legacy callers pass
  none and are unchanged).

`EvictionInChainRefetchesEvictedCells` now exercises the unified path explicitly
and asserts its re-fetch count (9, vs the legacy path's 7): under the eviction
flood the held long connection streams the first cold run but re-opens a GET per
window for the later flood-evicted segments. Correctness (no missing bytes) is
unchanged.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
# Conflicts:
#	src/Disks/DiskObjectStorage/ObjectStorages/HDFS/HDFSObjectStorage.cpp
#	src/Disks/DiskObjectStorage/ObjectStorages/S3/S3ObjectStorage.cpp
#	src/Disks/DiskObjectStorage/ObjectStorages/Web/WebObjectStorage.cpp
#	src/Disks/IO/ReadBufferFromWebServer.cpp
#	src/IO/ReadPipeline.cpp
#	src/IO/ReadPipeline.h
#	src/Storages/ObjectStorage/StorageObjectStorageSource.cpp
Comment thread src/IO/ReaderExecutor.cpp Outdated
st.ready_bytes.append(std::move(collected));
}
st.fetched += window;
st.phase = RetrievePhase::Ready;

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This collect path can leave a fully-fetched cache-fill retrieve in Ready forever. tryCollectMachine runs schedulePutStep/reapPutMachine before st.fetched is advanced here, so the reapPutMachine check sees the old frontier and does not mark the retrieve Done; then this assignment overwrites the phase with Ready. depsSatisfied only accepts Done, so a later retrieve that writes the same append-only cache cell can remain blocked even though this prefetch has already fetched and committed the predecessor range.

For example, with retrieve 0 filling [0, 128) and retrieve 1 depending on it for [128, 256), collecting retrieve 0's final window writes the cell, but the dependency still sees retrieve 0 as Ready and never launches retrieve 1. Please advance the fetched frontier before reaping, or set Done after this line whenever st.fetched >= read_plan.schedule.retrieves[ri].range.size.

CheSema and others added 3 commits July 1, 2026 09:21
Flip `reader_executor_unified_foreground` on by default: the `ReaderExecutor`'s
synchronous foreground serve now runs the scheduled fills inline through the same
`FetchMachine` flow as the background prefetch (a `LocalFetchMachineRunner`),
instead of the bespoke synchronous read-and-assemble path. Set the setting to 0
to fall back to the legacy synchronous path (kept for one release as a safety
valve before it is removed).

The inline path reached parity with the legacy `readPhysicalWindow` in the
preceding change (whole cell-aligned fetch, embedded-hit skip, over-read and
cache-hit attribution), so this flip is behaviour-equivalent on the metric grid
and the schedule KPIs.

Flipped in the four plumbing sites (Settings declaration + history, the
`ReadSettings` member, and the `ReaderExecutor::Options` default the unit grid
uses). Validated: the no-pool gtest grid (`PlanSchedule*:*ReaderExecutor*`) green
with the default on, and the real-`FileCache`/S3 `test_reader_executor_metric`
integration (the main matrix now exercises the unified path; the legacy arm of
`test_repro_unified_foreground_matches_legacy` sets the setting off explicitly).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
With the unified inline foreground the default, the bespoke synchronous
read-and-assemble path is dead weight: delete `serveRetrieveForeground`,
`syncGapRead`, `readWindowLogical`, `readPhysicalWindow`, `fetchAndBackfillGaps`,
and `serveResidentFromPlan`. The three call sites that used it are rerouted:

- The two `serveRetrieveForeground` uses (a not-prefetched bypass gap, and the
  populatable contended fallback where the cursor sits on a segment a sibling
  executor leads) collapse into one `serveWindowInline`. It fills the cell-aligned
  window from the source (`fetchWindowAt`, opening a long connection per object
  piece so the reads coalesce across windows), but first WAITS on a disk cell a
  sibling is downloading and reads the sibling's committed bytes - deduping the
  concurrent cold populate instead of re-fetching it - and pins the filled segment
  via `finalizeAssembledWindow`, exactly as the deleted path did. A bypass gap has
  no cell, so it is a plain source read banked in `ready_bytes`.
- `initDecryption`'s header read goes through a direct one-shot `fetchGapsFromSource`
  (no plan, no long connection) instead of `readPhysicalWindow` with an empty
  geometry.

`assembleAndWriteBack`, `finalizeAssembledWindow`, `recreditCommittedPrefixes`,
`serveLateHits`, `readHitFromView`, and `pushAssembledToWriteBuffers` stay - they
are shared with the worker collect path (`collectInFlightInto`).

`UnifiedForegroundStopsAtFirstSiblingLedSegment` now asserts the sibling-led
segment at the cursor is resolved from the sibling's committed cell (a cache read),
not a synchronous source read (`syncReadMicros == 0`).

Validated: the no-pool gtest grid `PlanSchedule*:*ReaderExecutor*` green (173),
and the real-`FileCache`/S3 `test_reader_executor_metric` integration.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
The unified inline foreground is now the ReaderExecutor's default and, since the
legacy synchronous stack was deleted, its ONLY foreground path - so the
`reader_executor_unified_foreground` setting is vestigial and misleading: setting
it to 0 no longer restores a legacy path (there is none), it only disabled the
inline drain-loop and the cross-cache down-fill, leaving a degraded unified path.

Remove it: the SQL setting (`Settings.cpp` DECLARE + `SettingsChangesHistory` -
added and removed within the same unreleased version - + `Context.cpp` + the
`ReadSettings` member + the `ReadPipeline` propagation) and the
`ReaderExecutor::Options::unified_foreground` field. The two `if (unified_foreground)`
sites - the inline drain-loop in `serveRetrievePopulatable` and the cross-cache
down-fill in `serveHitStep` - become unconditional.

Tests: drop the now-redundant `opts.unified_foreground = true` from the gtests
(the path is always on); remove `test_repro_unified_foreground_matches_legacy`
(its "flag on vs off" premise is moot - `test_repro_legacy_vs_executor` already
covers executor-vs-non-executor equivalence).

Validated: gtest grid `PlanSchedule*:*ReaderExecutor*` green (173), real
`FileCache`/S3 `test_reader_executor_metric` integration.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Comment thread src/IO/ReadPipeline.cpp

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

tryBuildReaderExecutor skips the page cache when any StoredObject has UnknownSize, but it still installs every filesystem-cache provider here. The legacy path gates filesystem cache with canUseFilesystemCache(object), because FileCache needs a real object size for segment bounds and tail completion. With an S3/Web object that has no Content-Length, this executor path passes StoredObject::UnknownSize (UInt64::max) through DiskCacheProvider::openWriteBuffers into getOrSet, so the cache treats an EOF-streaming object as an effectively infinite file and can leave/download segments past the real EOF instead of taking the plain stream-until-EOF path. Please skip DiskCacheProvider when any mapped object is unknown-size (or fall back to the legacy no-cache behavior for those objects), matching the existing canUseFilesystemCache guard.

CheSema and others added 5 commits July 3, 2026 15:19
…t credit mask)

The unified serve reads every delivered byte OUT of a committed cell - that is a
real cache read (`w.writer->read()`), the cache being the buffer. The consolidated
fetch had added a `cache_credit` mask that suppressed counting a cell-read as a
cache hit when this same serve had just fetched the byte from the source, so a cold
miss reported 0 bytes from cache - the legacy DISJOINT attribution, inconsistent
with the cache-as-buffer model (the legacy path served source bytes straight from
the assembled result and never re-read the cell; the unified path always routes
through the cell).

Drop the mask. Now:
- `BytesFromSource` = bytes fetched from the origin into cells.
- `BytesFrom{Fs,Page}Cache` = bytes read out of cells to serve the client.
A cold miss shows BOTH (source filled the cell, the serve read it back out); the
two count distinct operations and legitimately overlap.

Removes `cacheCredit` + the `cache_credit` parameter on `recreditCommittedPrefixes`
/ `serveWindowFromCells` + the `pre_committed` snapshot in `serveRetrievePopulatable`.

Tests: the schedule's KPI oracle `predictKpi` now predicts `served_from_cache` as
ALL delivered User bytes (resident hits AND filled misses), not just pre-resident
ones - so `PlanScheduleValidation` (ColdAllMiss / TwoTierDisjointHits /
ResidentIsland) and `SchedulePredictsByteKpis` compare actual-vs-oracle on the new
semantics with no literal changes. `PartialFsHitTailFromSource` re-baselined:
`hit_delta` is the whole file (prefix hit + tail served from the filled cell), the
tail still counted once as source.

The cost formula and the integration invariants are unaffected (they use
`BytesFromSource` and the cache-GET COUNT, not the cache byte counters).

Validated: gtest grid `PlanSchedule*:*ReaderExecutor*` green (173), real
`FileCache`/S3 `test_reader_executor_metric` integration.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…odel M1)

First stage of the Plan/Schedule/Execute/Display model
(`tmp/reader-executor-model/DESIGN.md`): the schedule carries everything needed
to RUN a fetch job, so the inline foreground fill consumes the schedule verbatim
instead of re-deriving the fetch geometry per serve call.

- `PlanSchedule::Retrieve` gains `fetch_runs` - the sub-ranges of `range` to read
  from the source, split at EVERY embedded resident region (served / filled down
  from its tier, never scheduled as a source read). `buildSchedule` computes them
  with the walk `fillRegion` was already doing before `mergeSorted` discarded the
  per-gap structure. Whether the executor reads THROUGH a resident hole at run
  time (its down-fill was skipped by the append-only cell) is a display-state
  decision - the drain-loop's frontier-in-hole rule - not a schedule property.
- `Retrieve` also gains `fetch_head_grid` / `fetch_tail_grid`: the coarsest
  cache-cell alignment across the plan's tiers (every Remote byte is a miss on
  every tier, so the coarsest extension applies). The `into` cell edges cannot
  serve this purpose - the providers merge adjacent aligned misses, so those
  "cells" are multi-segment runs.
- The inline fill's drain-loop becomes a pure schedule consumer: grid-floor the
  window's first missing display byte (walking ACROSS runs, so a before-slack run
  a seek jumped over is still fetched and the append-only cell fills whole from
  its floor), take the display frontier from there, and fetch to the first run
  end past it - reading through an inter-run resident hole exactly when the
  frontier sits in one, as the legacy assembler did - tail-ceiled to the grid.
  `fetchWindowAt` / `gapEndWithin` no longer appear on the serve path.

Behavior-preserving: the pieces the loop launches match the previous geometry
re-derivation (metric grid R/I/O unchanged, no re-baselining). An adversarial
review pass killed the first cut (which baked `min_bytes_for_seek` bridging into
the runs - reading through hits the old code clipped at); the landed rule -
static resident-split jobs + dynamic frontier read-through - matches the old
`gapEndWithin` semantics. Known accepted divergence: with a refused-write /
contended cell state the new frontier walk starts at the merged-range grid floor
(the old per-tier miss-cell clamp could start later), so it can RESUME a refused
cell from its committed end - same data, different piece shape, only reachable
under write refusal. The background launch path is untouched (its frontier
unification is the next stage).

New `PlanScheduleRetrieves` tests: the runs split at an embedded resident region
regardless of `min_bytes_for_seek`; a cold gap's after-slack past `plan_end`
extends one maximal run.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Second stage of the Plan/Schedule/Execute/Display model
(`tmp/reader-executor-model/DESIGN.md`): the DISPLAY - the committed cell state -
becomes the data-progress surface for populatable jobs, and the two wait sites
collapse into one.

- New `jobFrontier(ri)`: the first byte of a job's `fetch_runs` not yet committed
  to the cells - the job's DATA progress, derived from the same display state the
  serve reads, so a piece stopped anywhere (background or inline) is continued by
  whoever runs next with no handover.
- New `launchProgress(ri)` = `jobFrontier` advanced past bytes already ATTEMPTED
  (the `RetrieveStatus::fetched` high-water, now END-based and cursor-snapped):
  the background launch POLICY. The two must differ: a refused cell write (cache
  full / download budget) or a segment a SIBLING executor downloaded never enters
  this executor's per-writer committed set, so a raw display frontier pins there
  and the launcher would re-GET the same lead every serve window, never retire
  the job, and starve later jobs of read-ahead (an adversarial review pass
  constructed both loops - the old counter's drift was quietly load-bearing as
  this heal). `launchRetrieve`, the launch scan, and the Ready->Done transition
  use `launchProgress`; the serve never reads it. The old counter's accumulation
  drift is gone: the high-water advances by attempted-window END and cursor
  position, both well-defined for inline (cell-floor-based) machines.
- New `waitOnDisplay(window, own_worker_only, ...)`: the one "a live writer is
  filling this cell - wait on its download frontier and read it" primitive.
  `serveWindowFromCells`'s read-behind wait (own worker's targets) and
  `serveWindowInline`'s contended sibling wait (any disk writer) were the same
  loop split by who-writes; they now differ only by the policy argument. Page
  cells are never waited on (filled by promotion, no downloader; the old
  worker-path iteration over them could never produce coverage).

Also corrects a wrong claim from M1's design notes: within one plan the
per-writer committed set is MONOTONE (the writer remembers evicted bytes), so
eviction is healed at the next re-plan, not by the frontier moving back.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…l M3)

Third stage of the Plan/Schedule/Execute/Display model
(`tmp/reader-executor-model/DESIGN.md`): every fill the executor performs is now
a SCHEDULED job, and the fill kinds whose input is the just-served bytes run
through one source-branching executor.

- `buildSchedule` now emits a `HandedChain` promote job for every User GAP range
  (previously only for resident ranges): `into` = the user-overlapping miss cells
  `writeTargetsFor` deliberately left out of the Remote fetch (the faster tiers -
  the fetch fills only the bottom tier and same-tier slower layers). The
  promote-of-fetched work the serve was doing ad hoc is now visible schedule
  data. The resident promote gains the `[CF-promote]` guard as data: a same-tier
  faster LAYER is never a promote target (it is a fetch target) - previously that
  rule lived only in `maybePromote`'s tier-equality break.
- New `Retrieve::ahead_eligible`: the fg/bg partition as schedule data. `Remote`
  fills depend on nothing but the source and may run ahead; the handed kinds
  (`UpperCacheRead`, `HandedChain`) take the serve's output as input, so they are
  serve-front only. `maybeLaunchAhead` filters on it.
- New `runHandedFills(served_range, bytes)`: the ONE executor of the handed jobs,
  replacing three ad-hoc functions - `downFillScheduledLower` (its schedule-driven
  body, generalized to both sources), `promoteFetchedToUpper` (which re-READ the
  bottom cell to find the bytes the serve already had in hand), and `maybePromote`
  (whose tier-walk the scheduled `into` cells now encode). Every serve site hands
  the served bytes to the same call: the hit serve (up-promote + cross-cache
  down-fill in one pass), the populatable cell-serve, the inline fallback, and
  the banked bypass serve. The last one matters: the schedule can emit a promote
  over an `into`-empty gap (a whole-cell run the fetch parts do not fully span is
  not a FETCH target, yet the served bytes may complete its blocks), and a
  planned job must run regardless of which path served the bytes - an adversarial
  review pass caught the bank path skipping it, which would have let prefetch
  timing decide whether a planned job ran.
  Handing the served bytes also promotes where the old committed-only walk could
  not: when the bottom-cell write was refused (cache full) or the segment is
  sibling-owned, the faster tiers still fill from the bytes in hand - the stated
  `HandedChain` semantics, and strictly less re-fetching.
- `isScheduledFillTarget` (the MACHINE's fill-target predicate) is constrained to
  `Remote` jobs' cells: without this the new promote jobs would have made the
  fetch write the faster-tier cells directly - exactly what `writeTargetsFor`
  forbids (the pc fill trails the serve cursor; it does not ride the fetch lead).

New `PlanScheduleRetrieves.PromoteNeverCrossesSameTier` pins `[CF-promote]` as
schedule data (the schedule twin of `SlowFsHitIsNotPromotedToFastFs`);
`DesignWorkedExample` / `SlackNotPromotedToFasterTier` now assert the visible
promote job and `ahead_eligible`.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
… (model M4)

Final stage of the Plan/Schedule/Execute/Display model
(`tmp/reader-executor-model/DESIGN.md`): the serve paths share one window and
consume only schedule data; the per-job sidecar keeps only what cannot be
derived.

- ONE step-bounded serve window, computed once in `serveRetrieveStep` for every
  path (the populatable cell-serve, the bypass bank, and the inline fallback);
  `serveRetrievePopulatable` / `serveWindowInline` take it as a parameter. The
  fallback receives the ORIGINAL window (not the narrowed committed prefix), as
  before - the raced/short path re-serves it whole.
- The LAST geometry call leaves the serve path: `serveWindowInline`'s
  `fetchWindowAt` is replaced by the job's fetch grids clamped into its range -
  the same schedule data the drain-loop uses (byte-equal for gap windows,
  identity for a bypass job with grids of 1). Geometry is now consulted only at
  plan/schedule build, as the model requires.
- `RetrievePhase` is gone. The enum was the third progress tracker: "in flight"
  is `machineFor(ri)`, the serve gates on display coverage, and only `Done` was
  ever read (the `depsSatisfied` append-only ordering gate). `RetrieveStatus`
  now holds exactly the underivable rest: `done`, the launch high-water, and
  the bypass bank. The test inspector exposes `retrieveDone` and a job-relative
  `retrieveLaunchProgress` instead of raw phases.

The fetch grids are tightened to the plan's POPULATING tiers (an adversarial
review pass): a bypass-mode cache (`read_from_*_cache_if_exists_otherwise_bypass_cache`)
reports its coarse alignment but schedules NO fill cells, so it must not shape
the fetch - nothing could hold the grid extension, and the schedule-derived
fetch window would have fetched-and-discarded it EVERY window (quadratic source
over-read on partially-cached bypass reads; the old geometry query was identity
there because extension required a containing miss run). An `into`-empty job can
only exist in an all-bypass plan - any populating tier missing a gap gives it a
cell - so bypass jobs now truly carry grids of 1. This also stops a bypass
tier's alignment from inflating the populatable drain-loop pieces (latent since
the grids were introduced). Pinned by
`PlanScheduleRetrieves.BypassTierDoesNotShapeFetchGrids`. Known accepted bounded
divergence (documented in-code): the schedule-derived head clamp can reach
across a same-tier resident run the old per-tier clamp stopped at - at most one
grid cell, served from cache when resident, once per plan.

Explicitly not folded in (with rationale in the design doc): routing the bypass
serve through inline machines would lose the per-object-piece long-connection
opens (re-breaking the multi-object bounding), and the bypass bank stays a bank
rather than a virtual display cell - `serveWindowInline` remains the one bespoke
assembler (sibling dedup + source remainder + pin + bank), now fully
schedule-driven in its inputs.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

pr-experimental Experimental Feature

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants