ReaderExecutor: pipeline-based read orchestration#103706
Conversation
77b9b0a to
fb8ddfb
Compare
8e5f067 to
f9d35b8
Compare
8130e60 to
45679f0
Compare
45679f0 to
0cbc4f0
Compare
0cbc4f0 to
22b6a38
Compare
cd5c38a to
fbb1388
Compare
…ions field `reader_executor_decrypt_ahead` is experimental, defaults off, and has no production reach: only one gtest exercises the worker-decrypt path, and it sets `ReaderExecutor::Options::decrypt_ahead` directly. Drop the user-facing setting and its plumbing (Settings.cpp, the SettingsChangesHistory 26.7 block, Context.cpp, ReadSettings.h, ReadPipeline.cpp); keep the `Options` field and the gated worker-decrypt path, so the behavior stays available to tests and is unchanged in production (decryption still happens at the serve boundary). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
- `ChainedBuffers::extract` was `slice()` plus a `chassert(covers())` that can never fire (all 5 callers guard `covers()` first); delete it and call `slice()`. - `coveredBytes` had a single production use - the overlap guard in `copyTo`; inline that debug-gated unique-byte sum and drop the public method. - `planResidencyView` aligned its miss ranges in three passes (collect, sort, merge), but the forward segment walk already yields them ascending, so the sort was redundant; fuse the alignment into the single merge pass. - Extract `segmentCommittedEnd` for the committed-end ternary shared by `readOverlappingSegments` and `DiskCacheReader::readable`. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Behavior-neutral simplifications, verified against the full executor gtest suite: - `RetrieveStatus::machine` duplicated the single global in-flight `machine`; replace it with `machineFor(ri)` (`machine && machine->retrieve_index == ri`), which also removes the stale-pointer clear in `cancelMachine`. - `FetchMachine::requested_range` was always `physical_window` shifted by `data_start_offset`; drop the field and derive it at the read sites. - `FetchMachine::sibling_led` was a vector the worker filled but collect only tested for emptiness; replace it with a `bool contended` set by the worker. - `effectivePrefetchWindowSize` was only ever compared `== 0`; replace it with a `prefetchEnabled(level)` predicate. - Inline the single-caller helpers `LongConnection::isComplete` and `maybeDrainLongTail`. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
- `ContinuityTracker::Options::ewma_alpha` was never configured (always the 0.5 default); inline 0.5 in `closeRun` and drop the field. - Comment-only sweep of references to removed/renamed entities: "live connection" -> "long connection", the removed `requested_range` / `coverWindow` / `PrefetchJob` / `serveCacheTiersCollectingMisses` / `fill_ahead_lead_ram` asides, `job->pressure_level` -> `pressure_snapshot`, the retired `read_extent_end` plan-sizing claim, and the `reader_executor_block_size` doc (it sizes ChainedBuffers nodes, not the read window). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
`reader_executor_decrypt_ahead` (already demoted from a setting) decrypted prefetched bytes on the worker thread ahead of serving - experimental, never enabled, never measured. Drop it entirely so decryption always happens at the serve boundary (`decryptWindow`), the only behavior that ever shipped. Removes the `Options`/member toggle, `decryptFetchedAhead`, `FetchMachine::plaintext_fetched`, `RetrieveStatus::ready_bytes_is_plaintext`, `served_window_is_plaintext`, the `is_plaintext` plumbing through `tryCollectMachine`/collect/serve, and the gtest that exercised it. Decrypt-on-serve coverage is unchanged (`DecryptsMultiNodeWindow` etc.). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…ptively The readNextWindow boundary block bridged two concerns in one tangle. Separate them: - collect the in-flight machine at the boundary (its own focused comment), then - replan only once the plan is fully consumed - cursor fell before plan_start, or reached plan_end and the plan doesn't already run to EOF. Drop the pre-emptive `position_phys + window_size > plan_end` look-ahead: the plan is now used to its end before a rebuild. Safe because the boundary collect runs before the serve in the same call, so the cursor never serves a consumed step at plan_end (no premature interior EOF). The replan + next prefetch launch now happen at the boundary rather than one window early. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
`to_read` (the per-call read ceiling) was threaded through interpretStep and the six serve functions. Replace the parameter with a private `readCeiling()` helper that recomputes the same value on demand - `position` is stable during a serve, so the value is identical. Behavior-neutral; cleans up seven signatures. The helper deliberately omits the `reached_eof` check, so an in-flight machine's final bytes are still collected after EOF latches (the `atEnd() && !machine` guard owns the no-machine EOF). `serveCacheBlock`'s own byte-budget parameter is unchanged. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…Cursor/finishWindow readNextWindow gated EOF as `atEnd() && !machine`, then fell through to a `readCeiling() > 0`-gated replan block and a post-serve tail - so the in-flight machine's final-window drain rode the general path and the gate had to ignore `reached_eof`. Hoist EOF to an explicit `if (atEnd())` branch: drain an in-flight machine's final window, else report EOF. The rest is then strictly `!atEnd()`, so the replan gate simplifies to `!atExtent()` (new helper) - the read extent, not the file end. Factor the collect-then-replan block into `prepareCursor` and the post-serve tail into `finishWindow`, shared by the drain and normal paths. Behavior-neutral: for a reachable `atEnd() && machine` (unknown-size short read) `readCeiling() > 0` always held, so the drain's unconditional `prepareCursor` matches the old gated block; and under `!atEnd()`, `!atExtent()` equals the old `readCeiling() > 0`. Covered by `UnknownSizePrefetchedFinalBytesAreServed`. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…handleExtentOrReplan
The serve path had three functions making the same "plan stale -> reschedule"
decision: prepareCursor (proactive, before every serve), interpretStep's guard
(detect + route), and handleExtentOrReplan (reactive replan + retry). Since
readNextWindow's EOF restructure runs prepareCursor before every interpretStep,
the reactive replan/retry is redundant - the plan is already fresh when
interpretStep runs, so handleExtentOrReplan only ever hits its early empty return.
Verified the replan/retry path is unreachable: a chassert probe stayed clean across
the unit grid AND the multi-thread integration that caught the original
premature-EOF bug, and a 5-way adversarial reachability analysis agreed - whenever
interpretStep's guard fires the executor is at atEnd() or readCeiling()==0 (the
cursor is structurally bounded by findStepContaining's clamp, and empty steps ==
empty plan span == atEnd/atExtent).
- Delete handleExtentOrReplan; interpretStep returns {} directly when there is
nothing to serve.
- prepareCursor no-ops at the extent (atExtent), so readNextWindow calls it
unconditionally - one scheduler call site, no !atExtent gate.
Behavior-neutral; grid 145 + integration 4/4 green.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
serveCacheBlock had a single caller - serveHitStep, a one-line wrapper that computed the byte budget (min of the read ceiling and the step's remaining span) and passed it down. Fold the resident-run streaming body up into serveHitStep (computing the budget from the step internally) and drop serveCacheBlock and its generic to_read parameter. interpretStep still dispatches the symmetric pair serveHitStep(step) / serveRetrieveStep(step). Behavior-neutral; grid 145 green. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…tchMachineRunner The runner that drives FetchMachine machines was a single concrete class bound to the prefetch pool. Extract an abstract `IFetchMachineRunner` (the four-verb protocol: schedule / tryCancelQueued / requestInterrupt / waitReleased) and rename the pool-backed implementation to `PoolFetchMachineRunner`; the executor's `runner` member is now `unique_ptr<IFetchMachineRunner>`. This is the seam for a future inline LocalFetchMachineRunner that runs the machine synchronously on the serve thread, so foreground and background can share one FetchMachine flow. Pure type-level refactor, behavior-neutral; grid 166 green. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
The `run_step` closure was inlined in `launchRetrieve`. Factor it into a named `makeFetchStep(FetchMachine&)` helper so the step body is independent of the scheduling decision - a future inline runner can reuse the identical closure. Behavior-identical extraction; grid 166 green. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Add `LocalFetchMachineRunner`: an `IFetchMachineRunner` whose `schedule` runs the machine's step synchronously on the calling thread (mirroring the pool wrapper's state writes + exception-into-`failure` capture) and leaves `current_step` null. Not constructed by the executor yet - this is the inline driver the foreground will use so it can share the FetchMachine flow without the pool. Because an inline machine never carries a `JobHandle`, `waitReleased` / `tryCancelQueued` and the abandoned-machine drain all no-op on it - structurally avoiding the double-join a pool future would risk. Unit tests cover inline-run + terminal state + the step-throw-lands-in-`failure` hazard + the no-handle verbs. grid 170 green (4 new tests). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…chMachine (flag-gated) Add `reader_executor_unified_foreground` (experimental, default OFF). When ON, the synchronous foreground populatable serve runs the SAME FetchMachine flow as the prefetch - built by `launchMachineForWindow` and run INLINE on the serve thread by a `LocalFetchMachineRunner` - then collected immediately, so the serve reads the just-committed cells (cache-as-buffer) instead of the bespoke sync read+bank. The legacy sync path stays as the fallback for sibling-led contention and bypass gaps. - New `launchMachineForWindow(ri, window, runner)` factored from `launchRetrieve`, shared by the pool (read-ahead) and the inline (foreground) runners. - New `local_runner` member; `collectRunner()` returns `runner ? *runner : *local_runner`, so the collect verbs work with or without a pool (a settled inline machine no-ops through either - it carries no JobHandle). - The inline launch is guarded by `!machine` (slot free), never clobbering an in-flight read-ahead machine, and is collected within the same serve call. Default OFF -> behavior-neutral. Validated: gtest 172 (flag OFF identical; flag ON no-pool serves+populates from cells with no sync fallback; flag ON + pool coexists with read-ahead); a 5-way adversarial reachability pass (long-conn, machine-slot, contended, teardown, collect-runner - all clean); and integration on real S3 - the flag-OFF matrix unchanged + a new multi-thread `executor==legacy` flag-ON repro. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
The unified-foreground inline FetchMachine (a `LocalFetchMachineRunner`, no prefetch pool) now fetches only the contiguous LED PREFIX up to the first sibling-led cache segment, serving it as a short window. The caller's next read resolves the sibling boundary through the existing sync fallback (elect/wait), instead of this thread blocking to fetch a led run past a sibling-led hole. `coordinatedPrefetch` bounds the fetch at the first sibling offset when `inline_serve` is set; `serveRetrievePopulatable` then narrows the serve window to the committed prefix via the new `committedCellPrefixEnd`, so the existing cell-serve covers it fully. The prefix is this executor's own led work, so it is in this writer's per-writer `committed()` set - there is no dead work: a byte a sibling downloaded is never in `committed()`, which is exactly what bounds the prefix. The pool read-ahead path is unchanged (`inline_serve` is false, so the fetch bound stays at the window end). Still gated behind `reader_executor_unified_foreground` (default off). Adds `UnifiedForegroundStopsAtFirstSiblingLedSegment`, a real `DiskCacheWriter` contention test - the downloader-blind `MockCacheWriter` masked an earlier dead-work defect in this area. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…g-gated) The unified-foreground path now runs the SCHEDULED fills inline on the serve thread (the same `FetchMachine` flow as the background prefetch, driven by a `LocalFetchMachineRunner`) instead of inventing a per-cursor fetch: - `serveRetrievePopulatable` drains the scheduled fill from the cache cell's append-only frontier (`committedCellPrefixEnd(fetchWindowAt(window))`) until the cell covers the cursor, then serves from the committed cells. Fetching from the cell frontier (not the cursor) fills a mid-cell read from the cell floor, so the append-only cell write commits instead of being skipped and re-fetched; the frontier also advances past a down-filled embedded hit, so a gap after the hit resumes there rather than re-fetching it from source. - `serveHitStep` down-promotes a served upper-tier hit into the lower cells the schedule marks for cross-cache fill (a `Fill(UpperCacheRead)`), using the bytes already in hand - so a lower segment completes across an embedded faster-tier hit without a remote over-read. - The cell-alignment head fetched below the cursor is recorded as over-read. - A `readBigAt` transient takes no fill pin (mirrors `finalizeAssembledWindow`'s `is_transient` guard), matching the legacy non-executor path: populate, never pin. Gated behind `reader_executor_unified_foreground` (default off). The remaining unified-path work - the inline fill fetching the whole cell-aligned gap run in one source read while skipping upper-resident parts (the legacy `readPhysicalWindow` fetch folded in) - is tracked for a follow-up. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
The unified-foreground inline fill now fetches like the legacy `readPhysicalWindow` did - the whole cell-aligned miss run in one source read, skipping embedded faster-tier hits - and attributes the served bytes the same way. Four coupled changes (all behind `reader_executor_unified_foreground`, still default off): - `serveRetrievePopulatable` launches the inline machine over the whole cell- aligned run `[base, gapEndWithin(base, aligned.end()))` instead of just `[base, cursor.end())`. One cold cell is now ONE source read and the cursor's later windows in the cell are served from it - matching the legacy `fetchWindowAt`. The cell-alignment head and the run tail past the cursor window are recorded as `overread_pending` (the inline collect writes via `pushChainToWriters`, which - unlike the sync `assembleAndWriteBack` - does not account over-read), keyed off the committed frontier so a short fill records only what landed. - `CoverageMap::gapEndWithin(from, limit)` caps that launch at the first embedded resident byte of any tier, so an embedded faster-tier hit fills the lower cell DOWN (`downFillScheduledLower`) instead of being re-fetched from the source. `gapEnd` itself is unsuitable here: it is bounded by `plan_end`, which the read-extent advances one window at a time, so it would collapse the run to the cursor window. - `coordinatedPrefetch` fetches each led run in ONE `fetchGapsFromSource` when the machine serves inline (no concurrent reader to hand a growing prefix to) - one GET on the stateless arm instead of one per window. The background run-ahead keeps its per-tile progressive commit. - The cell-serve credits a cache hit only for bytes already committed BEFORE this serve started filling (a prior window's prefill or a worker's run-ahead). Bytes this serve fetches from the source transit the cell but are counted as `BytesFromSource`, not a cache hit - otherwise a cold all-miss reported itself as served entirely from cache (`recreditCommittedPrefixes` / `serveWindowFromCells` gained a `cache_credit` mask; the legacy callers pass none and are unchanged). `EvictionInChainRefetchesEvictedCells` now exercises the unified path explicitly and asserts its re-fetch count (9, vs the legacy path's 7): under the eviction flood the held long connection streams the first cold run but re-opens a GET per window for the later flood-evicted segments. Correctness (no missing bytes) is unchanged. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
# Conflicts: # src/Disks/DiskObjectStorage/ObjectStorages/HDFS/HDFSObjectStorage.cpp # src/Disks/DiskObjectStorage/ObjectStorages/S3/S3ObjectStorage.cpp # src/Disks/DiskObjectStorage/ObjectStorages/Web/WebObjectStorage.cpp # src/Disks/IO/ReadBufferFromWebServer.cpp # src/IO/ReadPipeline.cpp # src/IO/ReadPipeline.h # src/Storages/ObjectStorage/StorageObjectStorageSource.cpp
| st.ready_bytes.append(std::move(collected)); | ||
| } | ||
| st.fetched += window; | ||
| st.phase = RetrievePhase::Ready; |
There was a problem hiding this comment.
This collect path can leave a fully-fetched cache-fill retrieve in Ready forever. tryCollectMachine runs schedulePutStep/reapPutMachine before st.fetched is advanced here, so the reapPutMachine check sees the old frontier and does not mark the retrieve Done; then this assignment overwrites the phase with Ready. depsSatisfied only accepts Done, so a later retrieve that writes the same append-only cache cell can remain blocked even though this prefetch has already fetched and committed the predecessor range.
For example, with retrieve 0 filling [0, 128) and retrieve 1 depending on it for [128, 256), collecting retrieve 0's final window writes the cell, but the dependency still sees retrieve 0 as Ready and never launches retrieve 1. Please advance the fetched frontier before reaping, or set Done after this line whenever st.fetched >= read_plan.schedule.retrieves[ri].range.size.
Flip `reader_executor_unified_foreground` on by default: the `ReaderExecutor`'s synchronous foreground serve now runs the scheduled fills inline through the same `FetchMachine` flow as the background prefetch (a `LocalFetchMachineRunner`), instead of the bespoke synchronous read-and-assemble path. Set the setting to 0 to fall back to the legacy synchronous path (kept for one release as a safety valve before it is removed). The inline path reached parity with the legacy `readPhysicalWindow` in the preceding change (whole cell-aligned fetch, embedded-hit skip, over-read and cache-hit attribution), so this flip is behaviour-equivalent on the metric grid and the schedule KPIs. Flipped in the four plumbing sites (Settings declaration + history, the `ReadSettings` member, and the `ReaderExecutor::Options` default the unit grid uses). Validated: the no-pool gtest grid (`PlanSchedule*:*ReaderExecutor*`) green with the default on, and the real-`FileCache`/S3 `test_reader_executor_metric` integration (the main matrix now exercises the unified path; the legacy arm of `test_repro_unified_foreground_matches_legacy` sets the setting off explicitly). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
With the unified inline foreground the default, the bespoke synchronous read-and-assemble path is dead weight: delete `serveRetrieveForeground`, `syncGapRead`, `readWindowLogical`, `readPhysicalWindow`, `fetchAndBackfillGaps`, and `serveResidentFromPlan`. The three call sites that used it are rerouted: - The two `serveRetrieveForeground` uses (a not-prefetched bypass gap, and the populatable contended fallback where the cursor sits on a segment a sibling executor leads) collapse into one `serveWindowInline`. It fills the cell-aligned window from the source (`fetchWindowAt`, opening a long connection per object piece so the reads coalesce across windows), but first WAITS on a disk cell a sibling is downloading and reads the sibling's committed bytes - deduping the concurrent cold populate instead of re-fetching it - and pins the filled segment via `finalizeAssembledWindow`, exactly as the deleted path did. A bypass gap has no cell, so it is a plain source read banked in `ready_bytes`. - `initDecryption`'s header read goes through a direct one-shot `fetchGapsFromSource` (no plan, no long connection) instead of `readPhysicalWindow` with an empty geometry. `assembleAndWriteBack`, `finalizeAssembledWindow`, `recreditCommittedPrefixes`, `serveLateHits`, `readHitFromView`, and `pushAssembledToWriteBuffers` stay - they are shared with the worker collect path (`collectInFlightInto`). `UnifiedForegroundStopsAtFirstSiblingLedSegment` now asserts the sibling-led segment at the cursor is resolved from the sibling's committed cell (a cache read), not a synchronous source read (`syncReadMicros == 0`). Validated: the no-pool gtest grid `PlanSchedule*:*ReaderExecutor*` green (173), and the real-`FileCache`/S3 `test_reader_executor_metric` integration. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
The unified inline foreground is now the ReaderExecutor's default and, since the legacy synchronous stack was deleted, its ONLY foreground path - so the `reader_executor_unified_foreground` setting is vestigial and misleading: setting it to 0 no longer restores a legacy path (there is none), it only disabled the inline drain-loop and the cross-cache down-fill, leaving a degraded unified path. Remove it: the SQL setting (`Settings.cpp` DECLARE + `SettingsChangesHistory` - added and removed within the same unreleased version - + `Context.cpp` + the `ReadSettings` member + the `ReadPipeline` propagation) and the `ReaderExecutor::Options::unified_foreground` field. The two `if (unified_foreground)` sites - the inline drain-loop in `serveRetrievePopulatable` and the cross-cache down-fill in `serveHitStep` - become unconditional. Tests: drop the now-redundant `opts.unified_foreground = true` from the gtests (the path is always on); remove `test_repro_unified_foreground_matches_legacy` (its "flag on vs off" premise is moot - `test_repro_legacy_vs_executor` already covers executor-vs-non-executor equivalence). Validated: gtest grid `PlanSchedule*:*ReaderExecutor*` green (173), real `FileCache`/S3 `test_reader_executor_metric` integration. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
There was a problem hiding this comment.
tryBuildReaderExecutor skips the page cache when any StoredObject has UnknownSize, but it still installs every filesystem-cache provider here. The legacy path gates filesystem cache with canUseFilesystemCache(object), because FileCache needs a real object size for segment bounds and tail completion. With an S3/Web object that has no Content-Length, this executor path passes StoredObject::UnknownSize (UInt64::max) through DiskCacheProvider::openWriteBuffers into getOrSet, so the cache treats an EOF-streaming object as an effectively infinite file and can leave/download segments past the real EOF instead of taking the plain stream-until-EOF path. Please skip DiskCacheProvider when any mapped object is unknown-size (or fall back to the legacy no-cache behavior for those objects), matching the existing canUseFilesystemCache guard.
…t credit mask)
The unified serve reads every delivered byte OUT of a committed cell - that is a
real cache read (`w.writer->read()`), the cache being the buffer. The consolidated
fetch had added a `cache_credit` mask that suppressed counting a cell-read as a
cache hit when this same serve had just fetched the byte from the source, so a cold
miss reported 0 bytes from cache - the legacy DISJOINT attribution, inconsistent
with the cache-as-buffer model (the legacy path served source bytes straight from
the assembled result and never re-read the cell; the unified path always routes
through the cell).
Drop the mask. Now:
- `BytesFromSource` = bytes fetched from the origin into cells.
- `BytesFrom{Fs,Page}Cache` = bytes read out of cells to serve the client.
A cold miss shows BOTH (source filled the cell, the serve read it back out); the
two count distinct operations and legitimately overlap.
Removes `cacheCredit` + the `cache_credit` parameter on `recreditCommittedPrefixes`
/ `serveWindowFromCells` + the `pre_committed` snapshot in `serveRetrievePopulatable`.
Tests: the schedule's KPI oracle `predictKpi` now predicts `served_from_cache` as
ALL delivered User bytes (resident hits AND filled misses), not just pre-resident
ones - so `PlanScheduleValidation` (ColdAllMiss / TwoTierDisjointHits /
ResidentIsland) and `SchedulePredictsByteKpis` compare actual-vs-oracle on the new
semantics with no literal changes. `PartialFsHitTailFromSource` re-baselined:
`hit_delta` is the whole file (prefix hit + tail served from the filled cell), the
tail still counted once as source.
The cost formula and the integration invariants are unaffected (they use
`BytesFromSource` and the cache-GET COUNT, not the cache byte counters).
Validated: gtest grid `PlanSchedule*:*ReaderExecutor*` green (173), real
`FileCache`/S3 `test_reader_executor_metric` integration.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…odel M1) First stage of the Plan/Schedule/Execute/Display model (`tmp/reader-executor-model/DESIGN.md`): the schedule carries everything needed to RUN a fetch job, so the inline foreground fill consumes the schedule verbatim instead of re-deriving the fetch geometry per serve call. - `PlanSchedule::Retrieve` gains `fetch_runs` - the sub-ranges of `range` to read from the source, split at EVERY embedded resident region (served / filled down from its tier, never scheduled as a source read). `buildSchedule` computes them with the walk `fillRegion` was already doing before `mergeSorted` discarded the per-gap structure. Whether the executor reads THROUGH a resident hole at run time (its down-fill was skipped by the append-only cell) is a display-state decision - the drain-loop's frontier-in-hole rule - not a schedule property. - `Retrieve` also gains `fetch_head_grid` / `fetch_tail_grid`: the coarsest cache-cell alignment across the plan's tiers (every Remote byte is a miss on every tier, so the coarsest extension applies). The `into` cell edges cannot serve this purpose - the providers merge adjacent aligned misses, so those "cells" are multi-segment runs. - The inline fill's drain-loop becomes a pure schedule consumer: grid-floor the window's first missing display byte (walking ACROSS runs, so a before-slack run a seek jumped over is still fetched and the append-only cell fills whole from its floor), take the display frontier from there, and fetch to the first run end past it - reading through an inter-run resident hole exactly when the frontier sits in one, as the legacy assembler did - tail-ceiled to the grid. `fetchWindowAt` / `gapEndWithin` no longer appear on the serve path. Behavior-preserving: the pieces the loop launches match the previous geometry re-derivation (metric grid R/I/O unchanged, no re-baselining). An adversarial review pass killed the first cut (which baked `min_bytes_for_seek` bridging into the runs - reading through hits the old code clipped at); the landed rule - static resident-split jobs + dynamic frontier read-through - matches the old `gapEndWithin` semantics. Known accepted divergence: with a refused-write / contended cell state the new frontier walk starts at the merged-range grid floor (the old per-tier miss-cell clamp could start later), so it can RESUME a refused cell from its committed end - same data, different piece shape, only reachable under write refusal. The background launch path is untouched (its frontier unification is the next stage). New `PlanScheduleRetrieves` tests: the runs split at an embedded resident region regardless of `min_bytes_for_seek`; a cold gap's after-slack past `plan_end` extends one maximal run. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Second stage of the Plan/Schedule/Execute/Display model (`tmp/reader-executor-model/DESIGN.md`): the DISPLAY - the committed cell state - becomes the data-progress surface for populatable jobs, and the two wait sites collapse into one. - New `jobFrontier(ri)`: the first byte of a job's `fetch_runs` not yet committed to the cells - the job's DATA progress, derived from the same display state the serve reads, so a piece stopped anywhere (background or inline) is continued by whoever runs next with no handover. - New `launchProgress(ri)` = `jobFrontier` advanced past bytes already ATTEMPTED (the `RetrieveStatus::fetched` high-water, now END-based and cursor-snapped): the background launch POLICY. The two must differ: a refused cell write (cache full / download budget) or a segment a SIBLING executor downloaded never enters this executor's per-writer committed set, so a raw display frontier pins there and the launcher would re-GET the same lead every serve window, never retire the job, and starve later jobs of read-ahead (an adversarial review pass constructed both loops - the old counter's drift was quietly load-bearing as this heal). `launchRetrieve`, the launch scan, and the Ready->Done transition use `launchProgress`; the serve never reads it. The old counter's accumulation drift is gone: the high-water advances by attempted-window END and cursor position, both well-defined for inline (cell-floor-based) machines. - New `waitOnDisplay(window, own_worker_only, ...)`: the one "a live writer is filling this cell - wait on its download frontier and read it" primitive. `serveWindowFromCells`'s read-behind wait (own worker's targets) and `serveWindowInline`'s contended sibling wait (any disk writer) were the same loop split by who-writes; they now differ only by the policy argument. Page cells are never waited on (filled by promotion, no downloader; the old worker-path iteration over them could never produce coverage). Also corrects a wrong claim from M1's design notes: within one plan the per-writer committed set is MONOTONE (the writer remembers evicted bytes), so eviction is healed at the next re-plan, not by the frontier moving back. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…l M3) Third stage of the Plan/Schedule/Execute/Display model (`tmp/reader-executor-model/DESIGN.md`): every fill the executor performs is now a SCHEDULED job, and the fill kinds whose input is the just-served bytes run through one source-branching executor. - `buildSchedule` now emits a `HandedChain` promote job for every User GAP range (previously only for resident ranges): `into` = the user-overlapping miss cells `writeTargetsFor` deliberately left out of the Remote fetch (the faster tiers - the fetch fills only the bottom tier and same-tier slower layers). The promote-of-fetched work the serve was doing ad hoc is now visible schedule data. The resident promote gains the `[CF-promote]` guard as data: a same-tier faster LAYER is never a promote target (it is a fetch target) - previously that rule lived only in `maybePromote`'s tier-equality break. - New `Retrieve::ahead_eligible`: the fg/bg partition as schedule data. `Remote` fills depend on nothing but the source and may run ahead; the handed kinds (`UpperCacheRead`, `HandedChain`) take the serve's output as input, so they are serve-front only. `maybeLaunchAhead` filters on it. - New `runHandedFills(served_range, bytes)`: the ONE executor of the handed jobs, replacing three ad-hoc functions - `downFillScheduledLower` (its schedule-driven body, generalized to both sources), `promoteFetchedToUpper` (which re-READ the bottom cell to find the bytes the serve already had in hand), and `maybePromote` (whose tier-walk the scheduled `into` cells now encode). Every serve site hands the served bytes to the same call: the hit serve (up-promote + cross-cache down-fill in one pass), the populatable cell-serve, the inline fallback, and the banked bypass serve. The last one matters: the schedule can emit a promote over an `into`-empty gap (a whole-cell run the fetch parts do not fully span is not a FETCH target, yet the served bytes may complete its blocks), and a planned job must run regardless of which path served the bytes - an adversarial review pass caught the bank path skipping it, which would have let prefetch timing decide whether a planned job ran. Handing the served bytes also promotes where the old committed-only walk could not: when the bottom-cell write was refused (cache full) or the segment is sibling-owned, the faster tiers still fill from the bytes in hand - the stated `HandedChain` semantics, and strictly less re-fetching. - `isScheduledFillTarget` (the MACHINE's fill-target predicate) is constrained to `Remote` jobs' cells: without this the new promote jobs would have made the fetch write the faster-tier cells directly - exactly what `writeTargetsFor` forbids (the pc fill trails the serve cursor; it does not ride the fetch lead). New `PlanScheduleRetrieves.PromoteNeverCrossesSameTier` pins `[CF-promote]` as schedule data (the schedule twin of `SlowFsHitIsNotPromotedToFastFs`); `DesignWorkedExample` / `SlackNotPromotedToFasterTier` now assert the visible promote job and `ahead_eligible`. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
… (model M4) Final stage of the Plan/Schedule/Execute/Display model (`tmp/reader-executor-model/DESIGN.md`): the serve paths share one window and consume only schedule data; the per-job sidecar keeps only what cannot be derived. - ONE step-bounded serve window, computed once in `serveRetrieveStep` for every path (the populatable cell-serve, the bypass bank, and the inline fallback); `serveRetrievePopulatable` / `serveWindowInline` take it as a parameter. The fallback receives the ORIGINAL window (not the narrowed committed prefix), as before - the raced/short path re-serves it whole. - The LAST geometry call leaves the serve path: `serveWindowInline`'s `fetchWindowAt` is replaced by the job's fetch grids clamped into its range - the same schedule data the drain-loop uses (byte-equal for gap windows, identity for a bypass job with grids of 1). Geometry is now consulted only at plan/schedule build, as the model requires. - `RetrievePhase` is gone. The enum was the third progress tracker: "in flight" is `machineFor(ri)`, the serve gates on display coverage, and only `Done` was ever read (the `depsSatisfied` append-only ordering gate). `RetrieveStatus` now holds exactly the underivable rest: `done`, the launch high-water, and the bypass bank. The test inspector exposes `retrieveDone` and a job-relative `retrieveLaunchProgress` instead of raw phases. The fetch grids are tightened to the plan's POPULATING tiers (an adversarial review pass): a bypass-mode cache (`read_from_*_cache_if_exists_otherwise_bypass_cache`) reports its coarse alignment but schedules NO fill cells, so it must not shape the fetch - nothing could hold the grid extension, and the schedule-derived fetch window would have fetched-and-discarded it EVERY window (quadratic source over-read on partially-cached bypass reads; the old geometry query was identity there because extension required a containing miss run). An `into`-empty job can only exist in an all-bypass plan - any populating tier missing a gap gives it a cell - so bypass jobs now truly carry grids of 1. This also stops a bypass tier's alignment from inflating the populatable drain-loop pieces (latent since the grids were introduced). Pinned by `PlanScheduleRetrieves.BypassTierDoesNotShapeFetchGrids`. Known accepted bounded divergence (documented in-code): the schedule-derived head clamp can reach across a same-tier resident run the old per-tier clamp stopped at - at most one grid cell, served from cache when resident, once per plan. Explicitly not folded in (with rationale in the design doc): routing the bypass serve through inline machines would lose the per-object-piece long-connection opens (re-breaking the multi-object bounding), and the bypass bank stays a bank rather than a virtual display cell - `serveWindowInline` remains the one bespoke assembler (sibling dedup + source remainder + pin + bank), now fully schedule-driven in its inputs. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

Builds on top of #103234 (ReadPipeline). Gated by
SET use_reader_executor = 1(experimental, default off).This commit replaces the matryoshka of
ReadBufferwrappers with aReaderExecutorthat owns offset mapping, cache decisions, prefetch and decryption in one place. Caches plug in via a uniformICacheProvider/ICacheHandleAPI.What this brings
Zero-copy page-cache hits. Reads served from
PageCachereach the caller as ashared_ptrinto the mmap'd cell — nomemcpybetween cache and working buffer (compare toCachedInMemoryReadBufferFromFile's per-block copy).Connection reuse across remote read calls. Sequential reads against the same S3/Azure/HTTP object reuse an open buffer instead of issuing a new HTTP request per window.
SourceBufferLimitcaps live connections globally (max_remote_read_connections, default 1000); over-limit reads fall back to stateless open-read-close. Visible viasystem.remote_read_connections.Coordinated cache + prefetch decisions. The matryoshka design had each layer making independent choices (prefetch ahead in
AsynchronousBoundedReadBuffer, decide what to cache inCachedOnDiskReadBufferFromFile, gather across objects inReadBufferFromRemoteFSGather). The executor sees the whole window:Per-object cache identity, not per-pipeline.
DiskCacheProviderderivesFileCacheKey/FileCacheOriginInfoperStoredObject(etag-keyed caching forStorageObjectStorageSource, segment-key-type classification forDatavsSystemqueues). The old single-key-per-executor approach silently shared cache identity across all objects in a gather-mode read.Foundation for new cache backends. New caches (vector cache, distributed cache, etc.) implement one interface (
ICacheProvider::lookup) and plug into the chain without touching read paths.Observability
system.remote_read_connections— open connections (path, query id, position, elapsed)system.reader_executor_log— per-executor stats (cache hit/miss/populate bytes, prefetch hits/cancellations, source request count, decrypt time, …)ProfileEvents:ReaderExecutor*counters and microsecond timers;LiveSourceBuffer*countersHistogramMetrics: cache get/populate/source-read/prefetch-wait latenciesServer settings
reader_executor_prefetch_pool_size(default 8) — shared prefetch thread pool sizereader_executor_prefetch_queue_size(default 0 =pool_size * 10) — pool queue depthmax_remote_read_connections(default 1000) — live-buffer slot count, reload-ableInternals
Rope/RopeNode/OwnedRopeBuffer— refcounted buffer chains with mixed provenance (owned, page-cache-pinned), built-in cursor (peek/advance/tryRewind), sorted-on-insert nodes, disjoint-interval coverage tracking.ICacheProvider/ICacheHandle— uniform cache API (lookup → status/get/put). Two implementations:PageCacheProvider(file-level, zero-copy),DiskCacheProvider(per-object, FileCache-backed).ISourceReader— stateless range-read interface.LocalSourceReader,ObjectStorageSourceReader,BufferSourceReader(adapter for backup /BufferCreator).OffsetMap— replacesReadBufferFromRemoteFSGather's gather logic with a logical-to-(object, object-offset) lookup; supports single-object unknown-size (S3HEADwithoutContent-Length→ streams to EOF).ReaderExecutor— owns: position, offset map, cache chain, prefetch handle, live buffer, over-read tail, source-buffer slot.PipelineReadBuffer— thinReadBufferFromFileBaseexposing the executor throughBufferBase::set/next/seek(so legacy callers see no change).PrefetchThreadPool— shared bounded pool, returnsnullptron overflow (sync fallback), task cancellation is race-free.Testing
~3,700 lines of tests (≈46% of the PR), 130+ gtest cases:
gtest_rope— 31 cases (append / peek / advance / tryRewind / slice / copyTo / coverage queries / shift).gtest_reader_executor— 41+ cases (cache-chain combinations, per-object lookup, prefetch races, EOF release, slot leak fixes, unknown-size streaming, …).gtest_filecache— 22 cases including over-read / bypass-mode / partial coverage scenarios.gtest_read_pipeline— known/unknown size selection through the executor.04262_reader_executor_observabilityplus opt-outs (SET use_reader_executor = 0) on tests that intentionally exercise the legacy path.Load tesing
the simpliest cache API"
changed cache API (stream aware):
That is good outcome. I need to optimize the work with caches. Need to stream data from available cache segments without any additional costs.
Changelog category (leave one):
Changelog entry (a user-readable short description of the changes that goes into CHANGELOG.md):
Add experimental
ReaderExecutorfor pipeline-based read orchestration with unified cache API (PageCacheProvider,DiskCacheProvider), connection reuse for remote reads, shared prefetch pool, andsystem.remote_read_connections/system.reader_executor_logobservability tables. Enable withSET use_reader_executor = 1.Documentation entry for user-facing changes