Add `borrow_from_cache` object storage and `memory` metadata types by alexey-milovidov · Pull Request #100371 · ClickHouse/ClickHouse · GitHub
Skip to content

Add borrow_from_cache object storage and memory metadata types#100371

Open
alexey-milovidov wants to merge 90 commits into
masterfrom
borrow-from-cache-object-storage
Open

Add borrow_from_cache object storage and memory metadata types#100371
alexey-milovidov wants to merge 90 commits into
masterfrom
borrow-from-cache-object-storage

Conversation

@alexey-milovidov

Copy link
Copy Markdown
Member

Add a new object storage type borrow_from_cache that allocates space in a named filesystem cache using ephemeral FileSegments. Each stored object is backed by a cache segment held alive via FileSegmentsHolder; when released, the cache reclaims the space.

Add a new metadata type memory that keeps all file-to-blob mappings and directory structure entirely in memory with no persistence. On server restart all data is lost, which is acceptable for the intended use case of temporary tables.

Configuration example:

  disk(type=object_storage,
       object_storage_type='borrow_from_cache',
       metadata_type='memory',
       cache_name='some_cache')

Changelog category (leave one):

  • New Feature

Changelog entry (a user-readable short description of the changes that goes into CHANGELOG.md):

A new object storage type and metadata type suitable for temporary tables. Add a new object storage type borrow_from_cache that allocates space in a named filesystem cache and holds it from eviction. Add a new metadata storage type memory that keeps the mapping in memory. For example, this could allow creating temporary tables with custom engines in the Cloud without using S3.

Add a new object storage type `borrow_from_cache` that allocates space
in a named filesystem cache using ephemeral `FileSegment`s. Each stored
object is backed by a cache segment held alive via `FileSegmentsHolder`;
when released, the cache reclaims the space.

Add a new metadata type `memory` that keeps all file-to-blob mappings
and directory structure entirely in memory with no persistence. On
server restart all data is lost, which is acceptable for the intended
use case of temporary tables.

Configuration example:
  disk(type=object_storage,
       object_storage_type='borrow_from_cache',
       metadata_type='memory',
       cache_name='some_cache')

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@clickhouse-gh

clickhouse-gh Bot commented Mar 22, 2026

Copy link
Copy Markdown
Contributor

@clickhouse-gh clickhouse-gh Bot added the pr-feature Pull request with new product feature label Mar 22, 2026
Fix `unlinkFile` to decrement `ref_count` when positive instead of
unconditionally removing blobs. Objects are only scheduled for removal
when the last reference is unlinked (`ref_count == 0`).

Fix `createHardLink` to reject existing destination paths with
`FILE_ALREADY_EXISTS` instead of silently overwriting.

Remove consecutive blank lines flagged by style check.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Comment thread src/Disks/DiskObjectStorage/MetadataStorages/Memory/MetadataStorageInMemory.cpp Outdated
…dicate

- Remove unused `DIRECTORY_DOESNT_EXIST` error code from `MetadataStorageInMemory.cpp`
  (style check failure)
- Remove default argument values from override declarations in
  `BorrowFromCacheObjectStorage` to fix `google-default-arguments` clang-tidy error
- Guard `removeRecursive` against null `should_remove_objects` predicate and
  pass relative paths to match the predicate contract

#100371

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Comment thread src/Disks/DiskObjectStorage/MetadataStorages/Memory/MetadataStorageInMemory.cpp Outdated
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Comment thread src/Disks/DiskObjectStorage/MetadataStorages/Memory/MetadataStorageInMemory.cpp Outdated
alexey-milovidov and others added 2 commits March 23, 2026 00:03
…` storage

Add a new section to the storing-data documentation describing
the `borrow_from_cache` object storage type and `memory` metadata type.
Update the metadata_type and object_storage_type lists in the introduction.
Rename test 04053 → 04054 to resolve numbering conflict with master.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…StorageInMemory`

Fix the test reference file that had the wrong cache name (04053 vs 04054).

Introduce `BlobGroup` with shared ownership via `shared_ptr` so that
hardlinked files correctly share blob reference counts. Previously,
`createHardLink` copied `FileEntry` by value, causing ref_counts to
diverge between linked files. Now both entries share the same `BlobGroup`,
and objects are only queued for removal when the last reference is gone.

Fix `unlinkFile` to always erase the file entry (previously it left
zombie entries when ref_count > 0).

Implement `getBlobsToRemove` and `recordAsRemoved` overrides so that
objects queued for removal are actually processed by the removal thread.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Comment thread src/Disks/DiskObjectStorage/MetadataStorages/Memory/MetadataStorageInMemory.cpp Outdated
`moveFile` silently overwrote the destination entry without decrementing
the blob group's ref_count or scheduling orphaned blobs for removal.
Now properly handles destination cleanup, consistent with `replaceFile`.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Comment thread src/Disks/DiskObjectStorage/ObjectStorages/ObjectStorageFactory.cpp
alexey-milovidov and others added 2 commits March 28, 2026 21:34
…ging

Fix `moveDirectory` in `MetadataStorageInMemory` to properly handle
existing destination files: decrement `ref_count` on the old blob group
and queue blobs for removal when the last reference is gone, instead of
silently overwriting.

Add `BlobStorageLog` deletion events to `BorrowFromCacheObjectStorage`
in `removeObjectIfExists` and `removeObjectsIfExist`, matching the
logging behavior of other object storage backends (S3, Azure, HDFS).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Comment thread src/Disks/DiskObjectStorage/MetadataStorages/Memory/MetadataStorageInMemory.cpp Outdated
Comment thread docs/en/operations/storing-data.md Outdated
Comment thread src/Disks/DiskObjectStorage/MetadataStorages/Memory/MetadataStorageInMemory.cpp Outdated
alexey-milovidov and others added 4 commits June 17, 2026 18:32
…eInMemory`

`MergeTree` reads a part directory's mtime as the part `modification_time`
(`DataPartStorageOnDiskBase::getLastModified`), and sets it on the temporary
part directory before renaming it into place via `setLastModified` followed by
`moveDirectory`. The in-memory metadata storage only tracked directory paths,
not their timestamps: `getLastModified` returned the epoch for every directory
and `setLastModified` silently ignored directories. As a result, parts stored
through `borrow_from_cache` reported a `1970-01-01` `modification_time`, which
feeds merge selection, TTL/age calculations and `system.parts`.

Change `directories` from a `std::set` to a `std::map` keyed by path and valued
by `Poco::Timestamp`. `createDirectory`/`createDirectoryRecursive` now stamp the
current time on creation (matching `mkdir`), `setLastModified` updates the stored
directory timestamp (and now throws when the path is neither a file nor a
directory, matching disk-backed metadata), `getLastModified` returns it, and
`moveDirectory` carries it across the rename. The transaction rollback journal
is reworked accordingly: `dirs_inserted`/`dirs_erased` are replaced by a single
`dirs_undo` map of pre-mutation timestamps, mirroring `files_undo`, so a partial
commit failure restores both directory existence and the original mtime.

Add `gtest_metadata_in_memory` cases covering directory mtime set/get,
preservation across `moveDirectory`, the throw on a non-existent path, and
rollback of a directory timestamp on a failed transaction.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Verify that a part stored on a `borrow_from_cache` object storage with `memory`
metadata reports a real `modification_time` (not the `1970` epoch) after an
insert and after a merge.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
The test `05017_borrow_from_cache_part_modification_time` had an
anomalously high number, while the highest stateless test number on
`master` is `04356`. The `check_gaps_in_tests_numbers` style check
(`ci/jobs/check_style.py`) flags gaps of `100` or more between
consecutive test numbers and failed with `Gap (4355, 5017) > 100`.

Renumber it to `04357` (the next available number) and update the
in-test `cache_name`/disk/path identifiers accordingly.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@alexey-milovidov

Copy link
Copy Markdown
Member Author

Pushed 4f931588c42:

  • Merged master (the branch was ~1134 commits behind). The merge was conflict-free (git merge-tree reported no conflicts) and also reconciled the contrib/avro submodule pointer that master had advanced to 7b7f6a5.
  • Fixed the Style check failure (test_numbers_check: Gap (4355, 5017) > 100). The regression test 05017_borrow_from_cache_part_modification_time had an anomalously high number, while the highest stateless test number on master is 04356. check_gaps_in_tests_numbers (ci/jobs/check_style.py) flags gaps of 100 or more between consecutive test numbers. Renumbered it to 04357 (the next available number) and updated the in-test cache_name/disk/path identifiers to match. The gap check now passes locally (no gaps >= 100).

Verified that the 1134-commit merge did not break the new code: BorrowFromCacheObjectStorage.cpp, MetadataStorageInMemory.cpp, MetadataStorageFactory.cpp, ObjectStorageFactory.cpp, DiskType.cpp, and gtest_metadata_in_memory.cpp all pass -fsyntax-only clean against current-master headers.

The only remaining unresolved review item (an allow_experimental_* gate for borrow_from_cache) reflects the deliberate design decision already stated on this PR — it is intended to be production, not experimental — and is left as-is. The remaining blocker is the human review (@kssenii).

Comment thread docs/en/operations/storing-data.md Outdated
The overview in `docs/en/operations/storing-data.md` stated unconditionally
that `metadata_type` defaults to `local`. This contradicts the new
`borrow_from_cache` object storage, whose default (from
`MetadataStorageFactory::getCompatibilityMetadataTypeHint`) is `memory`, and
which is what `04141_borrow_from_cache_metadata_check` relies on when
`metadata_type` is omitted.

Qualify the wording: the default is `local` for durable object storages
(`s3`, `azure_blob_storage`, `local_blob_storage`, `hdfs`), `web` for the
`web` object storage, and `memory` for `borrow_from_cache`.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@alexey-milovidov

Copy link
Copy Markdown
Member Author

Pushed bea54b1e8fd:

  • Addressed the AI Review "Request changes" (the sole remaining unresolved thread, docs/en/operations/storing-data.md:49): the overview stated unconditionally that metadata_type defaults to local, which contradicts the new borrow_from_cache backend. MetadataStorageFactory::getCompatibilityMetadataTypeHint returns memory for ObjectStorageType::BorrowFromCache (and local for the durable object storages, web for web), and 04141_borrow_from_cache_metadata_check relies on omitting metadata_type defaulting to memory.

    Qualified the wording so the default is stated per object storage type: local for durable object storages (s3, azure_blob_storage, local_blob_storage, hdfs), web for the web object storage, and memory for borrow_from_cache. The following sentence already documents that borrow_from_cache requires metadata_type='memory' and that any other combination is rejected with INVALID_CONFIG_PARAMETER.

CI for the previous commit 4dfb2e8e is green (150 passed, 0 failed). This is a docs-only wording change; the behavior is unchanged and already covered by 04141_borrow_from_cache_metadata_check.

`HoldingReadBuffer` (in `BorrowFromCacheObjectStorage`) keeps the borrowed
`FileSegmentsHolder` alive but, as a bare `ReadBufferFromFileDecorator`, only
forwarded `getFileOffsetOfBufferEnd`. The borrowed cache file is plugged into
`ReadBufferFromRemoteFSGather` exactly like the `local_blob_storage` read
wrapper `ReadBufferFromFileWithLogging`: the gather calls `setReadUntilPosition`
on the per-object buffer when a requested range ends inside an object, and uses
`supportsReadAt` / `readBigAt` for positional reads. Because the base classes
default `setReadUntilPosition` / `setReadUntilEnd` to no-ops and
`supportsRightBoundedReads` / `supportsReadAt` to `false`, the decorator was not
transparent: positional `readBigAt` reads of the borrowed object were disabled
(the inner `pread` buffer supports them), and the right-bound contract the
gather relies on was not forwarded.

Forward the same set of methods as `ReadBufferFromFileWithLogging`
(`setReadUntilPosition`, `setReadUntilEnd`, `supportsRightBoundedReads`,
`supportsReadAt`, `readBigAt`, `isSeekCheap`, `isContentCached`) so the wrapper
is transparent, while still holding the segment alive.

Add `04489_borrow_from_cache_bounded_read` covering a bounded sub-range read of
a borrowed object with the synchronous read path forced
(`remote_filesystem_read_method = 'read'`), so no `AsynchronousBoundedReadBuffer`
trims an over-read above the gather.

Addresses the AI Review "Request changes" on
#100371

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@alexey-milovidov

Copy link
Copy Markdown
Member Author

Pushed 5afd10a60b4:

  • Addressed the AI Review "Request changes" (the remaining unresolved thread on BorrowFromCacheObjectStorage.cpp). HoldingReadBuffer kept the borrowed FileSegmentsHolder alive but, as a bare ReadBufferFromFileDecorator, forwarded only getFileOffsetOfBufferEnd. The borrowed cache file is plugged into ReadBufferFromRemoteFSGather (and, on the async path, AsynchronousBoundedReadBuffer) exactly like the local_blob_storage read wrapper ReadBufferFromFileWithLogging. Made the decorator transparent by forwarding the same set of methods — setReadUntilPosition, setReadUntilEnd, supportsRightBoundedReads, supportsReadAt, readBigAt, isSeekCheap, isContentCached — so the gather's right bound (set when a requested range ends inside an object) is honored and positional readBigAt reads of the borrowed object are enabled (the inner pread buffer reports supportsReadAt() == true), while still holding the segment alive.

  • Added 04489_borrow_from_cache_bounded_read: a bounded sub-range read of a borrowed object with the synchronous read path forced (remote_filesystem_read_method = 'read') so the gather buffer is used directly and an over-read cannot be trimmed by AsynchronousBoundedReadBuffer.

The changed TU compiles cleanly with -Werror; the test reference was validated with clickhouse-local (the borrowed-disk backend does not change SELECT results). The disk DDL mirrors 04054_borrow_from_cache_object_storage.

alexey-milovidov and others added 2 commits June 30, 2026 04:44
`DiskObjectStorage::supportZeroCopyReplication` previously returned true for
every metadata type except `Keeper`, which advertised zero-copy support for the
new `memory` metadata storage used by `borrow_from_cache`.

That metadata is in-memory and ephemeral: `MetadataStorageInMemory` inherits
`IMetadataStorage::getSerializedMetadata`, which throws `NOT_IMPLEMENTED`, and
for `borrow_from_cache` the bytes live in node-local cache segments that other
replicas cannot read. With `allow_remote_fs_zero_copy_replication=1`, the
replicated code path could therefore be selected for such a disk even though it
cannot satisfy the zero-copy contract.

Fail closed by also excluding `MetadataStorageType::Memory`. The zero-copy
selection in `StorageReplicatedMergeTree::getZookeeperZeroCopyLockPaths` simply
skips disks that do not support zero-copy, so this falls back to ordinary
replication rather than erroring late.

This hardens the latent contract inconsistency raised in review for
#100371.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
`04490_borrow_from_cache_replicated_zero_copy` covers the contract hardened in
the previous commit: a `ReplicatedMergeTree` on a `borrow_from_cache` disk with
`memory` metadata and `allow_remote_fs_zero_copy_replication = 1` must fall back
to ordinary replication rather than selecting the zero-copy path, since
`DiskObjectStorage::supportZeroCopyReplication` now returns false for `memory`
metadata. The table must work end-to-end (insert, merge, read) instead of
erroring late.

Validated against a local server with embedded Keeper: the single active part
after `OPTIMIZE ... FINAL` lands on the borrowed disk and no `zero_copy`
ZooKeeper root is created. The absence of zero-copy lock nodes is not asserted
in the test itself because that ZooKeeper root is shared server-wide and other
tests running in parallel can create it.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@alexey-milovidov

Copy link
Copy Markdown
Member Author

Pushed 271d32a965d (two commits):

  • Fail-close: disable zero-copy replication for memory metadata storage (2431d73da95). DiskObjectStorage::supportZeroCopyReplication previously returned true for every metadata type except Keeper, so it advertised zero-copy support for the new memory metadata. That metadata is in-memory and ephemeral — MetadataStorageInMemory inherits IMetadataStorage::getSerializedMetadata (throws NOT_IMPLEMENTED), and for borrow_from_cache the bytes live in node-local cache segments other replicas cannot read. Now it also excludes MetadataStorageType::Memory. All zero-copy call sites (getZookeeperZeroCopyLockPaths, getZeroCopyPartPath, tryCreateZeroCopyExclusiveLock, …) gate on supportZeroCopyReplication and simply skip disks that do not support it, so this falls back to ordinary replication rather than erroring late.

  • Test 04490_borrow_from_cache_replicated_zero_copy (271d32a965d). A ReplicatedMergeTree on a borrow_from_cache/memory disk with allow_remote_fs_zero_copy_replication = 1 must work end-to-end via ordinary replication. Validated against a local server with an embedded Keeper: insert + OPTIMIZE … FINAL + read succeed, the single active part lands on the borrowed disk, and no zero_copy ZooKeeper root is created. The existing 04054/04141/04357/04489 tests still pass with the change.

This addresses the zero-copy item of the AI Review "Request changes". The other item — gating borrow_from_cache behind an allow_experimental_* setting — is intentionally not done: this backend is meant to ship as production (e.g. temporary tables with custom engines in the Cloud without S3), not as an experimental feature, as already noted on the corresponding thread. The feature already fails closed on misconfiguration (borrow_from_cache requires metadata_type = 'memory', and memory metadata is rejected for durable object storages, both with INVALID_CONFIG_PARAMETER).

@alexey-milovidov

Copy link
Copy Markdown
Member Author

Root cause of the Stress test (arm_msan) failure (report, commit 271d32a9).

The sole CI failure (Cannot start clickhouse-server) is a real, PR-introduced restart-robustness gap — not a flaky-unrelated failure. It hit only 1 of 10 stress configs (arm_msan; the other nine are green), which points to a load-order race, but the underlying defect is genuine.

What happens. The stress harness hard-restarts the server while stateless tests are mid-flight. If the server is killed after 04141_borrow_from_cache_metadata_check runs CREATE TABLE tmp_borrowed ... disk = disk(type = object_storage, object_storage_type = 'borrow_from_cache', cache_name = '04141_cache_creator', ...) but before its DROP, the table's .sql metadata persists. On the next startup the server re-attaches it, which reconstructs the borrow_from_cache disk, which calls FileCacheFactory::get('04141_cache_creator') and throws There is no cache by name ... (code 36). The metadata-load job fails fatally, so the server cannot start.

The named cache is registered by a separate custom disk (tmp_cache_creator, type = cache), and custom DDL disks are materialized lazily per table — so the borrow disk can be reconstructed before the cache-creating disk has registered the cache. That ordering nondeterminism is why only one shard tripped.

This is inherent to a persistent table on a borrow_from_cache disk: its data is ephemeral by design (the in-memory entries map and memory metadata are empty after restart), so the reloaded table is necessarily empty — but today the disk reconstruction aborts the whole server instead of coming up empty.

no-stress is not an escape hatch: tests/clickhouse-test rejects that tag ("stress tests must be able to run every test"), so the server itself must tolerate this on restart.

Three points that require the cache on reattach (all currently throw when it is absent):

  1. object-storage construction — registerBorrowFromCacheObjectStorageFileCacheFactory::get (the throw seen in the log);
  2. the disk access check — IDisk::checkAccess writes a probe blob → BorrowFromCacheObjectStorage::writeObject → needs the cache;
  3. MergeTreeData's format_version.txt write on attach: with memory metadata the file is always absent after restart, so !read_format_version is true and it writes it — again writeObject → needs the cache.

Suggested fix. DiskFromAST/DiskFactory already thread an attach flag down into the object_storage disk creator in RegisterDiskObjectStorage.cpp (currently received as the ignored first trailing bool). Use it:

  • Make BorrowFromCacheObjectStorage resolve the cache lazily — it is only ever used in writeObject — so construction no longer requires the cache and point (1) stops throwing on reattach. A fresh CREATE with a bad cache_name still fails fast, just at the access check instead of at construction.
  • When attach == true and the named cache is absent (!FileCacheFactory::instance().getAll().contains(cache_name)), bring the disk up degraded so the empty table loads and the server starts: skip the access check and skip the format_version.txt write. Both already key off isReadOnly, so a read-only disk is the simplest lever — but ReadOnlyDiskWrapper blocks removals, so a leftover table could not then be DROPped; a write-once/broken-style flag that still permits removal is preferable. That is the one design call here.

I did not push a patch: this worker's checkout is in an unrelated dirty state and a full rebuild is not feasible in a single session, so the read-only-vs-degraded choice and a local restart check (create a borrow table, kill -9, restart) are best done where it can be built and tested.

Everything else is green and the AI review is ✅ Approve with no unresolved threads, so the remaining blockers are this restart fix and @kssenii's review.

alexey-milovidov and others added 2 commits June 30, 2026 22:33
A persisted table on a `borrow_from_cache` disk is reattached on restart like
any other table, but its data lives only in node-local ephemeral cache segments
and does not survive a restart, so the reattached table is necessarily empty.
The named cache is registered by a *separate* disk; inline (DDL) disks are
materialized lazily and in an unspecified order, so the borrow disk could be
reconstructed before the cache-creating disk had registered the cache (or the
cache may have been dropped). The object storage resolved the cache eagerly at
construction via `FileCacheFactory::get`, which throws `There is no cache by
name ...`, turning that ordering into a fatal metadata-load failure: the server
could not start. The stress harness, which hard-restarts the server mid-test,
hit this on `04141_borrow_from_cache_metadata_check` (one `arm_msan` shard).

`borrow_from_cache` now resolves its cache lazily, by name, on first use, and
reports `isReadOnly` while the cache is not registered. Read-only disks already
skip every attach-time write that would need the cache -- the access check in
`IDisk::startup`, directory creation and the `format_version.txt` write in
`MergeTreeData`, and cleanup in `MergeTreeData::dropAllData` are all gated on
`isReadOnly` -- so the empty table loads cleanly, the server starts, and the
leftover table can be queried (empty) and dropped. The disk becomes writable
again as soon as the cache is registered.

A fresh `CREATE` must still fail loudly when the `cache_name` does not exist, so
the `attach` flag (already threaded through `DiskFactory`) is now forwarded to
`ObjectStorageFactory::create` and the object storage creators, and the missing
cache is rejected with `BAD_ARGUMENTS` only when `attach` is false. Added
`FileCacheFactory::tryGet`, a non-throwing lookup.

CI report (Stress test, arm_msan): https://s3.amazonaws.com/clickhouse-test-reports/json.html?PR=100371&sha=271d32a965db39cdb9238ce1503a28619e90f47d&name_0=PR&name_1=Stress%20test%20%28arm_msan%29
PR: #100371

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…ache

`test_borrow_from_cache_restart` creates a table on a `borrow_from_cache` disk
that references a cache defined by a config disk, removes that config disk, and
restarts the server so the cache is genuinely absent. It asserts the server
starts (the regression aborted startup here), the reattached table is empty,
the disk reports `is_read_only`, writes are rejected with a clear error, and the
leftover table can be dropped. It also checks that a fresh `CREATE` referencing a
non-existent cache still fails with `BAD_ARGUMENTS`.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@alexey-milovidov

Copy link
Copy Markdown
Member Author

Pushed 47b883e8915 + d41b102f2a8, implementing the fix for the restart-robustness gap analyzed above (the Stress test (arm_msan)Cannot start clickhouse-server).

Fix. The borrow_from_cache object storage no longer resolves its cache eagerly at construction (which called FileCacheFactory::get and threw There is no cache by name ..., aborting metadata loading). It now resolves the cache lazily, by name, on first use, and reports isReadOnly while the cache is not registered. Read-only disks already skip every attach-time write that would otherwise need the cache — the access check in IDisk::startup, directory creation and the format_version.txt write in MergeTreeData, and cleanup in MergeTreeData::dropAllData are all gated on isReadOnly — so the (necessarily empty) reattached table loads cleanly, the server starts, and the leftover table can be queried and dropped. The disk becomes writable again once the cache is registered.

A fresh CREATE still fails loudly when cache_name does not exist: the attach flag (already threaded through DiskFactory) is now forwarded to ObjectStorageFactory::create and the object-storage creators, and the missing cache is rejected with BAD_ARGUMENTS only when attach is false. Added a non-throwing FileCacheFactory::tryGet.

Verified locally against a freshly-built server with a real stop/restart cycle:

  • With the referenced cache genuinely absent on restart (table .sql rewritten to a never-created cache_name, and separately via a removed config cache disk): the server starts, system.disks.is_read_only = 1 for the borrowed disk, SELECT count() returns 0 (empty), INSERT is rejected with TABLE_IS_PERMANENTLY_READ_ONLY, and DROP TABLE succeeds.
  • A fresh CREATE with a missing cache_name is rejected with Filesystem cache '...' not found for borrow_from_cache.
  • Existing stateless tests 04054, 04357, 04489 still pass, and the 04141 rejections (metadata_type='local', durable + memory) still fire with INVALID_CONFIG_PARAMETER.

Regression test. Added tests/integration/test_borrow_from_cache_restart, which reproduces the scenario deterministically (config-defined cache disk removed before restart_clickhouse) and asserts the server starts, the table is empty and droppable, the disk is read-only, writes are rejected, and a fresh CREATE against a missing cache fails.

…bject-storage

# Conflicts:
#	src/Disks/DiskObjectStorage/MetadataStorages/MetadataStorageFactory.cpp
@alexey-milovidov

Copy link
Copy Markdown
Member Author

Merged master and pushed 7b0dff7c177.

The branch was ~1070 commits behind and had gone CONFLICTING, which is why the restart-fix head d41b102f2a8 never got a CI run — GitHub could not build the merge ref. The push flips the PR back to MERGEABLE and a fresh PR workflow is now running on the merged head, so the arm_msanCannot start clickhouse-server restart gap fixed by 47b883e8915/d41b102f2a8 finally gets exercised in CI.

Conflict (one file, MetadataStorageFactory.cpp): master added registerMetadataStorageFromIndexPages (metadata type web_index) while this PR added registerMetadataStorageInMemory (metadata type memory) in the same two spots — the function definitions and the registerMetadataStorages body. Resolved by keeping both function definitions and both registration calls; the includes for both MetadataStorageInMemory.h and MetadataStorageFromIndexPages.h were already reconciled by the textual merge.

Verified locally: the 9 translation units touched by the merge — BorrowFromCacheObjectStorage.cpp, MetadataStorageInMemory.cpp, MetadataStorageFactory.cpp, ObjectStorageFactory.cpp, RegisterDiskObjectStorage.cpp, DiskType.cpp, FileCacheFactory.cpp, DistributedPlanExecutor.cpp, and gtest_metadata_in_memory.cpp — all compile clean with zero errors and zero warnings under -Weverything against current-master headers (which reorganized the metadata storages into Web//Memory/ subdirectories). std::make_shared<MetadataStorageInMemory> compiling confirms the class still fully implements the current IMetadataStorage interface after the merge.

AI Review is ✅ Approve with no unresolved threads; the remaining blocker is the human review (@kssenii).

…adataStorageInMemory

`MetadataStorageInMemoryTransaction::truncateFile(path, 0)` on a missing file
threw `FILE_DOESNT_EXIST` before checking `target_size`, so the public
`IDisk::truncateFile` operation failed only for `metadata_type='memory'` /
`borrow_from_cache`. The disk-backed `MetadataStorageFromDisk::TruncateMetadataFileOperation`
treats zero-truncation of a missing file as a no-op and only errors for a
non-zero target, and the public `DiskObjectStorageTest.TruncateFileToZero`
relies on that contract.

Mirror the disk-backed path: return when the file is missing and
`target_size == 0`, and throw only when a missing file is truncated to a
non-zero size. Add `MetadataInMemoryTest.TestTruncateMissingFileToZero`
covering the no-op, the non-zero error, and zero-truncation of an existing file.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@alexey-milovidov

Copy link
Copy Markdown
Member Author

Pushed 37da0b847b2:

Addressed the AI Review "Request changes" — the sole remaining unresolved thread, MetadataStorageInMemory.cpp:901 (truncateFile missing-file contract).

MetadataStorageInMemoryTransaction::truncateFile(path, 0) threw FILE_DOESNT_EXIST before checking target_size, so the public IDisk::truncateFile operation failed only for metadata_type='memory' / borrow_from_cache. The disk-backed MetadataStorageFromDisk::TruncateMetadataFileOperation treats zero-truncation of a missing file as a no-op and errors only for a non-zero target (MetadataStorageFromDiskTransactionOperations.cpp:568-574), and the public DiskObjectStorageTest.TruncateFileToZero (gtest_disk_object_storage.cpp:715) relies on truncateFile(missing, 0) succeeding.

Fix mirrors the disk-backed path: return when the file is missing and target_size == 0; throw only when a missing file is truncated to a non-zero size.

Added the focused regression test the review asked for — MetadataInMemoryTest.TestTruncateMissingFileToZero (gtest_metadata_in_memory.cpp) — covering the missing-file zero-truncation no-op, the missing-file non-zero-truncation error, and zero-truncation of an existing file (file kept, blobs dropped).

Both changed translation units compile clean with -fsyntax-only under -Weverything against current-master headers; the gtest is exercised end-to-end by the Unit tests job.

The other AI Review item — gating borrow_from_cache behind an allow_experimental_* setting — remains intentionally not done (recorded as author-dismissed on the corresponding thread); this backend is meant to ship as production and already fails closed on misconfiguration.

@clickhouse-gh

clickhouse-gh Bot commented Jul 1, 2026

Copy link
Copy Markdown
Contributor

LLVM Coverage Report

Metric Baseline Current Δ
Lines 85.40% 85.50% +0.10%
Functions 92.60% 92.70% +0.10%
Branches 77.70% 77.70% +0.00%

Changed lines: Changed C/C++ lines covered: 633/878 (72.10%) · Uncovered code

Full report · Diff report

@Michicosun Michicosun left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is it needed? Temporary tables can be created with plain-rewritable pointing to the /tmp folder with local object storage.

@alexey-milovidov

Copy link
Copy Markdown
Member Author

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

pr-feature Pull request with new product feature

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants