Add borrow_from_cache object storage and memory metadata types#100371
Add borrow_from_cache object storage and memory metadata types#100371alexey-milovidov wants to merge 90 commits into
borrow_from_cache object storage and memory metadata types#100371Conversation
Add a new object storage type `borrow_from_cache` that allocates space
in a named filesystem cache using ephemeral `FileSegment`s. Each stored
object is backed by a cache segment held alive via `FileSegmentsHolder`;
when released, the cache reclaims the space.
Add a new metadata type `memory` that keeps all file-to-blob mappings
and directory structure entirely in memory with no persistence. On
server restart all data is lost, which is acceptable for the intended
use case of temporary tables.
Configuration example:
disk(type=object_storage,
object_storage_type='borrow_from_cache',
metadata_type='memory',
cache_name='some_cache')
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Fix `unlinkFile` to decrement `ref_count` when positive instead of unconditionally removing blobs. Objects are only scheduled for removal when the last reference is unlinked (`ref_count == 0`). Fix `createHardLink` to reject existing destination paths with `FILE_ALREADY_EXISTS` instead of silently overwriting. Remove consecutive blank lines flagged by style check. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…dicate - Remove unused `DIRECTORY_DOESNT_EXIST` error code from `MetadataStorageInMemory.cpp` (style check failure) - Remove default argument values from override declarations in `BorrowFromCacheObjectStorage` to fix `google-default-arguments` clang-tidy error - Guard `removeRecursive` against null `should_remove_objects` predicate and pass relative paths to match the predicate contract #100371 Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…` storage Add a new section to the storing-data documentation describing the `borrow_from_cache` object storage type and `memory` metadata type. Update the metadata_type and object_storage_type lists in the introduction. Rename test 04053 → 04054 to resolve numbering conflict with master. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…StorageInMemory` Fix the test reference file that had the wrong cache name (04053 vs 04054). Introduce `BlobGroup` with shared ownership via `shared_ptr` so that hardlinked files correctly share blob reference counts. Previously, `createHardLink` copied `FileEntry` by value, causing ref_counts to diverge between linked files. Now both entries share the same `BlobGroup`, and objects are only queued for removal when the last reference is gone. Fix `unlinkFile` to always erase the file entry (previously it left zombie entries when ref_count > 0). Implement `getBlobsToRemove` and `recordAsRemoved` overrides so that objects queued for removal are actually processed by the removal thread. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
`moveFile` silently overwrote the destination entry without decrementing the blob group's ref_count or scheduling orphaned blobs for removal. Now properly handles destination cleanup, consistent with `replaceFile`. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…ging Fix `moveDirectory` in `MetadataStorageInMemory` to properly handle existing destination files: decrement `ref_count` on the old blob group and queue blobs for removal when the last reference is gone, instead of silently overwriting. Add `BlobStorageLog` deletion events to `BorrowFromCacheObjectStorage` in `removeObjectIfExists` and `removeObjectsIfExist`, matching the logging behavior of other object storage backends (S3, Azure, HDFS). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…eInMemory` `MergeTree` reads a part directory's mtime as the part `modification_time` (`DataPartStorageOnDiskBase::getLastModified`), and sets it on the temporary part directory before renaming it into place via `setLastModified` followed by `moveDirectory`. The in-memory metadata storage only tracked directory paths, not their timestamps: `getLastModified` returned the epoch for every directory and `setLastModified` silently ignored directories. As a result, parts stored through `borrow_from_cache` reported a `1970-01-01` `modification_time`, which feeds merge selection, TTL/age calculations and `system.parts`. Change `directories` from a `std::set` to a `std::map` keyed by path and valued by `Poco::Timestamp`. `createDirectory`/`createDirectoryRecursive` now stamp the current time on creation (matching `mkdir`), `setLastModified` updates the stored directory timestamp (and now throws when the path is neither a file nor a directory, matching disk-backed metadata), `getLastModified` returns it, and `moveDirectory` carries it across the rename. The transaction rollback journal is reworked accordingly: `dirs_inserted`/`dirs_erased` are replaced by a single `dirs_undo` map of pre-mutation timestamps, mirroring `files_undo`, so a partial commit failure restores both directory existence and the original mtime. Add `gtest_metadata_in_memory` cases covering directory mtime set/get, preservation across `moveDirectory`, the throw on a non-existent path, and rollback of a directory timestamp on a failed transaction. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Verify that a part stored on a `borrow_from_cache` object storage with `memory` metadata reports a real `modification_time` (not the `1970` epoch) after an insert and after a merge. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
The test `05017_borrow_from_cache_part_modification_time` had an anomalously high number, while the highest stateless test number on `master` is `04356`. The `check_gaps_in_tests_numbers` style check (`ci/jobs/check_style.py`) flags gaps of `100` or more between consecutive test numbers and failed with `Gap (4355, 5017) > 100`. Renumber it to `04357` (the next available number) and update the in-test `cache_name`/disk/path identifiers accordingly. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
|
Pushed
Verified that the 1134-commit merge did not break the new code: The only remaining unresolved review item (an |
The overview in `docs/en/operations/storing-data.md` stated unconditionally that `metadata_type` defaults to `local`. This contradicts the new `borrow_from_cache` object storage, whose default (from `MetadataStorageFactory::getCompatibilityMetadataTypeHint`) is `memory`, and which is what `04141_borrow_from_cache_metadata_check` relies on when `metadata_type` is omitted. Qualify the wording: the default is `local` for durable object storages (`s3`, `azure_blob_storage`, `local_blob_storage`, `hdfs`), `web` for the `web` object storage, and `memory` for `borrow_from_cache`. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
|
Pushed
CI for the previous commit |
`HoldingReadBuffer` (in `BorrowFromCacheObjectStorage`) keeps the borrowed `FileSegmentsHolder` alive but, as a bare `ReadBufferFromFileDecorator`, only forwarded `getFileOffsetOfBufferEnd`. The borrowed cache file is plugged into `ReadBufferFromRemoteFSGather` exactly like the `local_blob_storage` read wrapper `ReadBufferFromFileWithLogging`: the gather calls `setReadUntilPosition` on the per-object buffer when a requested range ends inside an object, and uses `supportsReadAt` / `readBigAt` for positional reads. Because the base classes default `setReadUntilPosition` / `setReadUntilEnd` to no-ops and `supportsRightBoundedReads` / `supportsReadAt` to `false`, the decorator was not transparent: positional `readBigAt` reads of the borrowed object were disabled (the inner `pread` buffer supports them), and the right-bound contract the gather relies on was not forwarded. Forward the same set of methods as `ReadBufferFromFileWithLogging` (`setReadUntilPosition`, `setReadUntilEnd`, `supportsRightBoundedReads`, `supportsReadAt`, `readBigAt`, `isSeekCheap`, `isContentCached`) so the wrapper is transparent, while still holding the segment alive. Add `04489_borrow_from_cache_bounded_read` covering a bounded sub-range read of a borrowed object with the synchronous read path forced (`remote_filesystem_read_method = 'read'`), so no `AsynchronousBoundedReadBuffer` trims an over-read above the gather. Addresses the AI Review "Request changes" on #100371 Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
|
Pushed
The changed TU compiles cleanly with |
`DiskObjectStorage::supportZeroCopyReplication` previously returned true for every metadata type except `Keeper`, which advertised zero-copy support for the new `memory` metadata storage used by `borrow_from_cache`. That metadata is in-memory and ephemeral: `MetadataStorageInMemory` inherits `IMetadataStorage::getSerializedMetadata`, which throws `NOT_IMPLEMENTED`, and for `borrow_from_cache` the bytes live in node-local cache segments that other replicas cannot read. With `allow_remote_fs_zero_copy_replication=1`, the replicated code path could therefore be selected for such a disk even though it cannot satisfy the zero-copy contract. Fail closed by also excluding `MetadataStorageType::Memory`. The zero-copy selection in `StorageReplicatedMergeTree::getZookeeperZeroCopyLockPaths` simply skips disks that do not support zero-copy, so this falls back to ordinary replication rather than erroring late. This hardens the latent contract inconsistency raised in review for #100371. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
`04490_borrow_from_cache_replicated_zero_copy` covers the contract hardened in the previous commit: a `ReplicatedMergeTree` on a `borrow_from_cache` disk with `memory` metadata and `allow_remote_fs_zero_copy_replication = 1` must fall back to ordinary replication rather than selecting the zero-copy path, since `DiskObjectStorage::supportZeroCopyReplication` now returns false for `memory` metadata. The table must work end-to-end (insert, merge, read) instead of erroring late. Validated against a local server with embedded Keeper: the single active part after `OPTIMIZE ... FINAL` lands on the borrowed disk and no `zero_copy` ZooKeeper root is created. The absence of zero-copy lock nodes is not asserted in the test itself because that ZooKeeper root is shared server-wide and other tests running in parallel can create it. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
|
Pushed
This addresses the zero-copy item of the AI Review "Request changes". The other item — gating |
|
Root cause of the The sole CI failure ( What happens. The stress harness hard-restarts the server while stateless tests are mid-flight. If the server is killed after The named cache is registered by a separate custom disk ( This is inherent to a persistent table on a
Three points that require the cache on reattach (all currently throw when it is absent):
Suggested fix.
I did not push a patch: this worker's checkout is in an unrelated dirty state and a full rebuild is not feasible in a single session, so the read-only-vs-degraded choice and a local restart check (create a borrow table, Everything else is green and the AI review is ✅ Approve with no unresolved threads, so the remaining blockers are this restart fix and @kssenii's review. |
A persisted table on a `borrow_from_cache` disk is reattached on restart like any other table, but its data lives only in node-local ephemeral cache segments and does not survive a restart, so the reattached table is necessarily empty. The named cache is registered by a *separate* disk; inline (DDL) disks are materialized lazily and in an unspecified order, so the borrow disk could be reconstructed before the cache-creating disk had registered the cache (or the cache may have been dropped). The object storage resolved the cache eagerly at construction via `FileCacheFactory::get`, which throws `There is no cache by name ...`, turning that ordering into a fatal metadata-load failure: the server could not start. The stress harness, which hard-restarts the server mid-test, hit this on `04141_borrow_from_cache_metadata_check` (one `arm_msan` shard). `borrow_from_cache` now resolves its cache lazily, by name, on first use, and reports `isReadOnly` while the cache is not registered. Read-only disks already skip every attach-time write that would need the cache -- the access check in `IDisk::startup`, directory creation and the `format_version.txt` write in `MergeTreeData`, and cleanup in `MergeTreeData::dropAllData` are all gated on `isReadOnly` -- so the empty table loads cleanly, the server starts, and the leftover table can be queried (empty) and dropped. The disk becomes writable again as soon as the cache is registered. A fresh `CREATE` must still fail loudly when the `cache_name` does not exist, so the `attach` flag (already threaded through `DiskFactory`) is now forwarded to `ObjectStorageFactory::create` and the object storage creators, and the missing cache is rejected with `BAD_ARGUMENTS` only when `attach` is false. Added `FileCacheFactory::tryGet`, a non-throwing lookup. CI report (Stress test, arm_msan): https://s3.amazonaws.com/clickhouse-test-reports/json.html?PR=100371&sha=271d32a965db39cdb9238ce1503a28619e90f47d&name_0=PR&name_1=Stress%20test%20%28arm_msan%29 PR: #100371 Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…ache `test_borrow_from_cache_restart` creates a table on a `borrow_from_cache` disk that references a cache defined by a config disk, removes that config disk, and restarts the server so the cache is genuinely absent. It asserts the server starts (the regression aborted startup here), the reattached table is empty, the disk reports `is_read_only`, writes are rejected with a clear error, and the leftover table can be dropped. It also checks that a fresh `CREATE` referencing a non-existent cache still fails with `BAD_ARGUMENTS`. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
|
Pushed Fix. The A fresh Verified locally against a freshly-built server with a real stop/restart cycle:
Regression test. Added |
…bject-storage # Conflicts: # src/Disks/DiskObjectStorage/MetadataStorages/MetadataStorageFactory.cpp
|
Merged The branch was ~1070 commits behind and had gone Conflict (one file, Verified locally: the 9 translation units touched by the merge — AI Review is ✅ Approve with no unresolved threads; the remaining blocker is the human review (@kssenii). |
…adataStorageInMemory `MetadataStorageInMemoryTransaction::truncateFile(path, 0)` on a missing file threw `FILE_DOESNT_EXIST` before checking `target_size`, so the public `IDisk::truncateFile` operation failed only for `metadata_type='memory'` / `borrow_from_cache`. The disk-backed `MetadataStorageFromDisk::TruncateMetadataFileOperation` treats zero-truncation of a missing file as a no-op and only errors for a non-zero target, and the public `DiskObjectStorageTest.TruncateFileToZero` relies on that contract. Mirror the disk-backed path: return when the file is missing and `target_size == 0`, and throw only when a missing file is truncated to a non-zero size. Add `MetadataInMemoryTest.TestTruncateMissingFileToZero` covering the no-op, the non-zero error, and zero-truncation of an existing file. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
|
Pushed Addressed the AI Review "Request changes" — the sole remaining unresolved thread,
Fix mirrors the disk-backed path: Added the focused regression test the review asked for — Both changed translation units compile clean with The other AI Review item — gating |
LLVM Coverage Report
Changed lines: Changed C/C++ lines covered: 633/878 (72.10%) · Uncovered code |
Michicosun
left a comment
There was a problem hiding this comment.
Why is it needed? Temporary tables can be created with plain-rewritable pointing to the /tmp folder with local object storage.

Add a new object storage type
borrow_from_cachethat allocates space in a named filesystem cache using ephemeralFileSegments. Each stored object is backed by a cache segment held alive viaFileSegmentsHolder; when released, the cache reclaims the space.Add a new metadata type
memorythat keeps all file-to-blob mappings and directory structure entirely in memory with no persistence. On server restart all data is lost, which is acceptable for the intended use case of temporary tables.Configuration example:
Changelog category (leave one):
Changelog entry (a user-readable short description of the changes that goes into CHANGELOG.md):
A new object storage type and metadata type suitable for temporary tables. Add a new object storage type
borrow_from_cachethat allocates space in a named filesystem cache and holds it from eviction. Add a new metadata storage typememorythat keeps the mapping in memory. For example, this could allow creating temporary tables with custom engines in the Cloud without using S3.