Open-source packed part storage#108118
Conversation
Introduce `DataPartStorageOnDiskPacked`, a data part storage that keeps all of a part's files inside a single archive instead of a file per stream. Add the supporting `IDataPartStorage` hooks (`isPackedPartStorage`, `setPreferredFileOrder`, `getProjectionNoInitialize`) and `PackedFilesReader::getFileOffsetAndSize`. Co-authored-by: Raúl Marín <raul.marin@clickhouse.com> Co-authored-by: Azat Khuzhin <a3at.mail@gmail.com> Co-authored-by: Antonio Andelic <antonio2368@users.noreply.github.com> Co-authored-by: Konstantin Bogdanov <thevar1able@users.noreply.github.com> Co-authored-by: alesapin <alesapin@clickhouse.com> Co-authored-by: Anton Popov <anton@clickhouse.com> Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Select packed storage in `MergeTreeData::choosePartFormat` based on the `min_bytes_for_full_part_storage`, `min_rows_for_full_part_storage` and `min_level_for_full_part_storage` settings, build it through `MergeTreeDataPartBuilder`, and compute the preferred on-disk file order (`getPreferredFileOrder`) so the archive is laid out for efficient reads. Thread it through merges (`MergeTask`), inserts (`MergeTreeDataWriter`) and mutations (`MutateTask`). Co-authored-by: Raúl Marín <raul.marin@clickhouse.com> Co-authored-by: Anton Popov <anton@clickhouse.com> Co-authored-by: Alexey Milovidov <milovidov@clickhouse.com> Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Add a `part_storage_type` column reporting whether a part uses `Full` or `Packed` data part storage. Co-authored-by: Raúl Marín <raul.marin@clickhouse.com> Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
The `min_bytes_for_full_part_storage`, `min_level_for_full_part_storage` and `min_rows_for_full_part_storage` settings now take effect outside ClickHouse Cloud, so drop the "Only available in ClickHouse Cloud" note from their descriptions. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Stateless tests covering packed selection, data correctness, refcounts and column modification time, plus an integration test exercising packed parts on local and S3 disks. Co-authored-by: Alexander Gololobov <davenger@clickhouse.com> Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Tidy the constructor parameter lists and the `build` / `getPartStorageByType` call sites in `MergeTreeDataPartBuilder` after enabling packed storage. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
The cached `skp_idx.packed` reader was reset on rename/changeRootPath while a concurrent query could be reading its index, causing a data race and use-after-free. Make `PackedFilesReader` a path-independent, shared, immutable index: `readFile` now takes the archive's current disk and path from the caller, so relocation no longer needs to free the reader. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Bring the open-sourcing branch up to current master, including the packed skip-indices data-race fix and related follow-ups already merged upstream.
`cloneAndLoadDataPart` writes `metadata_version.txt` after freeze when `metadata_version_to_write` is set (e.g. `REPLACE`/`MOVE`/`ATTACH PARTITION` into a `ReplicatedMergeTree` table). Packed part storage needs an active transaction for `writeFile`, so the bare write threw `NOT_INITIALIZED`. Wrap the write in a packed transaction and commit it before loading the part. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Drop `filecmp` and `helpers.test_tools.TSV`, which were imported but unused (flagged by the `ruff` style check). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
| auto projection_storage = getProjection(file); | ||
| auto params_copy = params; | ||
| params_copy.external_transaction = dest_storage->transaction; | ||
| projection_storage->freeze(dest_storage->getRelativePath(), file, read_settings, write_settings, save_metadata_callback, params); |
There was a problem hiding this comment.
params_copy is the one that carries dest_storage->transaction, but this call passes the original params. In the branch above where packed freeze rewrites data.packed, the parent storage starts a transaction and commits it only at the end; the projections should join that same transaction. As written, projection freeze runs outside the parent transaction (and may commit independently), so a failure before the parent commitTransaction can leave projection files committed without the parent archive. Please pass params_copy here.
| else | ||
| { | ||
| /// Transactions doesn't support copy between different metadata storages, so doing it manually | ||
| auto write_buf = transaction->writeFile(dest_storage->getRelativeDataPath(), DBMS_DEFAULT_BUFFER_SIZE, WriteMode::Rewrite, write_settings); |
There was a problem hiding this comment.
This is the source storage's member transaction, not the destination/external transaction. In the normal freezeRemote path the source storage has no active transaction, so this branch dereferences a null transaction whenever params.external_transaction is set and the source/destination metadata storages differ. Use the destination transaction (params.external_transaction / dest_storage->transaction) if the operation can be represented transactionally, or reject this cross-metadata-storage case instead of applying a non-transactional fallback.
| auto projection_storage = getProjection(file); | ||
| auto params_copy = params; | ||
| params_copy.external_transaction = dest_storage->transaction; | ||
| projection_storage->freeze(dest_storage->getRelativePath(), file, read_settings, write_settings, save_metadata_callback, params); |
There was a problem hiding this comment.
This recurses with local freeze on the source projection storage while freezeRemote is cloning the parent part to dst_disk. That writes the projection under dest_storage->getRelativePath() on the source disk, not on the destination disk, and the prepared params_copy is ignored as well. A packed part with projections cloned to another disk can end up with data.packed on dst_disk but missing projection data there. This should recurse through freezeRemote(..., dst_disk, ..., params_copy).
`system.parts` gained the `part_storage_type` column, so the `SHOW CREATE` schema snapshot in 02117_show_create_table_system needs the new column. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
| throw Exception(ErrorCodes::NOT_INITIALIZED, "Cannot write file {} because writer is not initialized", name); | ||
| } | ||
|
|
||
| void DataPartStorageOnDiskPacked::createFile(const String &) |
There was a problem hiding this comment.
VersionMetadataOnDisk::storeInfoToDataPartStorage calls createFile for txn_version.txt.tmp specifically to fail if a stale tmp file already exists before writing transaction metadata. This override silently succeeds even for separately written side files such as txn_version.txt.tmp, so packed parts overwrite the stale tmp file and lose that rollback/stale-write guard. Please delegate createFile for separately written files, and either create an empty packed entry through the writer or throw for non-separate names instead of treating the call as success.
|
📊 Cloud Performance Report ✅ AI verdict: This PR is a storage-mechanics refactor: it makes packed data-part storage available outside cloud builds, adjusts on-disk file ordering for parts, and tweaks metadata reads during part load/fetch. None of that touches the query execution path that ClickBench SELECTs run, so the three flagged improvements — Q4 (-18%), Q15 (-17%), and Q28 (-5.78%) — cannot plausibly be caused by this change. They move together and are consistent with run-to-run/instance variance, so all three were downgraded to not-sure. Net read: no real per-query performance impact from this PR. clickbenchFlagged queries (3 of 43)
q-value = BH-FDR adjusted p; smaller is stronger evidence. MIRAI flags a query when q < fdr_q (default 0.10) — the value the verdict is based on. tpch_adapted_1_official🟢 No significant changes Debug info
|
02497 measured S3PutObject from query_log, but the part upload runs on the object-storage writer pool, so the event is not attributed to the INSERT query in the open-source build (it stays 0). Count the part's physical objects via system.remote_data_paths instead: a packed part is a single archive object, far fewer than full storage's one-object-per-file. Rename the test accordingly. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Replace the S3PutObject query_log check (the upload runs on the object-storage writer pool, so the event is never attributed to the INSERT query in the open-source build) with a count of the part's distinct physical objects: a packed part is a single archive object, far fewer than full storage's one-object-per-file. Use the s3_cache policy, which exists in both repos. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
| chassert(writer); | ||
|
|
||
| /// Data is unchanged. | ||
| if (!writer->hasModifiedFiles()) |
There was a problem hiding this comment.
finalizeWriter leaves writer engaged when the transaction only touched separately-written files, or otherwise made no packed-archive changes. commitTransaction then commits and clears transaction, but the still-engaged writer makes the storage look permanently mid-write: the next beginTransaction throws Uncommitted transaction already exists, and later writeFile calls can buffer into a writer that no longer has a disk transaction to flush through. Please reset the writer on this no-op path as well, while leaving the existing reader unchanged.
| if (isWrittenSeparately(name)) | ||
| actual_file_names.insert(name); | ||
| else | ||
| actual_file_names.insert(DATA_FILE_NAME); |
There was a problem hiding this comment.
Collapsing all checksummed logical files to data.packed also removes the replication fetch checksum check for packed parts. DataPartsExchange::sendPartFromDisk and DataPartsExchange::downloadPartToDisk only compare part->checksums with the bytes sent over the wire for full storage, so for packed storage we only prove that the receiver got the same data.packed bytes the source served. We no longer prove those bytes match the source part's loaded logical checksums.txt/part->checksums, so a source whose archive changed after load can serve a self-consistent but wrong packed archive under this part name. Please add a packed-aware validation before serving or finalizing a fetched packed part, or carry the logical checksums through the exchange so packed fetches keep the same corruption check as full parts.
Bring the packed archive tooling to public builds: the `packed-io` clickhouse-disks subcommand and the standalone `clickhouse-packed-io` program, which list, extract and create ClickHouse packed-format archives (`data.packed`). The implementation (`PackedFilesOperations`, `CommandPackedIO`, `programs/packed-io`) only depends on the already public `PackedFilesReader`/`PackedFilesWriter`, so un-gate it from `#if CLICKHOUSE_CLOUD`. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
| ("extract,x", "extract packed archive") | ||
| ("create,c", "create packed archive") | ||
| ("input,i", po::value<String>()->required(), "input file/directory") | ||
| ("output,o", po::value<String>()->required(), "output file/directory") |
There was a problem hiding this comment.
The standalone clickhouse packed-io command advertises list as taking only --input, but po::notify runs before the command dispatch and this option is marked required, so clickhouse packed-io --list -i data.packed fails with missing --output. The same path also reads options.at("file-order") unconditionally for create, so create without the documented optional --file-order throws. Please make --output optional at the parser level and validate it only for extract/create, and read file-order with a default like the clickhouse disks packed-io command does.
| if (has_shared_transaction) | ||
| throw Exception(ErrorCodes::LOGICAL_ERROR, "Cannot commit shared transaction"); | ||
|
|
||
| if (!is_precommitted) |
There was a problem hiding this comment.
commitTransaction stops being retryable if the disk commit throws after finalizeWriter. In that path finalizeWriter has already reset writer, but is_precommitted is still false, so the storage keeps transaction set while later commitTransaction (and Cloud undoTransaction) report "There is no uncommitted transaction" and cannot commit or roll the disk transaction back. Please mirror the explicit precommitTransaction state here: after a successful finalizeWriter, mark the transaction precommitted before calling transaction->commit, and keep the reader/transaction state consistent for the exception path.
| for (const auto & file_name : files) | ||
| { | ||
| auto in = reader.readFile(disk_in, input_file, file_name, {}, {}); | ||
| auto out = disk_out->writeFile(output_dir / fs::path(file_name)); |
There was a problem hiding this comment.
extractPacked promises to extract into output_dir, but file_name comes from the archive index and is used as a path without validation. A crafted archive entry such as ../metadata.xml or an absolute path can escape output_dir and overwrite files reachable by this process. Please reject absolute paths and parent-directory components (similar to the fetch-side pathStartsWith guard) before opening the output file.
These tests assert behavior that is specific to full (non-packed) part storage. With packed storage now active in public builds, the CI random MergeTree settings can pick a large `min_bytes_for_full_part_storage`, switch the part to packed and break them. Pin the setting to 0 in the table definitions, matching the existing fix in the private repository. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
# Conflicts: # programs/disks/CMakeLists.txt
Revert 02497 to the original private form that counts S3 PutObject requests (a packed part is a single archive object, so writing it issues one PUT), undoing the earlier object-count rewrite. Adapt it for the open-source build: add `storage_policy = 's3_cache'` because the public default disk is local (in private the default is object storage), and pin `async_insert = 0` so the part is written synchronously and its S3PutObject events are attributed to the INSERT instead of a background flush (async_insert now defaults to true). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
The test asserts CreatedReadBufferMMap=2, but the randomized min_bytes_for_full_part_storage can switch the part to packed storage, whose single data.packed archive is read through one buffer (no per-file mmap), so the count drops to 0. Pin the setting to 0 to keep full storage. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Same as 01343: randomized min_bytes_for_full_part_storage can pack the part into a single data.packed archive, so the mmap buffer count drops to 0 instead of the asserted 4. Pin the setting to 0. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
The test deletes the part directory on local fs and runs CHECK TABLE. With a randomized large min_bytes_for_full_part_storage the part is packed into a single data.packed archive, so the check fails to open it (FILE_DOESNT_EXIST on stderr). Pin the setting to 0. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
The test checks the vertical merge of a 300-column part stays under 100MB. A randomized large min_bytes_for_full_part_storage packs the part into a single data.packed archive, inflating merge memory above the threshold. Pin it to 0. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
The test checks per-column column_modification_time in system.parts_columns. A randomized large min_bytes_for_full_part_storage packs all columns into one data.packed archive, so per-column file timestamps no longer differ as expected. Pin it to 0 on both table variants. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
The test verifies disk reads go through ThreadPoolRead with O_DIRECT. A randomized large min_bytes_for_full_part_storage packs the part into a single data.packed archive, read via a different path, so the expected thread rows don't appear. Pin it to 0. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
The test cat's columns_substreams.txt from the part directory. A randomized large min_bytes_for_full_part_storage packs the part into a single data.packed archive, so that file does not exist on disk and cat fails. Pin it to 0. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
The test edits and removes raw part files (skp_idx_idx.mrk4, checksums.txt). A randomized large min_bytes_for_full_part_storage packs the part into a single data.packed archive, so those files are not on disk and the rm/printf fail. Pin it to 0. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
The previous commit restored the bare private test but did not include the public adaptation. Add storage_policy='s3_cache' (the public default disk is local, so the part needs an S3-backed disk to issue PUTs) and async_insert=0 on the INSERTs (async_insert now defaults to true; without this the part write runs in a background flush and S3PutObject is not attributed to the query). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Removing a part from system.parts and removing its zero-copy refcount lock from Keeper are separate async steps of part cleanup. wait_for_delete_inactive_parts only waits for the former, so the system.zookeeper SELECT could observe the old parts' zero-copy lock nodes (confirmed in the server log: the SELECT ran ~0.4s before the locks were removed). Wait until the zero-copy lock count matches the number of active parts before asserting. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
LLVM Coverage ReportChanged lines: Changed C/C++ lines covered: 565/864 (65.39%) · Uncovered code |

This open-sources packed part storage, a well established feature in our private fork, as part of our ongoing commitment to open source.
Packed storage keeps all of a data part's files inside a single archive (
data.packed) instead of one file per stream. For small parts this significantly reduces the number of object-storage and Keeper requests, both on write and on parts load.Notes on this first step:
min_bytes_for_full_part_storage,min_rows_for_full_part_storageandmin_level_for_full_part_storagethresholds default to0, which means "always use Full storage". This matches the private default, where packed storage is turned on via configuration rather than via the setting defaults. It keeps the change backward compatible (forcompatibilitypurposes there is no default behavior change). Enabling it by default in the settings, for both public and private, can come in a future release, or after merging support.allow_experimental_*gate because this is not an experimental feature: it is a well established storage format in the private fork. The only user-visible consequence of turning it on is a compatibility one, called out in the changelog entry below: once a table writesdata.packedparts, older server versions cannot read them, so downgrading is no longer possible.Changelog category (leave one):
Changelog entry (a user-readable short description of the changes that goes into CHANGELOG.md):
Added "packed" data part storage for
MergeTreetables, which stores all of a part's files in a single archive (data.packed) instead of a file per stream. It is controlled by themin_bytes_for_full_part_storage,min_rows_for_full_part_storageandmin_level_for_full_part_storagesettings and is disabled by default. Compatibility note: once a table writes packed parts, older server versions cannot read them, so enabling this format prevents downgrading the server.