Open-source packed part storage by Algunenano · Pull Request #108118 · ClickHouse/ClickHouse · GitHub
Skip to content

Open-source packed part storage#108118

Open
Algunenano wants to merge 27 commits into
ClickHouse:masterfrom
Algunenano:packed-parts-oss
Open

Open-source packed part storage#108118
Algunenano wants to merge 27 commits into
ClickHouse:masterfrom
Algunenano:packed-parts-oss

Conversation

@Algunenano

@Algunenano Algunenano commented Jun 22, 2026

Copy link
Copy Markdown
Member

This open-sources packed part storage, a well established feature in our private fork, as part of our ongoing commitment to open source.

Packed storage keeps all of a data part's files inside a single archive (data.packed) instead of one file per stream. For small parts this significantly reduces the number of object-storage and Keeper requests, both on write and on parts load.

Notes on this first step:

  • The feature is disabled by default: the min_bytes_for_full_part_storage, min_rows_for_full_part_storage and min_level_for_full_part_storage thresholds default to 0, which means "always use Full storage". This matches the private default, where packed storage is turned on via configuration rather than via the setting defaults. It keeps the change backward compatible (for compatibility purposes there is no default behavior change). Enabling it by default in the settings, for both public and private, can come in a future release, or after merging support.
  • There is no allow_experimental_* gate because this is not an experimental feature: it is a well established storage format in the private fork. The only user-visible consequence of turning it on is a compatibility one, called out in the changelog entry below: once a table writes data.packed parts, older server versions cannot read them, so downgrading is no longer possible.
  • The settings are not randomized in CI yet. That will be done in follow-up pull requests.
  • The intention is to merge this as-is first, so the public and private trees match. Follow-ups will then deal with AI-review findings, bugs surfaced by stress tests, setting randomization, and enabling the feature by default.

Changelog category (leave one):

  • New Feature

Changelog entry (a user-readable short description of the changes that goes into CHANGELOG.md):

Added "packed" data part storage for MergeTree tables, which stores all of a part's files in a single archive (data.packed) instead of a file per stream. It is controlled by the min_bytes_for_full_part_storage, min_rows_for_full_part_storage and min_level_for_full_part_storage settings and is disabled by default. Compatibility note: once a table writes packed parts, older server versions cannot read them, so enabling this format prevents downgrading the server.

CheSema and others added 9 commits June 19, 2026 10:44
Introduce `DataPartStorageOnDiskPacked`, a data part storage that keeps
all of a part's files inside a single archive instead of a file per
stream. Add the supporting `IDataPartStorage` hooks (`isPackedPartStorage`,
`setPreferredFileOrder`, `getProjectionNoInitialize`) and
`PackedFilesReader::getFileOffsetAndSize`.

Co-authored-by: Raúl Marín <raul.marin@clickhouse.com>
Co-authored-by: Azat Khuzhin <a3at.mail@gmail.com>
Co-authored-by: Antonio Andelic <antonio2368@users.noreply.github.com>
Co-authored-by: Konstantin Bogdanov <thevar1able@users.noreply.github.com>
Co-authored-by: alesapin <alesapin@clickhouse.com>
Co-authored-by: Anton Popov <anton@clickhouse.com>
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Select packed storage in `MergeTreeData::choosePartFormat` based on the
`min_bytes_for_full_part_storage`, `min_rows_for_full_part_storage` and
`min_level_for_full_part_storage` settings, build it through
`MergeTreeDataPartBuilder`, and compute the preferred on-disk file order
(`getPreferredFileOrder`) so the archive is laid out for efficient reads.
Thread it through merges (`MergeTask`), inserts (`MergeTreeDataWriter`) and
mutations (`MutateTask`).

Co-authored-by: Raúl Marín <raul.marin@clickhouse.com>
Co-authored-by: Anton Popov <anton@clickhouse.com>
Co-authored-by: Alexey Milovidov <milovidov@clickhouse.com>
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Add a `part_storage_type` column reporting whether a part uses `Full` or
`Packed` data part storage.

Co-authored-by: Raúl Marín <raul.marin@clickhouse.com>
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
The `min_bytes_for_full_part_storage`, `min_level_for_full_part_storage`
and `min_rows_for_full_part_storage` settings now take effect outside
ClickHouse Cloud, so drop the "Only available in ClickHouse Cloud" note
from their descriptions.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Stateless tests covering packed selection, data correctness, refcounts and
column modification time, plus an integration test exercising packed parts
on local and S3 disks.

Co-authored-by: Alexander Gololobov <davenger@clickhouse.com>
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Tidy the constructor parameter lists and the `build` / `getPartStorageByType`
call sites in `MergeTreeDataPartBuilder` after enabling packed storage.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
The cached `skp_idx.packed` reader was reset on rename/changeRootPath while a
concurrent query could be reading its index, causing a data race and
use-after-free. Make `PackedFilesReader` a path-independent, shared, immutable
index: `readFile` now takes the archive's current disk and path from the
caller, so relocation no longer needs to free the reader.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Bring the open-sourcing branch up to current master, including the packed
skip-indices data-race fix and related follow-ups already merged upstream.
`cloneAndLoadDataPart` writes `metadata_version.txt` after freeze when
`metadata_version_to_write` is set (e.g. `REPLACE`/`MOVE`/`ATTACH PARTITION`
into a `ReplicatedMergeTree` table). Packed part storage needs an active
transaction for `writeFile`, so the bare write threw `NOT_INITIALIZED`. Wrap
the write in a packed transaction and commit it before loading the part.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@clickhouse-gh

clickhouse-gh Bot commented Jun 22, 2026

Copy link
Copy Markdown
Contributor

@clickhouse-gh clickhouse-gh Bot added the pr-feature Pull request with new product feature label Jun 22, 2026
Drop `filecmp` and `helpers.test_tools.TSV`, which were imported but unused
(flagged by the `ruff` style check).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
auto projection_storage = getProjection(file);
auto params_copy = params;
params_copy.external_transaction = dest_storage->transaction;
projection_storage->freeze(dest_storage->getRelativePath(), file, read_settings, write_settings, save_metadata_callback, params);

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

params_copy is the one that carries dest_storage->transaction, but this call passes the original params. In the branch above where packed freeze rewrites data.packed, the parent storage starts a transaction and commits it only at the end; the projections should join that same transaction. As written, projection freeze runs outside the parent transaction (and may commit independently), so a failure before the parent commitTransaction can leave projection files committed without the parent archive. Please pass params_copy here.

else
{
/// Transactions doesn't support copy between different metadata storages, so doing it manually
auto write_buf = transaction->writeFile(dest_storage->getRelativeDataPath(), DBMS_DEFAULT_BUFFER_SIZE, WriteMode::Rewrite, write_settings);

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the source storage's member transaction, not the destination/external transaction. In the normal freezeRemote path the source storage has no active transaction, so this branch dereferences a null transaction whenever params.external_transaction is set and the source/destination metadata storages differ. Use the destination transaction (params.external_transaction / dest_storage->transaction) if the operation can be represented transactionally, or reject this cross-metadata-storage case instead of applying a non-transactional fallback.

auto projection_storage = getProjection(file);
auto params_copy = params;
params_copy.external_transaction = dest_storage->transaction;
projection_storage->freeze(dest_storage->getRelativePath(), file, read_settings, write_settings, save_metadata_callback, params);

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This recurses with local freeze on the source projection storage while freezeRemote is cloning the parent part to dst_disk. That writes the projection under dest_storage->getRelativePath() on the source disk, not on the destination disk, and the prepared params_copy is ignored as well. A packed part with projections cloned to another disk can end up with data.packed on dst_disk but missing projection data there. This should recurse through freezeRemote(..., dst_disk, ..., params_copy).

`system.parts` gained the `part_storage_type` column, so the `SHOW CREATE`
schema snapshot in 02117_show_create_table_system needs the new column.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
throw Exception(ErrorCodes::NOT_INITIALIZED, "Cannot write file {} because writer is not initialized", name);
}

void DataPartStorageOnDiskPacked::createFile(const String &)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

VersionMetadataOnDisk::storeInfoToDataPartStorage calls createFile for txn_version.txt.tmp specifically to fail if a stale tmp file already exists before writing transaction metadata. This override silently succeeds even for separately written side files such as txn_version.txt.tmp, so packed parts overwrite the stale tmp file and lose that rollback/stale-write guard. Please delegate createFile for separately written files, and either create an empty packed entry through the writer or throw for non-separate names instead of treating the call as success.

@clickhouse-gh

clickhouse-gh Bot commented Jun 22, 2026

Copy link
Copy Markdown
Contributor

📊 Cloud Performance Report

✅ AI verdict: no_change — no significant changes across 38 queries analysed

This PR is a storage-mechanics refactor: it makes packed data-part storage available outside cloud builds, adjusts on-disk file ordering for parts, and tweaks metadata reads during part load/fetch. None of that touches the query execution path that ClickBench SELECTs run, so the three flagged improvements — Q4 (-18%), Q15 (-17%), and Q28 (-5.78%) — cannot plausibly be caused by this change. They move together and are consistent with run-to-run/instance variance, so all three were downgraded to not-sure. Net read: no real per-query performance impact from this PR.

clickbench

⚠️ 3 inconclusive

Flagged queries (3 of 43)
Query Verdict Baseline median (ms) PR median (ms) Change q-value Hint
⚠️ 4 not_sure 262 215 -17.9% <0.0001 io: PR only touches packed part storage layout/loading, not the SELECT exec path; -18% is run-to-run variance
⚠️ 15 not_sure 245 203 -17.1% <0.0001 io: Same off-path storage change; query exec untouched and -17% mirrors Q4 — instance-level variance
⚠️ 28 not_sure 6671 6286 -5.8% 0.0003 io: 6s query; PR's part-storage/loading changes don't reach the exec path — -5.78% is noise

q-value = BH-FDR adjusted p; smaller is stronger evidence. MIRAI flags a query when q < fdr_q (default 0.10) — the value the verdict is based on.

tpch_adapted_1_official

🟢 No significant changes

Debug info
  • StressHouse run: d2abb806-40dd-43fc-83cc-d195ea456aa0
  • MIRAI run: 705bcfab-d972-4755-9d8a-d729146f5c51
  • PR check IDs:
    • clickbench_495800_1782826966
    • clickbench_495806_1782826966
    • clickbench_495817_1782826966
    • tpch_adapted_1_official_495824_1782826966
    • tpch_adapted_1_official_495835_1782826966
    • tpch_adapted_1_official_495848_1782826966

Algunenano and others added 2 commits June 22, 2026 17:58
02497 measured S3PutObject from query_log, but the part upload runs on the
object-storage writer pool, so the event is not attributed to the INSERT query
in the open-source build (it stays 0). Count the part's physical objects via
system.remote_data_paths instead: a packed part is a single archive object,
far fewer than full storage's one-object-per-file. Rename the test accordingly.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Replace the S3PutObject query_log check (the upload runs on the object-storage
writer pool, so the event is never attributed to the INSERT query in the
open-source build) with a count of the part's distinct physical objects: a
packed part is a single archive object, far fewer than full storage's
one-object-per-file. Use the s3_cache policy, which exists in both repos.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
chassert(writer);

/// Data is unchanged.
if (!writer->hasModifiedFiles())

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

finalizeWriter leaves writer engaged when the transaction only touched separately-written files, or otherwise made no packed-archive changes. commitTransaction then commits and clears transaction, but the still-engaged writer makes the storage look permanently mid-write: the next beginTransaction throws Uncommitted transaction already exists, and later writeFile calls can buffer into a writer that no longer has a disk transaction to flush through. Please reset the writer on this no-op path as well, while leaving the existing reader unchanged.

if (isWrittenSeparately(name))
actual_file_names.insert(name);
else
actual_file_names.insert(DATA_FILE_NAME);

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Collapsing all checksummed logical files to data.packed also removes the replication fetch checksum check for packed parts. DataPartsExchange::sendPartFromDisk and DataPartsExchange::downloadPartToDisk only compare part->checksums with the bytes sent over the wire for full storage, so for packed storage we only prove that the receiver got the same data.packed bytes the source served. We no longer prove those bytes match the source part's loaded logical checksums.txt/part->checksums, so a source whose archive changed after load can serve a self-consistent but wrong packed archive under this part name. Please add a packed-aware validation before serving or finalizing a fetched packed part, or carry the logical checksums through the exchange so packed fetches keep the same corruption check as full parts.

Bring the packed archive tooling to public builds: the `packed-io`
clickhouse-disks subcommand and the standalone `clickhouse-packed-io`
program, which list, extract and create ClickHouse packed-format
archives (`data.packed`). The implementation (`PackedFilesOperations`,
`CommandPackedIO`, `programs/packed-io`) only depends on the already
public `PackedFilesReader`/`PackedFilesWriter`, so un-gate it from
`#if CLICKHOUSE_CLOUD`.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
("extract,x", "extract packed archive")
("create,c", "create packed archive")
("input,i", po::value<String>()->required(), "input file/directory")
("output,o", po::value<String>()->required(), "output file/directory")

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The standalone clickhouse packed-io command advertises list as taking only --input, but po::notify runs before the command dispatch and this option is marked required, so clickhouse packed-io --list -i data.packed fails with missing --output. The same path also reads options.at("file-order") unconditionally for create, so create without the documented optional --file-order throws. Please make --output optional at the parser level and validate it only for extract/create, and read file-order with a default like the clickhouse disks packed-io command does.

if (has_shared_transaction)
throw Exception(ErrorCodes::LOGICAL_ERROR, "Cannot commit shared transaction");

if (!is_precommitted)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

commitTransaction stops being retryable if the disk commit throws after finalizeWriter. In that path finalizeWriter has already reset writer, but is_precommitted is still false, so the storage keeps transaction set while later commitTransaction (and Cloud undoTransaction) report "There is no uncommitted transaction" and cannot commit or roll the disk transaction back. Please mirror the explicit precommitTransaction state here: after a successful finalizeWriter, mark the transaction precommitted before calling transaction->commit, and keep the reader/transaction state consistent for the exception path.

for (const auto & file_name : files)
{
auto in = reader.readFile(disk_in, input_file, file_name, {}, {});
auto out = disk_out->writeFile(output_dir / fs::path(file_name));

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

extractPacked promises to extract into output_dir, but file_name comes from the archive index and is used as a path without validation. A crafted archive entry such as ../metadata.xml or an absolute path can escape output_dir and overwrite files reachable by this process. Please reject absolute paths and parent-directory components (similar to the fetch-side pathStartsWith guard) before opening the output file.

Comment thread src/Storages/MergeTree/MergeTreeData.cpp
These tests assert behavior that is specific to full (non-packed) part
storage. With packed storage now active in public builds, the CI random
MergeTree settings can pick a large `min_bytes_for_full_part_storage`,
switch the part to packed and break them. Pin the setting to 0 in the
table definitions, matching the existing fix in the private repository.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Algunenano and others added 11 commits June 29, 2026 17:21
# Conflicts:
#	programs/disks/CMakeLists.txt
Revert 02497 to the original private form that counts S3 PutObject requests
(a packed part is a single archive object, so writing it issues one PUT),
undoing the earlier object-count rewrite. Adapt it for the open-source build:
add `storage_policy = 's3_cache'` because the public default disk is local
(in private the default is object storage), and pin `async_insert = 0` so the
part is written synchronously and its S3PutObject events are attributed to the
INSERT instead of a background flush (async_insert now defaults to true).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
The test asserts CreatedReadBufferMMap=2, but the randomized
min_bytes_for_full_part_storage can switch the part to packed storage, whose
single data.packed archive is read through one buffer (no per-file mmap), so
the count drops to 0. Pin the setting to 0 to keep full storage.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Same as 01343: randomized min_bytes_for_full_part_storage can pack the part
into a single data.packed archive, so the mmap buffer count drops to 0 instead
of the asserted 4. Pin the setting to 0.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
The test deletes the part directory on local fs and runs CHECK TABLE. With a
randomized large min_bytes_for_full_part_storage the part is packed into a
single data.packed archive, so the check fails to open it (FILE_DOESNT_EXIST on
stderr). Pin the setting to 0.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
The test checks the vertical merge of a 300-column part stays under 100MB. A
randomized large min_bytes_for_full_part_storage packs the part into a single
data.packed archive, inflating merge memory above the threshold. Pin it to 0.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
The test checks per-column column_modification_time in system.parts_columns.
A randomized large min_bytes_for_full_part_storage packs all columns into one
data.packed archive, so per-column file timestamps no longer differ as expected.
Pin it to 0 on both table variants.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
The test verifies disk reads go through ThreadPoolRead with O_DIRECT. A
randomized large min_bytes_for_full_part_storage packs the part into a single
data.packed archive, read via a different path, so the expected thread rows
don't appear. Pin it to 0.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
The test cat's columns_substreams.txt from the part directory. A randomized
large min_bytes_for_full_part_storage packs the part into a single data.packed
archive, so that file does not exist on disk and cat fails. Pin it to 0.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
The test edits and removes raw part files (skp_idx_idx.mrk4, checksums.txt). A
randomized large min_bytes_for_full_part_storage packs the part into a single
data.packed archive, so those files are not on disk and the rm/printf fail.
Pin it to 0.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
The previous commit restored the bare private test but did not include the
public adaptation. Add storage_policy='s3_cache' (the public default disk is
local, so the part needs an S3-backed disk to issue PUTs) and async_insert=0
on the INSERTs (async_insert now defaults to true; without this the part write
runs in a background flush and S3PutObject is not attributed to the query).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Comment thread src/Storages/MergeTree/DataPartStorageOnDiskPacked.cpp
Removing a part from system.parts and removing its zero-copy refcount lock from
Keeper are separate async steps of part cleanup. wait_for_delete_inactive_parts
only waits for the former, so the system.zookeeper SELECT could observe the old
parts' zero-copy lock nodes (confirmed in the server log: the SELECT ran ~0.4s
before the locks were removed). Wait until the zero-copy lock count matches the
number of active parts before asserting.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@clickhouse-gh

clickhouse-gh Bot commented Jun 30, 2026

Copy link
Copy Markdown
Contributor

LLVM Coverage Report

Metric Baseline Current Δ
Lines 85.40% 85.40% +0.00%
Functions 92.70% 92.60% -0.10%
Branches 77.60% 77.70% +0.10%

Changed lines: Changed C/C++ lines covered: 565/864 (65.39%) · Uncovered code

Full report · Diff report

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

pr-feature Pull request with new product feature

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants