Switch the default compression to ZSTD(3) for table data and network#108786
Switch the default compression to ZSTD(3) for table data and network#108786alexey-milovidov wants to merge 38 commits into
Conversation
Change the built-in default compression codec from `LZ4` to `ZSTD(3)`, for both on-disk table data and client/server and server/server network communication. On-disk: `CompressionCodecFactory::getDefaultCodec` now returns `ZSTD(3)`. This is the codec applied to columns that do not specify a codec and when no matching `<compression>` config case is present. It flows through the `MergeTree` write path, the `Log` family, temporary data, and the HTTP `compress=1` output. Network: the defaults of `network_compression_method` (`LZ4` -> `ZSTD`) and `network_zstd_compression_level` (`1` -> `3`) are changed so that the native protocol uses `ZSTD(3)` by default. The changes are recorded in `SettingsChangesHistory.cpp` for the `compatibility` setting. Docs updated accordingly, including the `Native` format and protocol specifications and the server `compression` settings. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
After switching the default compression codec from `LZ4` to `ZSTD(3)`, several stateless tests that implicitly relied on the `LZ4` default started producing different output. Pin `LZ4` explicitly where the test output depends on the codec, so the references stay stable regardless of the server default: - `04267_estimate_compression_ratio_window_accumulation` and `03363_estimate_compression_bug91415`: pass `'LZ4'` as the codec parameter of `estimateCompressionRatio`. - `03634_autopr_input_bytes_estimation_compact`: set the `MergeTree` setting `default_compression_codec='LZ4'` so `ReadCompressedBytes` matches the statistics-based estimate. - `02346_text_index_posting_list_compression_data_size_validation`: set `default_compression_codec='LZ4'` so the text index file compression (which uses the part default codec) is pinned. - `01943_log_column_sizes`: add `CODEC(LZ4)` to the `Log`/`TinyLog` columns (the `Log` family has no `MergeTree` settings). `02967_http_compressed` cannot pin the codec: the HTTP `compress=1` output uses `CompressionCodecFactory::getDefaultCodec` directly with no per-query override. Its reference is updated to the new `ZSTD(3)` output (verified to match the bytes produced by `clickhouse compressor` with `ZSTD(3)`, which are identical to the CI result). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
| )", 0) \ | ||
| \ | ||
| DECLARE(String, network_compression_method, "LZ4", R"( | ||
| DECLARE(String, network_compression_method, "ZSTD", R"( |
There was a problem hiding this comment.
This line changes the default codec for native client/server and server/server traffic, and the changelog promises lower network usage at only a modest CPU cost, but the PR currently only validates that the setting takes effect. For a global default change, reviewers need a before/after measurement, such as representative native traffic or distributed query exchange with compressed bytes and CPU/query-time impact, or a linked benchmark that justifies the tradeoff.
Please add or link that evidence, or narrow the performance claim to the behavior that was actually verified.
There was a problem hiding this comment.
This is the deliberate, well-understood ZSTD(3)-vs-LZ4 tradeoff (better ratio at a modestly higher CPU cost), and the change aligns the self-managed network default with what ClickHouse Cloud already uses. The changelog states only the qualitative tradeoff ("a modest increase in CPU usage"). Leaving this thread open for @alexey-milovidov to attach representative before/after network/CPU numbers if he wants them in the PR.
There was a problem hiding this comment.
The PR-body performance claim was narrowed to scope it to what is actually controlled (client/server and Distributed network compression) and to point at the CI performance comparison as the before/after measurement of the storage/network-vs-CPU tradeoff. Leaving this thread open for @alexey-milovidov to attach representative benchmark numbers, since the LZ4-vs-ZSTD(3) default is a deliberate product decision aligning self-managed with Cloud.
There was a problem hiding this comment.
The AI review pinpointed why the before/after numbers were missing: the `Performance Comparison` AMD shards were skipped because the PR was not labeled `pr-performance` (the gate in `ci/jobs/scripts/workflow_hooks/filter_job.py` — ARM perf shards run by default, AMD ones are label-gated). I added the `pr-performance` label, so the CI performance comparison the PR body relies on will now run its AMD shards as well; the ARM shards are already running on the latest CI for `4cd1fe1d`. This only enables the measurement the PR already committed to — attaching the resulting before/after compressed-bytes and CPU/query-time figures (or deciding to narrow the claim) is left to @alexey-milovidov, so this thread stays open.
Follow-up to the previous commit, after determining empirically how each code path selects its codec: - `03634_autopr_input_bytes_estimation_compact`: the compact-part input-bytes estimate serializes a sample with `getDefaultCodec` (now `ZSTD(3)`), so the actually-read compressed bytes must use the same codec to stay within the test's 2x tolerance. Pin `default_compression_codec='ZSTD(3)'` instead of `LZ4` (the previous commit pinned `LZ4`, which left the estimate and the reads mismatched). - `02346_text_index_posting_list_compression_data_size_validation`: the text index files are compressed with `getDefaultCodec` directly, which is not controlled by a column codec or the `default_compression_codec` setting. Drop the ineffective setting and update the reference to the `ZSTD(3)` sizes. - `03595_funcs_on_zero` (`no-fasttest`): pass `'LZ4'` to `estimateCompressionRatio`. - `00753_system_columns_and_system_tables_long` (`long`): add `CODEC(LZ4)` to the `TinyLog`/`Log` columns whose `total_bytes` depend on the default codec (`StripeLog` and `Memory` are codec-independent here; the `MergeTree` table is already pinned to `LZ4` by the test harness for `no-random-*` tests). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
The msan flaky check confirmed the `TinyLog`/`Log` `CODEC(LZ4)` pins work (`total_bytes` = 27/43), but `StripeLog` reports `135` instead of `113`. Unlike `TinyLog`/`Log`, `StripeLog` compresses its combined data stream with `getDefaultCodec` (now `ZSTD(3)`) regardless of the column codec, so its size cannot be pinned via SQL — update the expected `total_bytes` to the `ZSTD(3)` value. The value is stable: it depends only on the default codec, not on the randomized query settings (`min`/`max_compress_block_size`, etc.). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
With `ZSTD(3)` as the default codec, the autopr input-bytes estimate (which serializes a few sampled blocks via `getDefaultCodec`) lands at up to ~2x the actual full-column compressed size for highly-compressible columns: `ZSTD(3)` compresses the full column better than the small sample predicts, and the sampled-block set varies with thread scheduling, so the previous 2x tolerance flaked right at the boundary (`query_2`: 808201 vs 1617909 = 2.002x). Widen the tolerance to 4x, which still catches gross misestimation. The table stays pinned to `ZSTD(3)` so the estimate and the reads use the same codec. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
estimateCompressionRatio('') uses the default codec; pass 'LZ4' explicitly so
the ratio does not depend on the server default. (no-fasttest test, skipped by
Fast test, surfaced in the full parallel stateless run.)
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
`test_jbod_overflow` inserts random printable ASCII into a small (40MB) jbod disk and asserts the data overflows to the `external` disk. `randomPrintableASCII` data is entropy-light (95-symbol alphabet): `ZSTD(3)` (the new default codec) entropy-codes it ~18% smaller than `LZ4`, so the parts now fit in the jbod and no longer overflow, failing `assert ... == 'external'`. Pin the column to `CODEC(LZ4)` so the on-disk part size is independent of the server default codec. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
`03634_autopr_output_bytes_estimation` compares the autopr output-bytes estimate against hardcoded `expected_bytes` with a 2.5x tolerance. The estimate serializes the output (aggregation state) with `getDefaultCodec`; under the new `ZSTD(3)` default, query_12's output compresses to ~3.6M instead of the ~11.2M seen under `LZ4`, exceeding the 2.5x bound. Update query_12's expected value to 3620000. Across three CI runs the estimate was stable (3609402/3622834/3629968), and it is the only query whose LZ4-calibrated expectation falls outside 2.5x of the ZSTD estimate; the others remain within tolerance. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
test_recompression_ttl asserts the pre-recompression part reports `default_compression_codec == 'LZ4'`. With the default codec switched to `ZSTD(3)` the initial part now reports `ZSTD(3)`, failing the assertion (same situation as the stateless `01465_ttl_recompression`). Pin `default_compression_codec = 'LZ4'` on all three tables so the initial parts stay `LZ4`; the `RECOMPRESS CODEC(ZSTD(...))` codecs are unaffected (recompression wins over the setting). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…ault `test_backward_compatibility::test_aggregate_states` compares the serialized state of every aggregate function between an old and a new server. The old server defaults to `LZ4` and the new one to `ZSTD(3)`; `estimateCompressionRatio` is the only aggregate whose state encodes the argument compressed with the server's default codec, so its state (and final ratio) legitimately differs across the change and the cross-version comparison is meaningless. Skip it, matching how the test already skips functions it cannot meaningfully compare. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
`04100_autopr_input_bytes_estimation_aggregation_in_order` checks the autopr input/output byte estimates. Both use `getDefaultCodec`, now `ZSTD(3)`: - Pin `default_compression_codec='ZSTD(3)'` so the read bytes match the input estimate (the `no-random-*` harness would otherwise inject `LZ4`). - The output estimate serializes the aggregation output with `getDefaultCodec`; under `ZSTD(3)` the two `max_threads=1` queries (`agg_in_order_single`, `agg_in_order_multi_agg`) compress ~5x smaller (4781834 / 6889765) than the original `LZ4`-era expectations (25519057 / 33649632), exceeding the 2x bound. Recalibrate those two expected values; the `max_threads=4` queries stay within tolerance. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
The flaky check showed `04100`'s autopr output-bytes estimate swings several-fold between runs under `ZSTD(3)` (e.g. agg_in_order_single 4.78M..20.4M, multi_agg 6.89M..45.2M) because the estimate serializes the aggregation output with the default codec and is very sensitive to randomized block-sizing query settings - a sensitivity that did not surface under the previous `LZ4` default (04100 is green on master). Add `no-random-settings` (the test already used `no-random-merge-tree-settings` "to stabilize the test") so the query settings are fixed and the estimate is deterministic. The expected output values may need a follow-up recalibration to the deterministic `ZSTD(3)` values. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
With `no-random-settings` now fixing the query settings, `04100`'s autopr output estimates are deterministic under `ZSTD(3)`: agg_in_order_single=20406542 and agg_in_order_multi_agg=45288700 (identical across all flaky-check reruns). These are within 2x of the original `LZ4`-era expectations (25519057 / 33649632 — ratios 1.25x / 1.35x), so the original values hold. The earlier recalibration to 4781834 / 6889765 was based on an outlier randomized-settings run and is reverted. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
The `Log`, `TinyLog`, and `StripeLog` engines write each insert with the current default codec and append it to the same compressed stream. Their columns store no per-stream codec on disk, and the codec is resolved fresh at write time from `CompressionCodecFactory::getDefaultCodec`. Switching that default from `LZ4` to `ZSTD(3)` therefore makes an upgraded table append `ZSTD(3)` blocks after the existing `LZ4` blocks in one `.bin`/`index` file. The readers constructed their `CompressedReadBuffer`/`CompressedReadBufferFromFile` with `allow_different_codecs = false`, and `CompressedReadBufferBase` throws `CANNOT_DECOMPRESS` when the method byte changes within a single stream. So a post-upgrade `SELECT` (or `StripeLog` index load / restore) over a range that spans the append boundary would fail to read existing data. Each compressed block is self-describing (its codec method byte is in the header), so a mixed-codec append-only stream is valid. Set `allow_different_codecs = true` on the Log-family read paths, exactly as the `MergeTree` reader already does. A new gtest `gtest_mixed_codec_append` writes an `LZ4` block followed by a `ZSTD(3)` block and asserts both buffer types the Log family uses read it back with the tolerance on, and still throw `CANNOT_DECOMPRESS` with it off. A SQL-level upgrade test is not possible in a single server: the Log family has no per-query or `ALTER MODIFY ... CODEC` override, so a mixed-codec file can only arise from a cross-version change of the global default codec. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
`KeeperSnapshotManager` writes snapshots with a no-argument `CompressedWriteBuffer` when `compress_snapshots_with_zstd_format` is `false`. That setting is documented as "Write compressed snapshots in ZSTD format (instead of custom LZ4)", i.e. the `false` branch is the legacy custom-frame `LZ4` format. With the default codec now `ZSTD(3)`, that branch would silently start emitting `ZSTD(3)`-framed snapshots while still being presented as the legacy `LZ4` format. Old snapshots remain readable either way (the custom-frame reader auto-detects the codec from each frame's method byte), so this is not a correctness bug, but pinning `LZ4` keeps the documented legacy format literally true and independent of the server's default codec. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
After switching the default codec to `ZSTD(3)`, more tests whose output depends on the on-disk/network byte volume need the codec pinned so they do not depend on the server default: - `02047_log_family_data_file_dumps` and `02047_log_family_complex_structs_data_file_dumps`: these dump the exact bytes of the `Log`/`TinyLog`/`StripeLog` `.bin` files. Pin the columns to `CODEC(LZ4)` so the references keep validating the on-disk layout regardless of the default codec (explicit `CODEC(LZ4)` produces byte-identical output to the previous implicit `LZ4` default). - `test_replicated_fetches_bandwidth`: the throttling assertions depend on the size of the fetched parts. `randomPrintableASCII` produces random printable ASCII, which `ZSTD` entropy-codes noticeably better than `LZ4`, shrinking the parts enough to change fetch timing. Pin the `data` column to `CODEC(LZ4)` in all of these tests, matching the existing fix in `test_jbod_overflow`. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
The sample `<compression>` block recommended uncommenting it "for better compression", but its case used `<method>zstd</method>` with no level, i.e. `ZSTD(1)`, which is weaker than the new `ZSTD(3)` default. Enabling it would have reduced compression for matching (large) parts. Give the example an explicit `<level>` above the default so the case genuinely improves compression for large parts, and reword the surrounding comment. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
|
Addressed the AI review (the Block verdict's required actions) and the two real CI failures. Pushed Log-family mixed-codec upgrade path (the blocker). Global-default consumers, settled.
Docs / user-facing. Narrowed the rollback guidance (PR body + changelog): CI reds. Left open for you, @alexey-milovidov: the performance-evidence thread on Verification: the three changed |
`Log`/`TinyLog` store each column with its own codec, so pinning the columns to `CODEC(LZ4)` keeps their on-disk byte dumps independent of the server default codec. `StripeLog`, however, writes a single shared `data.bin` (and its `index.mrk`) through `CompressionCodecFactory::getDefaultCodec`, ignoring the per-column codec, so its dumps reflect the default — which this PR changes to `ZSTD(3)`. The previous column-codec pin therefore could not stabilize the `StripeLog` sections, and `02047_log_family_data_file_dumps` / `02047_log_family_complex_structs_data_file_dumps` still differed from their references. Regenerate the `StripeLog` sections of both references under the `ZSTD(3)` default (the `Log`/`TinyLog` sections are byte-identical) and update the explanatory comment to describe the per-engine behavior. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Three more stateless tests asserted sizes that depend on the default compression codec and broke under the `ZSTD(3)` default: - `02241_filesystem_cache_on_write_operations`: the cached on-disk segment byte totals shift (`ZSTD(3)` has slightly larger per-block overhead than `LZ4` on the tiny, highly-compressible blocks this test writes). Pin the columns to `CODEC(LZ4)` and update the three echoed `CREATE TABLE` lines in the reference. - `03546_add_distinct_to_in_clause`: compares compressed `NetworkReceiveBytes` between the with-`DISTINCT` and without-`DISTINCT` queries; under the `ZSTD(3)` network default the repetitive without-`DISTINCT` stream is entropy-coded almost as small as the tiny with-`DISTINCT` payload, so the strict comparison flaps. Pin `network_compression_method = 'LZ4'`. - `03783_autopr_dataflow_cache_reuse_qcc`: the existing `default_compression_codec = 'LZ4'` pin only covers wide parts; the compact-part byte estimation falls back to `getDefaultCodec` (now `ZSTD(3)`), and the merge-tree-settings randomizer can force compact parts. Add `min_bytes_for_wide_part = 0` so the estimate uses the on-disk `LZ4` ratio. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
My earlier 02241 fix was based on a wrong premise: pinning the default codec to `LZ4` (column or setting) does NOT make the aggregate `sum(size)` over `system.filesystem_cache` match the old `LZ4`-default values. Verified with the new binary against a local cache disk: even with `default_compression_codec='LZ4'`, the totals are `2139/84420/167743/251168` (vs the old `2131/84396/167711/251128`), because a few bytes per part come from auxiliary streams written with the server default codec (`getDefaultCodec`) that the table setting does not reach. The `data.bin` cached-segment sizes (the `Vertical` assertions) DO stay fixed under the `LZ4` pin, so keep the pin and update only the four aggregate `sum(size)` totals to the `ZSTD(3)`-default values. Correct the comment to describe this accurately. Locally reproduced the first two totals exactly; the other two are the deterministic CI-actual values. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
|
continue-pr status (session 3) Addressed the sole blocking AI Request changes finding — the rollback contract for default-only persistent Verified the finding is accurate. Took the documentation route the reviewer offered (a per-engine
Reading remains correct on upgrade: each CI: the public report is green (0 failed, 69 passed, 18 skipped; some jobs still running). The only remaining open review thread is the performance-evidence one on |
My previous 02241 reference update missed the `9\t1217` -> `9\t1225` measurements: the truncated CI report only surfaced the `2131`/`84396`/... hunk, so I updated those but not the earlier `1217` ones (also +8 under the ZSTD-default residual). Confirmed via the full diff in `system.checks` (`test_context_raw`) that `1217 -> 1225` is now the ONLY remaining difference, and `1225` matches what the new binary produces locally for the `numbers(100)` insert. Update all six occurrences. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
|
continue-pr status (session 4) CI: green — Praktika report 0 failed / 170 jobs; rollup shows only one job still in progress (head `4cd1fe1d`), no failures. Branch is fully up to date with master (0 behind, merge-base today) so no re-merge or conflict resolution was needed. Review: the only unresolved thread is the perf-evidence one (`src/Core/Settings.cpp:2005`), and the latest AI verdict ( Action this session: the AI verdict diagnosed the missing evidence precisely — the `Performance Comparison` AMD shards were skipped because the PR lacked the `pr-performance` label (ARM perf shards run by default and are already producing numbers on the latest run). I added the `pr-performance` label, which enables the CI performance comparison the PR body already commits to. The `labeled` event does not re-trigger the PR workflow, so the AMD shards will run on the next CI trigger; the ARM before/after numbers are already being generated for `4cd1fe1d`. No code change and no re-merge were necessary. Attaching the resulting figures (or narrowing the performance claim) remains a product call for @alexey-milovidov, so the perf-evidence thread is left open. |
|
📊 Cloud Performance Report 🟢 AI verdict: This PR changes the default compression codec from LZ4 to ZSTD(3) for both on-disk columns and client/server network traffic. That is squarely on the read/decompression path, so the clickbench improvements are plausible: ZSTD(3) yields smaller parts, so I/O-bound scans read fewer bytes. Q4 (-15.6%) and Q15 (-11.9%) are clear improvements confirmed by both tests; Q28 shows a smaller borderline gain in the same direction. No regressions were flagged. clickbench🟢 2 improved · Flagged queries (3 of 43)
q-value = BH-FDR adjusted p; smaller is stronger evidence. MIRAI flags a query when q < fdr_q (default 0.10) — the value the verdict is based on. tpch_adapted_1_official🟢 No significant changes Debug info
|
|
continue-pr status (session 5) Re-merged Why re-merge (and not just a job re-run). The The two real reds on
I checked whether this last test needs the same
The sole unresolved review thread (attach representative before/after network/CPU benchmark numbers, or narrow the claim) is left open as your reserved product decision — the claim is already narrowed to point at the CI performance comparison, and enabling the AMD shards above lets that comparison actually run. |
|
continue-pr status (session 6) Head unchanged ( Addressed the AI
I verified the precise split rather than generalizing to "all secondary indexes": only Perf-evidence thread ( The one CI red is an unrelated, already-tracked flake. |
|
continue-pr status (session 7) — head unchanged ( The perf evidence the AI verdict said was "pending" has now completed. The latest AI
The one remaining CI red is the known, unrelated flake. The only unresolved review thread is the perf-evidence one on |
|
continue-pr status (session 8) — head unchanged ( New this session:
No rerun triggered: the genuine (1) recurs deterministically, so rerunning the private jobs is pointless until the private test bound is adjusted. No re-merge: branch is 299 behind (merge-base Everything else is unchanged from session 7: the 11 BLOCKED on: the private |
|
continue-pr status (session 9) — head unchanged ( GitHub now marks this PR as conflicting, but that flag is stale. I fetched the current Nothing else has changed since the previous status. The two real blockers are unchanged and both are your calls:
The latest AI I did not re-merge (the merge is clean, and a fresh push would reset the completed perf shards back to pending) and did not touch the private repo (its working tree is occupied by another sync). |
|
To avoid degradations on microbenchmarks, let's also ship a configuration change, that will choose the default codec LZ4 for data parts less than 100 MB in size and ZSTD(3) for bigger data parts. |
… `ZSTD(3)` for large parts To avoid degradations on microbenchmarks after switching the default codec to `ZSTD(3)`, the built-in default of `CompressionCodecSelector::choose` is now size-aware: `MergeTree` parts smaller than 100 MB use the faster `LZ4`, while larger parts use the default codec (`ZSTD(3)`). Freshly inserted parts are passed a size of `0` (their final size is not yet known), so they start as `LZ4`; the bigger parts produced by background merges cross the threshold and switch to `ZSTD(3)`. This keeps compression cheap for small, hot, frequently-rewritten data while getting the better ratio for the bulk of the data, which lives in the large merged parts. This only affects the built-in default reached via `MergeTreeData::getCompressionCodecForPart` -> `Context::chooseCompressionCodec` when a table specifies neither an explicit column `CODEC` nor the `default_compression_codec` `MergeTree` setting. Paths that use `CompressionCodecFactory::getDefaultCodec` directly (the `Log` family, `StripeLog`, part checksums, text indexes, persistent `Set`/`Join` files and the HTTP `compress=1` output) keep the flat `ZSTD(3)` default. Explicit `<compression>` `<case>` rules still take precedence over the size-aware baseline. Added a gtest (`gtest_compression_codec_selector`) covering both sides of the threshold. Updated `config.xml`, `config.yaml.example` and the `create/table` documentation to describe the size-aware default. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
|
continue-pr status (session 10) — implemented the size-aware default codec you asked for; pushed Size-aware default codec (your directive)
Implemented in This is exactly the path reached by The direct- Two decisions for you
TestAdded a gtest VerificationThe header and every API call ( Merge / CIRe-merged The other reds are unchanged and unrelated: |
| /// Size-aware built-in default: use the faster `LZ4` for small parts (where the ratio matters | ||
| /// less and the data is frequently merged and read), and the default codec (`ZSTD(3)`) for larger | ||
| /// parts (where the better ratio pays off). Explicit `<case>` rules below still take precedence. | ||
| CompressionCodecPtr res = part_size < min_part_size_for_default_codec |
There was a problem hiding this comment.
Large MergeTree projection parts still bypass this new 100 MiB split. MergeTreeDataWriter::writeProjectionPartImpl always resolves the codec with data.getCompressionCodecForPart(0, {}, time(nullptr)) (src/Storages/MergeTree/MergeTreeDataWriter.cpp:1143), and MergeProjectionPartsTask::executeStep simply renames a single temporary projection part instead of re-merging it (src/Storages/MergeTree/MergeProjectionPartsTask.cpp:58-68). That means a large projection rebuilt during MERGE / MATERIALIZE PROJECTION can stay on LZ4 forever even when its parent part has crossed the threshold and the new MergeTree contract says large parts switch to ZSTD(3).
Please thread a real projection-part size into codec selection (or force the final projection part through a size-aware merge step) and add a regression that checks system.projection_parts.default_compression_codec for a large projection without an explicit default_compression_codec.
…llback note The `6566c957d71` commit made the built-in default codec size-aware (`LZ4` for parts smaller than 100 MB, `ZSTD(3)` for larger parts) and updated the `config.xml` `<compression>` example accordingly, but the parallel `_server_settings_outside_source.md` note still said unmatched parts uniformly use `ZSTD(3)`. Align the note with the code and the `config.xml` example so the docs no longer describe the opposite of the server behavior. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
|
continue-pr status (session 11) — pushed Addressed — stale Left for you — the AI
Coupled — changelog/body framing. The changelog entry still describes on-disk table data as switching uniformly to Unchanged — perf-evidence thread ( CI: freshly running on the new head, no failures. Not re-merged (branch merged |
|
continue-pr status (session 16) — brought the stale PR metadata back in sync with the shipped size-aware default; the two design/product items remain your calls. Addressed — stale PR body / changelog / validation (the
The Left for you — the AI Unchanged — perf-evidence thread ( CI / merge. Head |
This change switches the default codec from `LZ4` to `ZSTD(3)`, which makes nearly every read- and insert-bound performance test slower and drowns the comparison in expected codec regressions, hiding any genuine algorithmic change. Load a `<compression>` config for the performance comparison that pins the built-in default to `LZ4` on both the reference and tested servers, so the report measures query logic rather than the codec switch. A `<case>` with no size conditions matches every part, so `CompressionCodecSelector::choose` returns `LZ4` for all `MergeTree` column data regardless of part size, overriding the size-aware built-in default. This also covers tables created via `AS SELECT`, which cannot carry a per-column `CODEC`. Related: #108786 Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
|
Merged The Merging Left untouched — both are your reserved calls:
|
LLVM Coverage Report
Changed lines: Changed C/C++ lines covered: 107/110 (97.27%) · Uncovered code |

Switch the default compression in ClickHouse from
LZ4toZSTD(3), for both on-disk table data and network communication.Motivation
LZ4has been the default for a long time and optimizes for speed, butZSTD(3)gives substantially better compression ratios at a still-modest CPU cost, reducing both storage footprint and network traffic out of the box. ClickHouse Cloud already uses a stronger default; this aligns the self-managed defaults.What changes
CompressionCodecFactory::getDefaultCodecnow returnsZSTD(3)instead ofLZ4. ForMergeTreecolumn data the built-in default is size-aware: when a column specifies no codec and no<compression>server-config case matches,CompressionCodecSelector::chooseuses the fasterLZ4for parts smaller than 100 MB andZSTD(3)for larger parts, so freshly inserted data starts asLZ4and the bigger parts produced by background merges switch toZSTD(3). The directgetDefaultCodecusers — theLogfamily, part checksums, the HTTPcompress=1output, theStripeLogdata stream, the persistentSet/Joinfiles, and text-index files — are not size-aware and switch uniformly fromLZ4toZSTD(3). External temporary (spill) files are unaffected: they keep their owntemporary_files_codecdefault (LZ4).network_compression_method(LZ4→ZSTD) andnetwork_zstd_compression_level(1→3) are changed, so the native client/server and Distributed (server/server) protocol usesZSTD(3)by default. The changes are recorded inSettingsChangesHistory.cppso thecompatibilitysetting restores the previous behavior for these paths. The distributed-query streaming exchange (StreamingExchangeSink) does not readnetwork_compression_methodand always uses the server default codec; this is safe because each frame is self-describing (the receiver auto-detects the codec) and the exchange is a transient, same-version channel —StreamingExchangeProtocolrejects peers on a different protocol version, so a stream is never read back by a node expecting a different codec.Log/TinyLog/StripeLogengines resolve the default codec at write time, so a table written before the upgrade (LZ4) and appended to after it (ZSTD(3)) ends up with mixed-codec blocks in one file. Their readers now passallow_different_codecs = true(as theMergeTreereader already does) so such files still read back correctly. The legacy custom-frame Keeper snapshot format (compress_snapshots_with_zstd_format = false) is pinned toLZ4so it stays the documented format.Columns, tables, and connections that specify a codec or method explicitly are unaffected — with one exception: streams written through the built-in default codec directly, which ignore per-column codecs and the
<compression>config. TheStripeLogdata stream, the persistentSet/Joinbackup files (written bySetOrJoinSinkandStorageJoin::mutatethrough aCompressedWriteBufferwith no codec), andMergeTreeauxiliary streams such as part checksums (checksums.txt, viaMergeTreeDataPartChecksums::write) and text-index files always follow the new default; see the changelog entry below for the per-path rollback.Validation
Built locally and verified against the new binary:
MergeTreetable created with no explicit codec follows the size-aware default: small parts reportLZ4and parts larger than 100 MB reportZSTD(3)as theirdefault_compression_codecinsystem.parts. DirectgetDefaultCodecstreams (e.g.StripeLog) reportZSTD(3)regardless of size.system.settingsshowsnetwork_compression_method = ZSTDandnetwork_zstd_compression_level = 3.SELECT/INSERTover the native protocol with default (nowZSTD) network compression work; all ofLZ4/lz4hc/zstd/nonestill work and an invalid method still errors withBAD_ARGUMENTS.02995_new_settings_historyconsistency check passes (both setting changes are recorded).02047_log_family_*_data_file_dumps) were regenerated for the new default:Log/TinyLogkeepLZ4-pinned columns, whileStripeLog's shareddata.bin/index.mrkreflect theZSTD(3)default (it usesgetDefaultCodecand ignores the per-column codec). Size-sensitive stateless and integration tests that assert compressed byte sizes were pinned back toLZ4(cache segment sizes, distributed-batch corruption offsets, full-disk thresholds, frozen-part checksums).The CI performance comparison provides the before/after measurement of the storage/network-vs-CPU tradeoff for the changed default.
Documentation updated accordingly, including the
Nativeformat and protocol specifications and the<compression>config examples.Changelog category (leave one):
Changelog entry (a user-readable short description of the changes that goes into CHANGELOG.md):
The default compression method is changed from
LZ4toZSTD(3). For client/server and server/server network communication it switches uniformly toZSTD(3). For on-disk table data,MergeTreecolumn data now uses a size-aware built-in default — parts smaller than 100 MB useLZ4and larger parts useZSTD(3)— while the direct built-in-default streams (theStripeLogdata stream, the persistentSet/Joinfiles, the HTTPcompress=1framed output, part checksums, and text-index files) switch uniformly fromLZ4toZSTD(3). This improves compression ratios and reduces storage and network usage out of the box, at a modest increase in CPU usage. None of the on-disk or HTTP paths are controlled bycompatibility; the previous behavior can be restored per path as follows. For client/server and Distributed network compression, use thecompatibilitysetting (or setnetwork_compression_method = 'LZ4'). ForMergeTreecolumn data, set a column/tableCODEC(LZ4)or configure the server<compression>default toLZ4. For theLog/TinyLogengines, set a columnCODEC(LZ4)(their default does not consult the<compression>config). TheStripeLogdata stream, the persistentSet/Joinbackup files, the HTTPcompress=1framed output, andMergeTreeauxiliary streams such as part checksums (checksums.txt) and text-index files use the built-in default codec directly — they ignore per-column codecs and the<compression>config — so they always use the newZSTD(3)default and have no runtime rollback; reading stays correct because every compressed frame is self-describing.