Describe the unexpected behaviour
delayInsertOrThrowIfNeeded counts all active parts across all volumes, including those on prefer_not_to_merge volumes. Since those parts can never be reduced through merges, the back-pressure feedback loop is broken: delaying inserts does not lead to fewer parts, resulting in permanent throttling.
Which ClickHouse versions are affected?
All versions with tiered storage support. Confirmed on 26.1.2.
How to reproduce
Server version: 26.1.2, native client.
Storage policy (add to config.xml):
<storage_configuration>
<disks>
<cold_disk>
<path>/var/lib/clickhouse/cold/</path>
</cold_disk>
</disks>
<policies>
<tiered>
<volumes>
<hot>
<disk>default</disk>
</hot>
<cold>
<disk>cold_disk</disk>
<prefer_not_to_merge>true</prefer_not_to_merge>
</cold>
</volumes>
</tiered>
</policies>
</storage_configuration>
Reproduce:
CREATE TABLE test_delay (
date Date,
id UInt64
) ENGINE = MergeTree()
PARTITION BY date
ORDER BY id
SETTINGS storage_policy = 'tiered',
parts_to_delay_insert = 50,
parts_to_throw_insert = 100;
-- Prevent merges so small parts accumulate
SYSTEM STOP MERGES test_delay;
-- Create 60 small parts in partition 2026-01-01
-- for i in $(seq 1 60); do clickhouse-client -q "INSERT INTO test_delay SELECT '2026-01-01', number FROM numbers(100)"; done
-- Move all to cold volume (prefer_not_to_merge)
ALTER TABLE test_delay MOVE PARTITION '2026-01-01' TO VOLUME 'cold';
SYSTEM START MERGES test_delay;
-- Verify: 60 parts on cold volume, unmergeable
SELECT partition, disk_name, count() AS parts
FROM system.parts WHERE active AND table = 'test_delay'
GROUP BY partition, disk_name;
-- Insert into a different partition — this should NOT be delayed
INSERT INTO test_delay SELECT '2026-02-24', number FROM numbers(100);
-- But it IS delayed due to 60 unmergeable parts in 2026-01-01
Error message and/or stacktrace
Delaying inserting block by 126 ms. because there are 1251 parts and their average size is 1.00 KiB
The part count comes from a partition on the cold volume, but the insert targets a different partition with only a few parts on the hot volume.
Expected behavior
Parts on prefer_not_to_merge volumes should not contribute to the parts_to_delay_insert / parts_to_throw_insert threshold, since delaying inserts cannot reduce those parts.
Additional context
Root cause in code:
getMaxPartsCountAndSizeForPartitionWithState (MergeTreeData.cpp:5533) iterates all active parts without filtering by volume:
for (const auto & part : getDataPartsStateRange(state))
{
++cur_parts_count; // no volume/disk awareness
}
The merge selector correctly excludes prefer_not_to_merge parts via shallParticipateInMerges (IMergeTreeDataPart.cpp:2066), but the delay/throw logic does not.
Why this matters: In tiered storage setups (hot local disk + cold S3 with prefer_not_to_merge), if parts reach the cold volume before being fully merged (e.g., due to merge backlog when TTL triggers), they become permanently unmergeable. The insert back-pressure feedback loop breaks — delay cannot reduce parts that will never merge.
Related issues:
Possible fix: Filter by shallParticipateInMerges in getMaxPartsCountAndSizeForPartitionWithState, or introduce a separate counting path for delayInsertOrThrowIfNeeded that excludes unmergeable volumes.
Describe the unexpected behaviour
delayInsertOrThrowIfNeededcounts all active parts across all volumes, including those onprefer_not_to_mergevolumes. Since those parts can never be reduced through merges, the back-pressure feedback loop is broken: delaying inserts does not lead to fewer parts, resulting in permanent throttling.Which ClickHouse versions are affected?
All versions with tiered storage support. Confirmed on 26.1.2.
How to reproduce
Server version: 26.1.2, native client.
Storage policy (add to config.xml):
Reproduce:
Error message and/or stacktrace
The part count comes from a partition on the cold volume, but the insert targets a different partition with only a few parts on the hot volume.
Expected behavior
Parts on
prefer_not_to_mergevolumes should not contribute to theparts_to_delay_insert/parts_to_throw_insertthreshold, since delaying inserts cannot reduce those parts.Additional context
Root cause in code:
getMaxPartsCountAndSizeForPartitionWithState(MergeTreeData.cpp:5533) iterates all active parts without filtering by volume:The merge selector correctly excludes
prefer_not_to_mergeparts viashallParticipateInMerges(IMergeTreeDataPart.cpp:2066), but the delay/throw logic does not.Why this matters: In tiered storage setups (hot local disk + cold S3 with
prefer_not_to_merge), if parts reach the cold volume before being fully merged (e.g., due to merge backlog when TTL triggers), they become permanently unmergeable. The insert back-pressure feedback loop breaks — delay cannot reduce parts that will never merge.Related issues:
prefer_not_to_merge=true#85636 / PR Unblock ttl part drops for cold volumes #90059 — similarprefer_not_to_mergeedge case (TTL drops not applied)parts_to_delay_insertdesign flaw reported by @qoegaPossible fix: Filter by
shallParticipateInMergesingetMaxPartsCountAndSizeForPartitionWithState, or introduce a separate counting path fordelayInsertOrThrowIfNeededthat excludes unmergeable volumes.