iframe-proxy

groeneai · 2026-05-10T21:22:32Z

Changelog category (leave one):

Bug Fix (user-visible misbehavior in an official stable release)

Changelog entry (a user-readable short description of the changes that goes to CHANGELOG.md):

Fixed missing rows when reading a MergeTree table whose primary key is a Map, contains a Map, or is derived from a Map (mapKeys(m), m.keys, etc.) and whose parts use bucketed Map serialization.

Documentation entry for user-facing changes

Documentation is written (mandatory for new features)

Context

This bug was discovered while investigating flaky 01550_create_map_type (#104515 was an earlier setting-pin attempt that was rejected by @Algunenano and @alexey-milovidov in favor of a real engine fix).

Minimal repro

CREATE TABLE tm (a Map(String, Array(UInt8))) ENGINE = MergeTree() ORDER BY a SETTINGS
    min_bytes_for_wide_part = 0,
    map_serialization_version_for_zero_level_parts = 'with_buckets',
    max_buckets_in_map = 11,
    map_buckets_strategy = 'constant',
    map_buckets_min_avg_size = 2;
INSERT INTO tm VALUES (map('k1', [1,2,3], 'k2', [4,5,6])), (map('k0', [], 'k1', [100,20,90]));
INSERT INTO tm SELECT map('k1', [number, number + 2, number * 2]) FROM numbers(6);
INSERT INTO tm SELECT map('k2', [number, number + 2, number * 2]) FROM numbers(6);

SELECT length(groupArray(a)) FROM tm    -- returns 14 (correct)
SETTINGS merge_tree_read_split_ranges_into_intersecting_and_non_intersecting_injection_probability = 0, max_threads = 4;

SELECT length(groupArray(a)) FROM tm    -- returns 13 (one row dropped) before fix
SETTINGS merge_tree_read_split_ranges_into_intersecting_and_non_intersecting_injection_probability = 1, max_threads = 4;

Root cause

For a 2-row part of Map(String, Array(UInt8)) with bucketed serialization (11 buckets), one row has the keys {k1, k2}:

Index serialization (SerializationMap::serializeBinary -> basic nested Array(Tuple(K, V))) writes the row as [("k1", [1,2,3]), ("k2", [4,5,6])] -- insertion order.
Bulk serialization (SerializationMap::serializeBinaryBulkWithMultipleStreams) splits keys into per-bucket streams; on read, collectMapFromBuckets reassembles them in ascending bucket-index order. With hash("k2") % 11 < hash("k1") % 11, the reassembled row is [("k2", [4,5,6]), ("k1", [1,2,3])] -- bucket order.

ColumnMap::compareAt is positional on the nested Array(Tuple(K, V)), so the two representations compare as different values. splitPartsWithRangesByPrimaryKey computes layer/border PK values from the index but the downstream FilterSortedStreamByRange evaluates the filter against actual data rows, so the row that should fall inside a layer's PK range is dropped.

A simple bonus repro of the same underlying behavior (separate from the splitter, but the same root cause):

SELECT id, a FROM (
    SELECT 1 AS id, map('k1', 1, 'k2', 2, 'k3', 3) AS a
) ORDER BY id;
-- in-memory result: {"k1":1,"k2":2,"k3":3}

-- but after MergeTree round-trip with map_serialization_version_for_zero_level_parts = 'with_buckets':
SELECT id, a FROM tm_buckets;
-- result: {"k3":3,"k2":2,"k1":1}

Fix

Two layers in PartsSplitter::isSafePrimaryDataKeyType / isSafePrimaryKey:

Mark Map as not a safe primary-key type. This also covers Tuple(Map, ...) and similar nested cases via the existing recursive walk.
Walk the PK expression's source storage columns and mark the PK unsafe when any source column is itself a Map (or contains a Map). This covers PKs like ORDER BY mapKeys(m) and ORDER BY m.keys, where the resolved expression type is Array(K) -- safe by itself -- but the underlying values still come from a with_buckets Map and so suffer the same index-vs-data mismatch. (Layer 2 was added in response to a regression confirmed by @tiandiwonder on a 4-part bucketed-Map table: ORDER BY mapKeys(m) and ORDER BY m.keys both went 20 -> 19 under injection=1.)

splitPartsWithRangesByPrimaryKey already has a fast path for unsafe primary keys that builds a single in-order merging pipe from all parts without doing the layer split or the intersecting / non-intersecting separation, both of which depend on the index-vs-data comparison being consistent.

This is a targeted fix: only Map-keyed and Map-derived reads lose the splitter optimization, and only when at least one source storage column is a Map (which is uncommon). Non-Map PKs and Map columns used outside the primary key are unaffected.

What I did NOT fix here

The deeper issue -- that bucketed Map serialization is non-idempotent under positional ColumnMap::compareAt semantics, which also breaks e.g. WHERE a = map('k1', 1, 'k2', 2, 'k3', 3) on a bucketed table -- is left for a separate change. It needs maintainer input on whether the right answer is canonical Map storage on write, an order-insensitive Map comparison, or extending the bucketed format to preserve original per-row key positions. This PR only fixes the splitter-level symptom; the broader Map semantics question is unchanged.

Verification

50/50 iterations of regression test 04217_split_intersecting_bucketed_map_row_drop pass with full randomization on debug build (excluding environment-only failures from missing parallel_replicas cluster and timezone DB).
30/30 iterations of 01550_create_map_type pass with full randomization.
Confirmed both directions with git stash / restore: without this commit the regression test fails with groupArray=13 / count=13 on the original Map PK shapes, and with groupArray=19 / count=20 on the new mapKeys(m) and m.keys 4-part shapes; with this commit all rows are returned.
Non-Map PK split path still exercised: merge_tree_read_split_ranges_into_intersecting_and_non_intersecting_injection_probability=1 on a 200-row table with id UInt32 PK still returns 200 rows.
Bucketed Map as a non-PK column still works (read returns full row count).

@Algunenano

…ialization When a `Map` column is the primary key and the part uses `map_serialization_version_for_zero_level_parts = 'with_buckets'`, the primary key index and the data files end up storing the same logical row with different internal key orders: * The primary key index is written through `SerializationMap::serializeBinary`, which delegates to the basic nested `Array(Tuple(K, V))` serialization and preserves the in-memory positional order of keys. * The data files are written by `SerializationMap::serializeBinaryBulkWithMultipleStreams`, which hashes each key into a bucket and serializes each bucket independently. The read path (`deserializeBinaryBulkWithMultipleStreams` + `collectMapFromBuckets`) reassembles each row's keys in ascending bucket-index order, not insertion order. Under positional `ColumnMap::compareAt` these two representations are different values. `splitPartsWithRangesByPrimaryKey` uses the index for boundary computation but the splitter's downstream `FilterSortedStreamByRange` sees the bucket-order data, so rows that fall inside a layer's PK range under one representation can be dropped under the other. This is reproducible today via `merge_tree_read_split_ranges_into_intersecting_and_non_intersecting_injection_probability` (testing-only setting), and is also reachable from `FINAL` with `split_parts_ranges_into_intersecting_and_non_intersecting_final = 0`. Mark `Map` as not a safe primary-key type so the splitter falls into the single in-order merging-pipe path, which doesn't depend on this comparison. The fast path for non-`Map` primary keys is unchanged. Includes a stateless regression test covering both the `MergeTree`/non-`FINAL` injection path and the `ReplacingMergeTree` / `FINAL` path. Reported via flaky `01550_create_map_type` and confirmed during PR review of ClickHouse#104515 by @Algunenano and @alexey-milovidov.

groeneai · 2026-05-10T21:22:48Z

Pre-PR validation gate (TASK.md Phase 4 step 9)

Session id: cron:clickhouse-ci-task-worker:20260510-204500

#	Question	Answer
a	Deterministic repro?	Yes. `SELECT length(groupArray(a)) FROM tm SETTINGS merge_tree_read_split_ranges_into_intersecting_and_non_intersecting_injection_probability = 1, max_threads = 4` returns 13 every run (instead of 14) on the unfixed build. Also: same with `ReplacingMergeTree FINAL` + `split_parts_ranges_into_intersecting_and_non_intersecting_final = 0`.
b	Root cause explained?	Yes. Bucketed Map serialization permutes per-row keys by ascending bucket index on disk while `SerializationMap::serializeBinary` (used to write the primary key index) keeps insertion order. `ColumnMap::compareAt` is positional, so the two representations of the same logical row compare unequal. `splitPartsWithRangesByPrimaryKey` computes layer borders from the index but `FilterSortedStreamByRange` runs them against actual data rows → a row in the data is determined to fall outside the layer's PK range and is dropped.
c	Fix matches root cause?	Yes. The fix isn't a "widen bounds" / "pin setting" band-aid — it teaches the splitter that `Map` primary keys aren't `isSafePrimaryKey`, which makes `splitPartsWithRangesByPrimaryKey` take its existing single-merging-pipe fast path that doesn't depend on the comparison being consistent between index and data. The check sits next to the other unsafe types (`Float*`, `Nullable`, `Object`, `Variant`, `Dynamic`).
d	Test intent preserved? / New tests added?	A dedicated regression test `04217_split_intersecting_bucketed_map_row_drop` exercises both the testing-only injection path and the `FINAL split_final=0` production-reachable path on a `Map(String, Array(UInt8))` PK. `01550_create_map_type` keeps its current behavior.
e	Demonstrated in both directions?	Yes. `git stash` to revert to master, rebuild, run the new test → fails with `inj groupArray=13`, `final-split-0 count=13`, `final-split-0 groupArray=13`. `git stash pop` to bring back the fix, rebuild, run the new test → 50/50 iterations pass with full randomization. `01550_create_map_type` also passes 30/30 with the fix applied.
f	Fix is general, not a narrow patch?	Treating `Map` uniformly as unsafe covers all sites that route through `isSafePrimaryKey`: `spreadMarkRangesAmongStreams` (the injection / non-`FINAL` path) and `spreadMarkRangesAmongStreamsFinal` via `findPKRangesForFinalAfterSkipIndex` (the `FINAL` path). Both call sites use the same helper. The deeper issue — bucketed `Map` round-trip being non-idempotent under positional `compareAt` (also breaks `WHERE a = map_literal`) — is broader and depends on a Map-storage design decision; I called this out in the PR description so the right follow-up can be decided by the maintainers rather than band-aided here.

groeneai · 2026-05-10T21:23:00Z

cc @Algunenano @alexey-milovidov @CurtizJ @Avogar — engine fix for the parallel-reads-of-bucketed-Map-drop-rows issue we discussed during #104515.

The fix is in PartsSplitter::isSafePrimaryDataKeyType — Map now joins Float* / Nullable / Object / Variant / Dynamic as a type for which the splitter falls back to its single in-order merging-pipe path. Root cause: bucketed Map serialization writes per-row keys in ascending bucket-index order, but SerializationMap::serializeBinary (used for the primary key index) preserves in-memory insertion order, so under positional ColumnMap::compareAt the index value and the actual data row compare as different values and FilterSortedStreamByRange drops rows that should fall inside a layer's PK range.

The deeper issue — bucketed Map round-trip is non-idempotent under positional compareAt (also breaks WHERE a = map_literal even on a single bucketed part) — is broader than the splitter and depends on a Map-storage design call, so I deliberately did not change Map semantics here. Happy to follow up if you have a direction in mind: canonical Map storage on write, order-insensitive Map comparison, or extending the bucketed format to preserve original per-row key positions.

Regression test included; both directions verified via stash-revert; 50/50 iterations clean with full randomization. PR description has the full root-cause walkthrough.

clickhouse-gh · 2026-05-11T09:17:39Z

Workflow [PR], commit [ea2cc9c]

Summary: ❌

Performance Comparison: Performance dashboard

job_name	test_name	status	info
Fast test (arm_darwin)		FAIL
	01951_distributed_push_down_limit	FAIL	cidb
	01952_optimize_distributed_group_by_sharding_key	FAIL	cidb
Unit tests (amd_llvm_coverage)		FAIL
	-FunctionsStress	FAIL	cidb, issue
LLVM Coverage		DROPPED

AI Review

Summary

This PR fixes row drops when bucketed Map data participates in splitter primary-key boundaries by marking Map, nested Map, and Map-derived primary-key expressions as unsafe for layer splitting. The current diff also covers the symmetric FINAL, lazy final, and query_plan_join_shard_by_pk_ranges paths, with focused stateless regressions for direct, nested, subcolumn/function-derived, and join cases. I found no remaining correctness issue in the current code; the earlier bot findings for Map-derived keys and optimizeJoinByShards are addressed.

Final Verdict

Status: ✅ Approve

tiandiwonder · 2026-05-11T09:18:27Z

What I did NOT fix here
The deeper issue — that bucketed Map serialization is non-idempotent under positional ColumnMap::compareAt semantics, which also breaks e.g. WHERE a = map('k1', 1, 'k2', 2, 'k3', 3) on a bucketed table — is left for a separate change. It needs maintainer input on whether the right answer is canonical Map storage on write, an order-insensitive Map comparison, or extending the bucketed format to preserve original per-row key positions. This PR only fixes the splitter-level symptom; the broader Map semantics question is unchanged.

@groeneai @alexey-milovidov it seems a more serious issue here.

tiandiwonder · 2026-05-11T09:25:05Z

@@ -0,0 +1,70 @@
+-- Regression test for parallel reads of `Map` primary key with `with_buckets` serialization.
+--


The test doesn't cover nested PK types — e.g. ORDER BY (id, m) where m is a Map, or ORDER BY tuple(m, ...). The code logic (recursive isSafePrimaryDataKeyType) does handle these cases, but they aren't explicitly exercised by the test.

Avogar · 2026-05-11T09:54:34Z

which also breaks e.g. WHERE a = map('k1', 1, 'k2', 2, 'k3', 3) on a bucketed table

It's a general issue with the Map type, not related to bucketed serialization. WHERE a = map('k1', 1, 'k2', 2) and WHERE a = map('k2', 2, 'k1', 1) will return different results, because we store keys in insertion order.

We can try to rewrite the code of Map comparison so it doesn't depend on keys order, but this comparison will be slower.

@tiandiwonder

Per PR ClickHouse#104540 review feedback from @tiandiwonder: extend the regression test to also exercise the recursive `isSafePrimaryDataKeyType` check. Two new sections added, both using the same 14-row, 3-part, multi-bucket Map shape as the existing top-level-Map cases: 1. `ORDER BY (id, m)` on `MergeTree` and `ReplacingMergeTree` — composite primary key with `Map` as one column. All rows use the same `id` so parts intersect on `id` and the splitter has to compare `m` values at layer boundaries. Exercises the loop in `isSafePrimaryKey` that iterates over every PK column. 2. `ORDER BY c` with `c Tuple(Map(...), UInt32)` on `MergeTree` and `ReplacingMergeTree` — primary key whose type is itself a `Tuple` containing a `Map`. Exercises the recursive `Tuple` branch of `isSafePrimaryDataKeyType`, which must descend into the tuple's elements and mark the whole tuple as unsafe when any element is `Map`. For both shapes, both the injection (`merge_tree_read_split_ranges_into_intersecting_and_non_intersecting_injection_probability = 1`) and the `FINAL` (`split_parts_ranges_into_intersecting_and_non_intersecting_final = 0`) paths are covered. Verification matrix on local debug build: Without fix (`case TypeIndex::Map: return false;` removed from `PartsSplitter::isSafePrimaryDataKeyType`): - composite-inj groupArray: 13 (FAIL, one row dropped) - composite-final-split-0 count: 13 (FAIL) - composite-final-split-0 groupArray: 13 (FAIL) - tuple-inj groupArray: 13 (FAIL) - tuple-final-split-0 count: 13 (FAIL) - tuple-final-split-0 groupArray: 13 (FAIL) With fix restored: - all four new sections return 14 across `count` and `length(groupArray(...))`. `tests/clickhouse-test 04217_split_intersecting_bucketed_map_row_drop --test-runs 20` is 20/20 green with the current build.

groeneai · 2026-05-11T13:39:53Z

Pushed 74741723f9f extending test 04217_split_intersecting_bucketed_map_row_drop.sql with nested PK coverage.

Two new sections (same 14-row / 3-part / multi-bucket shape as the existing cases):

Composite PK ORDER BY (id, m) with m Map(String, Array(UInt8)) — both MergeTree (injection path) and ReplacingMergeTree (FINAL path). All rows use the same id so parts intersect on id, which forces the splitter to compare m values at layer boundaries. Exercises the loop in isSafePrimaryKey that scans every PK column.
Tuple-typed PK ORDER BY c with c Tuple(Map(...), UInt32) — both MergeTree (injection path) and ReplacingMergeTree (FINAL path). Exercises the recursive Tuple branch of isSafePrimaryDataKeyType, which must descend into the tuple's elements and mark the whole tuple as unsafe when any element is a Map.

Verification matrix on local debug build:

Section	Without fix (`case TypeIndex::Map: return false;` removed)	With fix
`composite-inj groupArray`	13 (FAIL)	14
`composite-final-split-0 count`	13 (FAIL)	14
`composite-final-split-0 groupArray`	13 (FAIL)	14
`tuple-inj groupArray`	13 (FAIL)	14
`tuple-final-split-0 count`	13 (FAIL)	14
`tuple-final-split-0 groupArray`	13 (FAIL)	14

tests/clickhouse-test 04217_split_intersecting_bucketed_map_row_drop --test-runs 20 is 20/20 green with the current build.

@Avogar — thanks for the confirmation that Map comparison being positional is a general type-level issue, not specific to the bucketed serialization. That's exactly the framing I was hoping for: the splitter fix here is the right scope (don't change Map semantics, just teach the splitter not to assume safe positional comparison), and the broader Map-comparison redesign (order-insensitive compareAt with the perf cost vs canonicalising on write vs extending the bucketed format to preserve per-row key positions) is a separate decision for the Map-type owners.

@tiandiwonder — the recursive type-check coverage you asked for is in 74741723f9f. Let me know if you'd like the cases shaped differently or if there's another nested-PK shape worth adding (e.g. Array(Map(...)) — happy to extend if you want it).

cc @alexey-milovidov @Algunenano @CurtizJ

tiandiwonder

changlog is too long, suggest:
Fixed missing rows when reading a MergeTree table whose primary key is Map (or contains Map) and whose parts use bucketed Map serialization (map_serialization_version_for_zero_level_parts = 'with_buckets'). The problem affected parallel reads (visible via merge_tree_read_split_ranges_into_intersecting_and_non_intersecting_injection_probability) and FINAL queries with split_parts_ranges_into_intersecting_and_non_intersecting_final = 0

tiandiwonder · 2026-05-27T05:06:51Z

+        /// so `FilterSortedStreamByRange` drops rows that would otherwise fall inside a layer's
+        /// PK range. Falling back to the single in-order merging pipe (the `!isSafePrimaryKey`
+        /// branch in `splitPartsWithRangesByPrimaryKey`) bypasses the layer split and the
+        /// intersecting/non-intersecting separation, both of which depend on this comparison.


Optional: the comment explains the with_buckets reorder very clearly, but doesn't say why we also disable the splitter for parts that use the default basic Map serialization (where index and data agree on insertion order). and also the comment is too long:
```
/// Map is unsafe as a splitter primary key: the PK index uses insertion-order
/// `serializeBinary` while `with_buckets` parts write data in bucket-index order,
/// and positional `ColumnMap::compareAt` then disagrees on equality, dropping rows.
/// Disabled unconditionally because the per-part serialization version is not
/// visible here; Map PKs are rare, so the cost is acceptable

Pushed a38a518e5062 shrinking the comment to 5 lines: states the with_buckets reorder mechanism and the unconditional-disable rationale (per-part serialization version is not visible here; Map PKs are rare). Good catch on the basic-path coverage gap.

@tiandiwonder

Per @tiandiwonder review on PR ClickHouse#104540: the original 13-line comment was verbose and did not state why we also disable the splitter for parts that use `basic` (default) Map serialization, where index and data agree on insertion order. Shortened to 5 lines covering: the `with_buckets` reorder mechanism, and the unconditional disable rationale (per-part serialization version is not visible here; `Map` PKs are rare).

groeneai · 2026-05-27T06:20:22Z

@tiandiwonder good call on both reviews. Two updates pushed at a38a518e5062:

Changelog entry shortened — adopted your suggested wording (with backticks for the literal SQL identifiers) in the PR description above. Reads: "Fixed missing rows when reading a MergeTree table whose primary key is Map (or contains Map) and whose parts use bucketed Map serialization ..."
Inline PartsSplitter.cpp comment shrunk to 5 lines — covers the with_buckets reorder mechanism and the unconditional-disable rationale (per-part serialization version is not visible at this call site; Map PKs are rare). Replied on the inline thread.

Verified clean build locally; only the comment text changed. CI should be unaffected.

tiandiwonder · 2026-06-03T07:17:22Z

@groeneai please make change log entry short?

tiandiwonder · 2026-06-05T10:20:43Z

+        /// and positional `ColumnMap::compareAt` then disagrees on equality, dropping rows.
+        /// Disabled unconditionally because the per-part serialization version is not
+        /// visible here; `Map` PKs are rare, so the cost is acceptable.
+        case TypeIndex::Map:


empirically confirmed on a 4-part bucketed-Map shape: ORDER BY mapKeys(m) and ORDER BY m.keys both go 20→19 under injection=1. heads-up on the fix — a check on getRequiredColumnsWithTypes for Map won't catch them: optimize_functions_to_subcolumns (default-on at Core/Settings.cpp:4664) rewrites mapKeys(m) to m.keys (FunctionsMapMiscellaneous.cpp:515), and ORDER BY m.keys directly reads the Array(K) sub-column with no Map in the resolved expression. has to look at the original PK AST or sub-column origin.

The previous commit only checked the resolved PK expression type, so PKs that resolve to a non-Map type but read from a Map storage column under the hood still hit the same with_buckets index-vs-data mismatch. Cover the remaining shapes by walking the PK expression's source columns. Concretely: ORDER BY mapKeys(m) resolves to Array(K) and ORDER BY m.keys reads the Map sub-column, which also resolves to Array(K). isSafePrimaryDataKeyType on the resolved type accepts both as safe Arrays, so splitPartsWithRangesByPrimaryKey still ran the layer split and dropped one row from a 4-part bucketed-Map table (20 -> 19) when merge_tree_read_split_ranges_into_intersecting_and_non_intersecting_injection_probability=1. Add a new isSafePrimaryKey overload that takes ColumnsDescription and walks primary_key.expression->getRequiredColumnsWithTypes(): for each source column name, look it up via tryGetColumnOrSubcolumn (which resolves m.keys to a Map sub-column with getTypeInStorage()=Map(K, V)). If any source storage column is or contains a Map, the PK is unsafe and the splitter falls back to the single-pipe path. The existing single-arg isSafePrimaryKey is kept as a conservative fallback for callers without storage-column metadata. Both PartsSplitter callers (splitPartsWithRangesByPrimaryKey, findPKRangesForFinalAfterSkipIndex) and optimizeLazyFinal.cpp now thread metadata_snapshot->getColumns() through. Empirical confirmation by tiandiwonder on PR ClickHouse#104540 review (2026-06-05): ORDER BY mapKeys(m) and ORDER BY m.keys both go 20 -> 19 under injection=1. The naive getRequiredColumnsWithTypes-only check does not work because optimize_functions_to_subcolumns rewrites mapKeys(m) -> m.keys, which is a direct Array(K) sub-column read with no Map in the resolved expression. The sub-column origin lookup against ColumnsDescription handles both shapes. Test 04217 extended with 4-part shapes for ORDER BY mapKeys(m), ORDER BY m.keys, and ReplacingMergeTree FINAL with split_parts_ranges_into_intersecting_and_non_intersecting_final=0. Closes: ClickHouse#104540 Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

groeneai · 2026-06-05T11:58:20Z

@tiandiwonder thanks for the empirical confirmation -- reproduced both shapes locally on a 4-part bucketed-Map table (20 -> 19 under injection=1 for ORDER BY mapKeys(m) and ORDER BY m.keys).

You were right that a getRequiredColumnsWithTypes check on the resolved expression types alone does not work: optimize_functions_to_subcolumns rewrites mapKeys(m) -> m.keys, which is a direct Array(K) sub-column read with no Map in the expression. The fix uses sub-column origin lookup against ColumnsDescription instead: tryGetColumnOrSubcolumn resolves m.keys to a sub-column whose getTypeInStorage() is Map(K, V). If any source column of the PK expression is or contains a Map in storage, the PK is marked unsafe.

Pushed fix-up commit 83bcbacc2fd2 -- splitter now rejects mapKeys(m), m.keys, and any other PK derived from a Map storage column. Test 04217 extended with the two new shapes plus a ReplacingMergeTree FINAL variant. Verified 20/20 with the fix; 20/19 without it.

Also shortened the changelog entry to a single sentence per your earlier ask.

Pre-PR validation gate

#	Question	Answer
a	Deterministic repro?	Yes. 4-part bucketed-`Map` table, `ORDER BY mapKeys(m)` and `ORDER BY m.keys`, `merge_tree_read_split_ranges_into_intersecting_and_non_intersecting_injection_probability=1, max_threads=4`: returns 20 rows (`count()`) but `length(groupArray(m))=19` deterministically before the fix.
b	Root cause explained?	Yes. The PK index stores `mapKeys(m)` / `m.keys` evaluated against the in-memory insertion-ordered `Map`. The data on disk uses `with_buckets`, which reorders per-row `Map` keys by ascending bucket index. The same expression evaluated against the disk-order `Map` yields a different `Array(K)` value than the index. `splitPartsWithRangesByPrimaryKey` builds layer/border PK values from the index but `FilterSortedStreamByRange` evaluates them against actual data rows, so the row whose PK falls inside a layer in the index does not in the data. The previous commit only marked the resolved expression type unsafe, but the resolved type here is `Array(K)`, which passes `isSafePrimaryDataKeyType`.
c	Fix matches root cause?	Yes. Adding a source-column check against the storage column metadata (`tryGetColumnOrSubcolumn`) marks the PK unsafe when ANY source column is or contains a `Map`, regardless of the resolved expression type. The splitter's existing fast path for unsafe primary keys then takes over.
d	Test intent preserved? / New tests added?	New tests: 04217 has three new sections covering `ORDER BY mapKeys(m)`, `ORDER BY m.keys`, and `ReplacingMergeTree FINAL` on `mapKeys(m)`. All three exercise the `with_buckets` + multi-bucket Map + `min_bytes_for_wide_part=0` + multiple parts shape that fails before the fix. Existing test sections are unchanged.
e	Both directions demonstrated?	Yes. Without commit `83bcbacc2fd2`: `mapKeys(m)` and `m.keys` both return `count()=20, length(groupArray(m))=19` under injection=1 (logged in `tmp/repro-4part.log` from the pre-fix run on the same source tree). With the commit: both return 20 / 20. The pre-existing 04217 sections (the original `Map` PK and `Tuple(Map,...)` shapes) also continue to pass 5/5 under no-random and 20/20 under randomization (excluding two environment-only failures unrelated to this PR: missing `parallel_replicas` cluster on the local server and `Invalid time zone: zoneinfo/UTC` from a missing TZ DB entry; both reproduce with the unchanged base test as well).
f	Fix is general, not a narrow patch?	Yes. The check uses a recursive `typeContainsMap` walk over the storage column type via `forEachChild`, so it catches arbitrarily-nested `Map` (e.g. `Tuple(UInt8, Map(...))`, `Array(Map(...))`, `LowCardinality(Map(...))`). All three known callers of `isSafePrimaryKey` (`splitPartsWithRangesByPrimaryKey`, `findPKRangesForFinalAfterSkipIndex`, `optimizeLazyFinal`) now pass `ColumnsDescription` so the new check fires uniformly. The single-arg `isSafePrimaryKey` overload is kept as a conservative fallback for any callers without storage metadata; behaviour for `Map`-typed PKs remains unchanged via the existing resolved-type check.

Session id: cron:clickhouse-ci-task-worker:20260605-112400

clickhouse-gh · 2026-06-05T15:31:39Z

LLVM Coverage Report

Metric	Baseline	Current	Δ
Lines	84.40%	84.40%	+0.00%
Functions	92.30%	92.30%	+0.00%
Branches	77.00%	77.00%	+0.00%

Changed lines: Changed C/C++ lines covered by tests: 53/58 (91.38%) | Lost baseline coverage: none · Uncovered code

Full report · Diff report

tiandiwonder · 2026-06-15T10:01:01Z

@@ -0,0 +1,249 @@
+-- Regression test for parallel reads of `Map` primary key with `with_buckets` serialization.


optional: add -- Tags: no-object-storage at the top — this test times out >180s on the asan+s3+keeper sync lane (5/33 reruns).

the with_buckets map serialization writes many small per-bucket files per part, so on object storage each is an S3 round-trip under ASan; it runs in ~0.5s on local storage and the splitter bug is storage-agnostic, so the tag loses no coverage.

Done in 0743540 @tiandiwonder. Added -- Tags: no-object-storage with your S3-round-trip rationale; the splitter bug is storage-agnostic so local-only coverage is unchanged.

The with_buckets Map serialization writes many small per-bucket files per part, so on object storage every read is an S3 round-trip; under ASan the test times out (>180s, 5/33 reruns on the asan+s3+keeper sync lane). It runs in ~0.5s on local storage, and the PartsSplitter bug under test is storage-agnostic, so restricting to local storage loses no coverage. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

optimizeJoinByShards::apply calls splitIntersectingPartsRangesIntoLayers directly, bypassing the isSafePrimaryKey guard added for the splitPartsWithRangesByPrimaryKey callers. With query_plan_join_shard_by_pk_ranges = 1, a join whose common primary-key prefix is a Map (or Map-derived, e.g. mapKeys(m)/m.keys) on a bucketed-Map table (map_serialization_version_for_zero_level_parts = 'with_buckets') therefore still dropped rows: the PK index is built from the in-memory insertion-ordered Map while the data is read back in bucket order, so the layer border filters (FilterSortedStreamByRange) drop rows whose positional ColumnMap::compareAt disagrees. Guard apply() with the same isSafePrimaryKey predicate; if any source has an unsafe PK the sharding optimization is skipped (the join then reads through the normal path). Non-Map PKs are unaffected. Reproduced on the PR-head binary (which already fixes the non-join path): INNER JOIN on a Map PK returned 19/20 rows, mapKeys(m) PK 109/110, with query_plan_join_shard_by_pk_ranges = 1 and external join disabled; both return the full count after this change. Found by clickhouse-gh[bot] review. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

groeneai · 2026-06-15T14:19:21Z

Follow-up commit ea2cc9c085b closes the optimizeJoinByShards symmetric-path gap raised in the bot review. Pre-PR validation gate for this layer:

#	Question	Answer
a	Deterministic repro?	Yes. `INNER JOIN` on a bucketed-`Map` PK with `query_plan_join_shard_by_pk_ranges = 1`, `query_plan_join_swap_table = 0`, external join disabled, `max_threads = 4`: shard-off=20, shard-on=19. `mapKeys(m)` PK: 110 vs 109. Deterministic.
b	Root cause explained?	`optimizeJoinByShards::apply` calls `splitIntersectingPartsRangesIntoLayers` directly, bypassing the `isSafePrimaryKey` guard the `splitPartsWithRangesByPrimaryKey` callers use. The PK index is built from the insertion-ordered `Map` (`serializeBinary`) while `with_buckets` reads data back in bucket order, so the layer border filters (`FilterSortedStreamByRange`, positional `ColumnMap::compareAt`) drop rows. Same mechanism as the original fix, reached through a different splitter caller.
c	Fix matches root cause?	Yes. The guard is applied at the exact bypassing call site, reusing the same `isSafePrimaryKey(primary_key, columns)` predicate. No symptom masking.
d	Test intent preserved / new tests added?	Two regression sections added to `04217`: `Map` PK join and `mapKeys(m)` PK join, each asserting equal counts with sharding off vs on. No existing assertions weakened.
e	Both directions demonstrated?	Yes. Removed the guard, rebuilt (Build ID changed), re-ran: 20->19 and 110->109 (drop reproduced). Restored guard, rebuilt: 20/20 and 110/110. Full test 04217 output matches `.reference` exactly.
f	Fix is general, not a narrow patch?	Yes. This was itself the "symmetric path" widening: the original fix guarded `splitPartsWithRangesByPrimaryKey`; this guards the only other direct `splitIntersectingPartsRangesIntoLayers` caller (`optimizeJoinByShards`). Both now route through the single central `isSafePrimaryKey` predicate, so `Map`/`Map`-derived (`mapKeys`/`m.keys`) PKs are covered uniformly. Non-`Map` PKs are unaffected.

Session id: cron:clickhouse-worker-slot-1:20260615-135600

groeneai · 2026-06-15T20:14:06Z

CI finished on ea2cc9c085be. Both red checks are pre-existing infra flakes unrelated to this change (which touches PartsSplitter / optimizeJoinByShards + Map-PK tests):

Fast test (arm_darwin): 01951_distributed_push_down_limit and 01952_optimize_distributed_group_by_sharding_key. Known macOS-only EXPLAIN/loopback flake hitting many unrelated PRs on arm_darwin; the blocking Fast test passes.
Unit tests (amd_llvm_coverage): -FunctionsStress. Chronic coverage-build unit flake hitting 13+ unrelated PRs per day.

Neither is caused by this PR.

tiandiwonder · 2026-06-17T23:59:11Z

CI finished on ea2cc9c085be. Both red checks are pre-existing infra flakes unrelated to this change (which touches PartsSplitter / optimizeJoinByShards + Map-PK tests):

Fast test (arm_darwin): 01951_distributed_push_down_limit and 01952_optimize_distributed_group_by_sharding_key. Known macOS-only EXPLAIN/loopback flake hitting many unrelated PRs on arm_darwin; the blocking Fast test passes.

Unit tests (amd_llvm_coverage): -FunctionsStress. Chronic coverage-build unit flake hitting 13+ unrelated PRs per day.

Neither is caused by this PR.

@groeneai please fix the flaky tests.

tiandiwonder · 2026-06-18T00:00:40Z

+    for (const auto & source : data.sources)
+    {
+        const auto & metadata = source->getStorageMetadata();
+        if (!isSafePrimaryKey(metadata->getPrimaryKey(), metadata->getColumns()))


optional: this uses isSafePrimaryKey but the file doesn't #include <Processors/QueryPlan/PartsSplitter.h> — it compiles via a transitive include.

adding the direct include guards against the include graph changing under it later.

groeneai · 2026-06-18T00:32:22Z

@tiandiwonder thanks. Both flaky classes are being handled in separate PRs (not folded into this engine fix), and one is already done:

Fast test (arm_darwin) 01951_distributed_push_down_limit + 01952_optimize_distributed_group_by_sharding_key: root cause was a reused macOS runner leaking lo0 aliases 127.0.0.2..16, which made remote('127.{1,2}', ...) look local and changed the EXPLAIN plan vs the committed reference (0 master hits; 55 unrelated PRs in 14d, arm_darwin only). Fixed in Add loopback interface aliases on macOS to run parallel-replica remote() tests #107435 (merged 2026-06-16): fail-closed lo0 de-alias teardown on the macOS Fast test job plus darwin.skip for these two distributed-EXPLAIN tests (joining the already-skipped 01244/01247/01814). That class should stop appearing on fresh runs now.
Unit tests FunctionsStress (-FunctionsStress): a trunk-wide function-property fuzzer flake (100+ distinct PRs across the sanitizer/coverage lanes, ~38 master hits in 30d). Current shape is a re-execution LOGICAL_ERROR, e.g. Unsupported argument types for function geohashesInBox : Float32, Float32, Float32, Float32 raised after const-folding. The offending function rotates run to run, so each shape is a distinct latent argument-validation gap. Tracked separately; will be addressed in its own PR(s), not here.

Neither is caused by this PR, so it should not block merging the Map-PK splitter fix (which is green on its own changes). I will pursue the FunctionsStress fixes independently.

Avogar · 2026-07-02T12:52:46Z

Let's close it in favour of #109178?

groeneai · 2026-07-02T13:34:37Z

Agreed, closing this in favor of #109178.

#109178 fixes the root cause: the index stores Map values in insertion order while bucketed serialization reorders the keys by hash in the data, so ColumnMap::compareAt (positional) disagrees. Restoring the original key order on deserialization makes index and data compare consistently, which removes the PartsSplitter boundary mismatch this PR routed around, and also fixes the broader ORDER BY / DISTINCT / GROUP BY / min / max / equality-filter issues on Map columns. The serialization-layer fix is the correct and more general one; this PR only forced Map primary keys onto the single in-order merging-pipe path.

One thing worth carrying over: #109178 doesn't seem to exercise the parallel split path that dropped rows here. 04217_split_intersecting_bucketed_map_row_drop reproduces it deterministically with merge_tree_read_split_ranges_into_intersecting_and_non_intersecting_injection_probability = 1 and max_threads = 4 on a Map primary key (including the composite / Tuple-containing-Map cases and the FINAL layer-split path). Happy to port it into #109178 as a regression guard, or you can pick it up from this branch.

Closing now.

groeneai · 2026-07-02T13:34:38Z

tiandiwonder self-assigned this May 11, 2026

clickhouse-gh Bot added the pr-bugfix Pull request with bugfix, not backported by default label May 11, 2026

tiandiwonder reviewed May 11, 2026

View reviewed changes

alexey-milovidov added the can be tested Allows running workflows for external contributors label May 17, 2026

tiandiwonder reviewed May 27, 2026

View reviewed changes

Merge commit '3fa1ed2d9d9cc4f472cd43e2e4eaabac12ec36db' into HEAD

29be1e3

clickhouse-gh Bot reviewed Jun 3, 2026

View reviewed changes

Comment thread src/Processors/QueryPlan/PartsSplitter.cpp

tiandiwonder requested changes Jun 5, 2026

View reviewed changes

tiandiwonder reviewed Jun 15, 2026

View reviewed changes

clickhouse-gh Bot reviewed Jun 15, 2026

View reviewed changes

Comment thread src/Processors/QueryPlan/PartsSplitter.cpp

tiandiwonder reviewed Jun 18, 2026

View reviewed changes

alexey-milovidov mentioned this pull request Jun 30, 2026

Close MergeTree part-writer file descriptors before renaming the part directory #108089

Open

groeneai closed this Jul 2, 2026

groeneai mentioned this pull request Jul 3, 2026

Set Keeper component in StorageMergeTree mutation-file methods #109297

Open

		@@ -0,0 +1,70 @@
		-- Regression test for parallel reads of `Map` primary key with `with_buckets` serialization.
		--

		@@ -0,0 +1,249 @@
		-- Regression test for parallel reads of `Map` primary key with `with_buckets` serialization.

Sunbelt Computer Software

PL/B Language Development and Support

Uh oh!

Conversation

groeneai commented May 10, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Changelog category (leave one):

Changelog entry (a user-readable short description of the changes that goes to CHANGELOG.md):

Documentation entry for user-facing changes

Context

Minimal repro

Root cause

Fix

What I did NOT fix here

Verification

Uh oh!

groeneai commented May 10, 2026

Pre-PR validation gate (TASK.md Phase 4 step 9)

Uh oh!

groeneai commented May 10, 2026

Uh oh!

clickhouse-gh Bot commented May 11, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

AI Review

Summary

Final Verdict

Uh oh!

tiandiwonder commented May 11, 2026

Uh oh!

tiandiwonder May 11, 2026

Choose a reason for hiding this comment

Uh oh!

Avogar commented May 11, 2026

Uh oh!

groeneai commented May 11, 2026

Uh oh!

tiandiwonder left a comment

Choose a reason for hiding this comment

Uh oh!

tiandiwonder May 27, 2026

Choose a reason for hiding this comment

Uh oh!

groeneai May 27, 2026

Choose a reason for hiding this comment

Uh oh!

groeneai commented May 27, 2026

Uh oh!

Uh oh!

tiandiwonder commented Jun 3, 2026

Uh oh!

tiandiwonder Jun 5, 2026

Choose a reason for hiding this comment

Uh oh!

groeneai commented Jun 5, 2026

Pre-PR validation gate

Uh oh!

clickhouse-gh Bot commented Jun 5, 2026

LLVM Coverage Report

Uh oh!

tiandiwonder Jun 15, 2026

Choose a reason for hiding this comment

Uh oh!

groeneai Jun 15, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

groeneai commented Jun 15, 2026

Uh oh!

groeneai commented Jun 15, 2026

Uh oh!

tiandiwonder commented Jun 17, 2026

Uh oh!

tiandiwonder Jun 18, 2026

Choose a reason for hiding this comment

Uh oh!

groeneai commented Jun 18, 2026

Uh oh!

Avogar commented Jul 2, 2026

Uh oh!

groeneai commented Jul 2, 2026

groeneai commented May 10, 2026 •

edited

Loading

clickhouse-gh Bot commented May 11, 2026 •

edited

Loading