iframe-proxy

groeneai · 2026-06-26T22:11:49Z

Changelog category (leave one):

Bug Fix (user-visible misbehavior in an official stable release)

Changelog entry (a user-readable short description of the changes that goes into CHANGELOG.md):

Fixed a heap-buffer-overflow when building a text index with positions = 1 over many distinct short tokens.

Description

Related: #103172 (added the text index positions feature).

MergeTreeIndexTextGranuleBuilder::build() attached each token's positions with position_map->find(entry.token). entry.token is a string_view into the postings map's own StringHashMap cell storage, which is not padded to 8 bytes. StringHashTable lookup reads a full 8-byte word around the key (data() + size() - 8 for short keys), so a short token landing near a hash-buffer boundary read out of bounds.

The postings map and positions map are filled in lockstep in addDocument(), so they always hold the same token set. Positions are now attached by sorting the positions map and zipping with the already-sorted tokens by index, instead of re-looking-up the unpadded key. Size and per-index token-mismatch checks preserve the original invariant.

Found by Stress test (arm_asan_ubsan) on 26.6. The crashing path is on master too.
Report: https://s3.amazonaws.com/clickhouse-test-reports/json.html?REF=26.6&sha=3babd82fe36c262d5aa12a71cd4823d118f7e894&name_0=ReleaseBranchCI&name_1=Stress%20test%20%28arm_asan_ubsan%29

ASan (src/Common/HashTable/StringHashTable.h:366 in dispatch<...PositionListBuilder...>::FindCallable):

heap-buffer-overflow, READ of size 8 ... 5 bytes before an 81920-byte region
  #0 StringHashTable<...PositionListBuilder...>::dispatch  StringHashTable.h:366
  #1 ...::find(string_view)                                StringHashTable.h:467
  #2 MergeTreeIndexTextGranuleBuilder::build()             MergeTreeIndexText.cpp
  #3 MergeTreeIndexAggregatorText::getGranuleAndReset()
  #4 MergeTreeDataPartWriterOnDisk::fillSkipIndicesChecksums

Reproduced locally on an ASan build (crashes without the fix, passes with it); regression test added.

groeneai · 2026-06-26T22:12:14Z

Pre-PR validation gate (click to expand)

#	Question	Answer
a	Deterministic repro?	Yes. ASan build, `INSERT` into a `positions = 1` text index over 3000 distinct short tokens x 20 rows in one granule aborts every run with heap-buffer-overflow.
b	Root cause explained?	`build()` did `position_map->find(entry.token)`; `entry.token` is a `string_view` into the postings map's own `StringHashMap` cell storage (not 8-byte padded). `StringHashTable::dispatch` reads `data()+size()-8` (8 bytes around the key) for short keys, so a short token at a hash-buffer boundary reads out of bounds.
c	Fix matches root cause?	Yes. Removes the contract-violating `find()` of an unpadded cell key; attaches positions by sorting the positions map and zipping with the already-sorted tokens by index.
d	Test intent preserved / new tests added?	New regression test `02346_text_index_positions_build_overflow` added. The size + per-index token-mismatch `LOGICAL_ERROR` checks keep the original "postings but no positions slot" invariant.
e	Both directions demonstrated?	Yes. Unfixed binary: abort + heap-buffer-overflow at `StringHashTable.h:366`. Fixed binary: exit 0, output `20/20/0`, 5/5 runs clean.
f	Fix is general across code paths?	Yes. This is the only `StringHashTable::find()` fed a cell-internal key view. The sibling `find()` calls in `TextIndexAnalyzer` use `absl::flat_hash_map<String, ...>` (standard hashing, no 8-byte over-read) — not the same bug class.
g	Fix generalizes across inputs (params/datatypes/wrappers)?	The token type is always `String` (tokenizer output); the bug is key-length/buffer-boundary dependent, not datatype dependent. The fix removes the over-read for all token lengths, and the repro exercises many short tokens (lengths 2-5).
h	Backward compatible?	Yes. No format, setting, or behavior change — only how positions are attached during granule build. Phrase-search results are unchanged (verified against the upstream `02346_text_index_function_hasPhrase_positions` expectations).
i	Invariants and contracts preserved?	Yes. `sorted_tokens` and the positions are sorted by the same comparator over identical key sets (filled in lockstep in `addDocument()`), so index `i` refers to the same token in both; mismatches throw `LOGICAL_ERROR`. No longer passes an unpadded key to `StringHashTable`, satisfying its 8-byte-padding precondition.

Session id: cron:clickhouse-worker-slot-2:20260626-211200

groeneai · 2026-06-26T22:12:35Z

cc @ahmadov @CurtizJ — could you review this? It fixes a heap-buffer-overflow in the text-index positions build path: build() looked tokens up in the positions map with a key view into the postings map's unpadded hash cells, and StringHashTable lookup over-reads 8 bytes around short keys. The two maps are filled in lockstep, so positions are now attached by sorted zip instead of an unsafe find().

clickhouse-gh · 2026-06-26T22:22:17Z

Workflow [PR], commit [cf29606]

Summary: ❌

Performance Comparison: Performance dashboard

job_name	test_name	status	info
Stress test (amd_tsan)		FAIL
	Hung check failed, possible deadlock found	FAIL	cidb, issue
Stress test (arm_tsan)		FAIL
	Logical error: '(isConst() \|\| isSparse() \|\| isReplicated() \|\| rhs.isConst() \|\| rhs.isSparse() \|\| rhs.isReplicated()) ? getDataType() == rhs.getDataType() : typeid(*this) == typeid(rhs)' (STID: 2508-30f6)	FAIL	cidb

AI Review

Summary

This PR fixes the positions = 1 text-index build path by replacing an unsafe StringHashMap::find on an unpadded token view with byte-wise comparisons against the already sorted token list. I did not find any unresolved correctness, safety, or testing issue that needs an inline review comment.

Final Verdict

Status: ✅ Approve

Ergus

The change adds another call to std::sort, which could be very bad for performance.
If position_map then we can skip the first sort for sorted_tokens and initialize everything once.

Ergus · 2026-06-29T11:36:01Z

@groeneai

What about:

if (position_map)
{
    size_t attached = 0;
    position_map->forEachValue([&](const auto & key, auto & mapped)
    {
        const std::string_view token(key);
        auto it = std::ranges::lower_bound(sorted_tokens, token, std::less<>{},
            [](const SortedToken & e) { return e.token; });
        if (it == sorted_tokens.end() || it->token != token)
            throw Exception(ErrorCodes::LOGICAL_ERROR,
                "Text index: positions token '{}' has no postings slot", token);
        it->positions = &mapped;
        ++attached;
    });

    if (attached != sorted_tokens.size())
        throw Exception(ErrorCodes::LOGICAL_ERROR,
            "Text index: postings map has {} tokens but positions map attached {}",
            sorted_tokens.size(), attached);
}

MergeTreeIndexTextGranuleBuilder::build() looked up each token in the positions map with position_map->find(entry.token). entry.token is a string_view into the postings map's own StringHashMap cell storage, which is not padded to 8 bytes. StringHashTable lookup reads a full 8-byte word around the key (data()+size()-8 for short keys), so a short token landing near a hash-buffer boundary read out of bounds: ASan reported a heap-buffer-overflow at StringHashTable.h in build() during part write (fillSkipIndicesChecksums -> getGranuleAndReset). The postings map and positions map are filled in lockstep in addDocument(), so they always hold the same token set. Walk the positions map once and attach each builder by binary searching the already-sorted tokens (std::ranges::lower_bound, which only compares key bytes, no over-read); this avoids both the unpadded find() and a second sort of the positions map. The attached-count check keeps the original "postings but no positions slot" invariant. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

groeneai · 2026-06-29T12:35:46Z

Adopted, thanks @Ergus. Dropped the extra std::sort: now one position_map->forEachValue pass that std::ranges::lower_bounds each key into the already-sorted sorted_tokens and attaches &mapped, with an attached == sorted_tokens.size() check for the invariant. Single O(log n) search per token, no second sort and no temporary vector.

OOB-safe: lower_bound compares only key bytes (string_view operator<), so the unpadded position_map->find() that read an 8-byte word past a short token is gone.

Re-validated on an ASan build with the regression test: crashes without the fix (heap-buffer-overflow, READ of size 8, identical signature), passes with it (5/5, output 20/20/0). Used a lambda comparator + &SortedToken::token projection instead of std::less<>{} (same semantics, matches the existing lower_bound style in this file).

groeneai · 2026-06-29T19:14:56Z

CI finish ledger — `988c551`

Every failure below has an owner: a fixing PR (ours or external), or a full-effort fix task whose fixing-PR link will be posted here when it opens. Only CH Inc sync is exempt.

Check / test	Reason	Owner / fixing PR
Stateless tests (11 configs) / 02354_vector_search_rescoring	master regression (Code 44 ILLEGAL_COLUMN `_distance`, from #105591)	#108812 (ours, merged) — merged master @ `cf29606` to pick it up
CH Inc sync	-	CH Inc sync (private, not actionable)

The single failing test is the merged-#105591 master regression (deterministic, all configs since 2026-06-29 11:10 UTC), fixed by revert #108812 (merged 14:13 UTC). This branch predated the revert, so I merged current master to pick it up; CI re-runs on cf29606.

Session id: cron:our-pr-ci-monitor:20260629-183000

Ergus

A cleaner and faster fix is to store the positions builder alongside the postings builder in a single map value, eliminating the join (and the whole class of "look the key up in the other table" bugs) entirely.

@ahmadov WDYT? That's a bigger change than this bugfix should carry, but worth a follow-up

Ergus · 2026-06-29T11:26:31Z

+
+        for (size_t i = 0; i < sorted_tokens.size(); ++i)
+        {
+            if (sorted_tokens[i].token != sorted_positions[i].first)


Must use sorted_positions here

Ergus · 2026-06-29T20:00:55Z

+SELECT count() FROM tab;
+
+-- Positions are still attached correctly: phrase search over consecutive short tokens works.
+SELECT count() FROM tab WHERE hasPhrase(message, 't1 t2 t3');


IIUC this tests trioggers the OOB problem probabilistically.

It reliably crashes today, but nothing pins the boundary condition. I.e a future change to StringHashTable could stop the test from ever exercising the over-read, and it would still pass 20/20/0 silently becoming a non-regression-test.

You're right. The over-read itself is deterministic: for any sub-8-byte token StringHashTable::dispatch memcpy's a full 8-byte word around the key (the case 0 branch reads p[0..7] or p[size-8..] by bit 2048 of the address). What isn't pinned is the ASan trap, which fires only when a short token's cell sits at the end of a postings allocation so the read crosses a redzone, and that adjacency is a StringHashTable layout property. A layout change could stop the crossing and leave the test green on an unfixed binary.

A SQL test can't pin that layout-independently. The durable guard is the single-map-value refactor you raised in the approval: with the positions builder in the postings map value there's no second lookup, so no unpadded find() to over-read and no boundary to pin. The current fix already drops that unsafe find() (the new lower_bound compares key bytes only, no 8-byte hash read), so production is safe today; this is only about the test's strength against reintroduction. I'd land this as the minimal backportable crash fix and track the refactor plus a layout-independent test as the follow-up. Happy to widen the token set here if you want a higher-confidence trap in the meantime.

clickhouse-gh · 2026-06-29T22:51:57Z

LLVM Coverage Report

Metric	Baseline	Current	Δ
Lines	85.40%	85.40%	+0.00%
Functions	92.60%	92.60%	+0.00%
Branches	77.60%	77.70%	+0.10%

Changed lines: Changed C/C++ lines covered by tests: 15/18 (83.33%) | Lost baseline coverage: none · Uncovered code

Full report · Diff report

groeneai · 2026-06-29T23:45:37Z

CI finish ledger — `cf29606`

Every failure below has an owner: a fixing PR (ours), or a full-effort fix task. Only CH Inc sync is exempt. None are caused by this PR (a text-index granule-builder over-read fix in MergeTreeIndexText.cpp; neither code path is touched by the failures below).

Session id: cron:our-pr-ci-monitor:20260629-233000

CurtizJ added the can be tested Allows running workflows for external contributors label Jun 26, 2026

clickhouse-gh Bot added the pr-bugfix Pull request with bugfix, not backported by default label Jun 26, 2026

Ergus self-assigned this Jun 29, 2026

Ergus requested changes Jun 29, 2026

View reviewed changes

groeneai force-pushed the groeneai/fix-text-index-positions-build-overflow branch from 0af5124 to 988c551 Compare June 29, 2026 12:35

Merge remote-tracking branch 'origin/master' into HEAD

cf29606

Ergus approved these changes Jun 29, 2026

View reviewed changes

Sunbelt Computer Software

PL/B Language Development and Support

Uh oh!

Conversation

groeneai commented Jun 26, 2026

Changelog category (leave one):

Changelog entry (a user-readable short description of the changes that goes into CHANGELOG.md):

Description

Uh oh!

groeneai commented Jun 26, 2026

Uh oh!

groeneai commented Jun 26, 2026

Uh oh!

clickhouse-gh Bot commented Jun 26, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

AI Review

Summary

Final Verdict

Uh oh!

Ergus left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Ergus commented Jun 29, 2026

Uh oh!

groeneai commented Jun 29, 2026

Uh oh!

groeneai commented Jun 29, 2026

CI finish ledger — 988c551

Uh oh!

Ergus left a comment

Choose a reason for hiding this comment

Uh oh!

Ergus Jun 29, 2026

Choose a reason for hiding this comment

Uh oh!

Ergus Jun 29, 2026

Choose a reason for hiding this comment

Uh oh!

groeneai Jun 29, 2026

Choose a reason for hiding this comment

Uh oh!

clickhouse-gh Bot commented Jun 29, 2026

LLVM Coverage Report

Uh oh!

groeneai commented Jun 29, 2026

CI finish ledger — cf29606

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

clickhouse-gh Bot commented Jun 26, 2026 •

edited

Loading

Ergus left a comment •

edited

Loading

CI finish ledger — `988c551`

CI finish ledger — `cf29606`