Fix heap-buffer-overflow building text index with positions#108659
Fix heap-buffer-overflow building text index with positions#108659groeneai wants to merge 2 commits into
Conversation
|
cc @ahmadov @CurtizJ — could you review this? It fixes a heap-buffer-overflow in the text-index |
|
Workflow [PR], commit [cf29606] Summary: ❌
AI ReviewSummaryThis PR fixes the Final VerdictStatus: ✅ Approve |
|
What about: if (position_map)
{
size_t attached = 0;
position_map->forEachValue([&](const auto & key, auto & mapped)
{
const std::string_view token(key);
auto it = std::ranges::lower_bound(sorted_tokens, token, std::less<>{},
[](const SortedToken & e) { return e.token; });
if (it == sorted_tokens.end() || it->token != token)
throw Exception(ErrorCodes::LOGICAL_ERROR,
"Text index: positions token '{}' has no postings slot", token);
it->positions = &mapped;
++attached;
});
if (attached != sorted_tokens.size())
throw Exception(ErrorCodes::LOGICAL_ERROR,
"Text index: postings map has {} tokens but positions map attached {}",
sorted_tokens.size(), attached);
} |
MergeTreeIndexTextGranuleBuilder::build() looked up each token in the positions map with position_map->find(entry.token). entry.token is a string_view into the postings map's own StringHashMap cell storage, which is not padded to 8 bytes. StringHashTable lookup reads a full 8-byte word around the key (data()+size()-8 for short keys), so a short token landing near a hash-buffer boundary read out of bounds: ASan reported a heap-buffer-overflow at StringHashTable.h in build() during part write (fillSkipIndicesChecksums -> getGranuleAndReset). The postings map and positions map are filled in lockstep in addDocument(), so they always hold the same token set. Walk the positions map once and attach each builder by binary searching the already-sorted tokens (std::ranges::lower_bound, which only compares key bytes, no over-read); this avoids both the unpadded find() and a second sort of the positions map. The attached-count check keeps the original "postings but no positions slot" invariant. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
0af5124 to
988c551
Compare
|
Adopted, thanks @Ergus. Dropped the extra OOB-safe: Re-validated on an ASan build with the regression test: crashes without the fix (heap-buffer-overflow, READ of size 8, identical signature), passes with it (5/5, output 20/20/0). Used a lambda comparator + |
CI finish ledger — 988c551Every failure below has an owner: a fixing PR (ours or external), or a full-effort fix task whose fixing-PR link will be posted here when it opens. Only
The single failing test is the merged-#105591 master regression (deterministic, all configs since 2026-06-29 11:10 UTC), fixed by revert #108812 (merged 14:13 UTC). This branch predated the revert, so I merged current master to pick it up; CI re-runs on cf29606. Session id: cron:our-pr-ci-monitor:20260629-183000 |
Ergus
left a comment
There was a problem hiding this comment.
A cleaner and faster fix is to store the positions builder alongside the postings builder in a single map value, eliminating the join (and the whole class of "look the key up in the other table" bugs) entirely.
@ahmadov WDYT? That's a bigger change than this bugfix should carry, but worth a follow-up
|
|
||
| for (size_t i = 0; i < sorted_tokens.size(); ++i) | ||
| { | ||
| if (sorted_tokens[i].token != sorted_positions[i].first) |
| SELECT count() FROM tab; | ||
|
|
||
| -- Positions are still attached correctly: phrase search over consecutive short tokens works. | ||
| SELECT count() FROM tab WHERE hasPhrase(message, 't1 t2 t3'); |
There was a problem hiding this comment.
IIUC this tests trioggers the OOB problem probabilistically.
It reliably crashes today, but nothing pins the boundary condition. I.e a future change to StringHashTable could stop the test from ever exercising the over-read, and it would still pass 20/20/0 silently becoming a non-regression-test.
There was a problem hiding this comment.
You're right. The over-read itself is deterministic: for any sub-8-byte token StringHashTable::dispatch memcpy's a full 8-byte word around the key (the case 0 branch reads p[0..7] or p[size-8..] by bit 2048 of the address). What isn't pinned is the ASan trap, which fires only when a short token's cell sits at the end of a postings allocation so the read crosses a redzone, and that adjacency is a StringHashTable layout property. A layout change could stop the crossing and leave the test green on an unfixed binary.
A SQL test can't pin that layout-independently. The durable guard is the single-map-value refactor you raised in the approval: with the positions builder in the postings map value there's no second lookup, so no unpadded find() to over-read and no boundary to pin. The current fix already drops that unsafe find() (the new lower_bound compares key bytes only, no 8-byte hash read), so production is safe today; this is only about the test's strength against reintroduction. I'd land this as the minimal backportable crash fix and track the refactor plus a layout-independent test as the follow-up. Happy to widen the token set here if you want a higher-confidence trap in the meantime.
LLVM Coverage Report
Changed lines: Changed C/C++ lines covered by tests: 15/18 (83.33%) | Lost baseline coverage: none · Uncovered code |
CI finish ledger — cf29606Every failure below has an owner: a fixing PR (ours), or a full-effort fix task. Only Session id: cron:our-pr-ci-monitor:20260629-233000 |

Changelog category (leave one):
Changelog entry (a user-readable short description of the changes that goes into CHANGELOG.md):
Fixed a heap-buffer-overflow when building a text index with
positions = 1over many distinct short tokens.Description
Related: #103172 (added the text index
positionsfeature).MergeTreeIndexTextGranuleBuilder::build()attached each token's positions withposition_map->find(entry.token).entry.tokenis astring_viewinto the postings map's ownStringHashMapcell storage, which is not padded to 8 bytes.StringHashTablelookup reads a full 8-byte word around the key (data() + size() - 8for short keys), so a short token landing near a hash-buffer boundary read out of bounds.The postings map and positions map are filled in lockstep in
addDocument(), so they always hold the same token set. Positions are now attached by sorting the positions map and zipping with the already-sorted tokens by index, instead of re-looking-up the unpadded key. Size and per-index token-mismatch checks preserve the original invariant.Found by
Stress test (arm_asan_ubsan)on 26.6. The crashing path is on master too.Report: https://s3.amazonaws.com/clickhouse-test-reports/json.html?REF=26.6&sha=3babd82fe36c262d5aa12a71cd4823d118f7e894&name_0=ReleaseBranchCI&name_1=Stress%20test%20%28arm_asan_ubsan%29
ASan (
src/Common/HashTable/StringHashTable.h:366indispatch<...PositionListBuilder...>::FindCallable):Reproduced locally on an ASan build (crashes without the fix, passes with it); regression test added.