Fix heap-buffer-overflow building text index with positions by groeneai · Pull Request #108659 · ClickHouse/ClickHouse · GitHub
Skip to content

Fix heap-buffer-overflow building text index with positions#108659

Open
groeneai wants to merge 2 commits into
ClickHouse:masterfrom
groeneai:groeneai/fix-text-index-positions-build-overflow
Open

Fix heap-buffer-overflow building text index with positions#108659
groeneai wants to merge 2 commits into
ClickHouse:masterfrom
groeneai:groeneai/fix-text-index-positions-build-overflow

Conversation

@groeneai

Copy link
Copy Markdown
Contributor

Changelog category (leave one):

  • Bug Fix (user-visible misbehavior in an official stable release)

Changelog entry (a user-readable short description of the changes that goes into CHANGELOG.md):

Fixed a heap-buffer-overflow when building a text index with positions = 1 over many distinct short tokens.

Description

Related: #103172 (added the text index positions feature).

MergeTreeIndexTextGranuleBuilder::build() attached each token's positions with position_map->find(entry.token). entry.token is a string_view into the postings map's own StringHashMap cell storage, which is not padded to 8 bytes. StringHashTable lookup reads a full 8-byte word around the key (data() + size() - 8 for short keys), so a short token landing near a hash-buffer boundary read out of bounds.

The postings map and positions map are filled in lockstep in addDocument(), so they always hold the same token set. Positions are now attached by sorting the positions map and zipping with the already-sorted tokens by index, instead of re-looking-up the unpadded key. Size and per-index token-mismatch checks preserve the original invariant.

Found by Stress test (arm_asan_ubsan) on 26.6. The crashing path is on master too.
Report: https://s3.amazonaws.com/clickhouse-test-reports/json.html?REF=26.6&sha=3babd82fe36c262d5aa12a71cd4823d118f7e894&name_0=ReleaseBranchCI&name_1=Stress%20test%20%28arm_asan_ubsan%29

ASan (src/Common/HashTable/StringHashTable.h:366 in dispatch<...PositionListBuilder...>::FindCallable):

heap-buffer-overflow, READ of size 8 ... 5 bytes before an 81920-byte region
  #0 StringHashTable<...PositionListBuilder...>::dispatch  StringHashTable.h:366
  #1 ...::find(string_view)                                StringHashTable.h:467
  #2 MergeTreeIndexTextGranuleBuilder::build()             MergeTreeIndexText.cpp
  #3 MergeTreeIndexAggregatorText::getGranuleAndReset()
  #4 MergeTreeDataPartWriterOnDisk::fillSkipIndicesChecksums

Reproduced locally on an ASan build (crashes without the fix, passes with it); regression test added.

@groeneai

Copy link
Copy Markdown
Contributor Author

@groeneai

Copy link
Copy Markdown
Contributor Author

cc @ahmadov @CurtizJ — could you review this? It fixes a heap-buffer-overflow in the text-index positions build path: build() looked tokens up in the positions map with a key view into the postings map's unpadded hash cells, and StringHashTable lookup over-reads 8 bytes around short keys. The two maps are filled in lockstep, so positions are now attached by sorted zip instead of an unsafe find().

@CurtizJ CurtizJ added the can be tested Allows running workflows for external contributors label Jun 26, 2026
@clickhouse-gh

clickhouse-gh Bot commented Jun 26, 2026

Copy link
Copy Markdown
Contributor

Workflow [PR], commit [cf29606]

Summary:

job_name test_name status info comment
Stress test (amd_tsan) FAIL
Hung check failed, possible deadlock found FAIL cidb, issue
Stress test (arm_tsan) FAIL
Logical error: '(isConst() || isSparse() || isReplicated() || rhs.isConst() || rhs.isSparse() || rhs.isReplicated()) ? getDataType() == rhs.getDataType() : typeid(*this) == typeid(rhs)' (STID: 2508-30f6) FAIL cidb

AI Review

Summary

This PR fixes the positions = 1 text-index build path by replacing an unsafe StringHashMap::find on an unpadded token view with byte-wise comparisons against the already sorted token list. I did not find any unresolved correctness, safety, or testing issue that needs an inline review comment.

Final Verdict

Status: ✅ Approve

@clickhouse-gh clickhouse-gh Bot added the pr-bugfix Pull request with bugfix, not backported by default label Jun 26, 2026
@Ergus Ergus self-assigned this Jun 29, 2026

@Ergus Ergus left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The change adds another call to std::sort, which could be very bad for performance.
If position_map then we can skip the first sort for sorted_tokens and initialize everything once.

@Ergus

Ergus commented Jun 29, 2026

Copy link
Copy Markdown
Member

@groeneai

What about:

if (position_map)
{
    size_t attached = 0;
    position_map->forEachValue([&](const auto & key, auto & mapped)
    {
        const std::string_view token(key);
        auto it = std::ranges::lower_bound(sorted_tokens, token, std::less<>{},
            [](const SortedToken & e) { return e.token; });
        if (it == sorted_tokens.end() || it->token != token)
            throw Exception(ErrorCodes::LOGICAL_ERROR,
                "Text index: positions token '{}' has no postings slot", token);
        it->positions = &mapped;
        ++attached;
    });

    if (attached != sorted_tokens.size())
        throw Exception(ErrorCodes::LOGICAL_ERROR,
            "Text index: postings map has {} tokens but positions map attached {}",
            sorted_tokens.size(), attached);
}

MergeTreeIndexTextGranuleBuilder::build() looked up each token in the
positions map with position_map->find(entry.token). entry.token is a
string_view into the postings map's own StringHashMap cell storage, which
is not padded to 8 bytes. StringHashTable lookup reads a full 8-byte word
around the key (data()+size()-8 for short keys), so a short token landing
near a hash-buffer boundary read out of bounds: ASan reported a
heap-buffer-overflow at StringHashTable.h in build() during part write
(fillSkipIndicesChecksums -> getGranuleAndReset).

The postings map and positions map are filled in lockstep in addDocument(),
so they always hold the same token set. Walk the positions map once and
attach each builder by binary searching the already-sorted tokens
(std::ranges::lower_bound, which only compares key bytes, no over-read);
this avoids both the unpadded find() and a second sort of the positions
map. The attached-count check keeps the original "postings but no positions
slot" invariant.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
@groeneai groeneai force-pushed the groeneai/fix-text-index-positions-build-overflow branch from 0af5124 to 988c551 Compare June 29, 2026 12:35
@groeneai

Copy link
Copy Markdown
Contributor Author

Adopted, thanks @Ergus. Dropped the extra std::sort: now one position_map->forEachValue pass that std::ranges::lower_bounds each key into the already-sorted sorted_tokens and attaches &mapped, with an attached == sorted_tokens.size() check for the invariant. Single O(log n) search per token, no second sort and no temporary vector.

OOB-safe: lower_bound compares only key bytes (string_view operator<), so the unpadded position_map->find() that read an 8-byte word past a short token is gone.

Re-validated on an ASan build with the regression test: crashes without the fix (heap-buffer-overflow, READ of size 8, identical signature), passes with it (5/5, output 20/20/0). Used a lambda comparator + &SortedToken::token projection instead of std::less<>{} (same semantics, matches the existing lower_bound style in this file).

@groeneai

Copy link
Copy Markdown
Contributor Author

CI finish ledger — 988c551

Every failure below has an owner: a fixing PR (ours or external), or a full-effort fix task whose fixing-PR link will be posted here when it opens. Only CH Inc sync is exempt.

Check / test Reason Owner / fixing PR
Stateless tests (11 configs) / 02354_vector_search_rescoring master regression (Code 44 ILLEGAL_COLUMN _distance, from #105591) #108812 (ours, merged) — merged master @ cf29606 to pick it up
CH Inc sync - CH Inc sync (private, not actionable)

The single failing test is the merged-#105591 master regression (deterministic, all configs since 2026-06-29 11:10 UTC), fixed by revert #108812 (merged 14:13 UTC). This branch predated the revert, so I merged current master to pick it up; CI re-runs on cf29606.

Session id: cron:our-pr-ci-monitor:20260629-183000

@Ergus Ergus left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A cleaner and faster fix is to store the positions builder alongside the postings builder in a single map value, eliminating the join (and the whole class of "look the key up in the other table" bugs) entirely.

@ahmadov WDYT? That's a bigger change than this bugfix should carry, but worth a follow-up


for (size_t i = 0; i < sorted_tokens.size(); ++i)
{
if (sorted_tokens[i].token != sorted_positions[i].first)

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Must use sorted_positions here

SELECT count() FROM tab;

-- Positions are still attached correctly: phrase search over consecutive short tokens works.
SELECT count() FROM tab WHERE hasPhrase(message, 't1 t2 t3');

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IIUC this tests trioggers the OOB problem probabilistically.

It reliably crashes today, but nothing pins the boundary condition. I.e a future change to StringHashTable could stop the test from ever exercising the over-read, and it would still pass 20/20/0 silently becoming a non-regression-test.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're right. The over-read itself is deterministic: for any sub-8-byte token StringHashTable::dispatch memcpy's a full 8-byte word around the key (the case 0 branch reads p[0..7] or p[size-8..] by bit 2048 of the address). What isn't pinned is the ASan trap, which fires only when a short token's cell sits at the end of a postings allocation so the read crosses a redzone, and that adjacency is a StringHashTable layout property. A layout change could stop the crossing and leave the test green on an unfixed binary.

A SQL test can't pin that layout-independently. The durable guard is the single-map-value refactor you raised in the approval: with the positions builder in the postings map value there's no second lookup, so no unpadded find() to over-read and no boundary to pin. The current fix already drops that unsafe find() (the new lower_bound compares key bytes only, no 8-byte hash read), so production is safe today; this is only about the test's strength against reintroduction. I'd land this as the minimal backportable crash fix and track the refactor plus a layout-independent test as the follow-up. Happy to widen the token set here if you want a higher-confidence trap in the meantime.

@clickhouse-gh

clickhouse-gh Bot commented Jun 29, 2026

Copy link
Copy Markdown
Contributor

LLVM Coverage Report

Metric Baseline Current Δ
Lines 85.40% 85.40% +0.00%
Functions 92.60% 92.60% +0.00%
Branches 77.60% 77.70% +0.10%

Changed lines: Changed C/C++ lines covered by tests: 15/18 (83.33%) | Lost baseline coverage: none · Uncovered code

Full report · Diff report

@groeneai

Copy link
Copy Markdown
Contributor Author

CI finish ledger — cf29606

Every failure below has an owner: a fixing PR (ours), or a full-effort fix task. Only CH Inc sync is exempt. None are caused by this PR (a text-index granule-builder over-read fix in MergeTreeIndexText.cpp; neither code path is touched by the failures below).

Check / test Reason Owner / fixing PR
Stress test (arm_tsan) / Logical error getDataType() == rhs.getDataType() (STID 2508-30f6) trunk bug (HashJoin buildAdditionalFilter Nullable/non-Nullable mismatch under join_use_nulls + non-equi ON; 4 unrelated PRs, 0 master, 30d) #107957 (ours, open)
Stress test (amd_tsan) / Hung check failed, possible deadlock found trunk shutdown-deadlock hung-check family #101680 / #105905 (ours, open)
Mergeable Check / PR rollup of the two Stress failures above (no independent failure) owned via the two rows above
CH Inc sync private sync CH Inc sync (private, not actionable)

Session id: cron:our-pr-ci-monitor:20260629-233000

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

can be tested Allows running workflows for external contributors pr-bugfix Pull request with bugfix, not backported by default

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants