Guard ColumnString::filter against offsets inconsistent with chars array by groeneai · Pull Request #109184 · ClickHouse/ClickHouse · GitHub
Skip to content

Guard ColumnString::filter against offsets inconsistent with chars array#109184

Open
groeneai wants to merge 1 commit into
ClickHouse:masterfrom
groeneai:fix-columnstring-filter-oob-inconsistent-offsets
Open

Guard ColumnString::filter against offsets inconsistent with chars array#109184
groeneai wants to merge 1 commit into
ClickHouse:masterfrom
groeneai:fix-columnstring-filter-oob-inconsistent-offsets

Conversation

@groeneai

@groeneai groeneai commented Jul 2, 2026

Copy link
Copy Markdown
Contributor

Changelog category (leave one):

  • Bug Fix (user-visible misbehavior in an official stable release)

Changelog entry (a user-readable short description of the changes that goes into CHANGELOG.md):

Fixed a possible out-of-bounds write in ColumnString::filter when a String column arrives with offsets inconsistent with its chars array (for example offsets.back() not equal to chars.size()). Such a column now raises an INCORRECT_DATA exception instead of reading and writing out of bounds.

Description

ColumnString::filter (both the const and in-place overloads) passed offsets and chars straight to filterArraysImpl / filterArraysImplInPlace without checking that they are consistent. The SIMD fast path in filterArraysImplGeneric (ColumnsCommon.cpp) trusts offsets to size its copies: for a full-passing 64-row chunk it computes chunk_size = offsets_pos[63] - offsets_pos[-1], then does resize + memcpy of that many bytes. If a corrupt column arrives with offsets.back() != chars.size(), chunk_size is bogus and the memcpy runs off the end of the buffer, an out-of-bounds write.

This was observed in CI as a SEGV ("address not mapped to object") with the stack:

__asan_memcpy
ColumnsCommon.cpp  ResultOffsetsBuilder::insertChunk<64> / filterArraysImpl<char8_t>
ColumnString.cpp   ColumnString::filter
ColumnNullable.cpp ColumnNullable::filter
FilterDescription.cpp FilterDescription::filter
FilterTransform.cpp   FilterTransform::doTransform

while filtering a nullable String column produced by the Parquet reader on a metadata-cache-hit path. The inconsistent column is produced upstream and the exact producer path is not deterministically reproducible, so this change does not touch the upstream producer. Instead it guards the point where the corruption turns into memory unsafety: ColumnString::filter now validates the same invariant the ColumnString copy constructor already enforces (offsets.back() == chars.size()), matching the guard recently added to the sibling ColumnString::insertRangeFrom path (#107196). A violated invariant becomes a clean INCORRECT_DATA exception, which is handled normally in release builds, rather than a silent out-of-bounds write.

The guard is O(1) (a single comparison on the already-loaded last offset), so there is no hot-path cost for well-formed columns.

Unit tests in gtest_column_string.cpp build a ColumnString whose offsets overshoot chars and assert that filter() (both overloads) throws INCORRECT_DATA instead of reading/writing out of bounds. Without the guard, the large-overshoot case is an AddressSanitizer heap-buffer-overflow. A positive test confirms a consistent column still filters correctly over both the SIMD and the tail paths.

No related open issue found.

ColumnString::filter (both the const and in-place overloads) handed
`offsets` and `chars` straight to filterArraysImpl / filterArraysImplInPlace
without checking that they are consistent. The SIMD fast path in
filterArraysImplGeneric (ColumnsCommon.cpp) trusts `offsets` to compute chunk
sizes: for a full-passing 64-row chunk it does
chunk_size = offsets_pos[63] - offsets_pos[-1], then resize + memcpy that many
bytes. If a corrupt column arrives with offsets.back() != chars.size() (offsets
overshoot chars, or non-monotonic), chunk_size is bogus and the memcpy runs off
the end of the destination/source buffer, an out-of-bounds write.

Observed in CI as a SEGV ("address not mapped") with the stack
  __asan_memcpy
  ColumnsCommon.cpp ResultOffsetsBuilder::insertChunk<64> / filterArraysImpl<char8_t>
  ColumnString::filter -> ColumnNullable::filter -> FilterDescription::filter
  -> FilterTransform::doTransform
when filtering a nullable String column produced by the parquet reader on a
metadata-cache-hit path. The corrupt column is produced upstream; this change
does not fix the upstream producer, it converts the silent out-of-bounds write
into a clean INCORRECT_DATA exception so the corruption fails safely and is
diagnosable instead of crashing (or silently corrupting memory) in release
builds.

This mirrors the invariant the ColumnString copy constructor already enforces
(offsets.back() == chars.size()) and the guard recently added to the sibling
ColumnString::insertRangeFrom path.

Adds unit tests that build a ColumnString whose offsets overshoot chars and
assert filter() throws INCORRECT_DATA instead of reading/writing out of bounds
(the large-overshoot case is an ASan heap-buffer-overflow without the guard),
plus a positive test that a consistent column still filters correctly.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
@groeneai

groeneai commented Jul 2, 2026

Copy link
Copy Markdown
Contributor Author

@groeneai

groeneai commented Jul 2, 2026

Copy link
Copy Markdown
Contributor Author

cc @Avogar @CurtizJ @davenger could you review this? It adds an O(1) consistency guard to ColumnString::filter (both overloads) that rejects a column whose offsets.back() != chars.size() with INCORRECT_DATA, instead of letting the filterArraysImpl SIMD fast path do an out-of-bounds memcpy (seen in CI as an ASan SEGV under FilterTransform on a nullable String from the Parquet reader). It mirrors the invariant the copy constructor enforces and the sibling guard added to insertRangeFrom in #107196.

@PedroTadim PedroTadim added the can be tested Allows running workflows for external contributors label Jul 2, 2026
@clickhouse-gh

clickhouse-gh Bot commented Jul 2, 2026

Copy link
Copy Markdown
Contributor

Workflow [PR], commit [7ce9c3a]

Summary:

job_name test_name status info comment
Integration tests (amd_msan, 8/8) FAIL
test_keeper_session_refuse_stale_server/test.py::test_session_refused_on_stale_keeper FAIL cidb

AI Review

Summary

This PR adds invariant checks in both ColumnString::filter overloads and regression tests for the offsets.back() != chars.size() corruption case. The direction is correct, but the guard is incomplete: filterArraysImpl also assumes the offsets are monotonic, so a corrupted ColumnString with a valid final offset can still underflow arr_size and hit the same out-of-bounds resize/memcpy path.

Findings

❌ Blockers

  • [src/Columns/ColumnString.cpp:208] filterArraysImpl and filterArraysImplInPlace subtract each offset from the previous one, so offsets.back() == chars.size() is not enough to make the data safe. A column like offsets = [4, 2, 12] passes the new guard, but filtering the second row computes arr_size = 2 - 4, underflows, and drives a huge resize/memcpy in ColumnsCommon.cpp. Because the upstream corruption source is still unknown, ColumnString::filter needs to reject non-monotonic offsets here as well, and the tests should cover the decreasing/dipping cases the same way insertRangeFrom does.
Final Verdict

❌ Changes requested.

@clickhouse-gh clickhouse-gh Bot added the pr-bugfix Pull request with bugfix, not backported by default label Jul 2, 2026
/// Enforce the ColumnString invariant (offsets.back() == chars.size(), same as the copy
/// constructor) before filterArraysImpl trusts `offsets` to size its memcpys; a violated
/// invariant would otherwise read/write out of bounds instead of failing cleanly.
if (offsets.back() != chars.size())

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

filterArraysImpl and filterArraysImplInPlace rely on more than offsets.back() == chars.size(): copy_array subtracts every offset from the previous one. A corrupt column like offsets = [4, 2, 12] still passes this new check, but filtering the second row makes arr_size = 2 - 4, which underflows and drives a huge resize/memcpy in both paths. Since the upstream producer is still unknown, the guard here needs to reject non-monotonic offsets too, and the new tests should cover the decreasing/dipping cases the same way insertRangeFrom already does.

@clickhouse-gh

clickhouse-gh Bot commented Jul 2, 2026

Copy link
Copy Markdown
Contributor

LLVM Coverage Report

Metric Baseline Current Δ
Lines 85.50% 85.50% +0.00%
Functions 92.60% 92.60% +0.00%
Branches 77.70% 77.70% +0.00%

Changed lines: Changed C/C++ lines covered: 52/53 (98.11%) · Uncovered code

Full report · Diff report

@groeneai

groeneai commented Jul 2, 2026

Copy link
Copy Markdown
Contributor Author

CI finish ledger — 7ce9c3a

Every failure below has an owner: a fixing PR (ours or external), or a full-effort fix task whose fixing-PR link will be posted here when it opens. Only CH Inc sync is exempt.

Check / test Reason Owner / fixing PR
Integration tests (amd_msan, 8/8) / test_keeper_session_refuse_stale_server::test_session_refused_on_stale_keeper flaky (wait_zookeeper_to_start ZK-restart race; 30d: 6 hits / 6 unrelated PRs / 0 master; unrelated to this PR's ColumnString::filter guard) #109037 (ours, open)

Note: Mergeable Check / PR are rollups of the single flaky Integration line above, owned by #109037.

Session id: cron:our-pr-ci-monitor:20260702-173000

@groeneai

groeneai commented Jul 3, 2026

Copy link
Copy Markdown
Contributor Author

CI finish ledger — 7ce9c3a

Every failure below has an owner: a fixing PR (ours or external), or a full-effort fix task whose fixing-PR link will be posted here when it opens. Only CH Inc sync is exempt.

Check / test Reason Owner / fixing PR
Integration tests (amd_msan, 8/8) / test_keeper_session_refuse_stale_server::test_session_refused_on_stale_keeper chronic flaky (14d: 5 unrelated PRs, 1 hit each; Keeper session test unrelated to this PR's ColumnArray::getExtremes fix) #109037 (ours, open)

Session id: cron:our-pr-ci-monitor:20260703-043000

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

can be tested Allows running workflows for external contributors pr-bugfix Pull request with bugfix, not backported by default

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants