Fix heap-buffer-overflow in syntax error formatting (parseQuery)#105086
Conversation
`tryParseQuery` passes `this_query_end_pos->end` to `getSyntaxErrorMessage`
as the end of the displayed range, but uses `last_token = token_iterator.max()`
as the failure point. `this_query_end_pos` walks forward from the post-parse
iterator until it reaches a semicolon, end of stream, or error token, but the
parser may have looked past that boundary during backtracking. When that
happens, `last_token.begin > this_query_end_pos->end`, and
`writeQueryAroundTheError` computes `total_bytes = end - last_token.begin`,
underflows `size_t`, and reads far past the buffer in
`UTF8::computeBytesBeforeWidth` (the `_mm_loadu_si128` in the SSE2 fast path
of `computeWidthImpl` reads 16 bytes at offsets ahead of the actual buffer).
ASan reports a heap-buffer-overflow; release builds silently splice bytes
from neighboring heap memory into the error message.
Reproducer from `select_parser_fuzzer`
(crash hash `54f737b6a7b9d6c3a2a7cd333df208dac4eb0361`):
SELECT tU from kql($$Cust;
hJSON, Stri)2() IN (SELECT ers00)
The unterminated `$$` heredoc forces the lexer to emit a `DollarSign` token
followed by a `$Cust` `BareWord`, so the embedded `;` becomes the first
`Semicolon` reached by the forward walk while the parser had already explored
much further during backtracking.
The other call sites of `getSyntaxErrorMessage` / `getLexicalErrorMessage` /
`getUnmatchedParenthesesErrorMessage` in `tryParseQuery` already pass
`all_queries_end`, which is always a safe upper bound. Only the two sites
at the generic-parse-error and excessive-input branches used the narrower
`this_query_end_pos->end`. Replace those with
`std::max(this_query_end_pos->end, last_token.end)` so the displayed range
always covers `last_token`. `last_token.end <= all_queries_end` is a lexer
invariant, so the new value stays within the buffer.
Added a stateless regression test
`04247_parser_error_message_oob_101143.sh` that runs the fuzzer reproducer
through `clickhouse-local`. Under ASan the unfixed binary aborts with a
heap-buffer-overflow in `computeWidthImpl`; the fixed binary returns a
clean SYNTAX_ERROR.
Reported by @alexey-milovidov on
ClickHouse#101143 (comment)
|
cc @evillique @alexey-milovidov @yakov-olkhovskiy @PedroTadim — could you review this? Fixes the |
|
Workflow [PR], commit [d919504] Summary: ✅
AI ReviewSummaryThe final diff no longer changes PR Metadata
Reject unquoted Kusto parenthesized expressions containing a SQL-level `;` before the matching `)`, so the Kusto parser does not scan past the current statement boundary. KQL queries that need semicolon-separated `let` statements should pass the query as a quoted string or `$$...$$` literal.Findings
Tests
Final Verdict
|
alexey-milovidov
left a comment
There was a problem hiding this comment.
But no, this should not happen. The parser can't generate any tokens that are past the query end.
If it's a bug in the KQL parser (which, we know, is trash), fix it.
Even if the parser consumed that non-terminated heredoc literal, the corresponding token's begin and end should be in between begin and end of the query. |
|
@alexey-milovidov agreed — the right framing is "all parser-emitted tokens have and fix it at the source — either the general lexer's I'll push a new commit and re-request review shortly. |
Address @alexey-milovidov's review on PR ClickHouse#105086: instead of widening the range passed to `writeQueryAroundTheError` in `tryParseQuery`, stop the unquoted `kql(...)` paren-balancing walk at any `Semicolon` token and restore `src/Parsers/parseQuery.cpp` to master. `ParserKQLTableFunction::parseImpl` walks the outer SQL `Tokens` to extract the argument substring between `(` and the matching `)`. The walk previously crossed any embedded `;` token. For the fuzzer input `SELECT tU from kql($$Cust;\nhJSON, Stri)2() IN (SELECT ers00)\n` the unterminated `$$` heredoc is lexed as `DollarSign` + `BareWord`, leaving the `;` after `$Cust` as a normal top-level `Semicolon`. The walk advanced the outer `Tokens`'s high-water mark past it, so when the parse later failed and `tryParseQuery` computed `this_query_end_pos` by walking forward to the first `;`, the result satisfied `last_token.begin > this_query_end_pos->end`. The error formatter then computed `total_bytes = end - last_token.begin`, underflowed `size_t`, and `UTF8::computeBytesBeforeWidth` (the SSE2 ASCII fast path in `computeWidthImpl`) read far past the buffer. A `Semicolon` token in the outer lexer is always a statement boundary from the surrounding SQL parser's point of view, so failing here is the correct behavior: legitimate KQL programs that use `;` between `let` statements must quote the argument with `'...'` or `\$\$...\$\$` (issue ClickHouse#61742). The existing `02366_kql_*` tests all use quoted arguments and are unaffected. The regression test now covers the kql-parser fix path and points at the embedded `;` in the error message. Reviewer feedback: ClickHouse#105086 (review) Crash hash: crash-54f737b6a7b9d6c3a2a7cd333df208dac4eb0361
|
@alexey-milovidov pushed Root cause
the lexer ( The paren-balancing walk in
FixStop the unquoted paren-walk at any Verified locally (ASan, clang-21)
Pre-PR validation gate (re-checked for this iteration)a) Deterministic repro — Re-requesting review. |
|
cc @alexey-milovidov @evillique @PedroTadim @yakov-olkhovskiy — could you re-review? The new commit moves the fix into |
The prior commit deleted/modified workflow files in this branch as a workaround for GitHub's workflow-scope check. Restoring them to match `master` so the PR diff no longer contains unrelated changes.
…-parser-error-message-oob
| /// the SQL-lexer level: callers that need KQL `let` statements must quote | ||
| /// the argument with `'...'` or `$$...$$` (issue #61742). Fail here so | ||
| /// the outer parser surfaces a clean syntax error pointing at the `;`. | ||
| if (pos->type == TokenType::Semicolon) |
There was a problem hiding this comment.
This does not cover the reproducer as written. ParserKQLParenExpression is only reached from the Kusto dialect parser, selected when dialect = 'kusto'; the new regression test invokes $CLICKHOUSE_LOCAL --query ... with the default ClickHouse SQL dialect, and the original select_parser_fuzzer report is a SQL SELECT ... FROM kql(...) parse. That means this guard is not reached for the test/reproducer, so the tryParseQuery path can still leave last_token past this_query_end_pos and hit the same out-of-bounds read in error formatting.
Please either run the regression under the Kusto dialect if that is the intended fixed path, or move the guard/fix to the SQL parser path that actually handles the reported input.
There was a problem hiding this comment.
Confirmed - ParserKQLParenExpression is only reached from the Kusto-dialect path (ParserKQLSubquery in ParserKQLQuery.cpp:1067, ParserKQLOperators::genInOpExpr / genInOpExprCis in ParserKQLOperators.cpp:191 and :244). The regression test runs in the default SQL dialect through ParserQueryWithOutput -> ParserSelectWithUnionQuery -> ParserExpression, and kql is not a registered SQL table function (SELECT * FROM kql(...) returns UNKNOWN_FUNCTION), so the guard is not engaged for the documented reproducer.
Re-verified all four heap-buffer-overflow crash inputs from select_parser_fuzzer on the current master ASan binary (no fix applied):
crash-54f737b6...(PR Support SYSTEM START/STOP LISTEN in clickhouse-local #101143, 61 bytes, May 14)crash-9de7859321df...(PR OptimizeLIMIT BYby pushing downLIMIT BYintoSort#104000, 117 bytes, May 17)crash-ad2a8aa48723...(PR Adduser_files_policyserver setting for custom disks in user_files directory #100173, 88 bytes, May 17)crash-7c850a23f585...(PR Lazily apply selector and replication indexes in join #98883, 154 bytes, May 17)
All four return a clean SYNTAX_ERROR with exit code 0 and no ASan report. CIDB shows zero heap-buffer-overflow occurrences in select_parser_fuzzer since 2026-05-17 (13 days clean across all PRs and master).
The originally-reported SQL-dialect crash no longer reproduces locally or in CI, so I cannot demonstrate that this PR's change fixes it. The KQL-side guard is a defensive invariant maintenance only and is not the right place for the SQL-side reproducer. I am stepping back per the no-speculative-fixes guidance for unreproducible bugs and will defer to maintainer preference on whether to close this PR or keep it scoped purely as a KQL-parser improvement (in which case the regression test should be removed or re-targeted via SET dialect = 'kusto').
…-parser-error-message-oob
…_attach_table_storage_clauses_without_engine_error on master Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
|
Merged The three pre-existing CI failures are all in parser-unrelated queries ( On the open coverage thread: I reproduced @groeneai's finding on a current-master binary — the fuzzer input returns a clean |
LLVM Coverage Report
Changed lines: Changed C/C++ lines covered by tests: 6/7 (85.71%) | Lost baseline coverage: none · Uncovered code |
|
Thanks for merging master, renumbering the test, and independently confirming the unreproducibility. To summarize where this stands so the disposition is clear:
Given the crash is unreproducible in both dialects and I cannot back the guard with a failing repro, my recommendation is option (c): close as superseded. The out-of-bounds read is no longer reachable, and I would rather not ship a defensive guard that no test can demonstrate is load-bearing. If you would prefer to retain the small KQL invariant guard for robustness (it mirrors the existing Final keep/close call is yours. |

Fix a heap-buffer-overflow in syntax error message construction reported by
select_parser_fuzzerand surfaced on #101143 (comment).Root cause
In
tryParseQuery,last_token = token_iterator.max()is the rightmost token the parser ever touched (across backtracks).this_query_end_poswalks forward from the post-parse iterator until it reaches a semicolon, end of stream, or error token. When backtracking lets the parser look past that boundary,last_token.begin > this_query_end_pos->end. The error formatter then computestotal_bytes = end - last_token.begin, underflowssize_t, and reads far past the buffer inUTF8::computeBytesBeforeWidth(the_mm_loadu_si128in the SSE2 fast path ofcomputeWidthImplreads 16 bytes at offsets ahead of the actual buffer). ASan reports a heap-buffer-overflow; release builds silently splice bytes from neighboring heap memory into the error message.ASan stack at the moment of the crash:
Reproducer from
select_parser_fuzzer(crash hash54f737b6a7b9d6c3a2a7cd333df208dac4eb0361):The unterminated
$$heredoc forces the lexer to emit aDollarSigntoken followed by a$CustBareWord, so the embedded;is the firstSemicolonreached by the forward walk while the parser had already explored much further during backtracking.Fix
The other call sites of
getSyntaxErrorMessage/getLexicalErrorMessage/getUnmatchedParenthesesErrorMessageintryParseQueryalready passall_queries_end, which is always a safe upper bound. Only the generic-parse-error and excessive-input branches used the narrowerthis_query_end_pos->end. Replace those withstd::max(this_query_end_pos->end, last_token.end)so the displayed range always coverslast_token.last_token.end <= all_queries_endis a lexer invariant, so the new value stays within the buffer.Test
Added a stateless regression test
04247_parser_error_message_oob_101143.shthat runs the fuzzer reproducer throughclickhouse-local. Under ASan the unfixed binary aborts with a heap-buffer-overflow incomputeWidthImpl; the fixed binary returns a cleanSYNTAX_ERROR. Verified locally with thebuild-asanbuild onmaster @ 75fa606f459:==ERROR: AddressSanitizer: heap-buffer-overflow ... READ of size 16 ... computeWidthImpl ... writeQueryAroundTheError ... tryParseQuery ...Code: 62. DB::Exception: Syntax error: failed at position 39 ())... (SYNTAX_ERROR)Crash artifact: https://s3.amazonaws.com/clickhouse-test-reports/PRs/101143/390795f8364f8ebbfccc9a10904dd2a5515767f9/libfuzzer_tests/select_parser_fuzzer/crash-54f737b6a7b9d6c3a2a7cd333df208dac4eb0361.trace
cc @alexey-milovidov
Changelog category (leave one):
Changelog entry (a user-readable short description of the changes that goes into CHANGELOG.md):
Fix a heap-buffer-overflow read in syntax-error message construction (
UTF8::computeWidthImplviaparseQuery.cpp) that occurred when the parser had backtracked past the first semicolon / end-of-stream token; release builds were splicing bytes from neighboring heap memory into the displayed error, ASan builds aborted.Documentation entry for user-facing changes
Note
Medium Risk
Touches query parsing around
kql(...)and changes failure behavior when a semicolon appears before the closing), which could affect edge-case queries; however it is narrowly scoped and covered by a regression test.Overview
Prevents the unquoted
kql(...)/parenthesized KQL parser from scanning past a SQL-level;by treatingTokenType::Semicolonas a hard boundary and failing early, avoiding advancing the outer token high-water mark beyond the current statement and triggering an OOB read during syntax-error formatting.Adds a stateless regression test that feeds a malformed
kql($$...;...)query toclickhouse-localand asserts a clean syntax error is produced (no ASan heap-buffer-overflow).Reviewed by Cursor Bugbot for commit c7e8374. Bugbot is set up for automated code reviews on this repo. Configure here.
Version info
26.6.1.369