Fix OptimizedRegularExpression treating a NUL byte in the pattern as end-of-pattern#108427
Conversation
…end-of-pattern
`analyzeImpl` stopped scanning the pattern at the first NUL byte
(`case '\0': pos = end`) - a C-string assumption. But the pattern is a
length-based `std::string_view` and RE2 matches `\0` literally, so this is wrong:
for a pattern that begins with `\0` the analyzer left `is_trivial = true` with an
empty required substring, and `OptimizedRegularExpression::match` then returned an
empty match without ever calling RE2; for a NUL elsewhere the extracted required
substring was truncated. Either way the `strstr` pre-filter could drop rows that
actually match.
For example, on master:
SELECT extract('test@clickhouse.com', '\0*@(.*)$')
returned the empty string instead of `clickhouse.com`. RE2 itself returns
`clickhouse.com` (it is binary-safe); the bug was entirely in the wrapper's
analyzer. Found by the AST fuzzer on the regexp-JIT work, whose JIT matcher
returned the correct result and tripped a debug cross-check against RE2.
Treat a NUL as an ordinary literal byte: it now falls through to the default
(literal) branch. Adds unit cases to `OptimizeRE.analyze` and a stateless
regression test.
Related: #108004
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
… byte Now that the analyzer no longer stops at a NUL (`\0`) byte, captures placed after a NUL are visible to `RegexpFunctionRewritePass` (`optimize_rewrite_regexp_functions`, enabled by default). For `extract`, the pass strips a leading `^.*` from the pattern, which changes which occurrence is captured when the part after the prefix can match at more than one offset: the greedy `^.*` selects the last match, while the stripped pattern selects the first. For example, `extract(s, '^.*\0([a-z]+).*$')` over `x\0first y\0second` returned `first` instead of `second` once the NUL fix exposed the capture. Skip the rewrite for patterns containing a NUL byte. The pre-existing `^.*` unsoundness for ordinary (non-NUL) multi-match patterns is a separate issue and is intentionally left untouched here. Found by the AI review of #108427 Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
The fix that makes `OptimizedRegularExpression` treat a NUL (`\0`) byte as an
ordinary literal (consistent with re2) changes results for the four
codec-/regexp-sensitive tests that the `Fast test` flagged. All four changes
are the now-correct behavior:
- `00760_url_functions_overflow`: `match('', '\0')` is `0` (the empty string
contains no NUL byte), not `1`.
- `02024_merge_regexp_assert`: `REGEXP('\0')` / `REGEXP('\0a')` match no
database, so they raise `CANNOT_EXTRACT_TABLE_STRUCTURE` instead of being
truncated to an empty, match-everything pattern that reached
`UNKNOWN_IDENTIFIER`.
- `02370_extractAll_regress`: the malformed pattern `:"\0[^"]*)"` (unbalanced
parenthesis) is now compiled by re2 and correctly rejected with
`CANNOT_COMPILE_REGEXP`; the analyzer still runs first, preserving the
use-after-scope regression coverage.
- `03594_like_perfect_affix_rewrite_pass`: `LIKE`-on-`FixedString` patterns
ending in a literal NUL (e.g. `%a\0`) now correctly match nothing (count `0`);
the patterns are still not rewritten, which is the point of the test.
CI report (Fast test): https://s3.amazonaws.com/clickhouse-test-reports/json.html?PR=108427&sha=2655ec4ea519e4c601c66a61ace4203f2e2e12e6&name_0=PR&name_1=Fast%20test
PR: #108427
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
… byte Now that the analyzer no longer stops at a NUL (`\0`) byte, captures placed after a NUL are visible to `RegexpFunctionRewritePass` (`optimize_rewrite_regexp_functions`, enabled by default). For `extract`, the pass strips a leading `^.*` from the pattern, which changes which occurrence is captured when the part after the prefix can match at more than one offset: the greedy `^.*` selects the last match, while the stripped pattern selects the first. For example, `extract(s, '^.*\0([a-z]+).*$')` over `x\0first y\0second` returned `first` instead of `second` once the NUL fix exposed the capture. Skip the rewrite for patterns containing a NUL byte. The pre-existing `^.*` unsoundness for ordinary (non-NUL) multi-match patterns is a separate issue and is intentionally left untouched here. Found by the AI review of #108427 Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> (cherry picked from commit 9564b4a)
The fix that makes `OptimizedRegularExpression` treat a NUL (`\0`) byte as an
ordinary literal (consistent with re2) changes results for the four
codec-/regexp-sensitive tests that the `Fast test` flagged. All four changes
are the now-correct behavior:
- `00760_url_functions_overflow`: `match('', '\0')` is `0` (the empty string
contains no NUL byte), not `1`.
- `02024_merge_regexp_assert`: `REGEXP('\0')` / `REGEXP('\0a')` match no
database, so they raise `CANNOT_EXTRACT_TABLE_STRUCTURE` instead of being
truncated to an empty, match-everything pattern that reached
`UNKNOWN_IDENTIFIER`.
- `02370_extractAll_regress`: the malformed pattern `:"\0[^"]*)"` (unbalanced
parenthesis) is now compiled by re2 and correctly rejected with
`CANNOT_COMPILE_REGEXP`; the analyzer still runs first, preserving the
use-after-scope regression coverage.
- `03594_like_perfect_affix_rewrite_pass`: `LIKE`-on-`FixedString` patterns
ending in a literal NUL (e.g. `%a\0`) now correctly match nothing (count `0`);
the patterns are still not rewritten, which is the point of the test.
CI report (Fast test): https://s3.amazonaws.com/clickhouse-test-reports/json.html?PR=108427&sha=2655ec4ea519e4c601c66a61ace4203f2e2e12e6&name_0=PR&name_1=Fast%20test
PR: #108427
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
(cherry picked from commit d75ec76)
Making `OptimizedRegularExpression::analyze` binary-safe (treat a NUL as a literal byte) also affects the text-index pruning path, which calls `analyze` via `regexpToTokensForQueries` to extract a required substring and prune granules. For `match` this is consistent: it is executed by re2, which is also binary-safe. But `multiMatchAny` is executed by vectorscan, which compiles each pattern as a NUL-terminated C-string (`hs_compile_multi` takes no length array), so it stops at the first NUL. With the binary-safe analyzer the index would extract a required substring spanning the NUL that the matcher never requires, and would wrongly prune granules that `multiMatchAny` matches - an unsound index, returning fewer rows than a full scan. Truncate each `multiMatchAny` pattern at the first NUL byte before analyzing, so the index analyzes exactly what vectorscan sees. This restores the equality between indexed and brute-force results in `04308_text_index_function_multiMatchAny_regex_cases`. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Algunenano
left a comment
There was a problem hiding this comment.
LGTM. The new behaviour is consistent with re2 so although there are some changes in tests (thus known behaviour) it's for the better
Master gained tests up to 04488 (and already carries several 04409_* tests), so the new `04489_regexp_pattern_with_nul_byte` test takes the next free number. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
|
Picked this up to advance it:
Review state: approved by @Algunenano and the AI review (✅), with the single inline thread (about The CI red on the previous commit was entirely unrelated flakes:
@groeneai, the |
|
@alexey-milovidov Already fixed: #107438 (approved by alesapin, awaiting merge). Root cause: at process exit CIDB last 7d: 8 hits across 8 distinct unrelated PRs (1 each), 0 on master, all No tracking issue exists yet. #107438 fixes the class at the source, so a dedicated issue is optional; I can open one to track it if you prefer. |
|
@groeneai, provide a fix for |
|
@alexey-milovidov on both asks: 1. 2. Hung check ( This is not the AST-fuzzer hang (#107941): the run has |
|
@alexey-milovidov hung check follow-up (closing the investigation you asked for). Reproduced and root-caused locally (optimized build). The hung query
So no separate fix from me is needed: the engine guard is merged and the harness cap is #108863. The hung check on this PR should stop once #108863 merges; it is a trunk/harness signature, not caused by your change. |
|
Picked this up to advance it. Approved by @Algunenano and the AI review (✅) with no unresolved threads, and the code diff vs the merge-base is unchanged. Triaged the CI on the current commit
The merge-base with |
|
@alexey-milovidov agreed, both reds are pre-existing and the PR is mergeable. Two owner corrections so the reds actually clear:
Hung check (amd_tsan + arm_tsan): not #107941. #107941's fix #108212 is already merged into No regexp in either path, so neither is related to this PR's change. |
The previous commit renumbered the regression test to `04489_regexp_pattern_with_nul_byte` as the next free number, but `master` has since gained `04489_max_threads_auto_parsing_compat` (and `04490`/`04491` are now taken too). Move the test to the next free number `04492` so its prefix is unique on `master`. Test-only change; no behavior change. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
|
Picked this up to advance it. Pushed one test-only commit (
Review state unchanged: approved by @Algunenano and the AI review (✅), no unresolved threads. The five CI reds on the previous commit
The merge-base with |
LLVM Coverage Report
Changed lines: Changed C/C++ lines covered: 28/30 (93.33%) · Uncovered code |
Reconcile with #108427 ("fix-optimize-re-nul"), which landed on master and independently fixed the same NUL-byte handling this branch carried: - `OptimizedRegularExpression.cpp`, `RegexpFunctionRewritePass.cpp` and the `gtest_optimize_re` cases converged to byte-identical content, so they merge with no divergence. - `MergeTreeIndexConditionText.cpp`: master's fix is more precise - it truncates the pattern at the first NUL only at the `multiMatchAny` call site (matching vectorscan's C-string semantics) while letting `match` (binary-safe re2) still prune. Dropped this branch's blanket `regexpToTokensForQueries` early-return, which declined pruning for both functions and would have regressed `match` pruning back to a full scan. The file is now identical to master's canonical version. - `04492_regexp_pattern_with_nul_byte.sql`: kept this branch's version, a superset of master's that adds the regexp-JIT `[a-\0]` descending character-class-range case (`CANNOT_COMPILE_REGEXP`). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
The merge of master brought in #108427's more precise text-index NUL handling: it truncates the pattern at the first NUL only at the `multiMatchAny` call site (matching vectorscan's C-string semantics), while letting `match` (binary-safe re2) still prune granules. This branch's earlier blanket early-return in `regexpToTokensForQueries` declined pruning for *both* functions on any NUL pattern, which would have regressed `match` pruning back to a full scan. Remove it so the file matches master's canonical version; correctness for `multiMatchAny` with a NUL is already covered by master's per-call-site truncation. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

OptimizedRegularExpression's pattern analyzer (analyzeImpl) stopped scanning at the first NUL byte in the pattern (case '\0': pos = end) — a C-string assumption. But the pattern is a length-basedstd::string_viewand RE2 matches\0literally, so this is wrong:\0, the analyzer leftis_trivial = truewith an empty required substring, andOptimizedRegularExpression::matchthen returned an empty match without ever calling RE2.strstrpre-filter could drop rows that actually match.For example, on
master:returns the empty string instead of
clickhouse.com. RE2 itself returnsclickhouse.com(it is binary-safe) — the bug was entirely in the wrapper's analyzer.The fix treats a NUL as an ordinary literal byte (it falls through to the default literal branch). All regex functions (
match,extract,extractAll,replaceRegexp*, …) benefit.Found by the AST fuzzer on the regexp-JIT work (#108004), whose JIT matcher computed the correct result and tripped a debug cross-check against the buggy RE2 path.
Changelog category (leave one):
Changelog entry (a user-readable short description of the changes that goes into CHANGELOG.md):
Fixed regular expression functions (
match,extract,extractAll,replaceRegexpOne,replaceRegexpAll, etc.) silently returning wrong results when the pattern contained a NUL (\0) byte. The NUL is now treated as an ordinary literal byte, consistent with RE2.Related: #108004
Version info
26.7.1.324