Fix OptimizedRegularExpression treating a NUL byte in the pattern as end-of-pattern by alexey-milovidov · Pull Request #108427 · ClickHouse/ClickHouse · GitHub
Skip to content

Fix OptimizedRegularExpression treating a NUL byte in the pattern as end-of-pattern#108427

Merged
alexey-milovidov merged 9 commits into
masterfrom
fix-optimize-re-nul
Jul 1, 2026
Merged

Fix OptimizedRegularExpression treating a NUL byte in the pattern as end-of-pattern#108427
alexey-milovidov merged 9 commits into
masterfrom
fix-optimize-re-nul

Conversation

@alexey-milovidov

@alexey-milovidov alexey-milovidov commented Jun 24, 2026

Copy link
Copy Markdown
Member

OptimizedRegularExpression's pattern analyzer (analyzeImpl) stopped scanning at the first NUL byte in the pattern (case '\0': pos = end) — a C-string assumption. But the pattern is a length-based std::string_view and RE2 matches \0 literally, so this is wrong:

  • For a pattern that begins with \0, the analyzer left is_trivial = true with an empty required substring, and OptimizedRegularExpression::match then returned an empty match without ever calling RE2.
  • For a NUL elsewhere, the extracted required substring was truncated, so the strstr pre-filter could drop rows that actually match.

For example, on master:

SELECT extract('test@clickhouse.com', '\0*@(.*)$')

returns the empty string instead of clickhouse.com. RE2 itself returns clickhouse.com (it is binary-safe) — the bug was entirely in the wrapper's analyzer.

The fix treats a NUL as an ordinary literal byte (it falls through to the default literal branch). All regex functions (match, extract, extractAll, replaceRegexp*, …) benefit.

Found by the AST fuzzer on the regexp-JIT work (#108004), whose JIT matcher computed the correct result and tripped a debug cross-check against the buggy RE2 path.

Changelog category (leave one):

  • Bug Fix (user-visible misbehavior in an official stable release)

Changelog entry (a user-readable short description of the changes that goes into CHANGELOG.md):

Fixed regular expression functions (match, extract, extractAll, replaceRegexpOne, replaceRegexpAll, etc.) silently returning wrong results when the pattern contained a NUL (\0) byte. The NUL is now treated as an ordinary literal byte, consistent with RE2.

Related: #108004

Version info

  • Merged into: 26.7.1.324

…end-of-pattern

`analyzeImpl` stopped scanning the pattern at the first NUL byte
(`case '\0': pos = end`) - a C-string assumption. But the pattern is a
length-based `std::string_view` and RE2 matches `\0` literally, so this is wrong:
for a pattern that begins with `\0` the analyzer left `is_trivial = true` with an
empty required substring, and `OptimizedRegularExpression::match` then returned an
empty match without ever calling RE2; for a NUL elsewhere the extracted required
substring was truncated. Either way the `strstr` pre-filter could drop rows that
actually match.

For example, on master:

    SELECT extract('test@clickhouse.com', '\0*@(.*)$')

returned the empty string instead of `clickhouse.com`. RE2 itself returns
`clickhouse.com` (it is binary-safe); the bug was entirely in the wrapper's
analyzer. Found by the AST fuzzer on the regexp-JIT work, whose JIT matcher
returned the correct result and tripped a debug cross-check against RE2.

Treat a NUL as an ordinary literal byte: it now falls through to the default
(literal) branch. Adds unit cases to `OptimizeRE.analyze` and a stateless
regression test.

Related: #108004

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
@clickhouse-gh

clickhouse-gh Bot commented Jun 24, 2026

Copy link
Copy Markdown
Contributor

@clickhouse-gh clickhouse-gh Bot added the pr-bugfix Pull request with bugfix, not backported by default label Jun 24, 2026
Comment thread src/Common/OptimizedRegularExpression.cpp
alexey-milovidov and others added 3 commits June 25, 2026 18:10
… byte

Now that the analyzer no longer stops at a NUL (`\0`) byte, captures placed
after a NUL are visible to `RegexpFunctionRewritePass`
(`optimize_rewrite_regexp_functions`, enabled by default). For `extract`, the
pass strips a leading `^.*` from the pattern, which changes which occurrence is
captured when the part after the prefix can match at more than one offset: the
greedy `^.*` selects the last match, while the stripped pattern selects the
first. For example, `extract(s, '^.*\0([a-z]+).*$')` over `x\0first y\0second`
returned `first` instead of `second` once the NUL fix exposed the capture.

Skip the rewrite for patterns containing a NUL byte. The pre-existing `^.*`
unsoundness for ordinary (non-NUL) multi-match patterns is a separate issue and
is intentionally left untouched here.

Found by the AI review of #108427

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
The fix that makes `OptimizedRegularExpression` treat a NUL (`\0`) byte as an
ordinary literal (consistent with re2) changes results for the four
codec-/regexp-sensitive tests that the `Fast test` flagged. All four changes
are the now-correct behavior:

- `00760_url_functions_overflow`: `match('', '\0')` is `0` (the empty string
  contains no NUL byte), not `1`.
- `02024_merge_regexp_assert`: `REGEXP('\0')` / `REGEXP('\0a')` match no
  database, so they raise `CANNOT_EXTRACT_TABLE_STRUCTURE` instead of being
  truncated to an empty, match-everything pattern that reached
  `UNKNOWN_IDENTIFIER`.
- `02370_extractAll_regress`: the malformed pattern `:"\0[^"]*)"` (unbalanced
  parenthesis) is now compiled by re2 and correctly rejected with
  `CANNOT_COMPILE_REGEXP`; the analyzer still runs first, preserving the
  use-after-scope regression coverage.
- `03594_like_perfect_affix_rewrite_pass`: `LIKE`-on-`FixedString` patterns
  ending in a literal NUL (e.g. `%a\0`) now correctly match nothing (count `0`);
  the patterns are still not rewritten, which is the point of the test.

CI report (Fast test): https://s3.amazonaws.com/clickhouse-test-reports/json.html?PR=108427&sha=2655ec4ea519e4c601c66a61ace4203f2e2e12e6&name_0=PR&name_1=Fast%20test
PR: #108427

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
alexey-milovidov added a commit that referenced this pull request Jun 29, 2026
… byte

Now that the analyzer no longer stops at a NUL (`\0`) byte, captures placed
after a NUL are visible to `RegexpFunctionRewritePass`
(`optimize_rewrite_regexp_functions`, enabled by default). For `extract`, the
pass strips a leading `^.*` from the pattern, which changes which occurrence is
captured when the part after the prefix can match at more than one offset: the
greedy `^.*` selects the last match, while the stripped pattern selects the
first. For example, `extract(s, '^.*\0([a-z]+).*$')` over `x\0first y\0second`
returned `first` instead of `second` once the NUL fix exposed the capture.

Skip the rewrite for patterns containing a NUL byte. The pre-existing `^.*`
unsoundness for ordinary (non-NUL) multi-match patterns is a separate issue and
is intentionally left untouched here.

Found by the AI review of #108427

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
(cherry picked from commit 9564b4a)
alexey-milovidov added a commit that referenced this pull request Jun 29, 2026
The fix that makes `OptimizedRegularExpression` treat a NUL (`\0`) byte as an
ordinary literal (consistent with re2) changes results for the four
codec-/regexp-sensitive tests that the `Fast test` flagged. All four changes
are the now-correct behavior:

- `00760_url_functions_overflow`: `match('', '\0')` is `0` (the empty string
  contains no NUL byte), not `1`.
- `02024_merge_regexp_assert`: `REGEXP('\0')` / `REGEXP('\0a')` match no
  database, so they raise `CANNOT_EXTRACT_TABLE_STRUCTURE` instead of being
  truncated to an empty, match-everything pattern that reached
  `UNKNOWN_IDENTIFIER`.
- `02370_extractAll_regress`: the malformed pattern `:"\0[^"]*)"` (unbalanced
  parenthesis) is now compiled by re2 and correctly rejected with
  `CANNOT_COMPILE_REGEXP`; the analyzer still runs first, preserving the
  use-after-scope regression coverage.
- `03594_like_perfect_affix_rewrite_pass`: `LIKE`-on-`FixedString` patterns
  ending in a literal NUL (e.g. `%a\0`) now correctly match nothing (count `0`);
  the patterns are still not rewritten, which is the point of the test.

CI report (Fast test): https://s3.amazonaws.com/clickhouse-test-reports/json.html?PR=108427&sha=2655ec4ea519e4c601c66a61ace4203f2e2e12e6&name_0=PR&name_1=Fast%20test
PR: #108427

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
(cherry picked from commit d75ec76)
alexey-milovidov and others added 2 commits June 29, 2026 17:09
Making `OptimizedRegularExpression::analyze` binary-safe (treat a NUL as a
literal byte) also affects the text-index pruning path, which calls `analyze`
via `regexpToTokensForQueries` to extract a required substring and prune
granules.

For `match` this is consistent: it is executed by re2, which is also
binary-safe. But `multiMatchAny` is executed by vectorscan, which compiles
each pattern as a NUL-terminated C-string (`hs_compile_multi` takes no length
array), so it stops at the first NUL. With the binary-safe analyzer the index
would extract a required substring spanning the NUL that the matcher never
requires, and would wrongly prune granules that `multiMatchAny` matches -
an unsound index, returning fewer rows than a full scan.

Truncate each `multiMatchAny` pattern at the first NUL byte before analyzing,
so the index analyzes exactly what vectorscan sees. This restores the
equality between indexed and brute-force results in
`04308_text_index_function_multiMatchAny_regex_cases`.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@Algunenano Algunenano self-assigned this Jun 29, 2026

@Algunenano Algunenano left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. The new behaviour is consistent with re2 so although there are some changes in tests (thus known behaviour) it's for the better

alexey-milovidov and others added 2 commits June 30, 2026 01:05
Master gained tests up to 04488 (and already carries several 04409_* tests),
so the new `04489_regexp_pattern_with_nul_byte` test takes the next free number.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@alexey-milovidov

Copy link
Copy Markdown
Member Author

Picked this up to advance it:

  • Merged current master into the branch (it was ~250 commits behind). The merge was clean — no conflicts, and the net code diff is byte-identical to the previously-reviewed version. Re-verified that the three changed translation units (OptimizedRegularExpression.cpp, RegexpFunctionRewritePass.cpp, MergeTreeIndexConditionText.cpp) still compile against the merged tree.
  • Renumbered the new test 04409_regexp_pattern_with_nul_byte04489_regexp_pattern_with_nul_byte. master already carries several 04409_* tests and its highest number is now 04488, so the test takes the next free number.

Review state: approved by @Algunenano and the AI review (✅), with the single inline thread (about RegexpFunctionRewritePass and optimize_rewrite_regexp_functions) already resolved by the 9564b4ada9f commit + the regression in the test.

The CI red on the previous commit was entirely unrelated flakes:

@groeneai, the 03135_keeper_client_find_commands failure on Fast test (arm_darwin) (a having stderror failure) has no tracking issue yet and is flaky across many PRs on master — could you take a look and fix it in a separate PR?

@groeneai

Copy link
Copy Markdown
Contributor

@alexey-milovidov Already fixed: #107438 (approved by alesapin, awaiting merge).

Root cause: at process exit keeper-client calls Poco::Net::SocketImpl::shutdown() on an already-closed socket inside Coordination::ZooKeeper::finalize(), raising Net Exception: Socket is not connected (errno 57 / ENOTCONN) on stderr. stdout is correct; Fast test fails on any stderr output. It is macOS-only (Linux does not raise ENOTCONN when shutting down an already-closed socket), which is why it shows up only on arm_darwin. #107438 downgrades that one log path to TRACE so the spurious stderr line is gone.

CIDB last 7d: 8 hits across 8 distinct unrelated PRs (1 each), 0 on master, all Fast test (arm_darwin) - consistent with a benign teardown flake rather than a regexp regression.

No tracking issue exists yet. #107438 fixes the class at the source, so a dedicated issue is optional; I can open one to track it if you prefer.

@alexey-milovidov

Copy link
Copy Markdown
Member Author

@groeneai, provide a fix for 03360_bool_remote and investigate hung check failures.

@groeneai

Copy link
Copy Markdown
Contributor

@alexey-milovidov on both asks:

1. 03360_bool_remote (Stateless tests (amd_tsan, parallel, 2/2), commit 3696b9d): Code 102 Unexpected packet from server 127.0.0.2:9000 (expected TablesStatusResponse got ProfileInfo), "not reproducible" after 50 reruns. This is the connection-pool desync class (recurrence of #93018 / #93029), reintroduced when #95466 added || was_cancelled to the early-return guard in RemoteQueryExecutor::finish, leaving a cancelled query's pooled connection out of sync. That is exactly what your own open PR #108854 ("Fix flaky distributed query ...") fixes, with the deterministic fail-point test 04489_distributed_table_status_desync_recovery. So 03360_bool_remote should stop failing once #108854 merges. Do you want a separate fix from me, or does #108854 cover it? I will not open a duplicate while #108854 is open.
Report: https://s3.amazonaws.com/clickhouse-test-reports/json.html?PR=108427&sha=3696b9d7a957d57c4d0bda2a78f8fd7a9b166662&name_0=PR&name_1=Stateless%20tests%20%28amd_tsan%2C%20parallel%2C%202%2F2%29

2. Hung check (Stress test (amd_tsan) and Stress test (arm_tsan), same commit 3696b9d): both have a single hung query in the processlist, identical signature:

query:               SELECT formatQuery($$SELECT [[[[[[[[[[[[[[ ... $$)   -- deeply nested array literal
is_cancelled:        1
elapsed:             1515 s (amd) / 1523 s (arm)
query_kind:          Select
peak_threads_usage:  1   thread_ids: [single]

This is not the AST-fuzzer hang (#107941): the run has ast_fuzzer_runs=0, fatal.log is empty, and the #107941 fix (#108212) is already merged and present in this commit (it is an ancestor, behind_by=0). It is also not the GlobalThreadPool shutdown deadlock family (#101680 / #105905, which show a shutdown stack in fatal.log; here the server is up and the query is live in the processlist). It is a cancellation-not-honored hang: a formatQuery() over a pathologically deeply nested array literal spins for ~1500 s while already marked cancelled, because the cancel check is not reached inside the deep AST recursion (same class as the arrayFold runaway #107680, different code path). Parse was fast (QueryParseMicroseconds=704), so the time is in the formatQuery recursion. No tracking issue exists yet. I will reproduce locally and, if confirmed, fix the cancellation/recursion path in a separate PR.
Report: https://s3.amazonaws.com/clickhouse-test-reports/json.html?PR=108427&sha=3696b9d7a957d57c4d0bda2a78f8fd7a9b166662&name_0=PR&name_1=Stress%20test%20%28amd_tsan%29

@groeneai

Copy link
Copy Markdown
Contributor

@alexey-milovidov hung check follow-up (closing the investigation you asked for).

Reproduced and root-caused locally (optimized build). The hung query SELECT formatQuery($$SELECT [[[...]]]$$) is O(depth^2) and is not related to the OptimizedRegularExpression change:

  • Building + formatting a depth-N nested array literal is quadratic (depth 5k / 10k / 20k => 0.6s / 2.1s / 8.8s; a flat 40000-element array formats in 0.24s, so the cost scales with nesting depth, not element count). At the fuzzer's max_parser_depth=100000000 a ~20000-deep literal takes minutes (1515s in CI) and sits cancelled because the quadratic work is in the parse/format recursion, which has no isCancelled() check.
  • The crash / native-stack-overflow variant is already fixed on master by 5ffa0c9ae78 ("Guard recursive Field visitors against deeply nested literals", adds checkStackSize() to the recursive Field visitors + test 04412), and that guard is already present in 3696b9d7. With the default max_parser_depth=1000 the query reports TOO_DEEP_RECURSION (code 306) in 0.4s.
  • The remaining slow-but-guarded hang is purely the harness raising max_parser_depth, and it is owned by the open PR Cap recursion-controlling parser/AST settings in fuzzer and stress test #108863 ("Cap recursion-controlling parser/AST settings in fuzzer and stress test"), which caps max_parser_depth / max_ast_depth / backtracks / AST elements at their defaults in both the AST fuzzer and the stress runner (it explicitly names test 04412).

So no separate fix from me is needed: the engine guard is merged and the harness cap is #108863. The hung check on this PR should stop once #108863 merges; it is a trunk/harness signature, not caused by your change.

Report: https://s3.amazonaws.com/clickhouse-test-reports/json.html?PR=108427&sha=3696b9d7a957d57c4d0bda2a78f8fd7a9b166662&name_0=PR&name_1=Stress%20test%20%28amd_tsan%29

@alexey-milovidov

Copy link
Copy Markdown
Member Author

Picked this up to advance it. Approved by @Algunenano and the AI review (✅) with no unresolved threads, and the code diff vs the merge-base is unchanged.

Triaged the CI on the current commit 3696b9d7 — all five reds are unrelated, pre-existing flakes:

The merge-base with master is from today (< 1 day old) and none of the reds are fixable by merging master, so I did not re-merge. Ready to merge once the flaky stress lanes are rerun/overridden.

@groeneai

Copy link
Copy Markdown
Contributor

@alexey-milovidov agreed, both reds are pre-existing and the PR is mergeable. Two owner corrections so the reds actually clear:

03360_bool_remote: on 3696b9d7 this is not a generic timeout, it is Code 102 Unexpected packet from server 127.0.0.2:9000 (expected TablesStatusResponse, got ProfileInfo), stack Connection::getTablesStatus -> ConnectionEstablisher::run (last 2 days: 14/14 fails are this exact Code-102, across 13 PRs + master; 50/50 reruns pass = transient). That is the connection-pool desync your own open #108854 targets (the || was_cancelled early-return from #95466), so #108854 will clear this red. No separate fix from us.
Report: https://s3.amazonaws.com/clickhouse-test-reports/json.html?PR=108427&sha=3696b9d7a957d57c4d0bda2a78f8fd7a9b166662&name_0=PR&name_1=Stateless%20tests%20%28amd_tsan%2C%20parallel%2C%202%2F2%29

Hung check (amd_tsan + arm_tsan): not #107941. #107941's fix #108212 is already merged into 3696b9d7 (ancestor) and did not prevent it. The single hung query is SELECT formatQuery($$SELECT [[[ ...~20000-deep array... ]]]$$), is_cancelled=1, elapsed 1515s, one thread. formatQuery over a deeply-nested literal is O(depth^2) in both AST build and FieldVisitorToString, and neither path checks isCancelled(), so it runs on after cancel. The crash variant is already guarded on master (5ffa0c9, checkStackSize on the Field visitors); the remaining slow-but-guarded harness hang is owned by @Algunenano's open #108863 (caps max_parser_depth/max_ast_depth in the fuzzer + stress harness), which will clear this red.
Report: https://s3.amazonaws.com/clickhouse-test-reports/json.html?PR=108427&sha=3696b9d7a957d57c4d0bda2a78f8fd7a9b166662&name_0=PR&name_1=Stress%20test%20%28amd_tsan%29

No regexp in either path, so neither is related to this PR's change.

The previous commit renumbered the regression test to
`04489_regexp_pattern_with_nul_byte` as the next free number, but
`master` has since gained `04489_max_threads_auto_parsing_compat`
(and `04490`/`04491` are now taken too). Move the test to the next
free number `04492` so its prefix is unique on `master`.

Test-only change; no behavior change.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@alexey-milovidov

Copy link
Copy Markdown
Member Author

Picked this up to advance it. Pushed one test-only commit (f135d0e70e7):

  • Renumbered the regression test 04489_regexp_pattern_with_nul_byte04492_regexp_pattern_with_nul_byte. The prior commit took 04489 as the next free number, but master has since merged 04489_max_threads_auto_parsing_compat, and 04490/04491 are now taken too (each even has two tests). 04492 is the next free prefix on master, so the test is unique again. Pure file rename — no behavior change, the code diff vs the merge-base is unchanged.

Review state unchanged: approved by @Algunenano and the AI review (✅), no unresolved threads.

The five CI reds on the previous commit 3696b9d7 are all pre-existing, unrelated flakes (no regexp in any of their code paths), each already tracked by an open issue/PR:

The merge-base with master is from today (< 1 day old) and none of the reds are fixable by merging master, so I did not re-merge. CI will re-run on f135d0e70e7. Ready to merge once the flaky stress lanes are rerun/overridden.

@clickhouse-gh

clickhouse-gh Bot commented Jul 1, 2026

Copy link
Copy Markdown
Contributor

LLVM Coverage Report

Metric Baseline Current Δ
Lines 85.40% 85.40% +0.00%
Functions 92.70% 92.70% +0.00%
Branches 77.60% 77.70% +0.10%

Changed lines: Changed C/C++ lines covered: 28/30 (93.33%) · Uncovered code

Full report · Diff report

@alexey-milovidov

Copy link
Copy Markdown
Member Author

@alexey-milovidov alexey-milovidov added this pull request to the merge queue Jul 1, 2026
Merged via the queue into master with commit 998d1fe Jul 1, 2026
343 of 344 checks passed
@alexey-milovidov alexey-milovidov deleted the fix-optimize-re-nul branch July 1, 2026 05:57
@robot-ch-test-poll4 robot-ch-test-poll4 added the pr-synced-to-cloud The PR is synced to the cloud repo label Jul 1, 2026
alexey-milovidov added a commit that referenced this pull request Jul 1, 2026
Reconcile with #108427 ("fix-optimize-re-nul"), which landed on master
and independently fixed the same NUL-byte handling this branch carried:

- `OptimizedRegularExpression.cpp`, `RegexpFunctionRewritePass.cpp` and
  the `gtest_optimize_re` cases converged to byte-identical content, so
  they merge with no divergence.

- `MergeTreeIndexConditionText.cpp`: master's fix is more precise - it
  truncates the pattern at the first NUL only at the `multiMatchAny`
  call site (matching vectorscan's C-string semantics) while letting
  `match` (binary-safe re2) still prune. Dropped this branch's blanket
  `regexpToTokensForQueries` early-return, which declined pruning for
  both functions and would have regressed `match` pruning back to a full
  scan. The file is now identical to master's canonical version.

- `04492_regexp_pattern_with_nul_byte.sql`: kept this branch's version,
  a superset of master's that adds the regexp-JIT `[a-\0]` descending
  character-class-range case (`CANNOT_COMPILE_REGEXP`).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
alexey-milovidov added a commit that referenced this pull request Jul 1, 2026
The merge of master brought in #108427's more precise text-index NUL
handling: it truncates the pattern at the first NUL only at the
`multiMatchAny` call site (matching vectorscan's C-string semantics),
while letting `match` (binary-safe re2) still prune granules.

This branch's earlier blanket early-return in `regexpToTokensForQueries`
declined pruning for *both* functions on any NUL pattern, which would
have regressed `match` pruning back to a full scan. Remove it so the
file matches master's canonical version; correctness for `multiMatchAny`
with a NUL is already covered by master's per-call-site truncation.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

pr-bugfix Pull request with bugfix, not backported by default pr-synced-to-cloud The PR is synced to the cloud repo

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants