Fix `**/` (globstar) to match zero or more directory components by alexey-milovidov · Pull Request #97676 · ClickHouse/ClickHouse · GitHub
Skip to content

Fix **/ (globstar) to match zero or more directory components#97676

Merged
alexey-milovidov merged 54 commits into
masterfrom
fix-globstar-zero-level
Jul 1, 2026
Merged

Fix **/ (globstar) to match zero or more directory components#97676
alexey-milovidov merged 54 commits into
masterfrom
fix-globstar-zero-level

Conversation

@alexey-milovidov

@alexey-milovidov alexey-milovidov commented Feb 23, 2026

Copy link
Copy Markdown
Member

Previously, **/ in glob patterns was not handled as a special case, so data/**/file.txt would not match data/file.txt (zero directory levels). The ** was converted to [^/]*[^{}]* which, followed by a literal /, required at least one path segment.

Pre-process **/ into the regex ([^{}]*/)* which correctly matches zero or more directory components. Standalone ** at end of path retains the original behavior.

Changelog category (leave one):

  • Backward Incompatible Change

Changelog entry (a user-readable short description of the changes that goes into CHANGELOG.md):

The **/ glob (any amount of directories) now also matches the same directory. Previously, **/ in glob patterns was not handled as a special case, so data/**/file.txt would not match data/file.txt (zero directory levels).

Version info

  • Merged into: 26.7.1.336

Previously, `**/` in glob patterns was not handled as a special case,
so `data/**/file.txt` would not match `data/file.txt` (zero directory
levels). The `**` was converted to `[^/]*[^{}]*` which, followed by
a literal `/`, required at least one path segment.

Pre-process `**/` into the regex `([^{}]*/)*` which correctly matches
zero or more directory components. Standalone `**` at end of path
retains the original behavior.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Comment thread src/Common/parseGlobs.cpp Outdated
@clickhouse-gh

clickhouse-gh Bot commented Feb 23, 2026

Copy link
Copy Markdown
Contributor

@clickhouse-gh clickhouse-gh Bot added the pr-backward-incompatible Pull request with backwards incompatible changes label Feb 23, 2026
Comment thread src/Common/parseGlobs.cpp Outdated
alexey-milovidov and others added 3 commits March 29, 2026 00:53
…kahead

Replace the `\x01` sentinel substitution approach with direct lookahead
in the character-by-character processing loop. This avoids the correctness
issue with paths containing the byte 0x01 and is cleaner overall.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Comment thread src/Common/parseGlobs.cpp Outdated
@alexey-milovidov

Copy link
Copy Markdown
Member Author

The Stress test (arm_msan) failure is fixed by #101239, which should be merged first. After it is merged, please update the branch to include the fix.

Comment thread src/Common/tests/gtest_makeRegexpPatternFromGlobs.cpp
…s match

Previously the `**/` pattern expanded to `([^{}]*/)*`, which failed to
match paths where a directory component contains `{` or `}`. This was a
regression versus the pre-PR behavior.

Replace `[^{}]` with `[^/]` inside the globstar group so any non-slash
byte is allowed, and add a regression test covering `data/{a}/file.txt`
against the `data/**/file.txt` pattern.
Comment thread src/Common/parseGlobs.cpp Outdated
alexey-milovidov and others added 3 commits April 25, 2026 18:56
The previous comment said the pattern matches "across directories" without
explaining how. Clarify that the combined `[^/]*[^{}]*` works because the
trailing `[^{}]*` permits `/` (while still excluding `{` and `}`), preserving
the original pre-globstar-fix behavior for `**` at end of pattern.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…evel

# Conflicts:
#	programs/keeper-bench/Runner.cpp
#	programs/server/play.html
#	src/Disks/DiskObjectStorage/DiskObjectStorage.cpp
#	src/Functions/initializeAggregation.cpp
#	src/Functions/toInterval.cpp
#	src/IO/WriteHelpers.cpp
#	tests/integration/test_mask_sensitive_info/test.py
#	tests/integration/test_storage_delta/test.py
Comment thread src/Common/parseGlobs.cpp Outdated
The previous rewrite of the final-processing loop changed the emitted
regex for patterns with 3+ consecutive `*` even when there is no `**/`
token, e.g. for `***` it emitted `[^/]*[^{}]*[^/]*` instead of the
legacy `[^/]*[^{}]*[^{}]*`. That is a user-visible behavior change
unrelated to the `**/` zero-level fix.

Restore the original character-by-character logic with `previous`
tracking and only special-case `**/` upfront. Add regression tests
for `***`, `a***b`, and `****` so the legacy regex stays pinned.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Comment thread src/Common/parseGlobs.cpp
The `**/` rewrite previously fired at any position where `*` `*` `/`
appeared, including positions inside a run of 3+ consecutive stars.
For input `***/file.txt` the loop would emit `[^/]*` for the first
star and then immediately match `**/` at index 1, producing
`[^/]*([^/]*/)*file\.txt` and silently changing legacy semantics for
patterns that already worked.

Require `previous != '*'` so the `**/` rewrite only triggers when it
starts a fresh star token, and add regression tests for `***/file.txt`
and `a***/b`.
Comment thread tests/queries/0_stateless/04320_globstar_zero_level_directory.sh Outdated
alexey-milovidov and others added 3 commits June 19, 2026 14:03
The `matched_paths` set that deduplicates adjacent globstars (`**/**/*.tsv`)
was declared once in `listFilesWithRegexpMatching` and shared across every
`expandSelectionGlob` alternative. That changed brace-expansion semantics
beyond the intended `**/` contract: independent alternatives that resolve to
the same concrete file (e.g. `file('{top,top}.tsv', ...)`) were silently
collapsed instead of being read once per alternative as before.

Move `matched_paths` inside the per-expansion loop so deduplication stays
scoped to the recursive globstar traversal of a single expanded pattern,
while independent brace-expanded alternatives keep their pre-existing
behavior. Added a regression to `04320_globstar_zero_level_directory`
asserting that `{top,top}.tsv` returns the file twice.

Addresses the review on `src/Storages/StorageFile.cpp`.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
`04320_globstar_zero_level_directory` only uses its own private
`USER_FILES_PATH/04320` subtree and does not mutate shared server state,
so the `no-parallel` tag is not needed and only serializes the stateless
suite more than necessary.

Addresses the review on the test tag.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@alexey-milovidov

Copy link
Copy Markdown
Member Author

Continued this PR:

  • Addressed the AI review 'Request changes' verdict on the local-path deduplication. The matched_paths set in listFilesWithRegexpMatching was declared once and shared across every expandSelectionGlob alternative, so it changed brace-expansion semantics beyond the **/ contract: independent alternatives resolving to the same concrete file (e.g. file('{top,top}.tsv', ...)) were silently collapsed instead of being read once per alternative. Moved the set inside the per-expansion loop (commit 0d86e8d) so deduplication stays scoped to the recursive globstar traversal of a single expanded pattern, while independent brace-expanded alternatives keep their pre-existing behavior.
  • Added a regression to 04320_globstar_zero_level_directory asserting that {top,top}.tsv returns the file twice (the per-alternative read is preserved), alongside the existing **/**/*.tsv dedup case.
  • Dropped the unnecessary no-parallel tag from the test (commit 569b55e) — it only uses its own USER_FILES_PATH/04320 subtree and does not mutate shared server state.
  • Merged master (was 774 commits behind and red) — clean merge, no conflicts; master had not touched StorageFile.cpp/parseGlobs.cpp since the merge base.

Verified all five glob cases against a freshly built binary via clickhouse local: **/*.tsv matches all levels including the top-level file, **/top.tsv matches at zero level, **/braced.tsv matches the braced directory, **/**/*.tsv returns each file exactly once, and {top,top}.tsv returns the top-level file twice.

The two prior CI failures are unrelated to a glob-matching change: the Stress test (arm_release) Logical error query is a fuzzer-found analyzer issue in SortingAggregatedTransform::prepare() for a GROUP BY ... WITH TOTALS query (a distinct STID/PR each occurrence on master), and the BuzzHouse (amd_debug) ERROR is a harness infra failure (status.tsv not found, 7-21/day across distinct PRs on master).

…_files dir

The test created its fixtures under a fixed `user_files/04320` subdirectory.
After the `no-parallel` tag was dropped, the flaky check (which runs the same
test concurrently many times) had several instances racing on that single
directory: one run's `rm -rf`/`mkdir` would clobber the files another run was
querying, so queries intermittently returned empty or partial results and the
output diverged from the reference. All 26 flaky-check failures across the four
sanitizers were exactly this one test.

Use a unique per-invocation subdirectory named after `$CLICKHOUSE_DATABASE`
instead of a fixed `04320`. `clickhouse-test` generates a fresh random
`test_<...>` database for every test execution (including each flaky-check
rerun), so the directory is now distinct per run and concurrent instances no
longer collide. This is the same isolation pattern already used by tests such
as `00309_formats_case_insensitive` and `01098_msgpack_format`, and keeps the
test parallel-safe without re-introducing the `no-parallel` tag. The reference
is unchanged (the directory name does not appear in query output).

CI report: #97676

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@alexey-milovidov

Copy link
Copy Markdown
Member Author

Fixed the only red — the flaky check on 04320_globstar_zero_level_directory (all 26 failures across the four sanitizers were this single test).

Root cause: when the no-parallel tag was dropped in 569b55e, the test was still creating its fixtures under a fixed user_files/04320 subdirectory. The flaky check runs the same test concurrently many times, so independent instances raced on that one directory — one run's rm -rf/mkdir clobbered the files another run was querying, producing intermittently empty/partial results that diverged from the reference.

Fix (cd2d351bc12): use a unique per-invocation subdirectory named after $CLICKHOUSE_DATABASE instead of the fixed 04320. clickhouse-test generates a fresh random test_<...> database for every test execution (including each flaky-check rerun), so the directory is now distinct per run and concurrent instances no longer collide. This is the same isolation pattern already used by e.g. 00309_formats_case_insensitive and 01098_msgpack_format, and keeps the test parallel-safe without re-introducing no-parallel. The reference is unchanged (the directory name never appears in query output).

Verified the path substitution and globbing mechanics end-to-end against a binary (no shell-quoting issues; the unique directory resolves correctly in every file(...) argument). The feature behaviour itself was already green in the non-flaky stateless runs. Did not merge master (merge base is ~2 days old, well under the week boundary, and the red was self-caused rather than a flaky master failure).

Comment thread src/Storages/StorageFile.cpp
alexey-milovidov and others added 5 commits June 27, 2026 04:56
Master merged colliding `04320_*` tests; the highest existing prefix is
now `04488`, so move the globstar regression to `04489`.
The AI review flagged an inconsistency between the two glob-matching code
paths. `makeRegexpPatternFromGlobs` rewrote any non-overlapping `**/` token
into the zero-level form `([^/]*/)*`, so it covered patterns like
`?**/file.txt` and `*?**/file.txt`, but the local `file`/`Filesystem`
listing in `StorageFile` applies zero-level semantics only when a path
segment is exactly `**` (`current_glob == "/**"`). For a pattern such as
`?**/file.txt` the regex helper and the local listing therefore diverged.

Treat `**` as a globstar only when it forms a whole path segment, i.e. when
it is bounded by `/` (or the start of the string) on the left and by `/` on
the right. This matches conventional glob semantics (Bash `globstar`, where
`**` is special only as a complete path component) and aligns the helper
with the already-correct local listing path. A `**` adjacent to other
characters in a segment (`a**`, `?**`, a run of 3+ stars like `***/`) keeps
its legacy expansion, exactly as before this PR.

Updated the helper unit tests for the now-legacy `?**/` and `*?**/` cases,
and added a user-facing stateless regression
`04490_globstar_non_whole_segment` showing that the local `file(...)` path
treats `?**/` as a single-directory-level match (not zero-level),
consistently with the helper.

Resolves the review thread:
#97676 (comment)

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@alexey-milovidov

Copy link
Copy Markdown
Member Author

Continued this PR.

Addressed the AI review "⚠️ Request changes" verdict

The review flagged that makeRegexpPatternFromGlobs and the local file/Filesystem listing in StorageFile disagreed on **/: the helper rewrote any non-overlapping **/ token into the zero-level form ([^/]*/)* (covering ?**/file.txt, *?**/file.txt), while the local listing applies zero-level semantics only to a path segment that is exactly ** (current_glob == "/**").

I resolved this by narrowing the helper (commit b528b89) rather than complicating the local path: ** is now treated as a globstar only when it forms a whole path segment — bounded by / (or the string boundaries) on the left and by / on the right. This:

  • matches conventional glob semantics (Bash globstar, where ** is special only as a complete path component);
  • makes the helper consistent with the already-correct local listing;
  • leaves every whole-segment case byte-identical (**/file.txt, data/**/file.txt, data/**/sub/**/file.txt, data/**, the ***/a*** runs); only the non-whole-segment ?**/ and *?**/ cases revert to their legacy expansion.

Updated the helper unit tests for the now-legacy ?**//*?**/ cases and added a user-facing stateless regression 04490_globstar_non_whole_segment proving the local file(...) path treats ?**/ as a single-directory-level match (not zero-level), consistently with the helper. The unresolved review thread is resolved.

The branch also carries a previously-unpushed master merge and the test renumber to 04489 from an earlier session.

CI failures (all unrelated to this glob change)

The report was for d50f3fdc; none of the failures touch glob/file/path code:

Verified the new regex outputs for the changed cases with a faithful standalone reproduction of the final-processing loop (the inputs contain no range/enum/bracket characters, so escaping/expansion is a no-op): ?**/file.txt -> [^/][^/]*[^{}]*/file\.txt, *?**/file.txt -> [^/]*[^/][^{}]*[^{}]*/file\.txt, and all whole-segment cases unchanged.

…teral suffixes

The new test `04490_globstar_non_whole_segment` failed because the
whole-segment `**/pick.tsv` query matched only the zero- and one-level
files (`pick.tsv`, `sub/pick.tsv`) but not the two-level
`sub/subsub/pick.tsv`, even though the test (and the PR's goal) require
`**/` to match zero or more directory components.

Root cause: in `listFilesWithRegexpMatchingImpl`, when descending into a
subdirectory under a `**/` segment, the code passed the suffix *after*
`**` (here `/pick.tsv`), dropping the `**`. For a literal suffix this
suffix contains no glob, so the recursive call took the no-glob
exact-match branch and returned without descending further — matching a
single intermediate level only. A glob suffix such as `**/*.tsv` happened
to keep descending because the sticky `recursive` flag re-applied the
glob segment at every level, so only literal suffixes were affected. This
also made the local listing disagree with `makeRegexpPatternFromGlobs`,
which rewrites `**/` to `([^/]*/)*` (any number of levels), and with
existing tests where `**/tex*` and `**/*.csv` already match all depths.

Fix: when the current segment is exactly `**` and a suffix follows,
descend into subdirectories keeping the whole `**/...` pattern, so the
globstar keeps matching at every deeper level. The pre-existing zero-level
branch applies the post-`**` suffix at the current level, so the
combination matches zero, one, or more directory components. The trailing
`**` (no suffix) case is unchanged.

Verified `04490`, `04489_globstar_zero_level_directory`,
`02459_glob_for_recursive_directory_traversal`,
`02707_clickhouse_local_implicit_file_table_function` and
`02816_clickhouse_local_table_name_expressions` all pass against a freshly
built binary.

CI report: #97676

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@alexey-milovidov

Copy link
Copy Markdown
Member Author

Continued this PR.

Fixed the only real CI failure: 04490_globstar_non_whole_segment

At b528b8969bb the new test failed on every stateless lane (and triggered "Too many test failures" on the targeted/flaky checks). Its whole-segment case

SELECT * FROM file('<dir>/**/pick.tsv', 'TSV', 'a UInt8, b UInt8') ORDER BY a

matched only the zero- and one-level files (pick.tsv, sub/pick.tsv) but not the two-level sub/subsub/pick.tsv, while the test (and this PR's goal) require **/ to match zero or more directory components.

Root cause (in listFilesWithRegexpMatchingImpl, StorageFile.cpp): when descending into a subdirectory under a **/ segment, the code passed the suffix after ** (here /pick.tsv), dropping the **. For a literal suffix that string contains no glob, so the recursive call took the no-glob exact-match branch and returned without descending further — so only a single intermediate level matched. A glob suffix such as **/*.tsv happened to keep descending because the sticky recursive flag re-applied the glob segment at every level, so only literal suffixes were affected. This also made the local listing disagree with makeRegexpPatternFromGlobs, which rewrites **/ to ([^/]*/)* (any number of levels), and with 02459's **/tex* and 02707's **/*.csv, which already match all depths.

Fix (1 file, +14/-1): when the current segment is exactly ** and a suffix follows, descend into subdirectories keeping the whole **/... pattern, so the globstar keeps matching at every deeper level. The pre-existing zero-level branch applies the post-** suffix at the current level, so the combination matches zero, one, or more directory components. The trailing ** (no suffix) case is unchanged.

Verified against a freshly built binary — all of these pass:

  • 04490_globstar_non_whole_segment (whole-segment **/pick.tsv → 0,1,2; non-whole ?**/pick.tsv → one level)
  • 04489_globstar_zero_level_directory (**/*.tsv, **/top.tsv, braced dir, **/**/*.tsv dedup, {top,top}.tsv)
  • 02459_glob_for_recursive_directory_traversal
  • 02707_clickhouse_local_implicit_file_table_function
  • 02816_clickhouse_local_table_name_expressions

No new test was needed — 04490 itself is the regression test, and the byte-counting / dedup paths from the earlier commits are untouched.

Other CI

The stateless redness traced entirely to 04490. The non-stateless failures were already triaged on the previous SHAs as unrelated/tracked (Stress amd_debug RefreshTask::doScheduling fixed by #108441 which the master merge brought in; arm_msan Hung check = #107941; AST fuzzer UnionStep::updateOutputHeader STID 0993-27f0 = #108142; perf-comparison = CIDB infra). I did not re-merge master: the merge base is from 2026-06-29 (< 1 day), there are no conflicts, and master has not touched parseGlobs.cpp/StorageFile.cpp since then.

@alexey-milovidov

Copy link
Copy Markdown
Member Author

Continue-PR pass (no code change): the PR is fully addressed and only blocked on flaky stress reds + draft status.

Review state

  • AI Review: ✅ Approve; 0 unresolved review threads; no outstanding change-requests. LLVM coverage report shows 90/90 (100%) of changed C/C++ lines covered.
  • Diff reviewed end-to-end and confirmed sound: makeRegexpPatternFromGlobs rewrites a whole-segment **/ to ([^/]*/)* (legacy character-by-character expansion preserved for non-whole-segment ** such as ?**/, a**, and ***), and listFilesWithRegexpMatchingImpl applies the post-** suffix at the current directory (zero-level branch) and descends keeping the whole **/... pattern, so the globstar matches zero, one, or more directory levels — with an unordered_set dedup scoped per brace-expansion alternative to avoid double-emission from adjacent globstars while preserving per-alternative reads.

CI

The current report (563701cd) has only two reds, both the chronic Hung check failed, possible deadlock found flake — Stress test (amd_tsan) and Stress test (arm_tsan):

  • Both are the open tracking issue Hung check failed, possible deadlock found #107941 (the CI summary itself links them there). The hung-thread stacks are generic TCPHandler::runexecuteASTFuzzerQueries frames with zero glob/parseGlobs/StorageFile frames — unrelated to this change.
  • The workflow run (28458923351) is still in progress, so the failed jobs cannot be re-run yet (This workflow is already running); they should be re-run once the run completes.

Not re-merging master

The branch is 244 commits behind, but the merge base is from 2026-06-29 (< 1 day old, well under the one-week threshold), master has not touched parseGlobs.cpp/StorageFile.cpp since then, and the only reds are the #107941 flake (not fixed on master, so a merge cannot clear it). A fresh full CI run is currently in progress on this head SHA, which a re-merge would discard. Holding off.

This PR is a draft — it appears ready apart from marking it ready for review and re-running the flaky Hung check jobs.

@clickhouse-gh

clickhouse-gh Bot commented Jun 30, 2026

Copy link
Copy Markdown
Contributor

LLVM Coverage Report

Metric Baseline Current Δ
Lines 85.40% 85.50% +0.10%
Functions 92.70% 92.60% -0.10%
Branches 77.70% 77.70% +0.00%

Changed lines: Changed C/C++ lines covered: 97/97 (100.00%) · Uncovered code

Full report · Diff report

@alexey-milovidov

Copy link
Copy Markdown
Member Author

@alexey-milovidov alexey-milovidov marked this pull request as ready for review July 1, 2026 07:28

@alexey-milovidov alexey-milovidov left a comment

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok.

@alexey-milovidov alexey-milovidov self-assigned this Jul 1, 2026
@alexey-milovidov alexey-milovidov added this pull request to the merge queue Jul 1, 2026
Merged via the queue into master with commit c79b0d9 Jul 1, 2026
678 of 684 checks passed
@alexey-milovidov alexey-milovidov deleted the fix-globstar-zero-level branch July 1, 2026 08:01
@robot-ch-test-poll robot-ch-test-poll added the pr-synced-to-cloud The PR is synced to the cloud repo label Jul 1, 2026
PedroTadim pushed a commit that referenced this pull request Jul 1, 2026
…_files dir

The test created its fixtures under a fixed `user_files/04320` subdirectory.
After the `no-parallel` tag was dropped, the flaky check (which runs the same
test concurrently many times) had several instances racing on that single
directory: one run's `rm -rf`/`mkdir` would clobber the files another run was
querying, so queries intermittently returned empty or partial results and the
output diverged from the reference. All 26 flaky-check failures across the four
sanitizers were exactly this one test.

Use a unique per-invocation subdirectory named after `$CLICKHOUSE_DATABASE`
instead of a fixed `04320`. `clickhouse-test` generates a fresh random
`test_<...>` database for every test execution (including each flaky-check
rerun), so the directory is now distinct per run and concurrent instances no
longer collide. This is the same isolation pattern already used by tests such
as `00309_formats_case_insensitive` and `01098_msgpack_format`, and keeps the
test parallel-safe without re-introducing the `no-parallel` tag. The reference
is unchanged (the directory name does not appear in query output).

CI report: #97676

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
PedroTadim pushed a commit that referenced this pull request Jul 1, 2026
The AI review flagged an inconsistency between the two glob-matching code
paths. `makeRegexpPatternFromGlobs` rewrote any non-overlapping `**/` token
into the zero-level form `([^/]*/)*`, so it covered patterns like
`?**/file.txt` and `*?**/file.txt`, but the local `file`/`Filesystem`
listing in `StorageFile` applies zero-level semantics only when a path
segment is exactly `**` (`current_glob == "/**"`). For a pattern such as
`?**/file.txt` the regex helper and the local listing therefore diverged.

Treat `**` as a globstar only when it forms a whole path segment, i.e. when
it is bounded by `/` (or the start of the string) on the left and by `/` on
the right. This matches conventional glob semantics (Bash `globstar`, where
`**` is special only as a complete path component) and aligns the helper
with the already-correct local listing path. A `**` adjacent to other
characters in a segment (`a**`, `?**`, a run of 3+ stars like `***/`) keeps
its legacy expansion, exactly as before this PR.

Updated the helper unit tests for the now-legacy `?**/` and `*?**/` cases,
and added a user-facing stateless regression
`04490_globstar_non_whole_segment` showing that the local `file(...)` path
treats `?**/` as a single-directory-level match (not zero-level),
consistently with the helper.

Resolves the review thread:
#97676 (comment)

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
PedroTadim pushed a commit that referenced this pull request Jul 1, 2026
…teral suffixes

The new test `04490_globstar_non_whole_segment` failed because the
whole-segment `**/pick.tsv` query matched only the zero- and one-level
files (`pick.tsv`, `sub/pick.tsv`) but not the two-level
`sub/subsub/pick.tsv`, even though the test (and the PR's goal) require
`**/` to match zero or more directory components.

Root cause: in `listFilesWithRegexpMatchingImpl`, when descending into a
subdirectory under a `**/` segment, the code passed the suffix *after*
`**` (here `/pick.tsv`), dropping the `**`. For a literal suffix this
suffix contains no glob, so the recursive call took the no-glob
exact-match branch and returned without descending further — matching a
single intermediate level only. A glob suffix such as `**/*.tsv` happened
to keep descending because the sticky `recursive` flag re-applied the
glob segment at every level, so only literal suffixes were affected. This
also made the local listing disagree with `makeRegexpPatternFromGlobs`,
which rewrites `**/` to `([^/]*/)*` (any number of levels), and with
existing tests where `**/tex*` and `**/*.csv` already match all depths.

Fix: when the current segment is exactly `**` and a suffix follows,
descend into subdirectories keeping the whole `**/...` pattern, so the
globstar keeps matching at every deeper level. The pre-existing zero-level
branch applies the post-`**` suffix at the current level, so the
combination matches zero, one, or more directory components. The trailing
`**` (no suffix) case is unchanged.

Verified `04490`, `04489_globstar_zero_level_directory`,
`02459_glob_for_recursive_directory_traversal`,
`02707_clickhouse_local_implicit_file_table_function` and
`02816_clickhouse_local_table_name_expressions` all pass against a freshly
built binary.

CI report: #97676

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

pr-backward-incompatible Pull request with backwards incompatible changes pr-synced-to-cloud The PR is synced to the cloud repo

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants