Fix **/ (globstar) to match zero or more directory components#97676
Conversation
Previously, `**/` in glob patterns was not handled as a special case,
so `data/**/file.txt` would not match `data/file.txt` (zero directory
levels). The `**` was converted to `[^/]*[^{}]*` which, followed by
a literal `/`, required at least one path segment.
Pre-process `**/` into the regex `([^{}]*/)*` which correctly matches
zero or more directory components. Standalone `**` at end of path
retains the original behavior.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…kahead Replace the `\x01` sentinel substitution approach with direct lookahead in the character-by-character processing loop. This avoids the correctness issue with paths containing the byte 0x01 and is cleaner overall. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
|
The Stress test (arm_msan) failure is fixed by #101239, which should be merged first. After it is merged, please update the branch to include the fix. |
…s match
Previously the `**/` pattern expanded to `([^{}]*/)*`, which failed to
match paths where a directory component contains `{` or `}`. This was a
regression versus the pre-PR behavior.
Replace `[^{}]` with `[^/]` inside the globstar group so any non-slash
byte is allowed, and add a regression test covering `data/{a}/file.txt`
against the `data/**/file.txt` pattern.
The previous comment said the pattern matches "across directories" without
explaining how. Clarify that the combined `[^/]*[^{}]*` works because the
trailing `[^{}]*` permits `/` (while still excluding `{` and `}`), preserving
the original pre-globstar-fix behavior for `**` at end of pattern.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…evel # Conflicts: # programs/keeper-bench/Runner.cpp # programs/server/play.html # src/Disks/DiskObjectStorage/DiskObjectStorage.cpp # src/Functions/initializeAggregation.cpp # src/Functions/toInterval.cpp # src/IO/WriteHelpers.cpp # tests/integration/test_mask_sensitive_info/test.py # tests/integration/test_storage_delta/test.py
The previous rewrite of the final-processing loop changed the emitted
regex for patterns with 3+ consecutive `*` even when there is no `**/`
token, e.g. for `***` it emitted `[^/]*[^{}]*[^/]*` instead of the
legacy `[^/]*[^{}]*[^{}]*`. That is a user-visible behavior change
unrelated to the `**/` zero-level fix.
Restore the original character-by-character logic with `previous`
tracking and only special-case `**/` upfront. Add regression tests
for `***`, `a***b`, and `****` so the legacy regex stays pinned.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The `**/` rewrite previously fired at any position where `*` `*` `/` appeared, including positions inside a run of 3+ consecutive stars. For input `***/file.txt` the loop would emit `[^/]*` for the first star and then immediately match `**/` at index 1, producing `[^/]*([^/]*/)*file\.txt` and silently changing legacy semantics for patterns that already worked. Require `previous != '*'` so the `**/` rewrite only triggers when it starts a fresh star token, and add regression tests for `***/file.txt` and `a***/b`.
The `matched_paths` set that deduplicates adjacent globstars (`**/**/*.tsv`)
was declared once in `listFilesWithRegexpMatching` and shared across every
`expandSelectionGlob` alternative. That changed brace-expansion semantics
beyond the intended `**/` contract: independent alternatives that resolve to
the same concrete file (e.g. `file('{top,top}.tsv', ...)`) were silently
collapsed instead of being read once per alternative as before.
Move `matched_paths` inside the per-expansion loop so deduplication stays
scoped to the recursive globstar traversal of a single expanded pattern,
while independent brace-expanded alternatives keep their pre-existing
behavior. Added a regression to `04320_globstar_zero_level_directory`
asserting that `{top,top}.tsv` returns the file twice.
Addresses the review on `src/Storages/StorageFile.cpp`.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
`04320_globstar_zero_level_directory` only uses its own private `USER_FILES_PATH/04320` subtree and does not mutate shared server state, so the `no-parallel` tag is not needed and only serializes the stateless suite more than necessary. Addresses the review on the test tag. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
|
Continued this PR:
Verified all five glob cases against a freshly built binary via The two prior CI failures are unrelated to a glob-matching change: the |
…_files dir The test created its fixtures under a fixed `user_files/04320` subdirectory. After the `no-parallel` tag was dropped, the flaky check (which runs the same test concurrently many times) had several instances racing on that single directory: one run's `rm -rf`/`mkdir` would clobber the files another run was querying, so queries intermittently returned empty or partial results and the output diverged from the reference. All 26 flaky-check failures across the four sanitizers were exactly this one test. Use a unique per-invocation subdirectory named after `$CLICKHOUSE_DATABASE` instead of a fixed `04320`. `clickhouse-test` generates a fresh random `test_<...>` database for every test execution (including each flaky-check rerun), so the directory is now distinct per run and concurrent instances no longer collide. This is the same isolation pattern already used by tests such as `00309_formats_case_insensitive` and `01098_msgpack_format`, and keeps the test parallel-safe without re-introducing the `no-parallel` tag. The reference is unchanged (the directory name does not appear in query output). CI report: #97676 Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
|
Fixed the only red — the flaky check on Root cause: when the Fix ( Verified the path substitution and globbing mechanics end-to-end against a binary (no shell-quoting issues; the unique directory resolves correctly in every |
Master merged colliding `04320_*` tests; the highest existing prefix is now `04488`, so move the globstar regression to `04489`.
The AI review flagged an inconsistency between the two glob-matching code paths. `makeRegexpPatternFromGlobs` rewrote any non-overlapping `**/` token into the zero-level form `([^/]*/)*`, so it covered patterns like `?**/file.txt` and `*?**/file.txt`, but the local `file`/`Filesystem` listing in `StorageFile` applies zero-level semantics only when a path segment is exactly `**` (`current_glob == "/**"`). For a pattern such as `?**/file.txt` the regex helper and the local listing therefore diverged. Treat `**` as a globstar only when it forms a whole path segment, i.e. when it is bounded by `/` (or the start of the string) on the left and by `/` on the right. This matches conventional glob semantics (Bash `globstar`, where `**` is special only as a complete path component) and aligns the helper with the already-correct local listing path. A `**` adjacent to other characters in a segment (`a**`, `?**`, a run of 3+ stars like `***/`) keeps its legacy expansion, exactly as before this PR. Updated the helper unit tests for the now-legacy `?**/` and `*?**/` cases, and added a user-facing stateless regression `04490_globstar_non_whole_segment` showing that the local `file(...)` path treats `?**/` as a single-directory-level match (not zero-level), consistently with the helper. Resolves the review thread: #97676 (comment) Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
|
Continued this PR. Addressed the AI review "
|
…teral suffixes The new test `04490_globstar_non_whole_segment` failed because the whole-segment `**/pick.tsv` query matched only the zero- and one-level files (`pick.tsv`, `sub/pick.tsv`) but not the two-level `sub/subsub/pick.tsv`, even though the test (and the PR's goal) require `**/` to match zero or more directory components. Root cause: in `listFilesWithRegexpMatchingImpl`, when descending into a subdirectory under a `**/` segment, the code passed the suffix *after* `**` (here `/pick.tsv`), dropping the `**`. For a literal suffix this suffix contains no glob, so the recursive call took the no-glob exact-match branch and returned without descending further — matching a single intermediate level only. A glob suffix such as `**/*.tsv` happened to keep descending because the sticky `recursive` flag re-applied the glob segment at every level, so only literal suffixes were affected. This also made the local listing disagree with `makeRegexpPatternFromGlobs`, which rewrites `**/` to `([^/]*/)*` (any number of levels), and with existing tests where `**/tex*` and `**/*.csv` already match all depths. Fix: when the current segment is exactly `**` and a suffix follows, descend into subdirectories keeping the whole `**/...` pattern, so the globstar keeps matching at every deeper level. The pre-existing zero-level branch applies the post-`**` suffix at the current level, so the combination matches zero, one, or more directory components. The trailing `**` (no suffix) case is unchanged. Verified `04490`, `04489_globstar_zero_level_directory`, `02459_glob_for_recursive_directory_traversal`, `02707_clickhouse_local_implicit_file_table_function` and `02816_clickhouse_local_table_name_expressions` all pass against a freshly built binary. CI report: #97676 Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
|
Continued this PR. Fixed the only real CI failure:
|
|
Continue-PR pass (no code change): the PR is fully addressed and only blocked on flaky stress reds + draft status. Review state
CIThe current report (
Not re-merging
|
LLVM Coverage Report
Changed lines: Changed C/C++ lines covered: 97/97 (100.00%) · Uncovered code |
…_files dir The test created its fixtures under a fixed `user_files/04320` subdirectory. After the `no-parallel` tag was dropped, the flaky check (which runs the same test concurrently many times) had several instances racing on that single directory: one run's `rm -rf`/`mkdir` would clobber the files another run was querying, so queries intermittently returned empty or partial results and the output diverged from the reference. All 26 flaky-check failures across the four sanitizers were exactly this one test. Use a unique per-invocation subdirectory named after `$CLICKHOUSE_DATABASE` instead of a fixed `04320`. `clickhouse-test` generates a fresh random `test_<...>` database for every test execution (including each flaky-check rerun), so the directory is now distinct per run and concurrent instances no longer collide. This is the same isolation pattern already used by tests such as `00309_formats_case_insensitive` and `01098_msgpack_format`, and keeps the test parallel-safe without re-introducing the `no-parallel` tag. The reference is unchanged (the directory name does not appear in query output). CI report: #97676 Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
The AI review flagged an inconsistency between the two glob-matching code paths. `makeRegexpPatternFromGlobs` rewrote any non-overlapping `**/` token into the zero-level form `([^/]*/)*`, so it covered patterns like `?**/file.txt` and `*?**/file.txt`, but the local `file`/`Filesystem` listing in `StorageFile` applies zero-level semantics only when a path segment is exactly `**` (`current_glob == "/**"`). For a pattern such as `?**/file.txt` the regex helper and the local listing therefore diverged. Treat `**` as a globstar only when it forms a whole path segment, i.e. when it is bounded by `/` (or the start of the string) on the left and by `/` on the right. This matches conventional glob semantics (Bash `globstar`, where `**` is special only as a complete path component) and aligns the helper with the already-correct local listing path. A `**` adjacent to other characters in a segment (`a**`, `?**`, a run of 3+ stars like `***/`) keeps its legacy expansion, exactly as before this PR. Updated the helper unit tests for the now-legacy `?**/` and `*?**/` cases, and added a user-facing stateless regression `04490_globstar_non_whole_segment` showing that the local `file(...)` path treats `?**/` as a single-directory-level match (not zero-level), consistently with the helper. Resolves the review thread: #97676 (comment) Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…teral suffixes The new test `04490_globstar_non_whole_segment` failed because the whole-segment `**/pick.tsv` query matched only the zero- and one-level files (`pick.tsv`, `sub/pick.tsv`) but not the two-level `sub/subsub/pick.tsv`, even though the test (and the PR's goal) require `**/` to match zero or more directory components. Root cause: in `listFilesWithRegexpMatchingImpl`, when descending into a subdirectory under a `**/` segment, the code passed the suffix *after* `**` (here `/pick.tsv`), dropping the `**`. For a literal suffix this suffix contains no glob, so the recursive call took the no-glob exact-match branch and returned without descending further — matching a single intermediate level only. A glob suffix such as `**/*.tsv` happened to keep descending because the sticky `recursive` flag re-applied the glob segment at every level, so only literal suffixes were affected. This also made the local listing disagree with `makeRegexpPatternFromGlobs`, which rewrites `**/` to `([^/]*/)*` (any number of levels), and with existing tests where `**/tex*` and `**/*.csv` already match all depths. Fix: when the current segment is exactly `**` and a suffix follows, descend into subdirectories keeping the whole `**/...` pattern, so the globstar keeps matching at every deeper level. The pre-existing zero-level branch applies the post-`**` suffix at the current level, so the combination matches zero, one, or more directory components. The trailing `**` (no suffix) case is unchanged. Verified `04490`, `04489_globstar_zero_level_directory`, `02459_glob_for_recursive_directory_traversal`, `02707_clickhouse_local_implicit_file_table_function` and `02816_clickhouse_local_table_name_expressions` all pass against a freshly built binary. CI report: #97676 Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

Previously,
**/in glob patterns was not handled as a special case, sodata/**/file.txtwould not matchdata/file.txt(zero directory levels). The**was converted to[^/]*[^{}]*which, followed by a literal/, required at least one path segment.Pre-process
**/into the regex([^{}]*/)*which correctly matches zero or more directory components. Standalone**at end of path retains the original behavior.Changelog category (leave one):
Changelog entry (a user-readable short description of the changes that goes into CHANGELOG.md):
The
**/glob (any amount of directories) now also matches the same directory. Previously,**/in glob patterns was not handled as a special case, sodata/**/file.txtwould not matchdata/file.txt(zero directory levels).Version info
26.7.1.336