Support column matcher expansion for default value expressions and index expressions by niyue · Pull Request #105045 · ClickHouse/ClickHouse · GitHub
Skip to content

Support column matcher expansion for default value expressions and index expressions#105045

Open
niyue wants to merge 73 commits into
ClickHouse:masterfrom
niyue:feat/default-expr-column-matcher
Open

Support column matcher expansion for default value expressions and index expressions#105045
niyue wants to merge 73 commits into
ClickHouse:masterfrom
niyue:feat/default-expr-column-matcher

Conversation

@niyue

@niyue niyue commented May 15, 2026

Copy link
Copy Markdown
Contributor

This PR closes #92266

Support column matchers in column DEFAULT, ALIAS, MATERIALIZED, and EPHEMERAL expressions, and in data skipping index expressions. This allows expressions such as *, COLUMNS('...'), COLUMNS(a, b), EXCEPT, APPLY, and REPLACE to be expanded before expression validation and execution. The change also adds namedTuple function to make matcher-expanded named tuple expressions ergonomic in tests and user queries.

The tests cover these use cases, direct and indirect cyclic default-expression dependency detection, and nested matcher expansion.

Changelog category:

  • New Feature

Changelog entry:

  • Support column matchers such as * and COLUMNS in column default value expressions, DEFAULT, ALIAS, MATERIALIZED, and EPHEMERAL expressions, and in data skipping index expressions.

Note

About the newly added namedTuple function: I read the previous discussions in [1], [2], and [3], and my impression is that the existing enable_named_columns_in_function_tuple setting is not very discoverable. A separate function name may make the intention clearer and the feature easier to use, especially in this use case. It also avoids changing the behavior of the existing tuple function, so it should not introduce compatibility issues. Please let me know if this direction is not desirable.

[1] #63524
[2] #54921
[3] #54881


Note

Medium Risk
Touches core expression/DDL validation paths (defaults, aliases, ALTER, mutations, skip indexes), which can affect query planning and error behavior if matcher expansion or cycle detection is incorrect.

Overview
Enables column matchers (e.g. *, COLUMNS(...) plus EXCEPT/APPLY/REPLACE) inside column DEFAULT/MATERIALIZED/ALIAS/EPHEMERAL expressions by expanding matchers before validation/execution, honoring asterisk_include_* settings and rejecting qualified matchers.

Updates DDL/default validation, alias expansion, read-order optimization, merge/SELECT paths, and mutation/materialize flows to use a shared cloneAndExpandColumnDefaultExpression helper and adds cycle detection that accounts for matcher-expanded dependencies.

Extends skip index parsing/analysis to normalize matcher and alias usage (including cyclic-alias detection) before TreeRewriter analysis, and adds docs + comprehensive stateless tests covering matcher expansion, errors, and ALTER/mutation/index scenarios.

Reviewed by Cursor Bugbot for commit 48ab80f. Bugbot is set up for automated code reviews on this repo. Configure here.

@niyue niyue changed the title Feat/default expr column matcher Add column matcher expansion for default value expressions May 15, 2026
@niyue niyue changed the title Add column matcher expansion for default value expressions Support column matcher expansion for default value expressions May 15, 2026
@niyue niyue changed the title Support column matcher expansion for default value expressions Support column matcher expansion for default value expressions and index expressions May 15, 2026
@niyue niyue force-pushed the feat/default-expr-column-matcher branch from 6d0b6df to 308133e Compare May 15, 2026 15:46
@alexey-milovidov alexey-milovidov self-assigned this May 15, 2026
@alexey-milovidov alexey-milovidov added the can be tested Allows running workflows for external contributors label May 15, 2026
@clickhouse-gh

clickhouse-gh Bot commented May 15, 2026

Copy link
Copy Markdown
Contributor

@clickhouse-gh clickhouse-gh Bot added the pr-feature Pull request with new product feature label May 15, 2026
@alexey-milovidov

Copy link
Copy Markdown
Member

The change also adds namedTuple function to make matcher-expanded named tuple expressions ergonomic in tests and user queries.

Let's avoid adding it in this PR. We plan to make the ordinary tuple constructor return named tuples whenever possible.

@alexey-milovidov

Copy link
Copy Markdown
Member

the existing enable_named_columns_in_function_tuple setting is not very discoverable

We plan to enable it by default.

Comment thread src/Storages/AlterCommands.cpp Outdated
Comment thread src/Storages/IndicesDescription.cpp Outdated
Comment thread src/Functions/tuple.cpp Outdated
Comment thread docs/en/sql-reference/statements/create/table.md
niyue added 14 commits May 16, 2026 21:15
Allow  and  matchers in , , and  expressions, and in skip index expressions. Matchers are expanded before validation so normal expression analysis still reports arity and type exceptions.

Issue: ClickHouse#92266
Add `namedTuple` as a strict variant of `tuple` that always returns a named `Tuple` and reports an exception for duplicate or invalid argument names.
Extend the default-expression matcher stateless test to cover `asterisk_include_materialized_columns` and `asterisk_include_alias_columns` during default value evaluation.
Add a stateless test case where a `DEFAULT` expression with `*` expands into an indirect dependency cycle and reports `CYCLIC_ALIASES`.
Build the validation column snapshot with updated `default_desc.kind` values for `ADD COLUMN` and `MODIFY COLUMN`, so `*` and `COLUMNS` expansion sees the post-alter default kinds.
@niyue niyue force-pushed the feat/default-expr-column-matcher branch from 981800b to 2632f1b Compare May 16, 2026 13:17
niyue added 2 commits June 21, 2026 17:11
Run the expanded default-cycle check before collecting identifiers from
missing column defaults in `MergeTree` read dependency discovery. This
keeps old parts with setting-dependent `DEFAULT` / `MATERIALIZED` cycles
reporting `CYCLIC_ALIASES` instead of recursing until `TOO_DEEP_RECURSION`.

Add a stateless regression test for reading a matcher-based `DEFAULT`
column from an old part after adding a dependent `MATERIALIZED` column.
Comment thread src/Processors/Transforms/TTLTransform.cpp
Reject setting-dependent cycles after expanding matcher-based default
expressions in mutation paths before dependency discovery or mutation
expression execution can use the expanded AST.

Cover `MATERIALIZE COLUMN` and `CLEAR COLUMN` with `MATERIALIZED`
columns whose `tuple(*)` expression becomes recursive when
`asterisk_include_materialized_columns` is enabled.
Comment thread src/Storages/AlterCommands.cpp Outdated
niyue added 2 commits June 21, 2026 21:04
Apply `FIRST` and `AFTER` positions to the validation snapshot used
for stored default expressions. This keeps matcher expansion during
`ALTER ADD COLUMN` and `ALTER MODIFY COLUMN` validation aligned with
the schema order that `apply` will persist.

Add a stateless test for order-sensitive `tuple(*)` defaults.
Run `validateNoCyclicAliasesAfterExpansion` after expanding matcher-based
default expressions for column TTL actions and before building expression
actions.

Add a stateless test for a late `ALIAS` cycle after `OPTIMIZE TABLE`.
Comment thread src/Storages/MergeTree/MergeTreeBlockReadUtils.cpp
Comment thread src/Storages/IndicesDescription.cpp
Comment thread src/Storages/ColumnsDescription.cpp
Comment thread src/Interpreters/InterpreterSelectQuery.cpp
niyue and others added 8 commits June 22, 2026 13:05
Replace the root lambda argument when expanding `APPLY` matchers and avoid capturing identifiers that are local to nested lambdas.

Add a stateless test for default matcher expansion with lambda `APPLY`.
Run `validateNoCyclicAliasesAfterExpansion` before replacing nested alias columns in old-analyzer direct-clone paths.

Add a stateless test for matcher-expanded `ALIAS` cycles through `MATERIALIZED` columns.
Collect required columns for expanded default expressions with `RequiredSourceColumnsVisitor` so lambda-local identifiers are not treated as storage dependencies.

Add a stateless test for matcher-based defaults over old `MergeTree` parts with lambda-local names.
When replacing an `ALIAS` inside a lambda, recursively normalize the alias body with caller `private_aliases` cleared so table aliases are not captured by lambda parameters.

Add a stateless test for persisted skip index expressions that reference alias columns inside lambda predicates.
Move lambda argument name extraction from
`RequiredSourceColumnsMatcher::extractNamesFromLambda` to parser-level
`getASTLambdaArgumentNames` so `Parsers` and `Interpreters` code paths reuse
the same validation and extraction logic.

Use the shared helper in column matcher `APPLY` expansion and existing lambda
scope visitors.
The test expected `SELECT b` to raise `CYCLIC_ALIASES` after `OPTIMIZE`
expired the column `b`, relying on a `b` -> `x` -> `b` cycle formed when
`asterisk_include_alias_columns = 1` makes the matcher in `b`'s `DEFAULT`
expand to include the alias column `x`.

This worked under the new analyzer (referencing the table resolves every
`ALIAS` column expression in the query context, expanding `x` and detecting
the cycle), but failed under the old analyzer: `SELECT b` only requires the
`DEFAULT` column `b`, so the old analyzer never expands the unreferenced
alias `x`, and the read-time reconstruction of the expired column runs in
the storage context (server defaults, alias columns excluded from `*`), so
no cycle is formed and the query returns the value.

Read the alias column `x` instead of `b`, with `optimize_respect_aliases = 1`,
so the matcher expansion and the late cycle check run at query time under both
analyzers. This mirrors the existing `04317_alias_matcher_late_cycle_check`.

`OPTIMIZE` runs the `TTL` materialization in the background merge context
(server defaults), which excludes alias columns from `*`, so it forms no
cycle and must succeed; the cycle only manifests at query time.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@alexey-milovidov

Copy link
Copy Markdown
Member

Merged the latest master (clean, no conflicts) and fixed the one real CI failure: 04405_ttl_default_matcher_late_cycle was failing under the old analyzer (the amd_llvm_coverage stateless job).

Root cause: the test did SELECT b and expected CYCLIC_ALIASES. The bxb cycle only forms once asterisk_include_alias_columns = 1 makes the matcher in b's DEFAULT expand to include the alias column x.

  • Under the new analyzer this is detected because referencing the table resolves every ALIAS column expression in the query context, so x is expanded and the cycle is found.
  • Under the old analyzer, SELECT b only requires the DEFAULT column b, so the unreferenced alias x is never expanded, and the read-time reconstruction of the TTL-expired column runs in the storage context (server defaults, alias columns excluded from *) — no cycle, so the query returned a value.

Fix: read the alias column x (with optimize_respect_aliases = 1) instead of b, so the matcher expansion and the late-cycle check run at query time under both analyzers. This mirrors the existing, passing 04317_alias_matcher_late_cycle_check. OPTIMIZE runs the TTL materialization in the background merge context (server defaults, no alias columns), so it forms no cycle and must succeed; the cycle only manifests at query time.

For the record, validateNoCyclicAliasesAfterExpansion in TTLTransform is a defensive guard: the merge always expands the default under the background context (alias columns excluded by default), so it does not surface this particular cycle during the merge itself.

alexey-milovidov and others added 2 commits June 30, 2026 06:30
…atIfIndexEstimator`

This PR added a `ContextPtr context` parameter to `createMergeTreeSequentialSource`
(needed to expand column matchers in default expressions during merges), but the
call site in `WhatIfIndexEstimator.cpp` — a separate caller added on `master` — was
not updated, breaking the build with "no matching function for call to
'createMergeTreeSequentialSource'" (requires 13 arguments, but 12 were provided).

Pass the in-scope `context` (the same one already used for alter conversions and
query limits a few lines below) as the final argument.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@alexey-milovidov

Copy link
Copy Markdown
Member

Merged the latest master (clean, no conflicts — the branch had drifted ~320 commits) and fixed the one real CI failure: the Fast test and Build (arm_tidy) builds were failing to compile WhatIfIndexEstimator.cpp.

Root cause: this PR adds a ContextPtr context parameter to createMergeTreeSequentialSource (so column matchers in default expressions can be expanded during merges/reads). All call sites that existed when the PR was written were updated, but WhatIfIndexEstimator.cpp — a separate caller of createMergeTreeSequentialSource that landed on master — was not, so the build broke with:

WhatIfIndexEstimator.cpp:412:25: error: no matching function for call to 'createMergeTreeSequentialSource'
note: candidate function not viable: requires 13 arguments, but 12 were provided

Fix: pass the in-scope context (the same ContextPtr already used a few lines below for getAlterConversionsForPart and the query limits) as the final argument.

For the record, I checked the rest of the cross-cut against the freshly merged master: every other function this PR changed the signature of (getReadTaskColumnsForMerge, injectRequiredColumns, validateColumnsDefaults/validateColumnsDefaultsAndGetSampleBlock, performRequiredConversions, and the extractNamesFromLambdagetASTLambdaArgumentNames refactor) has all of its callers updated in the merged tree, and WhatIfIndexEstimator.cpp was the only stray caller. A fresh CI run is in progress on the updated head.

Comment thread src/Interpreters/InterpreterCreateQuery.cpp
Column matchers (`*`, `COLUMNS(...)`) in `DEFAULT`/`MATERIALIZED`/`ALIAS`
expressions are stored unexpanded and re-expanded against the table's
`ColumnsDescription` every time the default is evaluated. With the default
`flatten_nested = 1`, `getColumnsDescription` flattens the stored metadata
(`res.flattenNested()`), so at insert time the matcher is expanded against the
flattened columns (`n.x`), while validation in
`validateColumnsDefaultsAndGetSampleBlock` was expanding it against the
un-flattened `columns_for_default_validation` (`n`).

That mismatch let `CREATE TABLE` persist a default that executes differently
from what was validated. For example
`CREATE TABLE t (n Nested(x UInt8), b UInt64 DEFAULT length(COLUMNS('^n$')))`
validated against `n` but, after flattening, expanded to zero columns at insert
time and threw `NUMBER_OF_ARGUMENTS_DOESNT_MATCH`.

Flatten `columns_for_default_validation` under the same condition as `res` so
matcher expansion during validation sees exactly the schema used at execution
time. Add a stateless regression test.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@alexey-milovidov

Copy link
Copy Markdown
Member

Addressed the AI Review Request changes finding (matcher defaults validated against a pre-flatten schema) in b71da69.

Root cause. Column matchers in DEFAULT/MATERIALIZED/ALIAS expressions are stored unexpanded and re-expanded against the table's ColumnsDescription on every evaluation. getColumnsDescription validates defaults (validateColumnsDefaultsAndGetSampleBlock) against columns_for_default_validation, which was built before res.flattenNested(). With flatten_nested = 1 the stored schema is flattened (nn.x), so at insert time a matcher expands against the flattened columns while validation expanded against the un-flattened ones — letting CREATE TABLE persist a default that executes differently from what was validated.

Fix. Flatten columns_for_default_validation under the same condition as res, so validation-time matcher expansion sees exactly the execution-time schema. This rejects e.g. length(COLUMNS('^n$')) over a Nested column at DDL time (NUMBER_OF_ARGUMENTS_DOESNT_MATCH) and keeps valid flattened-subcolumn matchers (length(COLUMNS('x'))length(n.x)) both validating and evaluating consistently.

Test. Added 04489_default_expr_matcher_flatten_nested covering the accepted DEFAULT/MATERIALIZED cases and the rejected case with flatten_nested = 1. Verified the matcher-expansion semantics (COLUMNS('x')n.x; COLUMNS('^n$') → zero columns → code 42) against a master clickhouse-local; the C++ change reuses the existing ColumnsDescription::flattenNested already called a few lines below, so it only affects tables that contain Nested columns.

Net diff vs the previous head is exactly these 3 files (+60). The AI Review thread is resolved; waiting on fresh CI.

@clickhouse-gh

clickhouse-gh Bot commented Jun 30, 2026

Copy link
Copy Markdown
Contributor

LLVM Coverage Report

Metric Baseline Current Δ
Lines 85.40% 85.40% +0.00%
Functions 92.70% 92.70% +0.00%
Branches 77.60% 77.70% +0.10%

Changed lines: Changed C/C++ lines covered: 705/742 (95.01%) · Uncovered code

Full report · Diff report

@alexey-milovidov

Copy link
Copy Markdown
Member

Fixed the CH Inc sync failure, which was red with build failed (a private-side build break not visible in the public CI).

Root cause: this PR adds a ContextPtr context parameter to getReadTaskColumns so that column matchers in default expressions can be expanded during reads. Every public call site was updated, but a caller that exists only in the internal repository still used the old 10-argument overload, so the internal build failed across all configurations:

StorageMergeTreeParts.cpp:342:31: error: no matching function for call to 'getReadTaskColumns'
MergeTreeBlockReadUtils.h:33:26: note: candidate function not viable: requires 11 arguments, but 10 were provided

Fix: pass the in-scope getContext() as the new argument on the sync branch, mirroring the public MergeTreeReadPoolBase caller (which passes getContext() in the same slot). This is the same class of cross-cut as the earlier WhatIfIndexEstimator.cpp build fix; I verified across the whole internal tree that this is the only caller of any function whose signature this PR changes that the public CI does not cover. A fresh internal CI run is in progress.

The only remaining red is Hung check failed, possible deadlock found (Stress test (arm_asan_ubsan)), which is the endemic AST-fuzzer hung-check flake tracked in #107941 — it reproduces on master directly and does not touch any code path this PR changes.

@clickhouse-gh

clickhouse-gh Bot commented Jun 30, 2026

Copy link
Copy Markdown
Contributor

📊 Cloud Performance Report

✅ AI verdict: no_change — no significant changes across 38 queries analysed

This PR adds support for column matchers (*, COLUMNS(...)) and transformer modifiers in column DEFAULT/MATERIALIZED/ALIAS and index expressions, plus a refactor of lambda-argument-name extraction. All of that runs at DDL, INSERT-default, mutation, and alias-expansion time — paths only reached for tables that actually declare such columns. The four flagged ClickBench queries (Q4, Q15, Q28, Q34) are plain scan/aggregation SELECTs over tables with no alias or default-matcher columns, so the query-execution hot path they exercise is untouched. The reported improvements (ranging -5.7% to -19%) cannot plausibly be caused by this change and the base measurements were noticeably noisier than the source, so all four are downgraded to not-sure as run-to-run variance rather than real PR effects.

clickbench

⚠️ 4 inconclusive

Flagged queries (4 of 43)
Query Verdict Baseline median (ms) PR median (ms) Change q-value Hint
⚠️ 4 not_sure 262 212 -19.1% <0.0001 PR only touches DEFAULT/ALIAS/matcher expansion; Q4 scan path untouched, so the -19% is off-path variance
⚠️ 15 not_sure 245 201 -18.0% <0.0001 Off-path: ClickBench tables have no ALIAS/DEFAULT matchers; Q15 execution can't be changed by this diff
⚠️ 28 not_sure 6671 6289 -5.7% <0.0001 Off-path: aggregation query, no alias/default expansion involved; -5.7% not attributable to this PR
⚠️ 34 not_sure 1461 1377 -5.8% <0.0001 Off-path: query exercises scan/agg, not column-matcher/default expansion; -5.8% is run-to-run variance

q-value = BH-FDR adjusted p; smaller is stronger evidence. MIRAI flags a query when q < fdr_q (default 0.10) — the value the verdict is based on.

tpch_adapted_1_official

🟢 No significant changes

Debug info
  • StressHouse run: 347d2724-083e-468d-a5b8-94f3412c658b
  • MIRAI run: 3d7e0b0e-bf52-47f8-a258-02ef6bba5473
  • PR check IDs:
    • clickbench_504125_1782832886
    • clickbench_504132_1782832886
    • clickbench_504142_1782832886
    • tpch_adapted_1_official_504150_1782832887
    • tpch_adapted_1_official_504163_1782832886
    • tpch_adapted_1_official_504174_1782832886

@alexey-milovidov

Copy link
Copy Markdown
Member

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

can be tested Allows running workflows for external contributors pr-feature Pull request with new product feature

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Support * and column matchers inside default expressions for columns and indices

2 participants