Support UNIQUE predicate that tests for no duplicate rows in a subquery#99877
Support UNIQUE predicate that tests for no duplicate rows in a subquery#99877RenzoMXD wants to merge 42 commits into
UNIQUE predicate that tests for no duplicate rows in a subquery#99877Conversation
|
@alexey-milovidov Please review my PR again. Thanks. |
|
The current rewrite The key property of Proposed approachCreate an internal aggregate function The critical part: reuse the existing Today, Add two virtual methods to virtual bool isEarlyTerminable() const { return false; }
virtual bool shouldTerminateEarly(AggregateDataPtr place) const { return false; }After Analyzer rewrite becomes simply: Benefits
Implementation scope
The early termination interface is generic and reusable by future aggregate functions. |
|
@amosbird Thanks very much. I'll fix based on your suggestions. |
88ad14c to
d80e1de
Compare
574ed12 to
ef343a9
Compare
ef343a9 to
b275828
Compare
b275828 to
b00ff36
Compare
b00ff36 to
9e0dd1f
Compare
|
@alexey-milovidov Can you take another look on my PR please? Thanks. |
|
I would like to find a better name for this aggregate function, so it will be usable in normal scenarios. For example, |
40b1587 to
1e94863
Compare
- Rename test from 04054_unique_predicate to 04147_unique_predicate to avoid the 04054_* prefix collision with sixteen unrelated tests. - Preserve the original expression as the source AST when constructing the result ConstantNode in non-only_analyze mode, mirroring how the EXISTS branch builds its constant. Without it, EXPLAIN output and error messages lose the original unique(...) shape. - In only_analyze mode, return projection name "unique" instead of the internal name "__unique", so EXPLAIN does not leak the dispatch tag. - Rewrite the fallback comment in the non-analyze path: it does not fire for empty subqueries (count() = uniqExact(*) yields TRUE there), it only defends against a bug where evaluateScalarSubqueryIfNeeded fails to produce a constant. Refs ClickHouse#99877
|
@novikd Could you please continue the review / integration when you're available? Thanks. |
Address review comment on ClickHouse#99877 (comment)... ("Falling back to `1` ... hides analyzer/execution inconsistencies and can silently return a wrong UNIQUE result"). After `evaluateScalarSubqueryIfNeeded`, the rewrite expects a non-NULL scalar constant. If that invariant is violated, fail loudly with `LOGICAL_ERROR` instead of defaulting to true.
|
@novikd Could you review my PR again please? |
|
Dear @novikd, you haven't been active on this PR for 30 days. You will be unassigned. Will you continue working on it? If so, please feel free to reassign yourself. |
…dicate # Conflicts: # src/Analyzer/Resolve/resolveFunction.cpp
The UNIQUE predicate is parsed into the internal function `__unique`, which is only rewritten by the analyzer in `QueryAnalyzer::resolveUniquePredicate`. The old analyzer (`enable_analyzer = 0`) has no handling for it, so a query using UNIQUE previously failed with an obscure `Unknown function __unique` exception. Reject it explicitly in `TreeRewriter` (used only by the old analyzer) with a clear `NOT_IMPLEMENTED` message pointing to `enable_analyzer = 1`, mirroring how QUALIFY and WITH RECURSIVE are handled. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
`ConstantValue` now requires a `ColumnConstPtr`, so constructing it from a mutable `ColumnUInt8` no longer compiles (`no matching constructor for initialization of boost::intrusive_ptr<const DB::ColumnConst>`). Wrap the result column with `ColumnConst::create`, matching the `EXISTS` rewrite in the same file. Also make the per-column `isNotNull` filter address columns by position instead of by name. The previous rewrite referenced the subquery's projected columns by name, which can bind several filters to the same column when a subquery exposes duplicate column names (e.g. `SELECT t.x, s.x FROM t, s`). The subquery is now wrapped once as `SELECT * FROM (subquery)`, every projected column is renamed to a unique synthetic name, and the NULL filter is built over those names. The matcher-expanded projection nodes keep their position-correct column sources, so only the output names change. This also removes the separate probe clone, resolving the subquery only once. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…names Covers the position-safe NULL filter: subqueries that expose duplicate or qualified column names (via cross joins, repeated columns, or `join_use_nulls`) must still drop rows that contain a NULL in any projected column and compare rows by every projected position. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
|
Pushed three commits (fast-forward, no force-push):
Also merged Verification notes: the changed TU passes |
…ession The `UNIQUE(subquery)` predicate folds to a constant in `resolveUniquePredicate`, but the resulting `ConstantNode` kept the internal `__unique(subquery)` function as its source expression. `PlannerActionsVisitor` short-circuits the action node name of a constant only when its source expression is a `QUERY`/`UNION` node; for any other source it recurses, reaching the raw non-correlated subquery `QueryNode` and throwing `Only correlated QueryNode can be used as action query tree node`. This happened whenever the predicate was nested inside another expression, for example `uniq(UNIQUE((SELECT ...)), b)` in `ORDER BY`. `EXISTS` avoids this only because it has a dedicated special case in `PlannerActionsVisitor`. Use the rewritten subquery `QueryNode` as the source expression instead of the `__unique` function, matching how scalar subqueries fold to a constant. The planner then names the constant by value and never recurses into the subquery. Found by the AST fuzzer (STID 0250-0df5), reproduced and verified locally. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
|
Pushed one commit (fast-forward, no force-push) to fix the only PR-caused CI failure. The AST fuzzer found a logical error ( Fix: use the rewritten subquery Verified locally on a full build: the exact fuzzer reproducer now returns a result, and The remaining red jobs are master-wide flakes unrelated to this PR: the Azure |
The lambda argument name `__table1.` (a backquoted identifier ending in a dot) makes the targeted AST fuzzer mutate the test table into a column whose ALIAS default type mismatches, which forces error-message formatting to convert the lambda back to an AST. IdentifierNode::toASTImpl then builds an ASTIdentifier whose name parts include an empty trailing part (from splitting `__table1.` on the dot), tripping chassert(!part.empty()) in the ASTIdentifier constructor (pre-existing engine assertion, unrelated to this PR's fix; also seen on unrelated PRs ClickHouse#104436 and ClickHouse#99877). The `__table1.y` lambda-arg cases already cover the same code path in buildShardCollapseFanOut (an unquoted lambda-arg name that looks like a `__tableN.` qualifier but whose tail is not a real column, so its numbering must be left intact), without producing an empty identifier part. Keep those and drop the redundant trailing-dot variant so the test no longer surfaces the unrelated assertion under fuzzing. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
In only_analyze mode (EXPLAIN, `CREATE VIEW` validation, distributed shard headers) `resolveUniquePredicate` left a raw non-correlated `QueryNode` in expression position. Consumers that build projection actions from the analyzed tree (for example `InterpreterSelectQueryAnalyzer::getSampleBlock`) then reach it through `PlannerActionsVisitor::visitQuery`, which rejects non-correlated query nodes with "Only correlated QueryNode can be used as an action query tree node". As a result a dry-run path such as `CREATE VIEW v AS SELECT UNIQUE(SELECT number FROM numbers(3))` failed even though the same `SELECT` executes normally. Produce a typed `UInt8` `ConstantNode` with the rewritten subquery as its source expression instead, exactly as the executed path does. Only the result type and structure matter in only_analyze mode; the value is a placeholder, as it is for any scalar subquery in only_analyze mode. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Covers the only_analyze `getSampleBlock` paths (`CREATE VIEW` validation and `EXPLAIN`), including the UNIQUE predicate nested inside another expression, which previously threw "Only correlated QueryNode can be used as an action query tree node". Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
|
Advanced this PR (pushed
CI failure triage (all three pre-merge reds are not caused by this feature's logic):
CI will re-run on the merged commit. |
| placeholder_column->getData().push_back(static_cast<UInt8>(0)); | ||
| ConstantValue placeholder_value(ColumnConst::create(std::move(placeholder_column), 1), std::make_shared<DataTypeUInt8>()); | ||
| auto placeholder_const_node = std::make_shared<ConstantNode>(std::move(placeholder_value), new_unique_subquery); | ||
| auto placeholder_projection_name = placeholder_const_node->getValueStringRepresentation(); |
There was a problem hiding this comment.
only_analyze still needs a stable projection/header name here, not the placeholder value. resolveProjectionExpressionNodeList uses the returned ProjectionNames directly for the sample block header, so this branch always names the column 0, while the execute path below returns 0 or 1 from the real UNIQUE result.
That leaks into one of the consumers named in the comment: distributed shard header construction. A query like SELECT UNIQUE((SELECT number FROM numbers(3))) FROM remote(...) can analyze to header 0 UInt8 on the initiator, but execute to 1 UInt8 on the shard when the subquery is actually unique, which is a block-structure mismatch even though the type matches.
Please make the UNIQUE projection name stable across both paths (for example via a fixed synthetic name / __actionName-style wrapper) and add a regression that compares analyze-time and execution-time headers for a distributed or remote case.
LLVM Coverage ReportChanged lines: Changed C/C++ lines covered: 125/136 (91.91%) · Uncovered code |

Implement the SQL standard
UNIQUEpredicate that returns true when all rows in a subquery are distinct and false when duplicates exist. Empty subqueries return true (vacuously unique).The implementation follows the same pattern as
EXISTS:UniqueLayerclass parsesUNIQUE(subquery)syntaxSELECT count() = uniqExact(*) FROM (subquery)FunctionUniquefor correlated subquery supportCloses #99609
Changelog category (leave one):
Changelog entry (a user-readable short description of the changes that goes into CHANGELOG.md):
Added
UNIQUE(subquery)predicate that returns true when all rows in the subquery are distinct. It behaves according to the SQL standard: duplicates yield false, empty subqueries yield true. Closes #99609Documentation entry for user-facing changes