Support `UNIQUE` predicate that tests for no duplicate rows in a subquery by RenzoMXD · Pull Request #99877 · ClickHouse/ClickHouse · GitHub
Skip to content

Support UNIQUE predicate that tests for no duplicate rows in a subquery#99877

Open
RenzoMXD wants to merge 42 commits into
ClickHouse:masterfrom
RenzoMXD:feat-unique-subquery-predicate
Open

Support UNIQUE predicate that tests for no duplicate rows in a subquery#99877
RenzoMXD wants to merge 42 commits into
ClickHouse:masterfrom
RenzoMXD:feat-unique-subquery-predicate

Conversation

@RenzoMXD

@RenzoMXD RenzoMXD commented Mar 18, 2026

Copy link
Copy Markdown
Contributor

Implement the SQL standard UNIQUE predicate that returns true when all rows in a subquery are distinct and false when duplicates exist. Empty subqueries return true (vacuously unique).

The implementation follows the same pattern as EXISTS:

  • Parser: UniqueLayer class parses UNIQUE(subquery) syntax
  • Analyzer: rewrites to SELECT count() = uniqExact(*) FROM (subquery)
  • Stub function: FunctionUnique for correlated subquery support

Closes #99609

Changelog category (leave one):

  • New Feature

Changelog entry (a user-readable short description of the changes that goes into CHANGELOG.md):

Added UNIQUE(subquery) predicate that returns true when all rows in the subquery are distinct. It behaves according to the SQL standard: duplicates yield false, empty subqueries yield true. Closes #99609

Documentation entry for user-facing changes

  • Documentation is written (mandatory for new features)

@alexey-milovidov alexey-milovidov added the can be tested Allows running workflows for external contributors label Mar 18, 2026
Comment thread src/Analyzer/Resolve/resolveFunction.cpp Outdated
@clickhouse-gh

clickhouse-gh Bot commented Mar 18, 2026

Copy link
Copy Markdown
Contributor

@clickhouse-gh clickhouse-gh Bot added the pr-feature Pull request with new product feature label Mar 18, 2026
@RenzoMXD

Copy link
Copy Markdown
Contributor Author

@alexey-milovidov Please review my PR again. Thanks.

@amosbird

Copy link
Copy Markdown
Collaborator

The current rewrite SELECT count() = uniqExact(*) FROM (subquery) is heavier than necessary — it must scan all data even if a duplicate is found on the 2nd row, and runs two aggregate functions simultaneously.

The key property of UNIQUE is that non-unique can terminate immediately. We should exploit this.

Proposed approach

Create an internal aggregate function __hasNoDuplicates(*) that maintains a hash set. On each add, it inserts the row hash — if the key already exists (duplicate), it sets a flag and short-circuits.

The critical part: reuse the existing Aggregator::executeOnBlock BREAK path to cancel I/O.

Today, max_rows_to_group_by with group_by_overflow_mode = 'break' causes Aggregator::executeOnBlock to return false. AggregatingTransform then sets is_consume_finished = true and closes its input port, cascading upstream through the pipeline to stop storage reads. This path is stable and well-tested.

Add two virtual methods to IAggregateFunction:

virtual bool isEarlyTerminable() const { return false; }
virtual bool shouldTerminateEarly(AggregateDataPtr place) const { return false; }

After Aggregator::executeOnBlock processes a block, check whether any aggregate requests early termination — if so, return false, taking the exact same code path as BREAK. No new processors or transforms needed.

Analyzer rewrite becomes simply:

UNIQUE(subquery) → SELECT __hasNoDuplicates(*) FROM (subquery) WHERE <null filter>

Benefits

Current (count() = uniqExact(*)) Proposed (__hasNoDuplicates)
Best case (dup on row 2) O(n) time + full I/O O(1), stops after 1 block
Worst case (all unique) O(n), 2 aggregates O(n), 1 aggregate
I/O cancellation None Block-level via existing BREAK path
New processors None None
IAggregateFunction changes None 2 virtual methods (default false, backward-compatible)

Implementation scope

  • IAggregateFunction: 2 new virtual methods
  • Aggregator::executeOnBlock: ~5 lines to check early termination after block processing
  • AggregateFunctionHasNoDuplicates: ~80 lines (hash set + early termination interface)
  • resolveFunction.cpp: simpler rewrite than current (1 aggregate instead of 2)

The early termination interface is generic and reusable by future aggregate functions.

@RenzoMXD

Copy link
Copy Markdown
Contributor Author

@amosbird Thanks very much. I'll fix based on your suggestions.

Comment thread src/AggregateFunctions/AggregateFunctionHasNoDuplicates.h Outdated
Comment thread src/Parsers/ExpressionListParsers.cpp
@RenzoMXD RenzoMXD force-pushed the feat-unique-subquery-predicate branch 2 times, most recently from 88ad14c to d80e1de Compare March 19, 2026 14:26
Comment thread src/Core/Settings.cpp Outdated
@RenzoMXD RenzoMXD force-pushed the feat-unique-subquery-predicate branch 3 times, most recently from 574ed12 to ef343a9 Compare March 19, 2026 20:07
Comment thread src/Interpreters/Aggregator.cpp Outdated
@RenzoMXD RenzoMXD force-pushed the feat-unique-subquery-predicate branch from ef343a9 to b275828 Compare March 19, 2026 20:58
Comment thread src/AggregateFunctions/AggregateFunctionHasNoDuplicates.h Outdated
@RenzoMXD RenzoMXD force-pushed the feat-unique-subquery-predicate branch from b275828 to b00ff36 Compare March 19, 2026 21:34
Comment thread src/Analyzer/Resolve/resolveFunction.cpp Outdated
@RenzoMXD RenzoMXD closed this Mar 23, 2026
@RenzoMXD RenzoMXD force-pushed the feat-unique-subquery-predicate branch from b00ff36 to 9e0dd1f Compare March 23, 2026 15:46
@RenzoMXD RenzoMXD reopened this Mar 23, 2026
@clickhouse-gh clickhouse-gh Bot added the manual approve Manual approve required to run CI label Mar 23, 2026
@RenzoMXD

Copy link
Copy Markdown
Contributor Author

@alexey-milovidov Can you take another look on my PR please? Thanks.

@alexey-milovidov

Copy link
Copy Markdown
Member

I would like to find a better name for this aggregate function, so it will be usable in normal scenarios.

For example, allUnique

@RenzoMXD RenzoMXD force-pushed the feat-unique-subquery-predicate branch 2 times, most recently from 40b1587 to 1e94863 Compare March 23, 2026 19:32
Comment thread contrib/SimSIMD
- Rename test from 04054_unique_predicate to 04147_unique_predicate to
  avoid the 04054_* prefix collision with sixteen unrelated tests.
- Preserve the original expression as the source AST when constructing
  the result ConstantNode in non-only_analyze mode, mirroring how the
  EXISTS branch builds its constant. Without it, EXPLAIN output and
  error messages lose the original unique(...) shape.
- In only_analyze mode, return projection name "unique" instead of the
  internal name "__unique", so EXPLAIN does not leak the dispatch tag.
- Rewrite the fallback comment in the non-analyze path: it does not
  fire for empty subqueries (count() = uniqExact(*) yields TRUE there),
  it only defends against a bug where evaluateScalarSubqueryIfNeeded
  fails to produce a constant.

Refs ClickHouse#99877
@RenzoMXD RenzoMXD closed this May 4, 2026
@rschu1ze rschu1ze reopened this May 4, 2026
@rschu1ze

rschu1ze commented May 4, 2026

Copy link
Copy Markdown
Member

@novikd Could you please continue the review / integration when you're available? Thanks.

Comment thread src/Analyzer/Resolve/resolveFunction.cpp Outdated
Address review comment on
ClickHouse#99877 (comment)...
("Falling back to `1` ... hides analyzer/execution inconsistencies and
can silently return a wrong UNIQUE result").

After `evaluateScalarSubqueryIfNeeded`, the rewrite expects a non-NULL
scalar constant. If that invariant is violated, fail loudly with
`LOGICAL_ERROR` instead of defaulting to true.
@RenzoMXD

RenzoMXD commented May 7, 2026

Copy link
Copy Markdown
Contributor Author

@novikd Could you review my PR again please?

@RenzoMXD RenzoMXD requested a review from novikd May 7, 2026 11:32
@clickhouse-gh

clickhouse-gh Bot commented Jun 1, 2026

Copy link
Copy Markdown
Contributor

Dear @novikd, you haven't been active on this PR for 30 days. You will be unassigned. Will you continue working on it? If so, please feel free to reassign yourself.

@clickhouse-gh clickhouse-gh Bot assigned novikd and unassigned novikd Jun 1, 2026
alexey-milovidov and others added 2 commits June 15, 2026 01:12
…dicate

# Conflicts:
#	src/Analyzer/Resolve/resolveFunction.cpp
The UNIQUE predicate is parsed into the internal function `__unique`, which
is only rewritten by the analyzer in `QueryAnalyzer::resolveUniquePredicate`.
The old analyzer (`enable_analyzer = 0`) has no handling for it, so a query
using UNIQUE previously failed with an obscure `Unknown function __unique`
exception. Reject it explicitly in `TreeRewriter` (used only by the old
analyzer) with a clear `NOT_IMPLEMENTED` message pointing to `enable_analyzer = 1`,
mirroring how QUALIFY and WITH RECURSIVE are handled.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Comment thread src/Analyzer/Resolve/resolveFunction.cpp Outdated
alexey-milovidov and others added 3 commits June 20, 2026 11:51
`ConstantValue` now requires a `ColumnConstPtr`, so constructing it from a
mutable `ColumnUInt8` no longer compiles (`no matching constructor for
initialization of boost::intrusive_ptr<const DB::ColumnConst>`). Wrap the
result column with `ColumnConst::create`, matching the `EXISTS` rewrite in
the same file.

Also make the per-column `isNotNull` filter address columns by position
instead of by name. The previous rewrite referenced the subquery's projected
columns by name, which can bind several filters to the same column when a
subquery exposes duplicate column names (e.g. `SELECT t.x, s.x FROM t, s`).
The subquery is now wrapped once as `SELECT * FROM (subquery)`, every
projected column is renamed to a unique synthetic name, and the NULL filter
is built over those names. The matcher-expanded projection nodes keep their
position-correct column sources, so only the output names change. This also
removes the separate probe clone, resolving the subquery only once.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…names

Covers the position-safe NULL filter: subqueries that expose duplicate or
qualified column names (via cross joins, repeated columns, or `join_use_nulls`)
must still drop rows that contain a NULL in any projected column and compare
rows by every projected position.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@alexey-milovidov

Copy link
Copy Markdown
Member

Pushed three commits (fast-forward, no force-push):

  1. Fix build errorConstantValue now requires a ColumnConstPtr, so the rewrite no longer compiled (no matching constructor for boost::intrusive_ptr<const DB::ColumnConst> at resolveFunction.cpp:393). Wrapped the result with ColumnConst::create, matching the EXISTS rewrite. This was the cause of the Fast test / Build (arm_tidy) failures.
  2. Make the NULL filter position-safe (addresses the AI-review block) — the per-column isNotNull filter no longer references the subquery's columns by name. The subquery is wrapped once as SELECT * FROM (subquery), every projected column is renamed to a unique synthetic name, and the filter is built over those names, so it addresses columns by position. This also removes the separate probe clone (the subquery is resolved only once).
  3. Regression test 04401_unique_predicate_duplicate_columns covering duplicate/qualified column names with NULLs (cross joins, repeated columns, join_use_nulls).

Also merged master (the branch was 1683 commits behind; clean merge, no conflicts).

Verification notes: the changed TU passes clang -fsyntax-only against the merged tree, and the produced rewrite was validated semantically against a master build for all of the test cases. A full local build was not possible in this environment, so CI is the final integration check.

…ession

The `UNIQUE(subquery)` predicate folds to a constant in `resolveUniquePredicate`,
but the resulting `ConstantNode` kept the internal `__unique(subquery)` function as
its source expression. `PlannerActionsVisitor` short-circuits the action node name
of a constant only when its source expression is a `QUERY`/`UNION` node; for any
other source it recurses, reaching the raw non-correlated subquery `QueryNode` and
throwing `Only correlated QueryNode can be used as action query tree node`. This
happened whenever the predicate was nested inside another expression, for example
`uniq(UNIQUE((SELECT ...)), b)` in `ORDER BY`. `EXISTS` avoids this only because it
has a dedicated special case in `PlannerActionsVisitor`.

Use the rewritten subquery `QueryNode` as the source expression instead of the
`__unique` function, matching how scalar subqueries fold to a constant. The planner
then names the constant by value and never recurses into the subquery.

Found by the AST fuzzer (STID 0250-0df5), reproduced and verified locally.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@alexey-milovidov

Copy link
Copy Markdown
Member

Pushed one commit (fast-forward, no force-push) to fix the only PR-caused CI failure.

The AST fuzzer found a logical error (Only correlated QueryNode can be used as action query tree node, STID 0250-0df5) that triggers when the UNIQUE predicate is nested inside another expression, e.g. uniq(UNIQUE((SELECT ...)), b) in ORDER BY. The predicate folds to a constant in resolveUniquePredicate, but the resulting ConstantNode kept the internal __unique(subquery) function as its source expression. PlannerActionsVisitor only short-circuits the action-node name of a constant when its source is a QUERY/UNION node; otherwise it recurses into the source and reaches the raw non-correlated subquery QueryNode, throwing. (EXISTS escapes only because it has a dedicated special case there.)

Fix: use the rewritten subquery QueryNode as the source expression instead of the __unique function, matching how scalar subqueries fold to a constant, so the planner names the constant by value and never recurses into the subquery.

Verified locally on a full build: the exact fuzzer reproducer now returns a result, and 04147_unique_predicate plus a new 04410_unique_predicate_nested_in_expression regression test (nested in uniq in ORDER BY/SELECT, scalar contexts, duplicate columns, and the exact fuzzer query) both pass.

The remaining red jobs are master-wide flakes unrelated to this PR: the Azure base64.cpp ASan global-buffer-overflow, Hung check (#107941), Cannot start clickhouse-server, and Unexpected exception in refresh scheduling (STID 2508-34af).

Comment thread src/Analyzer/Resolve/resolveFunction.cpp Outdated
groeneai added a commit to groeneai/ClickHouse that referenced this pull request Jun 28, 2026
The lambda argument name `__table1.` (a backquoted identifier ending in a
dot) makes the targeted AST fuzzer mutate the test table into a column whose
ALIAS default type mismatches, which forces error-message formatting to
convert the lambda back to an AST. IdentifierNode::toASTImpl then builds an
ASTIdentifier whose name parts include an empty trailing part (from splitting
`__table1.` on the dot), tripping chassert(!part.empty()) in the ASTIdentifier
constructor (pre-existing engine assertion, unrelated to this PR's fix; also
seen on unrelated PRs ClickHouse#104436 and ClickHouse#99877).

The `__table1.y` lambda-arg cases already cover the same code path in
buildShardCollapseFanOut (an unquoted lambda-arg name that looks like a
`__tableN.` qualifier but whose tail is not a real column, so its numbering
must be left intact), without producing an empty identifier part. Keep those
and drop the redundant trailing-dot variant so the test no longer surfaces
the unrelated assertion under fuzzing.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
alexey-milovidov and others added 3 commits July 3, 2026 09:08
In only_analyze mode (EXPLAIN, `CREATE VIEW` validation, distributed shard
headers) `resolveUniquePredicate` left a raw non-correlated `QueryNode` in
expression position. Consumers that build projection actions from the analyzed
tree (for example `InterpreterSelectQueryAnalyzer::getSampleBlock`) then reach
it through `PlannerActionsVisitor::visitQuery`, which rejects non-correlated
query nodes with "Only correlated QueryNode can be used as an action query tree
node". As a result a dry-run path such as
`CREATE VIEW v AS SELECT UNIQUE(SELECT number FROM numbers(3))` failed even
though the same `SELECT` executes normally.

Produce a typed `UInt8` `ConstantNode` with the rewritten subquery as its source
expression instead, exactly as the executed path does. Only the result type and
structure matter in only_analyze mode; the value is a placeholder, as it is for
any scalar subquery in only_analyze mode.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Covers the only_analyze `getSampleBlock` paths (`CREATE VIEW` validation and
`EXPLAIN`), including the UNIQUE predicate nested inside another expression,
which previously threw "Only correlated QueryNode can be used as an action
query tree node".

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@alexey-milovidov

Copy link
Copy Markdown
Member

Advanced this PR (pushed 5fc2308f836):

  1. Merged master (as @rschu1ze requested — the branch was ~2 weeks behind). The merge was clean, no conflicts.

  2. Addressed the unresolved review thread on resolveFunction.cpp (the only_analyze path). Previously resolveUniquePredicate left a raw non-correlated QueryNode in expression position in only_analyze mode, so consumers that build projection actions from the analyzed tree — e.g. InterpreterSelectQueryAnalyzer::getSampleBlock, used by CREATE VIEW validation and distributed shard headers — reached it through PlannerActionsVisitor::visitQuery and threw Only correlated QueryNode can be used as an action query tree node. So CREATE VIEW v AS SELECT UNIQUE(SELECT number FROM numbers(3)) failed even though the plain SELECT worked. The only_analyze result now folds to a typed UInt8 ConstantNode with the rewritten subquery as source, exactly as the executed path does. Added regression test 04498_unique_predicate_only_analyze.

CI failure triage (all three pre-merge reds are not caused by this feature's logic):

CI will re-run on the merged commit.

placeholder_column->getData().push_back(static_cast<UInt8>(0));
ConstantValue placeholder_value(ColumnConst::create(std::move(placeholder_column), 1), std::make_shared<DataTypeUInt8>());
auto placeholder_const_node = std::make_shared<ConstantNode>(std::move(placeholder_value), new_unique_subquery);
auto placeholder_projection_name = placeholder_const_node->getValueStringRepresentation();

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

only_analyze still needs a stable projection/header name here, not the placeholder value. resolveProjectionExpressionNodeList uses the returned ProjectionNames directly for the sample block header, so this branch always names the column 0, while the execute path below returns 0 or 1 from the real UNIQUE result.

That leaks into one of the consumers named in the comment: distributed shard header construction. A query like SELECT UNIQUE((SELECT number FROM numbers(3))) FROM remote(...) can analyze to header 0 UInt8 on the initiator, but execute to 1 UInt8 on the shard when the subquery is actually unique, which is a block-structure mismatch even though the type matches.

Please make the UNIQUE projection name stable across both paths (for example via a fixed synthetic name / __actionName-style wrapper) and add a regression that compares analyze-time and execution-time headers for a distributed or remote case.

@clickhouse-gh

clickhouse-gh Bot commented Jul 3, 2026

Copy link
Copy Markdown
Contributor

LLVM Coverage Report

Metric Baseline Current Δ
Lines 85.50% 85.50% +0.00%
Functions 92.60% 92.60% +0.00%
Branches 77.70% 77.70% +0.00%

Changed lines: Changed C/C++ lines covered: 125/136 (91.91%) · Uncovered code

Full report · Diff report

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

can be tested Allows running workflows for external contributors manual approve Manual approve required to run CI pr-feature Pull request with new product feature

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Support UNIQUE(subquery) predicate that tests for no duplicate rows

5 participants