Add functions arrayTopK and arrayBottomK by vitlibar · Pull Request #104563 · ClickHouse/ClickHouse · GitHub
Skip to content

Add functions arrayTopK and arrayBottomK#104563

Merged
vitlibar merged 5 commits into
ClickHouse:masterfrom
vitlibar:add-array-top-k-bottom-k
Jun 2, 2026
Merged

Add functions arrayTopK and arrayBottomK#104563
vitlibar merged 5 commits into
ClickHouse:masterfrom
vitlibar:add-array-top-k-bottom-k

Conversation

@vitlibar

@vitlibar vitlibar commented May 11, 2026

Copy link
Copy Markdown
Member

Changelog category (leave one):

  • New Feature

Changelog entry (a user-readable short description of the changes that goes into CHANGELOG.md):

Add functions arrayTopK and arrayBottomK:

  • arrayTopK(k, array) returns the K largest elements in descending order
  • arrayBottomK(k, array) returns the K smallest elements in ascending order

For example arrayBottomK(3, [1, 5, 2, 7, 3]) returns [1,2,3]

Likewise arraySort() an optional lambda is supported, so
arrayBottomK((x, y) -> y, 2, ['a', 'b', 'c'], [3, 1, 2]) returns ['b', 'c']

Version info

  • Merged into: 26.6.1.320

@clickhouse-gh

clickhouse-gh Bot commented May 11, 2026

Copy link
Copy Markdown
Contributor

@clickhouse-gh clickhouse-gh Bot added the pr-feature Pull request with new product feature label May 11, 2026
@vitlibar vitlibar requested a review from Copilot May 11, 2026 07:44

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR introduces new higher-order array functions arrayTopK and arrayBottomK to return the top/bottom K elements of an array (with optional lambda sort keys and NULL skipping). It also updates the Prometheus Query-to-SQL converter to leverage these functions for topk/bottomk/limitk, enabling support for non-constant k aligned to the time grid, and adds test coverage for the new behavior.

Changes:

  • Add arrayTopK / arrayBottomK implementations and function registration.
  • Refactor Prometheus applyLimitAggregationOperator to use the new functions and support scalar-grid k, enabling an integration test previously disabled.
  • Add stateless SQL tests + references, and update aspell dictionary for the new function names.

Reviewed changes

Copilot reviewed 7 out of 7 changed files in this pull request and generated 3 comments.

Show a summary per file
File Description
tests/queries/0_stateless/04105_array_top_k_bottom_k.sql Adds stateless queries covering top/bottom K behavior, NULL handling, lambdas, and errors.
tests/queries/0_stateless/04105_array_top_k_bottom_k.reference Expected outputs for the new stateless tests.
tests/integration/test_prometheus_protocols/test_evaluation.py Re-enables a PromQL topk test with time-varying k now that it’s supported.
src/Storages/TimeSeries/PrometheusQueryToSQL/applyLimitAggregationOperator.cpp Uses arrayTopK/arrayBottomK for index selection and implements scalar-grid k handling.
src/Functions/array/arrayTopK.h Declares arrayTopK/arrayBottomK via FunctionArrayMapped and defines return type / argument checks.
src/Functions/array/arrayTopK.cpp Implements the core top/bottom K selection logic and registers both functions + docs.
ci/jobs/scripts/check_style/aspell-ignore/en/aspell-dict.txt Adds arrayTopK and arrayBottomK to spelling ignore list.

Comment thread src/Functions/array/arrayTopK.h
Comment thread src/Functions/array/arrayTopK.cpp Outdated
Comment thread src/Functions/array/arrayTopK.cpp Outdated
@vitlibar

vitlibar commented May 11, 2026

Copy link
Copy Markdown
Member Author

The closest existing alternative is arrayPartialSort(k, arr), but it returns unsorted elements k+1..N in unspecified order and also doesn't skip NULLs. So arrayBottomK(k, arr) is roughly the same as

arraySlice(k, arrayPartialSort(k, arrayFilter(x -> x IS NOT NULL, arr)), 1, k)

@vitlibar vitlibar marked this pull request as draft May 11, 2026 08:26
@vitlibar vitlibar force-pushed the add-array-top-k-bottom-k branch from d61b373 to 3afdcfe Compare May 11, 2026 09:39
@vitlibar vitlibar requested a review from Copilot May 11, 2026 09:40

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 7 out of 7 changed files in this pull request and generated 2 comments.

Comment thread src/Functions/array/arrayTopK.h
Comment thread src/Functions/array/arrayTopK.cpp Outdated
@vitlibar vitlibar force-pushed the add-array-top-k-bottom-k branch from 3afdcfe to aaf640d Compare May 11, 2026 10:03
@vitlibar vitlibar marked this pull request as ready for review May 11, 2026 10:35
@nihalzp nihalzp self-assigned this May 11, 2026
@vitlibar

Copy link
Copy Markdown
Member Author

@groeneai Investigate CI failures

@groeneai

Copy link
Copy Markdown
Contributor

@vitlibar — investigated CI failures on aaf640d (AST fuzzer ran at 10:35 UTC; perf at later times today). All chronic / infra, none PR-caused — detail below.

1. AST fuzzer (amd_debug, targeted, old_compatibility) — STID 3520-1e53 (Bad cast from type A to B) — CHRONIC family

The specific STID variant is new today (3520-1e53 = 1 hit / this PR), but it's a member of the chronic STID 3520-* family which is already known: sibling variants 3520-3e11 (13 hits / 6 PRs), 3520-4237 (8 hits / 8 PRs), 3520-5bd5 (6 hits / 1 PR), etc. The broader "Bad cast from type A to B" LOGICAL_ERROR has 86 hits / 47 PRs / 4 master hits in 30 days. AST fuzzer mutations regularly produce new STID hashes for variants of the same underlying engine bug; this is one such variant.

The crash path is in PlannerActionsVisitor::visitFunctionIExecutableFunction::executeWithoutLowCardinalityColumns (defaultImplementationForConstantArguments), not in your arrayTopK/arrayBottomK code.

2. Performance Comparison (arm_release, master_head, 1/6 / 5/6 / 6/6)Insert raw query metrics data / Insert historical dataCIDB INFRASTRUCTURE

All 6 shards failed identically with Code: 202. DB::Exception: Too many simultaneous queries. Maximum: 1000. (TOO_MANY_SIMULTANEOUS_QUERIES) or HTTPSConnectionPool(host='play.clickhouse.com', port=443): Read timed out. Unrelated to PR diff — global play.clickhouse.com throughput limit hit (also visible on PR #104564 today).

3. Performance Comparison Check Results/*::old/newNOISE

grace_hash_join, norm_distance, apply_patch_parts_join, function_calculation_after_sorting_and_limit, join_convert_outer_to_inner, set_index — all marked unstable or slower on queries outside arrayTopK/arrayBottomK code paths. Amplified by the CIDB write failures (no historical baseline). Should clear up on a rebase + retry.

Recommendation

Rebase against current master and re-run CI; the perf comparison + AST fuzzer should clear.

Comment thread src/Functions/array/arrayTopK.cpp

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 7 out of 7 changed files in this pull request and generated 3 comments.

Comment thread tests/queries/0_stateless/04105_array_top_k_bottom_k.sql
Comment thread src/Functions/array/arrayTopK.cpp
Comment thread tests/integration/test_prometheus_protocols/test_evaluation.py Outdated

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 7 out of 7 changed files in this pull request and generated 4 comments.

Comments suppressed due to low confidence (1)

src/Functions/array/arrayTopK.cpp:276

  • The implementation and tests define signed negative K values as clamped to 0, but the public function documentation for arrayBottomK does not mention that behavior. Please document this edge case alongside the K argument description.
    FunctionDocumentation::Syntax syntax = "arrayBottomK([f,] K, arr [, arr1, ... ,arrN])";
    FunctionDocumentation::Arguments arguments = {
        {"f(arr[, arr1, ... ,arrN])", "Optional. A lambda function to compute the sort key for each element.", {"Lambda function"}},
        {"K", "The number of smallest elements to return.", {"(U)Int*"}},
        {"arr", "An array.", {"Array(T)"}},

Comment thread tests/queries/0_stateless/04105_array_top_k_bottom_k.sql
Comment thread src/Functions/array/arrayTopK.cpp
Comment thread src/Functions/array/arrayTopK.cpp Outdated
@vitlibar vitlibar force-pushed the add-array-top-k-bottom-k branch from 7c9b2ab to c4d82eb Compare May 16, 2026 17:55
Comment thread src/Functions/array/arrayTopK.cpp Outdated
@vitlibar vitlibar force-pushed the add-array-top-k-bottom-k branch from c4d82eb to 06d75cf Compare May 16, 2026 18:13
Comment thread src/Functions/array/arrayTopK.h Outdated
Comment thread src/Functions/array/arrayTopK.cpp
Comment thread src/Functions/array/arrayTopK.cpp Outdated
Comment thread src/Functions/array/arrayTopK.cpp Outdated
Comment thread src/Functions/array/arrayTopK.cpp Outdated
Comment thread src/Functions/array/arrayTopK.cpp Outdated
Comment thread src/Functions/array/arrayTopK.cpp
Comment thread src/Functions/array/arrayTopK.cpp Outdated

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 6 out of 6 changed files in this pull request and generated 2 comments.

Comment thread src/Storages/TimeSeries/PrometheusQueryToSQL/applyLimitAggregationOperator.cpp Outdated
Comment thread src/Functions/array/arrayTopK.h Outdated

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 6 out of 6 changed files in this pull request and generated 1 comment.

Comment thread src/Functions/array/arrayTopK.h Outdated

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 6 out of 6 changed files in this pull request and generated no new comments.

Comment thread src/Functions/array/arrayTopK.cpp Outdated
Comment thread src/Functions/array/arrayTopK.cpp Outdated
@vitlibar

vitlibar commented Jun 1, 2026

Copy link
Copy Markdown
Member Author

@groeneai Investigate CI failures

@groeneai

groeneai commented Jun 1, 2026

Copy link
Copy Markdown
Contributor

@vitlibar Triage on head cbb8e0e3:

Check Status Verdict
Stateless tests (amd_msan, WasmEdge, parallel, 2/2) FAILURE Chronic WasmEdge MSAN flake (not PR-caused)
All other 143 checks SUCCESS clean

Failing test: 03469_json_read_subcolumns_combined_2_compact_merge_tree — timeout. CIDB 30-day cross-PR window shows 10 distinct unrelated PRs + 2 master hits failing on the same amd_msan, WasmEdge job (12 of 13 hits), 1 on amd_tsan. Failure spread across 11 days (2026-05-08 to 2026-05-31), well before this PR's latest commit. --diagnose-random-settings confirmed reproducible without minimizing within budget — settings-induced timeout under MSAN+WasmEdge's combined overhead.

Root cause family: the amd_msan, WasmEdge build is a known broken sanitizer/runtime combination — non-deterministic timeouts on slow MSAN-instrumented JSON/MergeTree paths plus the WASM runtime. Fix is being prepared in #105019 (change default x86 target from x86-64-v2 to x86-64-v3) which addresses the underlying __msan_chain_origin + vzeroupper interaction.

Recommendation: rebase + retry on the amd_msan WasmEdge job. If it persists, treat as infra and proceed — this PR has the assignee approval and is functionally clean.

@vitlibar vitlibar force-pushed the add-array-top-k-bottom-k branch from cbb8e0e to afa1c99 Compare June 1, 2026 14:18
@clickhouse-gh

clickhouse-gh Bot commented Jun 1, 2026

Copy link
Copy Markdown
Contributor

LLVM Coverage Report

Metric Baseline Current Δ
Lines 84.40% 84.50% +0.10%
Functions 91.40% 91.40% +0.00%
Branches 77.00% 77.10% +0.10%

Changed lines: Changed C/C++ lines covered by tests: 244/255 (95.69%) | Lost baseline coverage: none · Uncovered code

Full report · Diff report

@vitlibar vitlibar added this pull request to the merge queue Jun 2, 2026
Merged via the queue into ClickHouse:master with commit cb1182c Jun 2, 2026
166 checks passed
@vitlibar vitlibar deleted the add-array-top-k-bottom-k branch June 2, 2026 09:40
@robot-ch-test-poll3 robot-ch-test-poll3 added the pr-synced-to-cloud The PR is synced to the cloud repo label Jun 2, 2026
groeneai added a commit to groeneai/ClickHouse that referenced this pull request Jun 10, 2026
`CoalescingMergeTree` delegates to `SummingSortedAlgorithm` with
`aggregate_all_columns = true`. The dedicated branch for that case in
`defineColumns` (`src/Processors/Merges/Algorithms/SummingSortedAlgorithm.cpp:267`)
wrapped its work in `if (map_name == column.name || !endsWith(map_name, "Map"))`
and then fell through to an unconditional `continue`. For any nested column
whose parent table name ends with `Map` (e.g. `testMap.key`,
`legacy_features_Map.id`), the inner branch was skipped while the `continue`
still fired, so the column was added to neither `columns_to_aggregate` nor
`column_numbers_not_to_aggregate`.

`postprocessChunk` then constructed `res_columns` of size `num_result_columns`
and left those slots `nullptr`. `Chunk::setColumns` -> `Chunk::checkNumRowsIsConsistent`
dereferenced the null `ColumnPtr`, producing a SIGSEGV in release builds, a
`px != 0` assertion in debug builds, and an UBSan `member call on null pointer`
report in `arm_asan_ubsan` (STID 1003-326e on the AST fuzzer for PR ClickHouse#104281 and
PR ClickHouse#104563).

CoalescingMergeTree performs per-column `last_value` coalescing, so nested
`Map` arrays must be coalesced element-wise like any other array column. The
legacy `nestedName.suffix` routing into `discovered_maps` for `sumMap`-by-key
merging is a SummingMergeTree concept that does not apply here, so the
Map-suffix exclusion is dropped from the `aggregate_all_columns = true` branch.
SummingMergeTree code paths are untouched.

Closes ClickHouse#101937
CI report: https://s3.amazonaws.com/clickhouse-test-reports/PRs/104563/c2e0f0fe3b92d48b5a47fa90555f7d4aca54c553/ast_fuzzer_arm_asan_ubsan/

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

pr-feature Pull request with new product feature pr-synced-to-cloud The PR is synced to the cloud repo

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants