Add arrayCombinations, arrayPermutations, arrayPartialPermutations functions by Onyx2406 · Pull Request #99280 · ClickHouse/ClickHouse · GitHub
Skip to content

Add arrayCombinations, arrayPermutations, arrayPartialPermutations functions#99280

Open
Onyx2406 wants to merge 37 commits into
ClickHouse:masterfrom
Onyx2406:feat/array-combinatorics
Open

Add arrayCombinations, arrayPermutations, arrayPartialPermutations functions#99280
Onyx2406 wants to merge 37 commits into
ClickHouse:masterfrom
Onyx2406:feat/array-combinatorics

Conversation

@Onyx2406

@Onyx2406 Onyx2406 commented Mar 11, 2026

Copy link
Copy Markdown
Contributor

Changelog category (leave one):

New Feature

Changelog entry (a user-readable short description of the changes that goes into CHANGELOG.md):

Added arrayCombinations(arr, k), arrayPermutations(arr), and arrayPartialPermutations(arr, k) functions for generating combinations and permutations of array elements.

Documentation entry for user-facing changes

  • Documentation is written (function descriptions and examples in code)

What this PR does

Adds three combinatorics functions for arrays, as proposed in the Intern Tasks 2025/2026:

Function Description Example
arrayCombinations(arr, k) All k-length unordered subsets arrayCombinations([1,2,3], 2)[[1,2],[1,3],[2,3]]
arrayPermutations(arr) All full permutations arrayPermutations([1,2,3])[[1,2,3],[1,3,2],[2,1,3],[2,3,1],[3,1,2],[3,2,1]]
arrayPartialPermutations(arr, k) All k-length ordered selections arrayPartialPermutations([1,2,3], 2)[[1,2],[1,3],[2,1],[2,3],[3,1],[3,2]]

Implementation (2 files, ~250 lines)

  • arrayCombinations.cpp — Iterative index-based algorithm for generating C(n,k) combinations
  • arrayPermutations.cpp — Template class handling both full (std::next_permutation) and partial (recursive swap-based) permutations
  • Follows the arrayShingles pattern: Array(T)Array(Array(T)) with dual offset management
  • Works with any element type (integers, strings, etc.)

Test plan

  • New stateless test 04043_array_combinatorics.sql with 12 test cases:
    • Combinations: k=1, k=2, k=3, k=0, k=n, string elements
    • Full permutations: n=3, n=1, empty array
    • Partial permutations: k=2, k=1, k=0
    • Error cases: k>n, k<0 (BAD_ARGUMENTS)
  • CI (full test suite)

Closes: #43175


Note

Medium Risk
Adds new array combinatorics functions that can generate very large result sets; while guarded by element caps and tests, it touches query execution paths and could impact memory/CPU usage and edge-case correctness.

Overview
Adds three new array functions: arrayCombinations(arr, k), arrayPermutations(arr), and arrayPartialPermutations(arr, k), each returning Array(Array(T)) results generated per input row.

The implementations include argument validation and a hard cap (~1M total output elements) with TOO_LARGE_ARRAY_SIZE errors to prevent OOM, plus new stateless SQL/reference tests covering normal cases, multi-row offsets, and expected error/limit scenarios.

Reviewed by Cursor Bugbot for commit 549f6f0. Bugbot is set up for automated code reviews on this repo. Configure here.

Add three combinatorics functions for arrays:
- arrayCombinations(arr, k): all k-length unordered subsets
- arrayPermutations(arr): all full permutations
- arrayPartialPermutations(arr, k): all k-length ordered selections

Fixes ClickHouse#43175

@cursor cursor Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 3 potential issues.

Fix All in Cursor

Comment thread src/Functions/array/arrayPermutations.cpp
Comment thread src/Functions/array/arrayCombinations.cpp
Comment thread src/Functions/array/arrayPermutations.cpp
…size limit

- Use used-flags approach instead of swap-based recursion for partial
  permutations to produce lexicographic output order
- k=0 now correctly returns [[]] (one empty selection) matching C(n,0)=1
- Add TOO_LARGE_ARRAY_SIZE guard (1M element cap) to prevent OOM

@cursor cursor Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 1 potential issue.

Fix All in Cursor

Comment thread src/Functions/array/arrayCombinations.cpp Outdated
@alexey-milovidov alexey-milovidov added the can be tested Allows running workflows for external contributors label Mar 12, 2026
@clickhouse-gh

clickhouse-gh Bot commented Mar 12, 2026

Copy link
Copy Markdown
Contributor

Workflow [PR], commit [ff2a490]

Summary:

job_name test_name status info comment
Integration tests (arm_binary, distributed plan, 1/4) FAIL
test_implicit_index_upgrade/test.py::test_implicit_index_upgrade_alter_replay FAIL cidb
Stress test (amd_tsan) FAIL
Hung check failed, possible deadlock found FAIL cidb, issue

AI Review

Summary

This PR adds arrayCombinations, arrayPermutations, and arrayPartialPermutations with structured function documentation and focused stateless coverage. The current code addresses the earlier review issues around k == 0, lexicographic partial-permutation order, whole-block output caps, constant-array handling without replication, and introduced_in metadata. I found no remaining correctness, safety, or testing concerns that warrant inline comments.

Final Verdict

Status: ✅ Approve

@clickhouse-gh clickhouse-gh Bot added the pr-feature Pull request with new product feature label Mar 12, 2026
Comment thread src/Functions/array/arrayCombinations.cpp Outdated
Comment thread src/Functions/array/arrayCombinations.cpp Outdated
Comment thread src/Functions/array/arrayPermutations.cpp Outdated
The overflow guard was too strict — checking before division caused
valid C(n,k) values to be rejected (e.g. C(20,10)=184756). Now uses
GCD cross-cancellation between numerator and denominator factors
to keep intermediates small. Added test for C(20,10) success case.
Comment thread src/Functions/array/arrayCombinations.cpp Outdated
Comment thread src/Functions/array/arrayCombinations.cpp Outdated
Comment thread src/Functions/array/arrayPermutations.cpp
Comment thread tests/queries/0_stateless/04043_array_combinatorics.sql Outdated
@alexey-milovidov

Copy link
Copy Markdown
Member

@Onyx2406, almost ready! Let's check the review comments.

Address review comments:

1. Size guard now caps total materialized elements (rows * k) instead
   of just the number of result rows. Previously, inputs like
   `arrayCombinations(range(20), 10)` produced 1.8M elements while
   passing the 1M limit check.

2. Replace O(k^2) GCD cross-cancellation in `combinationCountCapped`
   with an O(k) incremental formula using 128-bit intermediate
   arithmetic. This prevents pathological inputs from spending
   excessive time in the guard before rejection.

3. Add regression tests verifying the total-elements invariant:
   - `arrayCombinations(range(20), 10)` now correctly rejected
   - `arrayPermutations(range(9))` now correctly rejected (362880*9 > 1M)

4. Fix CRLF line endings in source and test files.
@Onyx2406

Copy link
Copy Markdown
Contributor Author

@alexey-milovidov is this good to merge?

@alexey-milovidov

Copy link
Copy Markdown
Member

The Stress test (arm_msan) failure is fixed by #101239, which should be merged first. After it is merged, please update the branch to include the fix.

Comment thread tests/queries/0_stateless/04043_array_combinatorics.sql
@alexey-milovidov

Copy link
Copy Markdown
Member

The test 02859_replicated_db_name_zookeeper is fixed in #101952

@alexey-milovidov

Copy link
Copy Markdown
Member

The MSan stress test failure (MemorySanitizer: use-of-uninitialized-value, STID 4179-5154 or 4148-3044) is a known pre-existing issue unrelated to this PR. Fix: #102158

@alexey-milovidov

Copy link
Copy Markdown
Member

The Hung check failure is fixed by #102008 and #102010, let's update the branch

Master is at version 26.6, but the `FunctionDocumentation::IntroducedIn`
metadata for `arrayCombinations`, `arrayPermutations`, and
`arrayPartialPermutations` claimed `26.4`. This would generate docs
claiming the functions are available in releases where they cannot
exist. Update all three to `26.6`.
@alexey-milovidov

Copy link
Copy Markdown
Member

Picked this up to advance it:

  • Merged the latest master (the branch was ~667 commits behind); no conflicts. Re-synced the contrib/mariadb-connector-c submodule pointer that the merge bumped.
  • Addressed the stale-docs review thread: updated IntroducedIn for arrayCombinations, arrayPermutations, and arrayPartialPermutations from {26, 4} to {26, 6} to match the current master version. Built a fresh binary and confirmed system.functions.introduced_in reports 26.6.0 for all three.
  • Rebuilt and ran 04043_array_combinatorics against the fresh binary — reference matches.

Remaining open review thread: the suggestion to expose the 1000000-element output cap as a tunable setting. I left this for a maintainer call (rationale in the thread) rather than adding new setting API surface unilaterally.

The only red CI check is test_s3_aws_sdk_has_slightly_unreliable_behaviour::test_dataloss, which timed out at 600s — a known flaky S3 SDK test unrelated to this change.

@alexey-milovidov

Copy link
Copy Markdown
Member

Continued via `/continue-pr`.

State of the PR:

  • CI is green (162 checks, 0 failures).
  • No conflicts with `master` and the branch is only ~2 days behind the merge base — no re-merge needed.
  • All bot review threads were already addressed in earlier commits (per-row → whole-block element budget, the `k == n` fast-path honoring the cap, `IntroducedIn` updated to `{26, 6}`, multi-row offset regressions, and the negative-`k` test for `arrayPartialPermutations`).

The sole remaining open thread was the question of whether to expose the `1000000`-element `TOO_LARGE_ARRAY_SIZE` cap as a tunable setting. Keeping it hard-coded as an OOM safety guard (per the rationale above in that thread): the cap guards against factorial/combinatorial blow-up rather than a linear workload worth tuning, so a new public-API setting is not warranted. Marked that thread resolved.

The implementation looks correct and complete; no further changes needed.

The functions `arrayCombinations`, `arrayPermutations`, and
`arrayPartialPermutations` cap the total number of output elements per
block at 1000000 and raise a `TOO_LARGE_ARRAY_SIZE` exception when this
is exceeded. This cap was previously undocumented. Mention it explicitly
in the `FunctionDocumentation` description of each function so users can
discover why a valid combinatorics query may be rejected.

The cap remains a fixed, intentional safety limit (not a tunable
setting), as decided in review.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@alexey-milovidov

Copy link
Copy Markdown
Member

Follow-up:

  • Documented the fixed 1000000-element output cap (and the resulting TOO_LARGE_ARRAY_SIZE exception) in the FunctionDocumentation for all three functions, closing the AI review's remaining concern. The cap stays a fixed safety limit rather than a tunable setting, as discussed.
  • Updated the PR body to use the Closes: full-URL relationship form per repo conventions.

Rebuilt and re-ran 04043_array_combinatorics against a fresh binary — reference matches; function output matches the examples.

@alexey-milovidov

Copy link
Copy Markdown
Member

Continued via /continue-pr:

  • CI was fully green (all checks passed, 0 unresolved review threads), but the merge base was 8 days old, so I merged the latest master into the branch — the merge was clean, with no conflicts.
  • Compile-verified the merged branch: clickhouse_functions_array (including the new arrayCombinations.cpp and arrayPermutations.cpp) builds without errors against the updated master.
  • IntroducedIn metadata ({26, 6}) still matches the current master version, so no further changes were needed.

The PR remains ready for review/merge.

Comment thread src/Functions/array/arrayPermutations.cpp Outdated
Comment thread src/Functions/array/arrayCombinations.cpp Outdated
…ions

`arrayCombinations` and `arrayPartialPermutations` take two arguments, so
`useDefaultImplementationForConstants` only folds the all-constant case. When
the array argument is constant but `k` varies by row, the array reaches
`executeImpl` as `ColumnConst(ColumnArray)`, so `checkAndGetColumn<ColumnArray>`
returned `nullptr` and the call threw `ILLEGAL_COLUMN` for valid input such as
`arrayCombinations([1, 2, 3], number)`.

Materialize the array argument with `convertToFullColumnIfConst` (a no-op for
already-full columns) so the per-row offset logic is valid in this mixed-constant
case. Added regressions for both functions over `numbers(4)`.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Comment thread src/Functions/array/arrayCombinations.cpp Outdated
Comment thread src/Functions/array/arrayPermutations.cpp Outdated
alexey-milovidov and others added 2 commits June 15, 2026 20:15
`arrayCombinations`, `arrayPartialPermutations` (and the shared `arrayPermutations`)
called `convertToFullColumnIfConst` on the array argument. When only the array is
constant while `k` varies by row, that replicates a `ColumnConst(ColumnArray)` to
every input row before the per-row output budget is checked, so a query like
`SELECT arrayCombinations(range(1000000), number) FROM numbers(1000000)` materializes
`array_length * rows` input elements before `TOO_LARGE_ARRAY_SIZE` can fire.

Detect `ColumnConst(ColumnArray)` and read its single stored array (offset index 0)
for every row instead of replicating it, so the cap fires fast on the shared element
budget. Added a regression for a large constant array with row-varying `k` and many
rows.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@alexey-milovidov

Copy link
Copy Markdown
Member

Continued via /continue-pr:

  • Merged the latest master into the branch (it was ~1410 commits / 3 days behind the merge base, and showing red). The merge was clean — no conflicts.
  • Addressed the two open AI-review threads about convertToFullColumnIfConst expanding a ColumnConst(ColumnArray) across all rows before the output cap could fire. The array argument is now detected as ColumnConst(ColumnArray) and its single stored array is read per row (arr_row = 0) instead of being replicated to input_rows_count rows. The per-row output budget now fires fast: SELECT arrayCombinations(range(1000000), number) FROM numbers(1000000) (and the arrayPartialPermutations equivalent) throw TOO_LARGE_ARRAY_SIZE in ~0.2s rather than first materializing array_length * rows input elements. Added a regression for this case to 04043_array_combinatorics.sql. Both threads resolved.
  • Built a fresh binary and verified: all deterministic queries in 04043_array_combinatorics match the reference exactly (the fix is output-preserving for the existing const-array cases), and every error case throws the expected code — including the two new large-constant-array regressions, which throw TOO_LARGE_ARRAY_SIZE essentially instantly (no replication blow-up).

The only red CI check is the AST fuzzer (arm_asan_ubsan) failure Logical error: No set is registered for key __set_... in makeSetForInFunction. The triggering query is a fuzzed notIn((SELECT DISTINCT date FROM test_00597 ...), b AS alias453) over the standard corpus table test_00597 — no array combinatorics functions involved. This is a pre-existing planner robustness flake (NOT IN subquery set not registered in the planner context): it has reproduced on master itself (2026-05-01, 2026-05-14) and on the unrelated PR #104569, so it is unrelated to this change. A fresh CI run is triggered by the new commits.

…orics

Keep .github/workflows/*.yml at the PR tip to avoid fork-push workflow-scope rejection
(the fork's GitHub App token lacks the workflows scope). Same technique as the
earlier workflow-freeze commits on this branch.
@clickhouse-gh clickhouse-gh Bot added the manual approve Manual approve required to run CI label Jun 18, 2026
@alexey-milovidov

Copy link
Copy Markdown
Member

Merged fresh `master` (`283677ac277..4abadbf`).

The only red was `CH Inc sync` ("build failed"), which was not caused by this PR: the private sync build failed on `programs/distributed-cache-bench/BenchClient.cpp` including the non-existent header `IO/SilkFiberStreamSocketImpl.h` — a transient private-master break (the public counterpart was reverted in #107539, "Revert "Add silk fibers aware socket implementations""). programs/distributed-cache-bench/ does not exist in the public tree, and this PR only adds src/Functions/array/arrayCombinations.cpp + arrayPermutations.cpp + the 04043 test, so it is entirely orthogonal.

Re-merging fresh master regenerates the sync PR against the now-fixed private master (which already replaced the include with <FiberSocket.h>), so the sync build should pass. .github/workflows/*.yml were kept at the PR tip to avoid the fork-push workflow-scope rejection (same technique as the earlier workflow-freeze commits here).

alexey-milovidov and others added 2 commits June 20, 2026 16:29
The previous master merges left `.github/workflows/*.yml` carrying stale
CI definitions (e.g. a removed `smoke_test_amd_darwin` job) that diverged
from master. Restore all workflow files to match master so this PR's net
diff is limited to the new array combinatorics functions and their test.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@alexey-milovidov

Copy link
Copy Markdown
Member

Merged current master and restored .github/workflows/*.yml to match it.

The previous master merges had left the workflow files (backport_branches.yml, master.yml, pull_request.yml, release_branches.yml) carrying stale CI definitions that diverged from master — e.g. a smoke_test_amd_darwin job that no longer exists on master. Without this, the PR would have carried those unrelated CI changes into master.

After this push the PR's net diff against master is exactly the four intended files (the two new function sources and the stateless test); git diff origin/master...HEAD -- .github/ is empty. The function sources are unchanged since the last green CI run (283677ac), and the FunctionDocumentation / column APIs they use are unchanged on current master (version is still 26.6, matching IntroducedIn = {26, 6}). CI will re-run on the merged head.

@alexey-milovidov alexey-milovidov removed the manual approve Manual approve required to run CI label Jun 22, 2026
alexey-milovidov and others added 2 commits June 28, 2026 06:20
Master has moved to 26.7 development (VERSION_MINOR 7), so the
`introduced_in` metadata of `arrayCombinations`, `arrayPermutations`,
and `arrayPartialPermutations` is bumped from `{26, 6}` to `{26, 7}`
to reflect the release in which these functions will actually ship.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@alexey-milovidov

Copy link
Copy Markdown
Member

Merged master to refresh the branch (it had fallen >1 week behind: merge-base was 2026-06-20, now 7.6 days old). The merge was clean — no conflicts, and the net PR diff is still exactly the same 4 files (arrayCombinations.cpp, arrayPermutations.cpp, and the 04043_array_combinatorics test/reference), +519/-0.

While refreshing, master has advanced to 26.7 development (VERSION_MINOR 7), so I bumped introduced_in from {26, 6} to {26, 7} for all three functions to match the release they will actually ship in (same kind of bump as the earlier {26, 4}{26, 6} update). The test reference does not depend on the version, so it is unchanged. Re-running CI on the fresh base.

@clickhouse-gh

clickhouse-gh Bot commented Jun 28, 2026

Copy link
Copy Markdown
Contributor

LLVM Coverage Report

Metric Baseline Current Δ
Lines 85.30% 85.40% +0.10%
Functions 92.60% 92.60% +0.00%
Branches 77.50% 77.60% +0.10%

Changed lines: Changed C/C++ lines covered by tests: 204/208 (98.08%) | Lost baseline coverage: none · Uncovered code

Full report · Diff report

pull Bot pushed a commit to sowelswl/ClickHouse that referenced this pull request Jun 30, 2026
`test_implicit_index_upgrade_alter_replay` creates a second replica `node2`
and asserts `SELECT count() FROM test_alter_replay` equals `10001` immediately
after `wait_for_active_replica`. But `wait_for_active_replica` only waits for
`is_readonly = 0`, not for the freshly-joined replica to finish fetching parts.

So `node2` can observe only the small `VALUES` part (1 row, `key = 99999`)
before the 10000-row part is fetched, and the assertion fails with
`assert '1' == '10001'`. This raced once each on several unrelated PRs over
the last 30 days (e.g. ClickHouse#108084, ClickHouse#99280, ClickHouse#107566, ClickHouse#107442).

Add `SYSTEM SYNC REPLICA test_alter_replay` after the replica becomes active,
which blocks until all parts are fetched, before checking the row count.

CI report: https://s3.amazonaws.com/clickhouse-test-reports/json.html?PR=108770&sha=bb760616d54e3d8e5d20968022085506c31b5f37&name_0=PR&name_1=Integration%20tests%20%28arm_binary%2C%20distributed%20plan%2C%201%2F4%29

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

can be tested Allows running workflows for external contributors pr-feature Pull request with new product feature

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Combinatorics functions (on arrays)

2 participants