Add arrayCombinations, arrayPermutations, arrayPartialPermutations functions#99280
Add arrayCombinations, arrayPermutations, arrayPartialPermutations functions#99280Onyx2406 wants to merge 37 commits into
Conversation
Add three combinatorics functions for arrays: - arrayCombinations(arr, k): all k-length unordered subsets - arrayPermutations(arr): all full permutations - arrayPartialPermutations(arr, k): all k-length ordered selections Fixes ClickHouse#43175
…size limit - Use used-flags approach instead of swap-based recursion for partial permutations to produce lexicographic output order - k=0 now correctly returns [[]] (one empty selection) matching C(n,0)=1 - Add TOO_LARGE_ARRAY_SIZE guard (1M element cap) to prevent OOM
|
Workflow [PR], commit [ff2a490] Summary: ❌
AI ReviewSummaryThis PR adds Final VerdictStatus: ✅ Approve |
The overflow guard was too strict — checking before division caused valid C(n,k) values to be rejected (e.g. C(20,10)=184756). Now uses GCD cross-cancellation between numerator and denominator factors to keep intermediates small. Added test for C(20,10) success case.
|
@Onyx2406, almost ready! Let's check the review comments. |
Address review comments: 1. Size guard now caps total materialized elements (rows * k) instead of just the number of result rows. Previously, inputs like `arrayCombinations(range(20), 10)` produced 1.8M elements while passing the 1M limit check. 2. Replace O(k^2) GCD cross-cancellation in `combinationCountCapped` with an O(k) incremental formula using 128-bit intermediate arithmetic. This prevents pathological inputs from spending excessive time in the guard before rejection. 3. Add regression tests verifying the total-elements invariant: - `arrayCombinations(range(20), 10)` now correctly rejected - `arrayPermutations(range(9))` now correctly rejected (362880*9 > 1M) 4. Fix CRLF line endings in source and test files.
|
@alexey-milovidov is this good to merge? |
|
The Stress test (arm_msan) failure is fixed by #101239, which should be merged first. After it is merged, please update the branch to include the fix. |
|
The test |
|
The MSan stress test failure (MemorySanitizer: use-of-uninitialized-value, STID 4179-5154 or 4148-3044) is a known pre-existing issue unrelated to this PR. Fix: #102158 |
…06-feat-array-combinatorics
Master is at version 26.6, but the `FunctionDocumentation::IntroducedIn` metadata for `arrayCombinations`, `arrayPermutations`, and `arrayPartialPermutations` claimed `26.4`. This would generate docs claiming the functions are available in releases where they cannot exist. Update all three to `26.6`.
|
Picked this up to advance it:
Remaining open review thread: the suggestion to expose the The only red CI check is |
|
Continued via `/continue-pr`. State of the PR:
The sole remaining open thread was the question of whether to expose the `1000000`-element `TOO_LARGE_ARRAY_SIZE` cap as a tunable setting. Keeping it hard-coded as an OOM safety guard (per the rationale above in that thread): the cap guards against factorial/combinatorial blow-up rather than a linear workload worth tuning, so a new public-API setting is not warranted. Marked that thread resolved. The implementation looks correct and complete; no further changes needed. |
The functions `arrayCombinations`, `arrayPermutations`, and `arrayPartialPermutations` cap the total number of output elements per block at 1000000 and raise a `TOO_LARGE_ARRAY_SIZE` exception when this is exceeded. This cap was previously undocumented. Mention it explicitly in the `FunctionDocumentation` description of each function so users can discover why a valid combinatorics query may be rejected. The cap remains a fixed, intentional safety limit (not a tunable setting), as decided in review. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
|
Follow-up:
Rebuilt and re-ran |
|
Continued via
The PR remains ready for review/merge. |
…ions `arrayCombinations` and `arrayPartialPermutations` take two arguments, so `useDefaultImplementationForConstants` only folds the all-constant case. When the array argument is constant but `k` varies by row, the array reaches `executeImpl` as `ColumnConst(ColumnArray)`, so `checkAndGetColumn<ColumnArray>` returned `nullptr` and the call threw `ILLEGAL_COLUMN` for valid input such as `arrayCombinations([1, 2, 3], number)`. Materialize the array argument with `convertToFullColumnIfConst` (a no-op for already-full columns) so the per-row offset logic is valid in this mixed-constant case. Added regressions for both functions over `numbers(4)`. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
`arrayCombinations`, `arrayPartialPermutations` (and the shared `arrayPermutations`) called `convertToFullColumnIfConst` on the array argument. When only the array is constant while `k` varies by row, that replicates a `ColumnConst(ColumnArray)` to every input row before the per-row output budget is checked, so a query like `SELECT arrayCombinations(range(1000000), number) FROM numbers(1000000)` materializes `array_length * rows` input elements before `TOO_LARGE_ARRAY_SIZE` can fire. Detect `ColumnConst(ColumnArray)` and read its single stored array (offset index 0) for every row instead of replicating it, so the cap fires fast on the shared element budget. Added a regression for a large constant array with row-varying `k` and many rows. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
|
Continued via
The only red CI check is the |
…orics Keep .github/workflows/*.yml at the PR tip to avoid fork-push workflow-scope rejection (the fork's GitHub App token lacks the workflows scope). Same technique as the earlier workflow-freeze commits on this branch.
|
Merged fresh `master` (`283677ac277..4abadbf`). The only red was `CH Inc sync` ("build failed"), which was not caused by this PR: the private sync build failed on `programs/distributed-cache-bench/BenchClient.cpp` including the non-existent header `IO/SilkFiberStreamSocketImpl.h` — a transient private-master break (the public counterpart was reverted in #107539, "Revert "Add silk fibers aware socket implementations""). Re-merging fresh master regenerates the sync PR against the now-fixed private master (which already replaced the include with |
The previous master merges left `.github/workflows/*.yml` carrying stale CI definitions (e.g. a removed `smoke_test_amd_darwin` job) that diverged from master. Restore all workflow files to match master so this PR's net diff is limited to the new array combinatorics functions and their test. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
|
Merged current The previous master merges had left the workflow files ( After this push the PR's net diff against |
Master has moved to 26.7 development (VERSION_MINOR 7), so the
`introduced_in` metadata of `arrayCombinations`, `arrayPermutations`,
and `arrayPartialPermutations` is bumped from `{26, 6}` to `{26, 7}`
to reflect the release in which these functions will actually ship.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
|
Merged While refreshing, master has advanced to 26.7 development ( |
LLVM Coverage ReportChanged lines: Changed C/C++ lines covered by tests: 204/208 (98.08%) | Lost baseline coverage: none · Uncovered code |
`test_implicit_index_upgrade_alter_replay` creates a second replica `node2` and asserts `SELECT count() FROM test_alter_replay` equals `10001` immediately after `wait_for_active_replica`. But `wait_for_active_replica` only waits for `is_readonly = 0`, not for the freshly-joined replica to finish fetching parts. So `node2` can observe only the small `VALUES` part (1 row, `key = 99999`) before the 10000-row part is fetched, and the assertion fails with `assert '1' == '10001'`. This raced once each on several unrelated PRs over the last 30 days (e.g. ClickHouse#108084, ClickHouse#99280, ClickHouse#107566, ClickHouse#107442). Add `SYSTEM SYNC REPLICA test_alter_replay` after the replica becomes active, which blocks until all parts are fetched, before checking the row count. CI report: https://s3.amazonaws.com/clickhouse-test-reports/json.html?PR=108770&sha=bb760616d54e3d8e5d20968022085506c31b5f37&name_0=PR&name_1=Integration%20tests%20%28arm_binary%2C%20distributed%20plan%2C%201%2F4%29 Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>


Changelog category (leave one):
New Feature
Changelog entry (a user-readable short description of the changes that goes into CHANGELOG.md):
Added
arrayCombinations(arr, k),arrayPermutations(arr), andarrayPartialPermutations(arr, k)functions for generating combinations and permutations of array elements.Documentation entry for user-facing changes
What this PR does
Adds three combinatorics functions for arrays, as proposed in the Intern Tasks 2025/2026:
arrayCombinations(arr, k)arrayCombinations([1,2,3], 2)→[[1,2],[1,3],[2,3]]arrayPermutations(arr)arrayPermutations([1,2,3])→[[1,2,3],[1,3,2],[2,1,3],[2,3,1],[3,1,2],[3,2,1]]arrayPartialPermutations(arr, k)arrayPartialPermutations([1,2,3], 2)→[[1,2],[1,3],[2,1],[2,3],[3,1],[3,2]]Implementation (2 files, ~250 lines)
arrayCombinations.cpp— Iterative index-based algorithm for generating C(n,k) combinationsarrayPermutations.cpp— Template class handling both full (std::next_permutation) and partial (recursive swap-based) permutationsarrayShinglespattern:Array(T)→Array(Array(T))with dual offset managementTest plan
04043_array_combinatorics.sqlwith 12 test cases:Closes: #43175
Note
Medium Risk
Adds new array combinatorics functions that can generate very large result sets; while guarded by element caps and tests, it touches query execution paths and could impact memory/CPU usage and edge-case correctness.
Overview
Adds three new array functions:
arrayCombinations(arr, k),arrayPermutations(arr), andarrayPartialPermutations(arr, k), each returningArray(Array(T))results generated per input row.The implementations include argument validation and a hard cap (~1M total output elements) with
TOO_LARGE_ARRAY_SIZEerrors to prevent OOM, plus new stateless SQL/reference tests covering normal cases, multi-row offsets, and expected error/limit scenarios.Reviewed by Cursor Bugbot for commit 549f6f0. Bugbot is set up for automated code reviews on this repo. Configure here.