Add SHUFFLE clause support for SELECT#100305
Conversation
|
Hi @alexey-milovidov , if you have time, please take a look. if need anything more, please do let me know |
Switch `SHUFFLE LIMIT` to bounded reservoir sampling, add the experimental `allow_experimental_shuffle_query` setting, update docs/tests, and fix style issues from the CI report. CI report: https://s3.amazonaws.com/clickhouse-test-reports/json.html?PR=100305&sha=latest&name_0=PR&name_1=Style+check
Remove the global alias restriction for `SHUFFLE` so disabled mode keeps legacy alias parsing, and update regression tests to use unambiguous clause forms. PR comment: ClickHouse#100305
Apply the optimization only for positive limits, add to , normalize , and add a stateless regression test for . PR comment: ClickHouse#100305
Add to the first runnable example so it works under default settings. PR comment: ClickHouse#100305
Keep alias-behavior checks in and move the expected query into a dedicated error-only test so does not fail the whole file with return code . Fast test report: https://s3.amazonaws.com/clickhouse-test-reports/json.html?PR=100305&sha=0095e031faec3babd928dbe81c9b6c436a116255&name_0=PR&name_1=Fast%20test&name_2=Tests
- Remove trailing whitespace on line 633 of `CommonParsers.h`
- Move `-- { serverError SUPPORT_IS_DISABLED }` inline with the semicolon in `04054_shuffle_disabled_error.sql` so the test runner correctly associates the expected error with the query
CI report: https://s3.amazonaws.com/clickhouse-test-reports/json.html?PR=100305&sha=12f6e74bde27f51c8223a68221c984c4581c3ce1&name_0=PR&name_1=Fast%20test
When `allow_experimental_shuffle_query = 1` but `allow_experimental_analyzer = 0`, `TreeRewriter` previously passed the guard and let the query proceed. The old planner has no `ShuffleStep`, so `SHUFFLE` was silently ignored and non-randomized results were returned — a correctness bug. Fix: after the `SUPPORT_IS_DISABLED` gate, always throw `NOT_IMPLEMENTED` in the old planner path, matching the pattern used for `LIMIT BY ALL`. Add `04055_shuffle_old_interpreter` to cover the setting-enabled / old-planner case.
|
Hi @alexey-milovidov @shankar-iyer is anything more is left to review |
|
@xuir268 Expect delay please, people out on a business trip at the moment. |
Explain `SHUFFLE` primarily as result row order randomization, make `AS SHUFFLE` alias usage explicit, and describe `SHUFFLE LIMIT n` as the random-subset form without overemphasizing sampling terminology. PR comment: ClickHouse#100305
|
Hi @alexey-milovidov @shankar-iyer is anything more is left to review |
I will catch up on this PR this week. |
|
Is anything more left @rschu1ze |
|
|
I really appreciate your work @xuir268 but I will need to close this PR for these reasons (sorry):
For this niche advantage, it seems not worth to add and maintain many hundreds LOC of code going forward. |
|
@rschu1ze @shankar-iyer , Thanks for the review and for clarifying the final decision. I want to be direct about one process issue: this outcome is frustrating because SHUFFLE had been mentioned in the roadmap, and the earlier review discussion focused on implementation details, tests, benchmarks, and documentation rather than on whether the feature itself should exist at all. If the conclusion is that a ClickHouse-specific SHUFFLE clause is not worth the long-term maintenance cost because ORDER BY rand() [LIMIT n] already covers the user-visible functionality, that would have been important to state much earlier. I have to be honest: I do not think the review process was handled well here, because the product-level objection appears only after substantial implementation, benchmarking, testing, and documentation work had already been done. I would appreciate it if future feature reviews call out that kind of product-level objection as early as possible. That would save a lot of unnecessary implementation and review time on both sides. If some of the execution-side work here is still useful independently, for example around improving large-n random-row selection behavior, I would also appreciate guidance on whether that should be pursued in a different form. |
|
@xuir268 Thanks for the feedback, I understand the frustration. I could actually not find SHUFFLE on the 2024, the 2025 or the 2026 roadmaps. DId I miss something?
Fair enought and sorry again. This PR didn't have an associated linked ticket initially and I somehow missed its motivation and context in the beginning.
I'll check back if this makes sense, thanks! |
|
#87836 is the list of intern topics. These are technically not part of the roadmap because of their experimental nature.
Feel free to pick any open ticket, we are experimenting with AI and getting lots of automatic bug reports (e.g. |
|
Shuffle is obviously more performant than |
|
@shankar-iyer you can review the MR |
My concern is still the memory required for a large LIMIT. I am compiling the PR and will give it a spin. |
|
I took the PR on my VM (48 CPU) and ran on a not too big dataset (100M rows). At There is a simple optimization possible in Can you reiterate again when the SHUFFLE implementation in this PR would perform better? Structurally, I feel difficult for a single pass algorithm to perform allround like 2-pass lazy materialization. |
|
@shankar-iyer I updated the Implementation-wise, Local 10M-row
So with the updated semantics, |
|
@xuir268 Checking.. |
Thanks for working through the iterations on this — the design exploration has been useful and our last 2 conversations have made the actual win clear, and I think the scope should shrink to match. Looking at the current state: for So the real optimization the work has surfaced is narrow but genuine: letting the final Given how many revisions we've gone through here, I think the cleanest path forward is to close this PR and open a focused replacement PR that does just:
|
|
@shankar-iyer okay , opening dedicating MR for this focused replacement PR that does just: Pattern-match ORDER BY rand() LIMIT n in the planner. |

Changelog category (leave one):
Changelog entry (a user-readable short description of the changes that goes into CHANGELOG.md):
Add the
SHUFFLEclause forSELECTqueries, including optimizedSHUFFLE LIMITexecution, query plan support, tests, and documentation.Documentation entry for user-facing changes