Fix inexact partitioned TopK sort pushdown#23301
Conversation
There was a problem hiding this comment.
Why need max, inner.output_partitioning().partition_count() should be enough, sort pushdown will not change partition count.
There was a problem hiding this comment.
Done, changed this to use inner.output_partitioning().partition_count() directly.
| // locally sorted partitions. | ||
| let preserve_partitioning = sort_exec.preserve_partitioning(); | ||
| let needs_global_topk = | ||
| sort_exec.preserve_partitioning() && sort_exec.fetch().is_some(); |
There was a problem hiding this comment.
Nit, duplicated call for sort_exec.preserve_partitioning()
let needs_global_topk = preserve_partitioning && sort_exec.fetch().is_some();
There was a problem hiding this comment.
Done, reused the preserve_partitioning local variable for needs_global_topk.
There was a problem hiding this comment.
We have positive tests now to add spm, may be also add some negative test which will not add spm.
There was a problem hiding this comment.
Done, added a single-partition negative regression test that keeps the standalone TopK without adding SortPreservingMergeExec.
d7f5dd1 to
bc9a56e
Compare
zhuqi-lucas
left a comment
There was a problem hiding this comment.
LGTM now, thanks @xudong963 !

Which issue does this PR close?
Rationale for this change
Inexact sort pushdown keeps the
SortExecbecause the source can optimize for the requested ordering but cannot guarantee exact global ordering. When the rewritten standaloneSortExecis a partition-preserving TopK over multiple partitions, each partition applies a local TopK. A later coalesce can then concatenate those local results, which can violateORDER BY ... LIMITsemantics.For example, with two partitions containing
[1, 100]and[2, 3],ORDER BY a ASC LIMIT 3should return1, 2, 3, but the unmerged local TopK path can return1, 100, 2, 3.What changes are included in this PR?
SortPreservingMergeExecand preserve the TopK fetch on the merge.SortPreservingMergeExec -> SortExecand after inserting the standalone merge, so the newly inserted childSortExecis not processed again as a standalone TopK.Are these changes tested?
Yes:
I also verified the new execution regression fails on unmodified
upstream/main: the plan returns1, 100, 2, 3instead of the expected global TopK result1, 2, 3.Are there any user-facing changes?
Yes. This fixes incorrect results for
ORDER BY ... LIMITqueries when inexact sort pushdown produces a standalone partition-preserving TopK over multiple input partitions.