Fix inexact partitioned TopK sort pushdown by xudong963 · Pull Request #23301 · apache/datafusion · GitHub
Skip to content

Fix inexact partitioned TopK sort pushdown#23301

Open
xudong963 wants to merge 1 commit into
apache:mainfrom
massive-com:fix/standalone-inexact-topk-spm
Open

Fix inexact partitioned TopK sort pushdown#23301
xudong963 wants to merge 1 commit into
apache:mainfrom
massive-com:fix/standalone-inexact-topk-spm

Conversation

@xudong963

@xudong963 xudong963 commented Jul 3, 2026

Copy link
Copy Markdown
Member

Which issue does this PR close?

  • None.

Rationale for this change

Inexact sort pushdown keeps the SortExec because the source can optimize for the requested ordering but cannot guarantee exact global ordering. When the rewritten standalone SortExec is a partition-preserving TopK over multiple partitions, each partition applies a local TopK. A later coalesce can then concatenate those local results, which can violate ORDER BY ... LIMIT semantics.

For example, with two partitions containing [1, 100] and [2, 3], ORDER BY a ASC LIMIT 3 should return 1, 2, 3, but the unmerged local TopK path can return 1, 100, 2, 3.

What changes are included in this PR?

  • Wrap standalone multi-partition inexact partition-preserving TopK pushdown with SortPreservingMergeExec and preserve the TopK fetch on the merge.
  • Stop traversal after rewriting inexact SortPreservingMergeExec -> SortExec and after inserting the standalone merge, so the newly inserted child SortExec is not processed again as a standalone TopK.
  • Add a plan regression test that asserts the final merge is inserted.
  • Add an execution regression test using an inexact executable memory source to assert the global TopK result is returned.

Are these changes tested?

Yes:

I also verified the new execution regression fails on unmodified upstream/main: the plan returns 1, 100, 2, 3 instead of the expected global TopK result 1, 2, 3.

Are there any user-facing changes?

Yes. This fixes incorrect results for ORDER BY ... LIMIT queries when inexact sort pushdown produces a standalone partition-preserving TopK over multiple input partitions.

@github-actions github-actions Bot added optimizer Optimizer rules core Core DataFusion crate labels Jul 3, 2026

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why need max, inner.output_partitioning().partition_count() should be enough, sort pushdown will not change partition count.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done, changed this to use inner.output_partitioning().partition_count() directly.

// locally sorted partitions.
let preserve_partitioning = sort_exec.preserve_partitioning();
let needs_global_topk =
sort_exec.preserve_partitioning() && sort_exec.fetch().is_some();

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit, duplicated call for sort_exec.preserve_partitioning()

let needs_global_topk = preserve_partitioning && sort_exec.fetch().is_some();

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done, reused the preserve_partitioning local variable for needs_global_topk.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We have positive tests now to add spm, may be also add some negative test which will not add spm.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done, added a single-partition negative regression test that keeps the standalone TopK without adding SortPreservingMergeExec.

@xudong963 xudong963 force-pushed the fix/standalone-inexact-topk-spm branch from d7f5dd1 to bc9a56e Compare July 3, 2026 06:53

@zhuqi-lucas zhuqi-lucas left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM now, thanks @xudong963 !

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

core Core DataFusion crate optimizer Optimizer rules

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants