{{ message }}
[SPARK-46367][SQL] Support narrowing projection of KeyedPartitioning in PartitioningPreservingUnaryExecNode#55519
Draft
peter-toth wants to merge 1 commit intoapache:masterfrom
Conversation
…in PartitioningPreservingUnaryExecNode ### What changes were proposed in this pull request? When a `KeyedPartitioning` passes through a `PartitioningPreservingUnaryExecNode` (e.g. `ProjectExec`), the previous implementation projected the partitioning as a whole expression via `multiTransformDown`. If any expression position could not be mapped to an output attribute, the entire `KeyedPartitioning` was silently dropped, resulting in `UnknownPartitioning`. This PR replaces that approach with a per-position projection algorithm implemented in two new private helpers (`projectKeyedPartitionings` and `projectOtherPartitionings`), with the main `outputPartitioning` reduced to a simple split, project, and combine: 1. For each expression position (0..N-1), collect the unique expressions at that position across all input `KeyedPartitioning`s (using `ExpressionSet` to deduplicate semantically equal expressions), then project each through the output aliases via `projectExpression`. 2. Positions with at least one projected alternative are *projectable*; they define the maximum achievable granularity. Positions that cannot be expressed in the output are dropped (narrowing). 3. The shared `partitionKeys` are projected to the subset of projectable positions via `KeyedPartitioning.projectKeys`. 4. The final `KeyedPartitioning`s are the cross-product of per-position alternatives, computed lazily via `MultiTransform.generateCartesianProduct`, deduplicated, and bounded by a single outer `take(aliasCandidateLimit)`. All resulting `KeyedPartitioning`s at the same granularity share the same `partitionKeys` object, preserving the invariant required by `GroupPartitionsExec`. ### Why are the changes needed? Without narrowing, a `ProjectExec` that drops any one of a multi-column partition key causes the entire `KeyedPartitioning` to be lost. This breaks storage-partitioned join optimisations that rely on the partitioning surviving projection. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Added unit tests in `ProjectedOrderingAndPartitioningSuite` covering: - Full-granularity alias substitution (existing behaviour, unchanged) - 2->1 narrowing without aliases - 2->1 narrowing with alias, verifying shared `partitionKeys` object identity - 3->2 narrowing with alias - `PartitioningCollection` where one KP can be fully projected and another cannot ### Was this patch authored or co-authored using generative AI tooling? Generated-by: Claude Sonnet 4.6
433d560 to
0b3e7bc
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.

What changes were proposed in this pull request?
When a
KeyedPartitioningpasses through aPartitioningPreservingUnaryExecNode(e.g.ProjectExec), the previous implementation projected the partitioning as a whole expression viamultiTransformDown. If any expression position could not be mapped to an output attribute, the entireKeyedPartitioningwas silently dropped, resulting inUnknownPartitioning.This PR replaces that approach with a per-position projection algorithm implemented in two new private helpers (
projectKeyedPartitioningsandprojectOtherPartitionings), with the mainoutputPartitioningreduced to a simple split, project, and combine:KeyedPartitionings (usingExpressionSetto deduplicate semantically equal expressions), then project each through the output aliases viaprojectExpression.partitionKeysare projected to the subset of projectable positions viaKeyedPartitioning.projectKeys.KeyedPartitionings are the cross-product of per-position alternatives, computed lazily viaMultiTransform.generateCartesianProduct, deduplicated, and bounded by a single outertake(aliasCandidateLimit).All resulting
KeyedPartitionings at the same granularity share the samepartitionKeysobject, preserving the invariant required byGroupPartitionsExec.Why are the changes needed?
Without narrowing, a
ProjectExecthat drops any one of a multi-column partition key causes the entireKeyedPartitioningto be lost. This breaks storage-partitioned join optimisations that rely on the partitioning surviving projection.Does this PR introduce any user-facing change?
No.
How was this patch tested?
Added unit tests in
ProjectedOrderingAndPartitioningSuitecovering:partitionKeysobject identityPartitioningCollectionwhere one KP can be fully projected and another cannotWas this patch authored or co-authored using generative AI tooling?
Generated-by: Claude Sonnet 4.6