[SPARK-46367][SQL] Support narrowing projection of `KeyedPartitioning` in `PartitioningPreservingUnaryExecNode` by peter-toth · Pull Request #55519 · apache/spark · GitHub
Skip to content

[SPARK-46367][SQL] Support narrowing projection of KeyedPartitioning in PartitioningPreservingUnaryExecNode#55519

Draft
peter-toth wants to merge 1 commit intoapache:masterfrom
peter-toth:SPARK-46367-keyedpartitioning-projection
Draft

[SPARK-46367][SQL] Support narrowing projection of KeyedPartitioning in PartitioningPreservingUnaryExecNode#55519
peter-toth wants to merge 1 commit intoapache:masterfrom
peter-toth:SPARK-46367-keyedpartitioning-projection

Conversation

@peter-toth
Copy link
Copy Markdown
Contributor

What changes were proposed in this pull request?

When a KeyedPartitioning passes through a PartitioningPreservingUnaryExecNode (e.g. ProjectExec), the previous implementation projected the partitioning as a whole expression via multiTransformDown. If any expression position could not be mapped to an output attribute, the entire KeyedPartitioning was silently dropped, resulting in UnknownPartitioning.

This PR replaces that approach with a per-position projection algorithm implemented in two new private helpers (projectKeyedPartitionings and projectOtherPartitionings), with the main outputPartitioning reduced to a simple split, project, and combine:

  1. For each expression position (0..N-1), collect the unique expressions at that position across all input KeyedPartitionings (using ExpressionSet to deduplicate semantically equal expressions), then project each through the output aliases via projectExpression.
  2. Positions with at least one projected alternative are projectable; they define the maximum achievable granularity. Positions that cannot be expressed in the output are dropped (narrowing).
  3. The shared partitionKeys are projected to the subset of projectable positions via KeyedPartitioning.projectKeys.
  4. The final KeyedPartitionings are the cross-product of per-position alternatives, computed lazily via MultiTransform.generateCartesianProduct, deduplicated, and bounded by a single outer take(aliasCandidateLimit).

All resulting KeyedPartitionings at the same granularity share the same partitionKeys object, preserving the invariant required by GroupPartitionsExec.

Why are the changes needed?

Without narrowing, a ProjectExec that drops any one of a multi-column partition key causes the entire KeyedPartitioning to be lost. This breaks storage-partitioned join optimisations that rely on the partitioning surviving projection.

Does this PR introduce any user-facing change?

No.

How was this patch tested?

Added unit tests in ProjectedOrderingAndPartitioningSuite covering:

  • Full-granularity alias substitution (existing behaviour, unchanged)
  • 2->1 narrowing without aliases
  • 2->1 narrowing with alias, verifying shared partitionKeys object identity
  • 3->2 narrowing with alias
  • PartitioningCollection where one KP can be fully projected and another cannot

Was this patch authored or co-authored using generative AI tooling?

Generated-by: Claude Sonnet 4.6

…in PartitioningPreservingUnaryExecNode

### What changes were proposed in this pull request?

When a `KeyedPartitioning` passes through a `PartitioningPreservingUnaryExecNode`
(e.g. `ProjectExec`), the previous implementation projected the partitioning as a
whole expression via `multiTransformDown`. If any expression position could not be
mapped to an output attribute, the entire `KeyedPartitioning` was silently dropped,
resulting in `UnknownPartitioning`.

This PR replaces that approach with a per-position projection algorithm implemented
in two new private helpers (`projectKeyedPartitionings` and `projectOtherPartitionings`),
with the main `outputPartitioning` reduced to a simple split, project, and combine:

1. For each expression position (0..N-1), collect the unique expressions at that
   position across all input `KeyedPartitioning`s (using `ExpressionSet` to
   deduplicate semantically equal expressions), then project each through the
   output aliases via `projectExpression`.
2. Positions with at least one projected alternative are *projectable*; they define
   the maximum achievable granularity. Positions that cannot be expressed in the
   output are dropped (narrowing).
3. The shared `partitionKeys` are projected to the subset of projectable positions
   via `KeyedPartitioning.projectKeys`.
4. The final `KeyedPartitioning`s are the cross-product of per-position alternatives,
   computed lazily via `MultiTransform.generateCartesianProduct`, deduplicated, and
   bounded by a single outer `take(aliasCandidateLimit)`.

All resulting `KeyedPartitioning`s at the same granularity share the same
`partitionKeys` object, preserving the invariant required by `GroupPartitionsExec`.

### Why are the changes needed?

Without narrowing, a `ProjectExec` that drops any one of a multi-column partition
key causes the entire `KeyedPartitioning` to be lost. This breaks
storage-partitioned join optimisations that rely on the partitioning surviving
projection.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Added unit tests in `ProjectedOrderingAndPartitioningSuite` covering:
- Full-granularity alias substitution (existing behaviour, unchanged)
- 2->1 narrowing without aliases
- 2->1 narrowing with alias, verifying shared `partitionKeys` object identity
- 3->2 narrowing with alias
- `PartitioningCollection` where one KP can be fully projected and another cannot

### Was this patch authored or co-authored using generative AI tooling?

Generated-by: Claude Sonnet 4.6
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant