fix: Support string IDs in GraphFrame powerIterationClustering by SemyonSinchenko · Pull Request #773 · graphframes/graphframes · GitHub
Skip to content

fix: Support string IDs in GraphFrame powerIterationClustering#773

Merged
SemyonSinchenko merged 1 commit intographframes:mainfrom
SemyonSinchenko:757-powerIterationClustering-bug
Jan 20, 2026
Merged

fix: Support string IDs in GraphFrame powerIterationClustering#773
SemyonSinchenko merged 1 commit intographframes:mainfrom
SemyonSinchenko:757-powerIterationClustering-bug

Conversation

@SemyonSinchenko
Copy link
Copy Markdown
Collaborator

What changes were proposed in this pull request?

The powerIterationClustering algorithm requires integral vertex IDs, but the GraphFrame API supports string IDs. Previously, this method would fail when called on a GraphFrame with string IDs. Now, we internally convert string IDs to long integers, run the clustering algorithm, then map the results back to the original string IDs.

Changes:

  • In GraphFrame.powerIterationClustering:

    • For non-integral ID types (e.g., string), use the precomputed indexedEdges (which contain LONG_SRC/LONG_DST long ID columns) to create a temporary edge DataFrame with long IDs.
    • Preserve the optional weight column if specified.
    • Execute PowerIterationClustering on the long-ID edges.
    • Join the results (which have long IDs) back with indexedVertices to map the long cluster IDs back to the original vertex IDs.
    • For integral ID types, the original behavior is unchanged.
  • Added a test powerIterationClustering string ids in GraphFrameSuite to verify correctness with string IDs.

Why are the changes needed?

Close #757

The powerIterationClustering algorithm requires integral vertex IDs, but the GraphFrame API supports string IDs. Previously, this method would fail when called on a GraphFrame with string IDs. Now, we internally convert string IDs to long integers, run the clustering algorithm, then map the results back to the original string IDs.

Changes:

- In `GraphFrame.powerIterationClustering`:
    - For non-integral ID types (e.g., string), use the precomputed `indexedEdges` (which contain `LONG_SRC`/`LONG_DST` long ID columns) to create a temporary edge DataFrame with long IDs.
    - Preserve the optional weight column if specified.
    - Execute PowerIterationClustering on the long-ID edges.
    - Join the results (which have long IDs) back with `indexedVertices` to map the long cluster IDs back to the original vertex IDs.
    - For integral ID types, the original behavior is unchanged.

- Added a test `powerIterationClustering string ids` in `GraphFrameSuite` to verify correctness with string IDs.
@SemyonSinchenko SemyonSinchenko merged commit d269caa into graphframes:main Jan 20, 2026
5 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

bug: powerIterationClustering failes if src or dst columns are not Int

2 participants