feat: add NeighborhoodAwareCDLP community detection algorithm by SemyonSinchenko · Pull Request #825 · graphframes/graphframes · GitHub
Skip to content

feat: add NeighborhoodAwareCDLP community detection algorithm#825

Open
SemyonSinchenko wants to merge 4 commits intographframes:mainfrom
SemyonSinchenko:791-community-detection
Open

feat: add NeighborhoodAwareCDLP community detection algorithm#825
SemyonSinchenko wants to merge 4 commits intographframes:mainfrom
SemyonSinchenko:791-community-detection

Conversation

@SemyonSinchenko
Copy link
Copy Markdown
Collaborator

@SemyonSinchenko SemyonSinchenko commented Apr 7, 2026

What changes were proposed in this pull request?

  • Introduce NeighborhoodAwareCDLP: a neighborhood-aware variant of label propagation that weights incoming votes by a combination of direct-link strength (a) and neighborhood overlap (c * commonNeighbors).
  • Add implementation at core/src/main/scala/org/graphframes/lib/NeighborhoodAwareCDLP.scala with:
    • approximate common-neighbor estimation using theta sketches,
    • parameters for a, c, initial label column, and sketch size,
    • Pregel-based propagation and integration with GraphFrame options.
  • Expose API on GraphFrame as structureAwareLabelPropagation.
  • Add comprehensive unit tests at core/src/test/scala/org/graphframes/lib/NeighborhoodAwareCDLPSuite.scala covering basic propagation, parameter sensitivity, directed/undirected behavior, isolated vertices, and disconnected components.
  • Bump default Spark version from 3.5.7 to 3.5.8 in build.sbt.
  • Note: the theta-sketch based overlap estimation requires Spark >= 4.1; the implementation checks the Spark version and fails fast on older versions.

Why are the changes needed?

The current CDLP is very "basic" but optimized well for it's own problem. I do not want to break it. The new implementation is mostly based on the https://arxiv.org/pdf/1105.3264 with my won adjustments.

image

(on the picture c=0 is a classical CDLP)

Close #791
Close #301
Close #456 (partially?)

Python?

After I get an approve on the core I will add python.

- Introduce NeighborhoodAwareCDLP: a neighborhood-aware variant of label
  propagation that weights incoming votes by a combination of direct-link
  strength (a) and neighborhood overlap (c * commonNeighbors).
- Add implementation at
  core/src/main/scala/org/graphframes/lib/NeighborhoodAwareCDLP.scala with:
  - approximate common-neighbor estimation using theta sketches,
  - parameters for a, c, initial label column, and sketch size,
  - Pregel-based propagation and integration with GraphFrame options.
- Expose API on GraphFrame as structureAwareLabelPropagation.
- Add comprehensive unit tests at
  core/src/test/scala/org/graphframes/lib/NeighborhoodAwareCDLPSuite.scala
  covering basic propagation, parameter sensitivity, directed/undirected
  behavior, isolated vertices, and disconnected components.
- Bump default Spark version from 3.5.7 to 3.5.8 in build.sbt.
- Note: the theta-sketch based overlap estimation requires Spark >= 4.1;
  the implementation checks the Spark version and fails fast on older versions.#
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds a new community detection algorithm to GraphFrames: a neighborhood-aware variant of label propagation that weights label “votes” using a direct-link term plus an approximate common-neighbor overlap term (Theta sketches), and exposes it via the GraphFrame API.

Changes:

  • Introduces NeighborhoodAwareCDLP implementation using Pregel and Spark 4.1+ Theta sketch SQL functions.
  • Exposes the algorithm on GraphFrame as structureAwareLabelPropagation.
  • Adds a new Scala test suite for correctness/sensitivity cases and bumps the default Spark version to 3.5.8.

Reviewed changes

Copilot reviewed 3 out of 4 changed files in this pull request and generated 7 comments.

File Description
core/src/main/scala/org/graphframes/lib/NeighborhoodAwareCDLP.scala Implements NeighborhoodAwareCDLP (weighted label propagation with sketch-based overlap) and integrates with Pregel/options.
core/src/main/scala/org/graphframes/GraphFrame.scala Adds a public entrypoint structureAwareLabelPropagation.
core/src/test/scala/org/graphframes/lib/NeighborhoodAwareCDLPSuite.scala Adds unit tests covering propagation behavior, parameter effects, directionality, and edge cases.
build.sbt Bumps default Spark version from 3.5.7 to 3.5.8.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread core/src/main/scala/org/graphframes/lib/NeighborhoodAwareCDLP.scala Outdated
Comment thread core/src/main/scala/org/graphframes/lib/NeighborhoodAwareCDLP.scala
Comment thread core/src/main/scala/org/graphframes/lib/NeighborhoodAwareCDLP.scala
Comment thread core/src/main/scala/org/graphframes/lib/NeighborhoodAwareCDLP.scala Outdated
Comment thread core/src/main/scala/org/graphframes/lib/NeighborhoodAwareCDLP.scala
Comment thread core/src/main/scala/org/graphframes/lib/NeighborhoodAwareCDLP.scala Outdated
Comment thread core/src/test/scala/org/graphframes/lib/NeighborhoodAwareCDLPSuite.scala Outdated
SemyonSinchenko and others added 3 commits April 7, 2026 23:16
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
…uite.scala

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
@codecov-commenter
Copy link
Copy Markdown

⚠️ Please install the 'codecov app svg image' to ensure uploads and comments are reliably processed by Codecov.

Codecov Report

❌ Patch coverage is 0% with 58 lines in your changes missing coverage. Please review.
✅ Project coverage is 79.73%. Comparing base (a28a4e8) to head (7845f97).
⚠️ Report is 6 commits behind head on main.

Files with missing lines Patch % Lines
...la/org/graphframes/lib/NeighborhoodAwareCDLP.scala 0.00% 57 Missing ⚠️
...re/src/main/scala/org/graphframes/GraphFrame.scala 0.00% 1 Missing ⚠️
❗ Your organization needs to install the Codecov GitHub app to enable full functionality.
Additional details and impacted files
@@            Coverage Diff             @@
##             main     #825      +/-   ##
==========================================
- Coverage   80.75%   79.73%   -1.02%     
==========================================
  Files          78       79       +1     
  Lines        4421     4486      +65     
  Branches      543      545       +2     
==========================================
+ Hits         3570     3577       +7     
- Misses        851      909      +58     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

*
* Valid range is `[4, 24]`. Default: `12`.
*/
def setLgNomEntries(value: Int): this.type = {
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

could become a mixin.

Comment thread core/src/main/scala/org/graphframes/lib/NeighborhoodAwareCDLP.scala
/** Return true if the major and minor versions are greater or eq to constraints */
def requireSparkVersionGT(major: Int, minor: Int, sparkVersion: String): Boolean = {
val (gotMajor, gotMinor) = TestUtils.majorMinorVersion(sparkVersion)
(gotMajor >= major) && (gotMinor >= minor)
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
(gotMajor >= major) && (gotMinor >= minor)
gotMajor > major || (gotMajor == major && gotMinor >= minor)

also change the name to requireSparkVersionGE

@@ -0,0 +1,303 @@
package org.graphframes.lib
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

apache header? do we care about these?

* Default: `0.5`.
*/
def setC(value: Double): this.type = {
require(value >= 0.0 && value <= 1.0, "c must be in [0,1]")
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what is the important of keeping these between 0 and 1?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will be honest: I do not know. In the paper this is just 1 without an option to change. But after some thinking I decided it would be nice to keep it flexible. We can allow > 1 but I'm not sure how will it work.

*
* Default: `0.5`.
*/
def setC(value: Double): this.type = {
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should c and a have more descriptive names? could be helpful for users, most of them wont be familiar with the paper


// Compute approximate common neighbor counts on edges and materialize.
val enrichedEdges =
computeEdgeApproxCommonNeighbors(edges, lgNomEntries, c, a)
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should we make sure !(a == 0 && c == 0)?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

c == 0 is a valid case, it is just a regular CDLP
a == 0 is a valid case as well, it is kind of structure-based community detection

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But they shouldn’t BOTH be zero

with Logging {

private var c: Double = 0.5
private var a: Double = 1.0
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what does the parameter a really matter? isnt the only thing that matters the ratio of a to c? which can be managed with various values of c.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the classical CDLP we choose a new community as a most common across neighbors. It means that the classical CDLP threat each connection equally.

This algorithm choose a new community as a most "weighted" across neighbors where weight is a sum of edge weights to nodes from this community and the weight itself is $w_{i,j} = a + c \cdot | N_j \cap N_i |$

In the human language:

  • a -- it is a constant that shows how important is direct connection: 0 means we ignore direct connections at all and 1.0 means we are threat connections like in the CDLP
  • b -- it is a constant that shows how important are common neighbors: 0 means we ignore common nbrs at all and 1.0 means we consider each common neighbor equally to direct connection

There is a test case for different a and c

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we should have different parameters:

  • the ratio a to c
  • bool for ignore c.
  • bool for ignore a. (I'm not convinced this is a real case we need).

I feel taking the weight as a ratio better models what people want - the relative weight of the direct relation and a neighbor of neighbor relation.

Opinions?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The ration means we assume something like this $w_{i,j} = \lambda + (1 - \lambda) \cdot | N_i \cap N_j |$; I was thinking about it and it is a different story. For example, in the paper guys are trying different c with a constant a = 1

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

with the $\lambda = c/a$, the equation would be $w_{i,j} = a + \lambda \cdot | N_i \cap N_j |$ where $a \in \set{0,1}$

Aside from $a$ = 0 theres no reason to allow modifying values of $a$, because you can acomplish the exact same result with some value of $c$. so a bool of $ignoreA$ accomplishes the same thing as allowing $a$ to be set as $(0,1)$.

you actually don't need $ignoreC$ at all because it is equivalent to $\lambda =0$

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So what would be your suggestion? I do not like the ratio idea tbh. We can remove the a at all and have a formula from the paper with only c. But I would keep the current flow. I like the "sklearn-way": everything is configurable but you won't touch most of parameters because they have safe defaults. I like the idea of just having both a and c with the default a = 1

Copy link
Copy Markdown
Collaborator

@james-willis james-willis Apr 17, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

my proposal is to expose setC: float and ignoreA: bool setC solves for all relative weights of a and c. ignoreA makes a = 0

but I will not block this PR if you disagree and want to keep setA and setC.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And in this case we can have ìgnoreDirectLinksˋ instead of ìgnoreAˋand something like ˋstructuralSimilarityMultiplierˋ instead of ˋcˋ. What do you think?

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ignoreDirectLinks is a good name.

ˋstructuralSimilarityMultiplierˋ i am not as sure of but I do not have a better name.

with WithDirection
with Logging {

private var c: Double = 0.5
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why 0.5?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is like a safe default.

Comment on lines +110 to +111
* This is the `a` base term in edge weighting: {{ edgeWeight(src, dst) = a + c *
* commonNeighbors(src, dst) }}
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

claude says you need triple brackets for code blocks

* Sets weight for the neighborhood-overlap signal (common neighbors).
*
* This is the `c` term in edge weighting: {{ edgeWeight(src, dst) = a + c *
* commonNeighbors(src, dst) }} where `commonNeighbors(src, dst)` is the (approximate) number of
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

claude says you need triple brackets for code blocks

private val EDGE_WEIGHT_COL = "edge_weight"

private def aggregateMessages(msgCol: Column, idType: DataType): Column = reduce(
collect_list(msgCol),
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

collect list can cause OOMs when theres a lot of neighbors but the alternative is a lot of work: Catalyst native UDAF to do the reduction.

up to you if you want to do the work.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You dislike custom UDAF as I remember because they are not working with "accelerators" (Photon, Gluten, Comet). I will be happy to use native UDAF here as well in the CLDP too. But we need to re-consider the approach to native Catalyst first.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah sometimes.

collect_list really sucks when you really want a fold/reduce an aggregation and the cardinality of the group can be high.

really spark should have a good fold or reduce UDAF. then comet etc could just reimplement the method.

Comment on lines +243 to +247
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

in a directed graph, this defines a neighbor as one that exists on an outbound edge.

Is this the correct definition? is there any reason to use some other definition of common neighbor?

the paper is all about undirected graphs right? so should we clarify what we do for directed graphs?

Sorry if these answer are obvious to someone with more graph experience.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is an interesting question actually. Undirected case handled in L175-182 and there is no problem (doesn't matter which one to take, src or dst). But your question about directed case is a good catch! Let me think what should we do better here: group by src and agg dst or group by dst and agg src...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

4 participants