feat: add NeighborhoodAwareCDLP community detection algorithm#825
feat: add NeighborhoodAwareCDLP community detection algorithm#825SemyonSinchenko wants to merge 4 commits intographframes:mainfrom
Conversation
- Introduce NeighborhoodAwareCDLP: a neighborhood-aware variant of label propagation that weights incoming votes by a combination of direct-link strength (a) and neighborhood overlap (c * commonNeighbors). - Add implementation at core/src/main/scala/org/graphframes/lib/NeighborhoodAwareCDLP.scala with: - approximate common-neighbor estimation using theta sketches, - parameters for a, c, initial label column, and sketch size, - Pregel-based propagation and integration with GraphFrame options. - Expose API on GraphFrame as structureAwareLabelPropagation. - Add comprehensive unit tests at core/src/test/scala/org/graphframes/lib/NeighborhoodAwareCDLPSuite.scala covering basic propagation, parameter sensitivity, directed/undirected behavior, isolated vertices, and disconnected components. - Bump default Spark version from 3.5.7 to 3.5.8 in build.sbt. - Note: the theta-sketch based overlap estimation requires Spark >= 4.1; the implementation checks the Spark version and fails fast on older versions.#
There was a problem hiding this comment.
Pull request overview
Adds a new community detection algorithm to GraphFrames: a neighborhood-aware variant of label propagation that weights label “votes” using a direct-link term plus an approximate common-neighbor overlap term (Theta sketches), and exposes it via the GraphFrame API.
Changes:
- Introduces
NeighborhoodAwareCDLPimplementation using Pregel and Spark 4.1+ Theta sketch SQL functions. - Exposes the algorithm on
GraphFrameasstructureAwareLabelPropagation. - Adds a new Scala test suite for correctness/sensitivity cases and bumps the default Spark version to 3.5.8.
Reviewed changes
Copilot reviewed 3 out of 4 changed files in this pull request and generated 7 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
…uite.scala Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
|
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #825 +/- ##
==========================================
- Coverage 80.75% 79.73% -1.02%
==========================================
Files 78 79 +1
Lines 4421 4486 +65
Branches 543 545 +2
==========================================
+ Hits 3570 3577 +7
- Misses 851 909 +58 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
| * | ||
| * Valid range is `[4, 24]`. Default: `12`. | ||
| */ | ||
| def setLgNomEntries(value: Int): this.type = { |
There was a problem hiding this comment.
could become a mixin.
| /** Return true if the major and minor versions are greater or eq to constraints */ | ||
| def requireSparkVersionGT(major: Int, minor: Int, sparkVersion: String): Boolean = { | ||
| val (gotMajor, gotMinor) = TestUtils.majorMinorVersion(sparkVersion) | ||
| (gotMajor >= major) && (gotMinor >= minor) |
There was a problem hiding this comment.
| (gotMajor >= major) && (gotMinor >= minor) | |
| gotMajor > major || (gotMajor == major && gotMinor >= minor) |
also change the name to requireSparkVersionGE
| @@ -0,0 +1,303 @@ | |||
| package org.graphframes.lib | |||
There was a problem hiding this comment.
apache header? do we care about these?
| * Default: `0.5`. | ||
| */ | ||
| def setC(value: Double): this.type = { | ||
| require(value >= 0.0 && value <= 1.0, "c must be in [0,1]") |
There was a problem hiding this comment.
what is the important of keeping these between 0 and 1?
There was a problem hiding this comment.
I will be honest: I do not know. In the paper this is just 1 without an option to change. But after some thinking I decided it would be nice to keep it flexible. We can allow > 1 but I'm not sure how will it work.
| * | ||
| * Default: `0.5`. | ||
| */ | ||
| def setC(value: Double): this.type = { |
There was a problem hiding this comment.
should c and a have more descriptive names? could be helpful for users, most of them wont be familiar with the paper
|
|
||
| // Compute approximate common neighbor counts on edges and materialize. | ||
| val enrichedEdges = | ||
| computeEdgeApproxCommonNeighbors(edges, lgNomEntries, c, a) |
There was a problem hiding this comment.
should we make sure !(a == 0 && c == 0)?
There was a problem hiding this comment.
c == 0 is a valid case, it is just a regular CDLP
a == 0 is a valid case as well, it is kind of structure-based community detection
There was a problem hiding this comment.
But they shouldn’t BOTH be zero
| with Logging { | ||
|
|
||
| private var c: Double = 0.5 | ||
| private var a: Double = 1.0 |
There was a problem hiding this comment.
what does the parameter a really matter? isnt the only thing that matters the ratio of a to c? which can be managed with various values of c.
There was a problem hiding this comment.
In the classical CDLP we choose a new community as a most common across neighbors. It means that the classical CDLP threat each connection equally.
This algorithm choose a new community as a most "weighted" across neighbors where weight is a sum of edge weights to nodes from this community and the weight itself is
In the human language:
a-- it is a constant that shows how important is direct connection: 0 means we ignore direct connections at all and 1.0 means we are threat connections like in the CDLPb-- it is a constant that shows how important are common neighbors: 0 means we ignore common nbrs at all and 1.0 means we consider each common neighbor equally to direct connection
There is a test case for different a and c
There was a problem hiding this comment.
Maybe we should have different parameters:
- the ratio a to c
- bool for ignore c.
- bool for ignore a. (I'm not convinced this is a real case we need).
I feel taking the weight as a ratio better models what people want - the relative weight of the direct relation and a neighbor of neighbor relation.
Opinions?
There was a problem hiding this comment.
The ration means we assume something like this c with a constant a = 1
There was a problem hiding this comment.
with the
Aside from
you actually don't need
There was a problem hiding this comment.
So what would be your suggestion? I do not like the ratio idea tbh. We can remove the a at all and have a formula from the paper with only c. But I would keep the current flow. I like the "sklearn-way": everything is configurable but you won't touch most of parameters because they have safe defaults. I like the idea of just having both a and c with the default a = 1
There was a problem hiding this comment.
my proposal is to expose setC: float and ignoreA: bool setC solves for all relative weights of a and c. ignoreA makes a = 0
but I will not block this PR if you disagree and want to keep setA and setC.
There was a problem hiding this comment.
And in this case we can have ìgnoreDirectLinksˋ instead of ìgnoreAˋand something like ˋstructuralSimilarityMultiplierˋ instead of ˋcˋ. What do you think?
There was a problem hiding this comment.
ignoreDirectLinks is a good name.
ˋstructuralSimilarityMultiplierˋ i am not as sure of but I do not have a better name.
| with WithDirection | ||
| with Logging { | ||
|
|
||
| private var c: Double = 0.5 |
There was a problem hiding this comment.
It is like a safe default.
| * This is the `a` base term in edge weighting: {{ edgeWeight(src, dst) = a + c * | ||
| * commonNeighbors(src, dst) }} |
There was a problem hiding this comment.
claude says you need triple brackets for code blocks
| * Sets weight for the neighborhood-overlap signal (common neighbors). | ||
| * | ||
| * This is the `c` term in edge weighting: {{ edgeWeight(src, dst) = a + c * | ||
| * commonNeighbors(src, dst) }} where `commonNeighbors(src, dst)` is the (approximate) number of |
There was a problem hiding this comment.
claude says you need triple brackets for code blocks
| private val EDGE_WEIGHT_COL = "edge_weight" | ||
|
|
||
| private def aggregateMessages(msgCol: Column, idType: DataType): Column = reduce( | ||
| collect_list(msgCol), |
There was a problem hiding this comment.
collect list can cause OOMs when theres a lot of neighbors but the alternative is a lot of work: Catalyst native UDAF to do the reduction.
up to you if you want to do the work.
There was a problem hiding this comment.
You dislike custom UDAF as I remember because they are not working with "accelerators" (Photon, Gluten, Comet). I will be happy to use native UDAF here as well in the CLDP too. But we need to re-consider the approach to native Catalyst first.
There was a problem hiding this comment.
yeah sometimes.
collect_list really sucks when you really want a fold/reduce an aggregation and the cardinality of the group can be high.
really spark should have a good fold or reduce UDAF. then comet etc could just reimplement the method.
There was a problem hiding this comment.
in a directed graph, this defines a neighbor as one that exists on an outbound edge.
Is this the correct definition? is there any reason to use some other definition of common neighbor?
the paper is all about undirected graphs right? so should we clarify what we do for directed graphs?
Sorry if these answer are obvious to someone with more graph experience.
There was a problem hiding this comment.
It is an interesting question actually. Undirected case handled in L175-182 and there is no problem (doesn't matter which one to take, src or dst). But your question about directed case is a good catch! Let me think what should we do better here: group by src and agg dst or group by dst and agg src...

What changes were proposed in this pull request?
Why are the changes needed?
The current CDLP is very "basic" but optimized well for it's own problem. I do not want to break it. The new implementation is mostly based on the https://arxiv.org/pdf/1105.3264 with my won adjustments.
(on the picture c=0 is a classical CDLP)
Close #791
Close #301
Close #456 (partially?)
Python?
After I get an approve on the core I will add python.