iframe-proxy

SemyonSinchenko · 2026-04-07T20:58:44Z

What changes were proposed in this pull request?

Introduce NeighborhoodAwareCDLP: a neighborhood-aware variant of label propagation that weights incoming votes by a combination of direct-link strength (a) and neighborhood overlap (c * commonNeighbors).
Add implementation at core/src/main/scala/org/graphframes/lib/NeighborhoodAwareCDLP.scala with:
- approximate common-neighbor estimation using theta sketches,
- parameters for a, c, initial label column, and sketch size,
- Pregel-based propagation and integration with GraphFrame options.
Expose API on GraphFrame as structureAwareLabelPropagation.
Add comprehensive unit tests at core/src/test/scala/org/graphframes/lib/NeighborhoodAwareCDLPSuite.scala covering basic propagation, parameter sensitivity, directed/undirected behavior, isolated vertices, and disconnected components.
Bump default Spark version from 3.5.7 to 3.5.8 in build.sbt.
Note: the theta-sketch based overlap estimation requires Spark >= 4.1; the implementation checks the Spark version and fails fast on older versions.

Why are the changes needed?

The current CDLP is very "basic" but optimized well for it's own problem. I do not want to break it. The new implementation is mostly based on the https://arxiv.org/pdf/1105.3264 with my won adjustments.

(on the picture c=0 is a classical CDLP)

Close #791
Close #301
Close #456 (partially?)

Python?

After I get an approve on the core I will add python.

- Introduce NeighborhoodAwareCDLP: a neighborhood-aware variant of label propagation that weights incoming votes by a combination of direct-link strength (a) and neighborhood overlap (c * commonNeighbors). - Add implementation at core/src/main/scala/org/graphframes/lib/NeighborhoodAwareCDLP.scala with: - approximate common-neighbor estimation using theta sketches, - parameters for a, c, initial label column, and sketch size, - Pregel-based propagation and integration with GraphFrame options. - Expose API on GraphFrame as structureAwareLabelPropagation. - Add comprehensive unit tests at core/src/test/scala/org/graphframes/lib/NeighborhoodAwareCDLPSuite.scala covering basic propagation, parameter sensitivity, directed/undirected behavior, isolated vertices, and disconnected components. - Bump default Spark version from 3.5.7 to 3.5.8 in build.sbt. - Note: the theta-sketch based overlap estimation requires Spark >= 4.1; the implementation checks the Spark version and fails fast on older versions.#

Copilot

Pull request overview

Adds a new community detection algorithm to GraphFrames: a neighborhood-aware variant of label propagation that weights label “votes” using a direct-link term plus an approximate common-neighbor overlap term (Theta sketches), and exposes it via the GraphFrame API.

Changes:

Introduces NeighborhoodAwareCDLP implementation using Pregel and Spark 4.1+ Theta sketch SQL functions.
Exposes the algorithm on GraphFrame as structureAwareLabelPropagation.
Adds a new Scala test suite for correctness/sensitivity cases and bumps the default Spark version to 3.5.8.

Reviewed changes

Copilot reviewed 3 out of 4 changed files in this pull request and generated 7 comments.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

…uite.scala Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

codecov-commenter · 2026-04-08T09:56:42Z

⚠️ Please install the to ensure uploads and comments are reliably processed by Codecov.

Codecov Report

❌ Patch coverage is 0% with 58 lines in your changes missing coverage. Please review.
✅ Project coverage is 79.73%. Comparing base (a28a4e8) to head (7845f97).
⚠️ Report is 6 commits behind head on main.

Files with missing lines	Patch %	Lines
...la/org/graphframes/lib/NeighborhoodAwareCDLP.scala	0.00%	57 Missing ⚠️
...re/src/main/scala/org/graphframes/GraphFrame.scala	0.00%	1 Missing ⚠️
❗ Your organization needs to install the Codecov GitHub app to enable full functionality.

Additional details and impacted files

@@            Coverage Diff             @@
##             main     #825      +/-   ##
==========================================
- Coverage   80.75%   79.73%   -1.02%     
==========================================
  Files          78       79       +1     
  Lines        4421     4486      +65     
  Branches      543      545       +2     
==========================================
+ Hits         3570     3577       +7     
- Misses        851      909      +58

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

james-willis · 2026-04-16T18:39:10Z

+   *
+   * Valid range is `[4, 24]`. Default: `12`.
+   */
+  def setLgNomEntries(value: Int): this.type = {


could become a mixin.

james-willis · 2026-04-16T18:47:24Z

+  /** Return true if the major and minor versions are greater or eq to constraints */
+  def requireSparkVersionGT(major: Int, minor: Int, sparkVersion: String): Boolean = {
+    val (gotMajor, gotMinor) = TestUtils.majorMinorVersion(sparkVersion)
+    (gotMajor >= major) && (gotMinor >= minor)


Suggested change

(gotMajor >= major) && (gotMinor >= minor)

gotMajor > major || (gotMajor == major && gotMinor >= minor)

also change the name to requireSparkVersionGE

james-willis · 2026-04-16T18:48:17Z

@@ -0,0 +1,303 @@
+package org.graphframes.lib


apache header? do we care about these?

james-willis · 2026-04-16T18:52:05Z

+   * Default: `0.5`.
+   */
+  def setC(value: Double): this.type = {
+    require(value >= 0.0 && value <= 1.0, "c must be in [0,1]")


what is the important of keeping these between 0 and 1?

I will be honest: I do not know. In the paper this is just 1 without an option to change. But after some thinking I decided it would be nice to keep it flexible. We can allow > 1 but I'm not sure how will it work.

james-willis · 2026-04-16T18:52:34Z

+   *
+   * Default: `0.5`.
+   */
+  def setC(value: Double): this.type = {


should c and a have more descriptive names? could be helpful for users, most of them wont be familiar with the paper

james-willis · 2026-04-16T18:54:07Z

+
+    // Compute approximate common neighbor counts on edges and materialize.
+    val enrichedEdges =
+      computeEdgeApproxCommonNeighbors(edges, lgNomEntries, c, a)


should we make sure !(a == 0 && c == 0)?

c == 0 is a valid case, it is just a regular CDLP
a == 0 is a valid case as well, it is kind of structure-based community detection

But they shouldn’t BOTH be zero

james-willis · 2026-04-16T18:55:54Z

+    with Logging {
+
+  private var c: Double = 0.5
+  private var a: Double = 1.0


what does the parameter a really matter? isnt the only thing that matters the ratio of a to c? which can be managed with various values of c.

In the classical CDLP we choose a new community as a most common across neighbors. It means that the classical CDLP threat each connection equally.

This algorithm choose a new community as a most "weighted" across neighbors where weight is a sum of edge weights to nodes from this community and the weight itself is $w_{i,j} = a + c \cdot | N_j \cap N_i |$

In the human language:

a -- it is a constant that shows how important is direct connection: 0 means we ignore direct connections at all and 1.0 means we are threat connections like in the CDLP

b -- it is a constant that shows how important are common neighbors: 0 means we ignore common nbrs at all and 1.0 means we consider each common neighbor equally to direct connection

There is a test case for different a and c

Maybe we should have different parameters:

the ratio a to c

bool for ignore c.

bool for ignore a. (I'm not convinced this is a real case we need).

I feel taking the weight as a ratio better models what people want - the relative weight of the direct relation and a neighbor of neighbor relation.

Opinions?

The ration means we assume something like this $w_{i,j} = \lambda + (1 - \lambda) \cdot | N_i \cap N_j |$; I was thinking about it and it is a different story. For example, in the paper guys are trying different c with a constant a = 1

with the $\lambda = c/a$, the equation would be $w_{i,j} = a + \lambda \cdot | N_i \cap N_j |$ where $a \in \set{0,1}$

Aside from $a$ = 0 theres no reason to allow modifying values of $a$, because you can acomplish the exact same result with some value of $c$. so a bool of $ignoreA$ accomplishes the same thing as allowing $a$ to be set as $(0,1)$.

you actually don't need $ignoreC$ at all because it is equivalent to $\lambda =0$

So what would be your suggestion? I do not like the ratio idea tbh. We can remove the a at all and have a formula from the paper with only c. But I would keep the current flow. I like the "sklearn-way": everything is configurable but you won't touch most of parameters because they have safe defaults. I like the idea of just having both a and c with the default a = 1

my proposal is to expose setC: float and ignoreA: bool setC solves for all relative weights of a and c. ignoreA makes a = 0

but I will not block this PR if you disagree and want to keep setA and setC.

And in this case we can have ìgnoreDirectLinksˋ instead of ìgnoreAˋand something like ˋstructuralSimilarityMultiplierˋ instead of ˋcˋ. What do you think?

ignoreDirectLinks is a good name.

ˋstructuralSimilarityMultiplierˋ i am not as sure of but I do not have a better name.

james-willis · 2026-04-16T18:56:27Z

+    with WithDirection
+    with Logging {
+
+  private var c: Double = 0.5


It is like a safe default.

james-willis · 2026-04-16T19:02:01Z

+   * This is the `a` base term in edge weighting: {{ edgeWeight(src, dst) = a + c *
+   * commonNeighbors(src, dst) }}


claude says you need triple brackets for code blocks

james-willis · 2026-04-16T19:02:04Z

+   * Sets weight for the neighborhood-overlap signal (common neighbors).
+   *
+   * This is the `c` term in edge weighting: {{ edgeWeight(src, dst) = a + c *
+   * commonNeighbors(src, dst) }} where `commonNeighbors(src, dst)` is the (approximate) number of


claude says you need triple brackets for code blocks

james-willis · 2026-04-16T19:08:17Z

+  private val EDGE_WEIGHT_COL = "edge_weight"
+
+  private def aggregateMessages(msgCol: Column, idType: DataType): Column = reduce(
+    collect_list(msgCol),


collect list can cause OOMs when theres a lot of neighbors but the alternative is a lot of work: Catalyst native UDAF to do the reduction.

up to you if you want to do the work.

You dislike custom UDAF as I remember because they are not working with "accelerators" (Photon, Gluten, Comet). I will be happy to use native UDAF here as well in the CLDP too. But we need to re-consider the approach to native Catalyst first.

yeah sometimes.

collect_list really sucks when you really want a fold/reduce an aggregation and the cardinality of the group can be high.

really spark should have a good fold or reduce UDAF. then comet etc could just reimplement the method.

james-willis · 2026-04-16T19:12:54Z

+  private def computeEdgeApproxCommonNeighbors(
+      edges: DataFrame,
+      lgNomEntries: Int,
+      neighborsScale: Double,
+      linkScale: Double): DataFrame = {


in a directed graph, this defines a neighbor as one that exists on an outbound edge.

Is this the correct definition? is there any reason to use some other definition of common neighbor?

the paper is all about undirected graphs right? so should we clarify what we do for directed graphs?

Sorry if these answer are obvious to someone with more graph experience.

It is an interesting question actually. Undirected case handled in L175-182 and there is no problem (doesn't matter which one to take, src or dst). But your question about directed case is a good catch! Let me think what should we do better here: group by src and agg dst or group by dst and agg src...

SemyonSinchenko requested review from Copilot and james-willis April 7, 2026 20:58

SemyonSinchenko self-assigned this Apr 7, 2026

SemyonSinchenko added the scala label Apr 7, 2026

Copilot started reviewing on behalf of SemyonSinchenko April 7, 2026 20:59 View session

Copilot AI reviewed Apr 7, 2026

View reviewed changes

SemyonSinchenko and others added 3 commits April 7, 2026 23:16

Apply suggestions from code review

c132bc7

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

Update core/src/test/scala/org/graphframes/lib/NeighborhoodAwareCDLPS…

6a7b577

…uite.scala Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

fix: comments

7845f97

james-willis reviewed Apr 16, 2026

View reviewed changes

Comment thread core/src/main/scala/org/graphframes/lib/NeighborhoodAwareCDLP.scala

james-willis reviewed Apr 16, 2026

View reviewed changes

File	Description
core/src/main/scala/org/graphframes/lib/NeighborhoodAwareCDLP.scala	Implements NeighborhoodAwareCDLP (weighted label propagation with sketch-based overlap) and integrates with Pregel/options.
core/src/main/scala/org/graphframes/GraphFrame.scala	Adds a public entrypoint `structureAwareLabelPropagation`.
core/src/test/scala/org/graphframes/lib/NeighborhoodAwareCDLPSuite.scala	Adds unit tests covering propagation behavior, parameter effects, directionality, and edge cases.
build.sbt	Bumps default Spark version from 3.5.7 to 3.5.8.

	(gotMajor >= major) && (gotMinor >= minor)
	gotMajor > major \|\| (gotMajor == major && gotMinor >= minor)

		* This is the `a` base term in edge weighting: {{ edgeWeight(src, dst) = a + c *
		* commonNeighbors(src, dst) }}

Sunbelt Computer Software

PL/B Language Development and Support

Conversation

SemyonSinchenko commented Apr 7, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

Why are the changes needed?

Python?

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

codecov-commenter commented Apr 8, 2026

Codecov Report

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

james-willis Apr 17, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

SemyonSinchenko commented Apr 7, 2026 •

edited

Loading

james-willis Apr 17, 2026 •

edited

Loading