You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Russell Jurney edited this page Apr 17, 2025
·
6 revisions
The purpose of this document is to reason through the addition of node types to GraphFrames in order to better handle labeled property graph (LPG) data.
GraphFrames without Types
As of version 0.8.4 there is no distinction between types of node in GraphFrames. There is support for different edge types using the relationship field.
Node required columns: id
Edge required columns: src, dst, relationship
While it is possible to use any property of a node as its type, including type in features like network motifs, there are limitations when dealing with multiple types in GraphFrames.
Merging Node Types
As described in the Motif Finding Tutorial, to represent a labeled property graph (LPG) for motif finding it is necessary to create all fields in all node types and then union the result. There is no utility that does this for you, it is up to the user to figure this out... many will be confused and will simply avoid GraphFrames.
all_cols: List[Tuple[str, T.StructField]] =list(
set(
list(zip(a.columns, a.schema))
+list(zip(b.columns, b.schema))
...
)
)
all_column_names: List[str] =sorted([x[0] forxinall_cols])
defadd_missing_columns(df: DataFrame, all_cols: List[Tuple[str, T.StructField]]) ->DataFrame:
"""Add any missing columns from any DataFrame among several we want to merge."""forcol_name, schema_fieldinall_cols:
ifcol_namenotindf.columns:
df=df.withColumn(col_name, F.lit(None).cast(schema_field.dataType))
returndf# Now apply this function to each of your DataFrames to get a consistent schemaa=add_missing_columns(a, all_cols).select(all_column_names)
b=add_missing_columns(b, all_cols).select(all_column_names)
...
# Ensure we got the property merge right...assert (
set(a.columns)
==set(b.columns)
...
)
GraphFrames with Types
The addition of am [optional or required] type field to vertices would work much like relationships for edges.
Node required columns: id, type
Edge required columns: src, dst, relationship
Type Utilities
Once nodes and edges both have types, there are useful utilities we can build:
GraphFrames.typeDegree(), GraphFrame.typeInDegree() and GraphFrame.typeOutDegree()
Type aware degree functions that compute the degree of a node partitioned by the relationship types on its edges or the Type of its neighbors nodes and returns these counts in a MapType. It might be useful to compute values for ALL edge relationships or node types and fill missing types with zeros. This method is recommended in the literature to replace triangle counts for clustering coefficients for highly connected graphs.