iframe-proxy

SemyonSinchenko · 2026-06-09T11:39:22Z

GQL MATCH on PropertyGraphFrame — motivation, philosophy, and the plan model

1. Motivation & philosophy

The guiding principle is don't rebuild what the platform already does well. A property-graph query engine on top of Spark sits between two mature systems, and the design
draws a hard line on either side:

Everything that Spark SQL can do, stays in Spark SQL. Filtering, projection, scalar expressions, arithmetic, string/date/json functions, joins, unions — these are not reimplemented. The engine's job is to lower a GQL query into the Catalyst-backed DataFrame API and then get out of the way. We don't interpret values, we don't build our own predicate evaluator, we don't manage our own execution. A WHERE a.age > 30 becomes col("age") > 30; a RETURN year(x.born) becomes functions.year(...). Catalyst keeps doing the cost-based physical planning, the codegen, the shuffles. We emit logical DataFrame ops and let the existing engine optimize them.
Everything that belongs to graph algorithms, stays in the GraphFrames algorithm collection. This engine is for pattern matching over a known schema — bounded, relational-shaped traversals that compile to joins. It is deliberately not a graph-algorithm runtime. Anything iterative or fixpoint-shaped (connected components, PageRank, shortest paths of unbounded length, BFS/traversal frameworks) is out of scope by construction and remains the job of the GF algorithms API. Variable-length patterns exist, but they are bounded and expand to a finite union of fixed-length joins — not a traversal.

What's left in the middle — and the only thing this engine actually owns — is the schema-aware translation from a graph pattern to a relational join plan with pushing down all the predicates from the GQL. That is the entire value-add: taking (a:Person)-[:KNOWS]->(b) and figuring out, against the declared LPG schema, which concrete vertex/edge groups it can bind to and how they join. This is the part neither Spark SQL nor the algorithm collection provides.

2. Decisions and hard limitations that follow

The philosophy dictates the boundaries directly. These are not gaps to be filled later; they are the shape of the thing:

One statement form: MATCH <linear pattern> [WHERE] [RETURN]. No WITH, no aggregation, no ORDER BY/LIMIT, no multi-MATCH, no OPTIONAL. Anything that is "just SQL on the result" the user can do on the returned DataFrame — so we don't absorb it.
Linear patterns only, no branching. The grammar guarantees node/edge/node alternation. Branching/tree patterns would push us toward a real query planner; that's over the line.
Variable-length is bounded, hop-by-hop enumerated, capped by maxVarLength. Unbounded reachability is an algorithm, not a pattern — so it's refused, not approximated.
Functions are a hard whitelist mapping 1:1 onto spark.sql.functions, fail-fast on unknown name/arity. No UDFs. The whitelist is the scope boundary: if Spark SQL has it, we expose it by name; if it doesn't, we don't invent it. We do not expose all, just a subset that I think is useful for WHERE. Anything else does not make any sense (see motivation): if user wants to apply a function on top of results the result is DataFrame already.
Fixed output schema (start_*, end_*, edge_property_group, path array). We surface the matched topology; arbitrary projection reshaping is the user's to do downstream in SQL.
No cost-based optimization of the join order (yet, but with a placeholder for it). Catalyst still optimizes the emitted joins; we don't second-guess it. We reserve the option to reorder using graph statistics, but the default deliberately defers to written order + Catalyst.
Fail-fast everywhere: unknown labels, bad syntax, oversized enumeration — all rejected before any Spark job. A pattern that's disconnected in the schema returns empty without touching data.

3. The model: query → plan → optimization

Three narrowing IRs, each a typed boundary with one owner. The progression mirrors a compiler, but each stage exists only to get closer to "emit DataFrame ops."

Query (syntactic): GqlAst. A hand-written sealed ADT, completely firewalled from ANTLR — no generated *Context type escapes AstBuilder. Pure syntax; carries no schema knowledge and no precedence in its nodes (precedence lives in the grammar tiers). This is what makes the rest of the engine testable against plain case classes and the grammar replaceable.

Resolved (logical): ResolvedQuery. This is the only schema-aware stage — the heart of the engine. Resolution does two things:

Pattern → concrete SchemaPaths by a bounded DFS over the SchemaGraphSnapshot. An untyped/ambiguous element fans out into one path per compatible vertex/edge group. Direction (traversedForward) is recorded per step so <-[e]- and undirected edges join correctly. Disconnected → zero paths.
WHERE conjunct classification — the cardinality story, decided once (classifyWhere): single node var → scan-local (pushed onto the scan), two adjacent node vars → join predicate, single edge var → edge scan-local, everything else → post-filter. This is exactly the predicate-pushdown placement Spark would want, decided where we have the schema to do it.

Physical: JoinPlan. One self-contained plan per path (path + element-level join order + predicates + projection + the stats that drove it). Self-contained so explain(Physical) renders without re-resolving. JoinOptimizer is the single place order is chosen; it's structured as a Planner + PlanRefiner SPI threading Option[GraphStatistics], but the default planner uses written order and consumes no stats. The seam for CBO exists; the policy today is "trust Catalyst."

Execution. QueryExecutor walks each plan's order, scanning each element once and joining onto the growing frame. Two refinements keep the emitted SQL lean, both consistent with "let Spark do the work but don't hand it garbage":

Carry vs. defer (classifyElementProps): properties that affect cardinality (filter/join/post) ride the join tree; RETURN-only properties are deferred to a terminal id-keyed join-back so they don't widen every shuffle. (Edges are carried wholesale — no join-back, since undirected edges double rows.)
Scan-reuse floor (ScanKey memo): identical (group, scan-filter, carried-cols) scans share one DataFrame reference across the per-path fan-out.

Plans are UNION ALL-ed into one DataFrame with the fixed schema.

4. Notes on the code

ANTLR firewall. AstBuilder is the only file that imports generated types; everything downstream sees GqlAst. Keep it that way — it's what makes the grammar swappable.
Resolver.classifyWhere + areAdjacent is the subtlest logic and the most likely place for edge-case bugs (variables bound at multiple positions, e.g. triangles). Worth the closest read.
QueryExecutor.classifyElementProps encodes the carry/defer rule and the node/edge asymmetry; the comment block above it states the exact set algebra.
The schema DFS is intentionally naive (noted in-code) — fine for realistic group counts, not tuned for thousands of groups.
Suggested reading order: GqlAst → QueryIr/SchemaGraphSnapshot → Resolver.classifyWhere → QueryExecutor.executePlan/classifyElementProps. Grammar + AstBuilder are mechanical.

…xport

SemyonSinchenko · 2026-06-18T10:46:48Z

codecov-commenter · 2026-06-19T08:23:15Z

⚠️ Please install the to ensure uploads and comments are reliably processed by Codecov.

Codecov Report

❌ Patch coverage is 88.90728% with 67 lines in your changes missing coverage. Please review.
✅ Project coverage is 80.33%. Comparing base (28d181c) to head (c6a65b5).

Additional details and impacted files

@@            Coverage Diff             @@
##             main     #849      +/-   ##
==========================================
+ Coverage   79.26%   80.33%   +1.06%     
==========================================
  Files          81       90       +9     
  Lines        4712     5293     +581     
  Branches      554      646      +92     
==========================================
+ Hits         3735     4252     +517     
- Misses        977     1041      +64

☔ View full report in Codecov by Harness.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

optimizations will follow up

FunctionsRegistry, parsing, tests.

SemyonSinchenko added 5 commits June 1, 2026 12:29

feat: JGraphT as dependency

89cd7dd

wip

b339c90

Merge remote-tracking branch 'graphframes/main' into 829-schema-and-e…

bfbd4a2

…xport

Merge remote-tracking branch 'graphframes/main' into 829-schema-and-e…

359c092

…xport

feat: public API for schema and schemaDOT

ee1da9b

SemyonSinchenko requested a review from james-willis June 9, 2026 11:39

SemyonSinchenko self-assigned this Jun 9, 2026

SemyonSinchenko added the scala label Jun 9, 2026

SemyonSinchenko marked this pull request as draft June 15, 2026 14:27

Merge remote-tracking branch 'graphframes/main' into 829-schema-and-e…

4805369

…xport

SemyonSinchenko changed the title ~~feat: PG schema and DOT-string export~~ feat: GQL support Jun 18, 2026

feat: ANTL4 grammar and custom plugin

a36ef33

SemyonSinchenko added 4 commits June 18, 2026 13:21

fix: docs and switch to visitor

3d2aff6

feat: antl4 + resolver + analysis

918d8c6

fix: fix warnings

ba5562a

fix: scalastyle

198c462

SemyonSinchenko added 10 commits June 19, 2026 22:32

feat: engine itself

16370d5

optimizations will follow up

fix: docs build

c6a65b5

feat: reusing scans

f6188b3

feat: edge filters pushdown and shared edge scans

158e098

fix: codespell

f256d30

feat: undirected edges

1ae2704

fix: tests & codespell

619f0c9

feat: re-generate lock-file

0405746

fix: fix python build

861eb27

feat: functions support

bc44b3f

FunctionsRegistry, parsing, tests.

SemyonSinchenko mentioned this pull request Jun 23, 2026

chore(deps-dev): bump py7zr from 0.22.0 to 1.1.3 in /python #866

Closed

feat: multi-hop patterns

349a6bc

SemyonSinchenko added 2 commits June 24, 2026 18:07

chore: sanytize

5545b1a

fix: remove old test

18ffb3d

Sunbelt Computer Software

PL/B Language Development and Support

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat: GQL support#849

feat: GQL support#849
SemyonSinchenko wants to merge 24 commits into
graphframes:mainfrom
SemyonSinchenko:829-schema-and-export

SemyonSinchenko commented Jun 9, 2026 •

edited

Loading

Uh oh!

SemyonSinchenko commented Jun 18, 2026

Uh oh!

codecov-commenter commented Jun 19, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Sunbelt Computer Software

PL/B Language Development and Support

Uh oh!

Conversation

SemyonSinchenko commented Jun 9, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

GQL MATCH on PropertyGraphFrame — motivation, philosophy, and the plan model

1. Motivation & philosophy

2. Decisions and hard limitations that follow

3. The model: query → plan → optimization

4. Notes on the code

Uh oh!

SemyonSinchenko commented Jun 18, 2026

Uh oh!

codecov-commenter commented Jun 19, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

SemyonSinchenko commented Jun 9, 2026 •

edited

Loading

codecov-commenter commented Jun 19, 2026 •

edited

Loading