feat: GQL support by SemyonSinchenko · Pull Request #849 · graphframes/graphframes · GitHub
Skip to content

feat: GQL support#849

Draft
SemyonSinchenko wants to merge 24 commits into
graphframes:mainfrom
SemyonSinchenko:829-schema-and-export
Draft

feat: GQL support#849
SemyonSinchenko wants to merge 24 commits into
graphframes:mainfrom
SemyonSinchenko:829-schema-and-export

Conversation

@SemyonSinchenko

@SemyonSinchenko SemyonSinchenko commented Jun 9, 2026

Copy link
Copy Markdown
Collaborator

GQL MATCH on PropertyGraphFrame — motivation, philosophy, and the plan model

1. Motivation & philosophy

The guiding principle is don't rebuild what the platform already does well. A property-graph query engine on top of Spark sits between two mature systems, and the design
draws a hard line on either side:

  • Everything that Spark SQL can do, stays in Spark SQL. Filtering, projection, scalar expressions, arithmetic, string/date/json functions, joins, unions — these are not reimplemented. The engine's job is to lower a GQL query into the Catalyst-backed DataFrame API and then get out of the way. We don't interpret values, we don't build our own predicate evaluator, we don't manage our own execution. A WHERE a.age > 30 becomes col("age") > 30; a RETURN year(x.born) becomes functions.year(...). Catalyst keeps doing the cost-based physical planning, the codegen, the shuffles. We emit logical DataFrame ops and let the existing engine optimize them.
  • Everything that belongs to graph algorithms, stays in the GraphFrames algorithm collection. This engine is for pattern matching over a known schema — bounded, relational-shaped traversals that compile to joins. It is deliberately not a graph-algorithm runtime. Anything iterative or fixpoint-shaped (connected components, PageRank, shortest paths of unbounded length, BFS/traversal frameworks) is out of scope by construction and remains the job of the GF algorithms API. Variable-length patterns exist, but they are bounded and expand to a finite union of fixed-length joins — not a traversal.

What's left in the middle — and the only thing this engine actually owns — is the schema-aware translation from a graph pattern to a relational join plan with pushing down all the predicates from the GQL. That is the entire value-add: taking (a:Person)-[:KNOWS]->(b) and figuring out, against the declared LPG schema, which concrete vertex/edge groups it can bind to and how they join. This is the part neither Spark SQL nor the algorithm collection provides.

2. Decisions and hard limitations that follow

The philosophy dictates the boundaries directly. These are not gaps to be filled later; they are the shape of the thing:

  • One statement form: MATCH <linear pattern> [WHERE] [RETURN]. No WITH, no aggregation, no ORDER BY/LIMIT, no multi-MATCH, no OPTIONAL. Anything that is "just SQL on the result" the user can do on the returned DataFrame — so we don't absorb it.
  • Linear patterns only, no branching. The grammar guarantees node/edge/node alternation. Branching/tree patterns would push us toward a real query planner; that's over the line.
  • Variable-length is bounded, hop-by-hop enumerated, capped by maxVarLength. Unbounded reachability is an algorithm, not a pattern — so it's refused, not approximated.
  • Functions are a hard whitelist mapping 1:1 onto spark.sql.functions, fail-fast on unknown name/arity. No UDFs. The whitelist is the scope boundary: if Spark SQL has it, we expose it by name; if it doesn't, we don't invent it. We do not expose all, just a subset that I think is useful for WHERE. Anything else does not make any sense (see motivation): if user wants to apply a function on top of results the result is DataFrame already.
  • Fixed output schema (start_*, end_*, edge_property_group, path array). We surface the matched topology; arbitrary projection reshaping is the user's to do downstream in SQL.
  • No cost-based optimization of the join order (yet, but with a placeholder for it). Catalyst still optimizes the emitted joins; we don't second-guess it. We reserve the option to reorder using graph statistics, but the default deliberately defers to written order + Catalyst.
  • Fail-fast everywhere: unknown labels, bad syntax, oversized enumeration — all rejected before any Spark job. A pattern that's disconnected in the schema returns empty without touching data.

3. The model: query → plan → optimization

Three narrowing IRs, each a typed boundary with one owner. The progression mirrors a compiler, but each stage exists only to get closer to "emit DataFrame ops."

Query (syntactic): GqlAst. A hand-written sealed ADT, completely firewalled from ANTLR — no generated *Context type escapes AstBuilder. Pure syntax; carries no schema knowledge and no precedence in its nodes (precedence lives in the grammar tiers). This is what makes the rest of the engine testable against plain case classes and the grammar replaceable.

Resolved (logical): ResolvedQuery. This is the only schema-aware stage — the heart of the engine. Resolution does two things:

  1. Pattern → concrete SchemaPaths by a bounded DFS over the SchemaGraphSnapshot. An untyped/ambiguous element fans out into one path per compatible vertex/edge group. Direction (traversedForward) is recorded per step so <-[e]- and undirected edges join correctly. Disconnected → zero paths.
  2. WHERE conjunct classification — the cardinality story, decided once (classifyWhere): single node var → scan-local (pushed onto the scan), two adjacent node vars → join predicate, single edge var → edge scan-local, everything else → post-filter. This is exactly the predicate-pushdown placement Spark would want, decided where we have the schema to do it.

Physical: JoinPlan. One self-contained plan per path (path + element-level join order + predicates + projection + the stats that drove it). Self-contained so explain(Physical) renders without re-resolving. JoinOptimizer is the single place order is chosen; it's structured as a Planner + PlanRefiner SPI threading Option[GraphStatistics], but the default planner uses written order and consumes no stats. The seam for CBO exists; the policy today is "trust Catalyst."

Execution. QueryExecutor walks each plan's order, scanning each element once and joining onto the growing frame. Two refinements keep the emitted SQL lean, both consistent with "let Spark do the work but don't hand it garbage":

  • Carry vs. defer (classifyElementProps): properties that affect cardinality (filter/join/post) ride the join tree; RETURN-only properties are deferred to a terminal id-keyed join-back so they don't widen every shuffle. (Edges are carried wholesale — no join-back, since undirected edges double rows.)
  • Scan-reuse floor (ScanKey memo): identical (group, scan-filter, carried-cols) scans share one DataFrame reference across the per-path fan-out.

Plans are UNION ALL-ed into one DataFrame with the fixed schema.

4. Notes on the code

  • ANTLR firewall. AstBuilder is the only file that imports generated types; everything downstream sees GqlAst. Keep it that way — it's what makes the grammar swappable.
  • Resolver.classifyWhere + areAdjacent is the subtlest logic and the most likely place for edge-case bugs (variables bound at multiple positions, e.g. triangles). Worth the closest read.
  • QueryExecutor.classifyElementProps encodes the carry/defer rule and the node/edge asymmetry; the comment block above it states the exact set algebra.
  • The schema DFS is intentionally naive (noted in-code) — fine for realistic group counts, not tuned for thousands of groups.
  • Suggested reading order: GqlAst → QueryIr/SchemaGraphSnapshot → Resolver.classifyWhereQueryExecutor.executePlan/classifyElementProps. Grammar + AstBuilder are mechanical.

@SemyonSinchenko SemyonSinchenko self-assigned this Jun 9, 2026
@SemyonSinchenko SemyonSinchenko marked this pull request as draft June 15, 2026 14:27
@SemyonSinchenko SemyonSinchenko changed the title feat: PG schema and DOT-string export feat: GQL support Jun 18, 2026
@SemyonSinchenko

Copy link
Copy Markdown
Collaborator Author

@codecov-commenter

codecov-commenter commented Jun 19, 2026

Copy link
Copy Markdown

⚠️ Please install the 'codecov app svg image' to ensure uploads and comments are reliably processed by Codecov.

Codecov Report

❌ Patch coverage is 88.90728% with 67 lines in your changes missing coverage. Please review.
✅ Project coverage is 80.33%. Comparing base (28d181c) to head (c6a65b5).

Files with missing lines Patch % Lines
...hframes/propertygraph/internal/QueryExecutor.scala 89.44% 23 Missing ⚠️
...raphframes/propertygraph/internal/GqlExplain.scala 77.61% 15 Missing ⚠️
...raphframes/propertygraph/internal/AstBuilder.scala 92.92% 7 Missing ⚠️
...rames/propertygraph/internal/GraphStatistics.scala 12.50% 7 Missing ⚠️
.../graphframes/propertygraph/internal/JoinPlan.scala 25.00% 6 Missing ⚠️
...graphframes/propertygraph/PropertyGraphFrame.scala 95.23% 2 Missing ⚠️
...rg/graphframes/propertygraph/internal/GqlAst.scala 80.00% 2 Missing ⚠️
...s/propertygraph/internal/SchemaGraphSnapshot.scala 95.23% 2 Missing ⚠️
...hframes/propertygraph/property/PropertyGroup.scala 33.33% 2 Missing ⚠️
.../graphframes/propertygraph/internal/Resolver.scala 98.70% 1 Missing ⚠️
❗ Your organization needs to install the Codecov GitHub app to enable full functionality.
Additional details and impacted files
@@            Coverage Diff             @@
##             main     #849      +/-   ##
==========================================
+ Coverage   79.26%   80.33%   +1.06%     
==========================================
  Files          81       90       +9     
  Lines        4712     5293     +581     
  Branches      554      646      +92     
==========================================
+ Hits         3735     4252     +517     
- Misses        977     1041      +64     

☔ View full report in Codecov by Harness.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants