{{ message }}
feat: GQL support#849
Draft
SemyonSinchenko wants to merge 24 commits into
Draft
Conversation
Collaborator
Author
|
Codecov Report❌ Patch coverage is Additional details and impacted files@@ Coverage Diff @@
## main #849 +/- ##
==========================================
+ Coverage 79.26% 80.33% +1.06%
==========================================
Files 81 90 +9
Lines 4712 5293 +581
Branches 554 646 +92
==========================================
+ Hits 3735 4252 +517
- Misses 977 1041 +64 ☔ View full report in Codecov by Harness. 🚀 New features to boost your workflow:
|
optimizations will follow up
FunctionsRegistry, parsing, tests.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.

GQL MATCH on PropertyGraphFrame — motivation, philosophy, and the plan model
1. Motivation & philosophy
The guiding principle is don't rebuild what the platform already does well. A property-graph query engine on top of Spark sits between two mature systems, and the design
draws a hard line on either side:
WHERE a.age > 30becomescol("age") > 30; aRETURN year(x.born)becomesfunctions.year(...). Catalyst keeps doing the cost-based physical planning, the codegen, the shuffles. We emit logical DataFrame ops and let the existing engine optimize them.What's left in the middle — and the only thing this engine actually owns — is the schema-aware translation from a graph pattern to a relational join plan with pushing down all the predicates from the GQL. That is the entire value-add: taking
(a:Person)-[:KNOWS]->(b)and figuring out, against the declared LPG schema, which concrete vertex/edge groups it can bind to and how they join. This is the part neither Spark SQL nor the algorithm collection provides.2. Decisions and hard limitations that follow
The philosophy dictates the boundaries directly. These are not gaps to be filled later; they are the shape of the thing:
MATCH <linear pattern> [WHERE] [RETURN]. NoWITH, no aggregation, noORDER BY/LIMIT, no multi-MATCH, noOPTIONAL. Anything that is "just SQL on the result" the user can do on the returnedDataFrame— so we don't absorb it.spark.sql.functions, fail-fast on unknown name/arity. No UDFs. The whitelist is the scope boundary: if Spark SQL has it, we expose it by name; if it doesn't, we don't invent it. We do not expose all, just a subset that I think is useful forWHERE. Anything else does not make any sense (see motivation): if user wants to apply a function on top of results the result isDataFramealready.start_*,end_*,edge_property_group,patharray). We surface the matched topology; arbitrary projection reshaping is the user's to do downstream in SQL.3. The model: query → plan → optimization
Three narrowing IRs, each a typed boundary with one owner. The progression mirrors a compiler, but each stage exists only to get closer to "emit
DataFrameops."Query (syntactic):
GqlAst. A hand-written sealed ADT, completely firewalled from ANTLR — no generated *Context type escapesAstBuilder. Pure syntax; carries no schema knowledge and no precedence in its nodes (precedence lives in the grammar tiers). This is what makes the rest of the engine testable against plain case classes and the grammar replaceable.Resolved (logical): ResolvedQuery. This is the only schema-aware stage — the heart of the engine. Resolution does two things:
SchemaPathsby a bounded DFS over theSchemaGraphSnapshot. An untyped/ambiguous element fans out into one path per compatible vertex/edge group. Direction (traversedForward) is recorded per step so<-[e]-and undirected edges join correctly. Disconnected → zero paths.Physical:
JoinPlan. One self-contained plan per path (path + element-level join order + predicates + projection + the stats that drove it). Self-contained so explain(Physical) renders without re-resolving.JoinOptimizeris the single place order is chosen; it's structured as a Planner + PlanRefiner SPI threadingOption[GraphStatistics], but the default planner uses written order and consumes no stats. The seam for CBO exists; the policy today is "trust Catalyst."Execution.
QueryExecutorwalks each plan's order, scanning each element once and joining onto the growing frame. Two refinements keep the emitted SQL lean, both consistent with "let Spark do the work but don't hand it garbage":Plans are UNION ALL-ed into one
DataFramewith the fixed schema.4. Notes on the code
GqlAst→ QueryIr/SchemaGraphSnapshot →Resolver.classifyWhere→QueryExecutor.executePlan/classifyElementProps. Grammar +AstBuilderare mechanical.