feat(dataflow): P0 schema foundation for interprocedural variable-level model by carlos-alm · Pull Request #1608 · optave/ops-codegraph-tool · GitHub
Skip to content

feat(dataflow): P0 schema foundation for interprocedural variable-level model#1608

Merged
carlos-alm merged 8 commits into
mainfrom
feat/dataflow-vertex-schema-p0
Jun 19, 2026
Merged

feat(dataflow): P0 schema foundation for interprocedural variable-level model#1608
carlos-alm merged 8 commits into
mainfrom
feat/dataflow-vertex-schema-p0

Conversation

@carlos-alm

@carlos-alm carlos-alm commented Jun 18, 2026

Copy link
Copy Markdown
Contributor

Summary

Implementation of the interprocedural dataflow plan (phases P0–P2 + P5 B1).

P0 — Schema foundation (migration v18)

  • dataflow_vertices table: param/local/return/receiver data locations
  • dataflow_summary table: per-function transfer functions
  • dataflow table augmented with nullable source_vertex/target_vertex/scope/call_edge_id
  • dataflow_fn backward-compat view (cross-function vertex-linked edges)
  • Also fixes missing v17 migration in the Rust engine (Closes fix(native): Rust connection.rs missing migration v17 (edges.technique column) #1607)
  • DataflowVertex type, new DataflowEdgeKind values (def_use, arg_in, return_out)
  • Worker protocol: dataflowVertices field wired into SerializedExtractorOutput
  • hasDataflowVertices helper + export chain
  • parity-compare.mjs --dataflow flag for vertex multiset comparison

P1 — Variable model for JS/TS/TSX

  • buildDataflowVerticesAndEdges: creates param/return/local vertices + intra def_use edges
  • Internal cast types access visitor's paramName, paramIndex, referencedNames
  • Existing flows_to/returns/mutates edges unchanged (backward compat)

P2 — Interprocedural stitching

  • buildInterproceduralStitch: post-pass over all stitch candidates after per-file processing
  • arg_in inter edge: caller's source vertex → callee's param[j] vertex
  • return_out inter edge: callee's return → caller's capture local (if summary confirms flow)
  • dataflow_summary computation per (func, param): flows_to_return + is_mutated

P5 B1 — C/C++/ObjC/CUDA dataflow rules

  • src/ast-analysis/rules/c.ts: C + C++ rules with nameExtractor for nested declarators
  • DataflowRulesConfig.nameExtractor extension to handle complex function name structures
  • DATAFLOW_RULES: c, cpp, objc, cuda (4 new languages)

Tests

  • 34/34 integration tests pass (23 original + 11 new in tests/integration/dataflow-vertices.test.ts)
  • New tests verify: vertex creation, def_use edges, summaries, arg_in stitching, dataflow_fn view

Issues filed for remaining phases

Test plan

  • npx vitest run tests/integration/dataflow.test.ts tests/integration/dataflow-vertices.test.ts — 34/34 pass
  • npm run lint — clean
  • npx tsc --noEmit — clean
  • node scripts/parity-compare.mjs --langs javascript --dataflow — requires built dist + native addon

@greptile-apps

greptile-apps Bot commented Jun 18, 2026

Copy link
Copy Markdown
Contributor

Comment thread scripts/parity-compare.mjs Outdated
Comment on lines +188 to +191
} catch {
// table absent in pre-v18 DBs; empty multiset = no diffs
vertices.set('__TOTAL_ROWS__', 0);
}

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Bare catch silences all query errors, not just "table absent"

readDataflowVerticesMultiset swallows every exception — including genuine SQL errors (malformed query, schema drift, corrupt DB). If the same bug hits both the WASM and native builds, both return an empty multiset and the comparison reports no diffs, hiding a real divergence. Consider re-throwing any error whose message doesn't include "no such table".

Fix in Claude Code

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in commit 38589c5. The catch now checks the error message for 'no such table' and re-throws anything else, so genuine SQL errors (schema drift, corrupt DB, malformed query) are no longer silently swallowed.

Comment on lines +277 to +280
dfVertexCount: dataflow ? base.dfVertices.get('__TOTAL_ROWS__') : undefined,
nodeDiffs,
edgeDiffs,
dfVertexDiffs,

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 dfVertexDiffs is always emitted in JSON; dfVertexCount is not — inconsistent shape

When --dataflow is not passed, dfVertexCount is set to undefined (serialised as absent in JSON) while dfVertexDiffs is always present as []. JSON consumers can't use the presence/absence of dfVertexDiffs to detect whether a dataflow comparison was actually attempted.

Suggested change
dfVertexCount: dataflow ? base.dfVertices.get('__TOTAL_ROWS__') : undefined,
nodeDiffs,
edgeDiffs,
dfVertexDiffs,
dfVertexCount: dataflow ? base.dfVertices.get('__TOTAL_ROWS__') : undefined,
nodeDiffs,
edgeDiffs,
dfVertexDiffs: dataflow ? dfVertexDiffs : undefined,

Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!

Fix in Claude Code

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in commit 38589c5. dfVertexDiffs is now undefined (not []) when --dataflow is not passed, matching the convention already used by dfVertexCount. The ok check was updated to treat undefined as 'no comparison attempted'.

Comment thread src/db/migrations.ts
Comment on lines +300 to +312
CREATE VIEW IF NOT EXISTS dataflow_fn AS
SELECT
sv.func_id AS source_id,
tv.func_id AS target_id,
d.kind,
d.param_index,
d.expression,
d.line,
d.confidence
FROM dataflow d
JOIN dataflow_vertices sv ON d.source_vertex = sv.id
JOIN dataflow_vertices tv ON d.target_vertex = tv.id
WHERE sv.func_id != tv.func_id;

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 dataflow_fn view is described as "backward-compat" but excludes all pre-v18 rows

The INNER JOINs on dataflow_vertices mean any dataflow row where source_vertex IS NULL (i.e., every row inserted before v18) is invisible in this view. Code that migrates from querying dataflow.source_id/target_id to querying dataflow_fn.source_id/target_id will silently drop all historical inter-procedural flows. If the view is purely for the new vertex model (not a bridge for legacy data), the word "backward-compat" in the PR and commit messages will mislead future maintainers.

Fix in Claude Code

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in commit 38589c5. Added a comment to the CREATE VIEW statement clarifying that dataflow_fn intentionally uses INNER JOINs and is NOT a backward-compat replacement for direct dataflow queries — legacy consumers that migrate to it would silently drop all pre-v18 rows.

@github-actions

github-actions Bot commented Jun 18, 2026

Copy link
Copy Markdown
Contributor

Codegraph Impact Analysis

63 functions changed21 callers affected across 12 files

  • usage in scripts/parity-compare.mjs:33 (1 transitive callers)
  • readDataflowVerticesMultiset in scripts/parity-compare.mjs:170 (2 transitive callers)
  • readMultisets in scripts/parity-compare.mjs:198 (1 transitive callers)
  • unwrapDeclarator in src/ast-analysis/rules/c.ts:22 (2 transitive callers)
  • extractCFunctionName in src/ast-analysis/rules/c.ts:30 (0 transitive callers)
  • findFuncDeclarator in src/ast-analysis/rules/c.ts:61 (1 transitive callers)
  • getCParamListNode in src/ast-analysis/rules/c.ts:72 (0 transitive callers)
  • extractCParamName in src/ast-analysis/rules/c.ts:77 (0 transitive callers)
  • LanguageRules.nameExtractor in src/ast-analysis/visitor-utils.ts:16 (3 transitive callers)
  • functionName in src/ast-analysis/visitor-utils.ts:49 (2 transitive callers)
  • enterFunctionScope in src/ast-analysis/visitors/dataflow-visitor.ts:430 (1 transitive callers)
  • PurgeStmts.dataflowByVertex in src/db/repository/build-stmts.ts:8 (0 transitive callers)
  • PurgeStmts.dataflowSummary in src/db/repository/build-stmts.ts:9 (0 transitive callers)
  • PurgeStmts.dataflowVertices in src/db/repository/build-stmts.ts:10 (0 transitive callers)
  • preparePurgeStmts in src/db/repository/build-stmts.ts:28 (8 transitive callers)
  • runPurge in src/db/repository/build-stmts.ts:88 (8 transitive callers)
  • hasDataflowVertices in src/db/repository/dataflow.ts:26 (0 transitive callers)
  • SerializedExtractorOutput.dataflowVertices in src/domain/wasm-worker-protocol.ts:81 (0 transitive callers)
  • VisitorParam.funcName in src/features/dataflow.ts:193 (0 transitive callers)
  • VisitorParam.paramName in src/features/dataflow.ts:194 (0 transitive callers)

…el model

docs check acknowledged

Establishes the database schema and wire-format plumbing for the
interprocedural dataflow expansion (plan §8 P0).

Migration v18 (TS + Rust, mirrored):
- dataflow_vertices table: addressable data locations (param/local/return/
  receiver) keyed to an enclosing function node
- dataflow_summary table: per-function transfer function (param→return
  reachability, mutation flag) for inter-procedural stitching
- Nullable source_vertex/target_vertex/scope/call_edge_id columns on the
  existing dataflow table (additive; old rows keep source_id/target_id)
- dataflow_fn view: backward-compat function-level projection over the new
  vertex-linked inter-scope edges (empty until P1 populates vertices)

Also fixes missing v17 migration in the Rust engine (edges.technique column
was never added; tracked as issue #1607).

Supporting changes:
- DataflowVertex type in types.ts; new DataflowEdgeKind values (def_use,
  arg_in, return_out) for the vertex-level edge taxonomy
- dataflowVertices field wired into SerializedExtractorOutput (worker
  protocol, populated in P1)
- build-stmts.ts: purge statements for dataflow_vertices/summary/vertex-
  linked dataflow rows (correct cascade order)
- hasDataflowVertices helper + export chain (db/repository → db/index)
- parity-compare.mjs --dataflow flag: enables dataflow build + compares
  dataflow_vertices multisets between WASM and native engines

All existing integration tests pass (23/23).
…ra def_use edges

docs check acknowledged

Implements the WASM/JS-path vertex extraction phase of the interprocedural
dataflow plan (§8 P1).

buildDataflowVerticesAndEdges (new, in features/dataflow.ts):
- Creates 'param' vertices from each function parameter (name + index)
- Creates one 'return' vertex per function that has a return statement
- Creates 'local' vertices for variables assigned from call-return results
- Emits 'def_use' / 'intra' edges from param/local → return when the
  variable name appears in the return expression's referencedNames set
- All new rows use the source_vertex/target_vertex/scope columns added
  in v18; source_id/target_id are set to the enclosing function for
  backward compatibility with existing queries

Internal cast types (VisitorParam, VisitorReturn, VisitorAssignment) allow
safe access to the richer visitor output (paramName, paramIndex,
referencedNames) without changing the public DataflowResult contract.

Existing flows_to/returns/mutates edges are unchanged. The native bulk-
insert fast path is left untouched — native vertex emission tracked
separately.

Tests: 9 new assertions in tests/integration/dataflow-vertices.test.ts —
param/return/local vertex creation, def_use edge creation, negative test
(param not in return expression → no edge), backward-compat flows_to edge,
dataflow_fn view empty pre-P2.

32/32 integration tests pass (23 existing + 9 new).
…mmaries

docs check acknowledged

Implements the interprocedural stitch post-pass and summary computation
for the variable-level dataflow model (plan §8 P2).

buildInterproceduralStitch (new):
- Post-pass that runs after all per-file vertices + summaries are committed
- For each resolved argFlow (A calls B with arg x → B.param[j]):
  - Finds source vertex x in caller (via binding.type='param'|'local')
  - Finds B.param[j] vertex in callee
  - Emits 'arg_in' scope='inter' edge: x → B.param[j]
  - If B's summary shows B.param[j] flows_to_return: emits 'return_out'
    edge: B.return → A's capture local (if any)
- Resolves call_edge_id from the edges table for each stitch site

buildDataflowVerticesAndEdges (updated):
- Now also computes dataflow_summary (flows_to_return, is_mutated per param)
  using the def_use edges just committed (same transaction)
- Collects and returns StitchCandidate[] + ReturnCapture[] for the post-pass

buildDataflowEdges (updated):
- Accumulates stitch candidates across all files
- Calls buildInterproceduralStitch as a second-pass transaction

Tests: 11 P1+P2 assertions passing:
- param/return/local vertex creation
- def_use intra edges (positive + negative cases)
- summary computation: helper.param[y] flows_to_return=1, param[x]=0
- arg_in inter edge verified in both dataflow_fn view and raw query
- scope='inter', correct vertex kinds (param→param)

34/34 integration tests pass (23 original + 11 new).
…r extension

docs check acknowledged

Adds DATAFLOW_RULES for the C-family batch (B1) of the 26 new languages.

Infrastructure (needed by C/C++ and future complex languages):
- DataflowRulesConfig.nameExtractor optional override: when present,
  used by functionName() in visitor-utils.ts before the nameField path —
  handles languages where the function name is nested inside declarators
- DataflowRulesConfig/DATAFLOW_DEFAULTS/LanguageRules updated consistently

src/ast-analysis/rules/c.ts (new):
- extractCFunctionName: unwraps C/C++ function_definition.declarator →
  function_declarator.declarator → identifier, handling pointer/array/
  reference/parenthesized/qualified_identifier wrappers
- extractCParamName: extracts identifier from parameter_declaration,
  unwrapping pointer/reference declarators
- dataflow (C): covers function_definition, call_expression,
  field_expression (member access), return_statement, init_declarator
- dataflowCpp (C++): extends C with function_declaration and STL
  mutating method names

DATAFLOW_RULES additions:
  'c' → c.dataflow
  'cpp' → c.dataflowCpp
  'objc' → c.dataflow  (C-compatible functions; ObjC message sends TODO)
  'cuda' → c.dataflowCpp  (CUDA inherits C++ grammar)

34/34 integration tests pass.
Three bugs in the new C/C++ dataflow extraction path (P5 B1):

1. extractCFunctionName dropped pointer/reference-returning functions
   (int *foo(), T &bar()): the direct declarator child is a
   pointer_declarator wrapper — now unwrapped one level before checking
   for function_declarator.

2. Parameter list was unreachable for all C/C++ functions: params live
   on function_definition→declarator→parameters (nested), not directly
   on function_definition. Added getParamListNode optional override to
   DataflowRulesConfig/DATAFLOW_DEFAULTS/enterFunctionScope; C rules use
   getCParamListNode which traverses through optional wrappers to reach
   function_declarator.parameters.

3. dataflowCpp.functionNodes included function_declaration (forward
   declarations without bodies): these produce spurious param vertices
   with flows_to_return=0 and can overwrite correct dataflow_summary rows
   via INSERT OR REPLACE when processed after the definition.

Adds 29 passing tests covering all three paths for C and C++.
…y-compare

Two issues in scripts/parity-compare.mjs:

- readDataflowVerticesMultiset swallowed all SQL exceptions, hiding
  schema drift, malformed queries, and corrupt-DB errors; now only
  suppresses "no such table" (pre-v18 DBs), re-throwing everything else.

- dfVertexDiffs was always [] in JSON output when --dataflow was not
  passed, while dfVertexCount was absent (undefined). JSON consumers
  couldn't distinguish "comparison attempted, no diffs" from "comparison
  not attempted". dfVertexDiffs is now undefined when --dataflow is off,
  matching the dfVertexCount convention; ok computation updated to match.

Also clarifies the dataflow_fn view comment in src/db/migrations.ts:
the view INNER JOINs are intentional (only vertex-linked v18+ rows), not
a backward-compat bridge — migrating code from direct dataflow queries to
dataflow_fn would silently drop pre-v18 rows.
@carlos-alm carlos-alm force-pushed the feat/dataflow-vertex-schema-p0 branch from 6c8d044 to 38589c5 Compare June 19, 2026 02:26
@carlos-alm

Copy link
Copy Markdown
Contributor Author

@greptileai

Comment thread scripts/parity-compare.mjs Outdated
Comment on lines +311 to +313
for (const d of dfVertexDiffs) {
console.log(` [df-vertex] ${d.key} wasm=${d.base} ${variantName}=${d.other}`);
}

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 dfVertexDiffs is undefined when --dataflow is not passed. Iterating over undefined with for...of throws TypeError: undefined is not iterable. Any run without --dataflow that encounters node or edge divergences will crash at this line instead of printing the diff details.

Suggested change

Fix in Claude Code

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in 1e2e254 — wrapped the loop in an 'if (dfVertexDiffs)' guard so runs without --dataflow no longer throw when iterating.

@carlos-alm carlos-alm merged commit 23ac553 into main Jun 19, 2026
33 checks passed
@carlos-alm carlos-alm deleted the feat/dataflow-vertex-schema-p0 branch June 19, 2026 03:30
@github-actions github-actions Bot locked and limited conversation to collaborators Jun 19, 2026
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

fix(native): Rust connection.rs missing migration v17 (edges.technique column)

1 participant