perf(db): bulk CFG and dataflow DB writes via rusqlite by carlos-alm · Pull Request #653 · optave/ops-codegraph-tool · GitHub
Skip to content

perf(db): bulk CFG and dataflow DB writes via rusqlite#653

Merged
carlos-alm merged 11 commits into
mainfrom
perf/ast-node-bulk-insert
Mar 27, 2026
Merged

perf(db): bulk CFG and dataflow DB writes via rusqlite#653
carlos-alm merged 11 commits into
mainfrom
perf/ast-node-bulk-insert

Conversation

@carlos-alm

Copy link
Copy Markdown
Contributor

Summary

  • Add bulk_insert_cfg() and bulk_insert_dataflow() Rust napi functions that perform CFG block/edge and dataflow edge inserts via rusqlite in single transactions, bypassing JS iteration loops
  • Add native fast paths in buildCFGData() and buildDataflowEdges() that collect batches from pre-computed native extraction results and delegate to Rust bulk inserters
  • Both paths fall back to existing JS code when the native addon is unavailable (WASM engine)

Details

Same optimization pattern as bulk_insert_ast_nodes (6.9). The DB write phases for CFG (161ms) and dataflow (125ms) were identical JS code on both engines — this moves them to Rust when native engine is active.

Rust side:

  • cfg_db.rs — resolves function node IDs, deletes stale data, inserts blocks+edges with block index→row ID mapping
  • dataflow_db.rs — pre-builds node resolution cache (local-first, global fallback), inserts flows_to/returns/mutates edges

Target: cfgMs + dataflowMs < 50ms combined on native full builds (current ~286ms)

Test plan

  • CFG integration tests pass (11/11)
  • Dataflow integration tests pass (23/23)
  • CFG all-langs parser tests pass (40/40)
  • Dataflow JS parser tests pass (13/13)
  • Dataflow engine parity tests pass (13/13)
  • Biome lint clean on all changed files
  • Verify Rust compilation in CI (local MSVC linker env issue prevented native build)
  • Benchmark cfgMs + dataflowMs on native full builds after CI produces the addon

Move AST node SQLite inserts from per-row JS iteration to a single
native Rust transaction via napi-rs + rusqlite. The new
bulkInsertAstNodes function opens the DB directly from Rust,
pre-fetches parent node definitions, and inserts all rows in one
transaction — eliminating the JS-native FFI overhead per row.

The JS-side buildAstNodes tries the native fast path first (when
all files have native astNodes arrays), falling back to the existing
JS loop for WASM or mixed-engine scenarios.

Target: astMs < 50ms on native full builds (was ~393ms).
The Rust connection omitted busy_timeout = 5000 which the JS-side
connection.ts sets. Without it, SQLITE_BUSY is returned immediately
on WAL contention instead of retrying for 5 seconds.
)

bulkInsertAstNodes returns 0 for both "nothing to insert" and hard
errors (DB open failure, SQLITE_BUSY, etc). Compare expected vs actual
count and fall through to the JS path on mismatch so errors don't
silently drop all AST nodes.
Explain why bundled is intentional: Windows CI lacks system SQLite,
and dual-instance WAL coordination is OS-safe.
The Rust find_parent_id skipped definitions with end_line = NULL,
but the JS findParentDef treats them as always-enclosing with a
negative span (preferred over wider defs). This caused parent_node_id
mismatches between native and JS paths.
Return 0 immediately on any insert_stmt.execute() failure so the
transaction drops and rolls back, ensuring all-or-nothing semantics.
Previously, .is_ok() silently swallowed row-level errors which could
commit partial data and misfire the JS fallback causing duplicate rows.
Move CFG block/edge and dataflow edge inserts from JS iteration to Rust
bulk operations, following the same pattern as bulk_insert_ast_nodes (6.9).

Rust side:
- cfg_db.rs: bulk_insert_cfg() resolves function node IDs, deletes stale
  data, inserts blocks+edges in a single rusqlite transaction
- dataflow_db.rs: bulk_insert_dataflow() pre-builds node resolution cache
  (local-first, global fallback), inserts edges in a single transaction

JS side:
- cfg.ts: native fast path collects CfgFunctionBatch[] and delegates to
  Rust when all CFG is pre-computed by the native engine
- dataflow.ts: native fast path converts DataflowResult (argFlows,
  assignments, mutations) into FileDataflowBatch[] for Rust insertion
- Both fall back to existing JS paths when native addon is unavailable

Target: cfgMs + dataflowMs < 50ms combined (from ~286ms with JS iteration)
@greptile-apps

greptile-apps Bot commented Mar 27, 2026

Copy link
Copy Markdown
Contributor

Comment thread src/features/cfg.ts Outdated
Comment on lines +357 to +363
if (!needsJsFallback) {
const processed = native.bulkInsertCfg(db.name, batches);
const withBlocks = batches.filter((b) => b.blocks.length > 0).length;
if (processed > 0) {
info(`CFG: ${withBlocks} functions analyzed (native bulk)`);
}
return;

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Silent data loss on bulk-insert failure — no JS fallback

bulk_insert_cfg returning 0 indicates a Rust-side error (DB connection failure, missing table, failed transaction, etc.). The current code unconditionally returns after calling it, meaning any such failure silently drops all CFG writes without attempting the JS fallback path. This diverges from the established pattern in ast.ts, which explicitly checks the count and falls through on failure.

Compare ast.ts:

if (inserted === expectedNodes) {
  debug(`AST extraction (native bulk): ${inserted} nodes stored`);
  return;
}
debug(`AST extraction (native bulk): expected ${expectedNodes}, got ${inserted} — falling back to JS`);
// fall through to JS path

The CFG path should follow the same pattern — check the return value and only return on success:

if (!needsJsFallback) {
  const processed = native.bulkInsertCfg(db.name, batches);
  const withBlocks = batches.filter((b) => b.blocks.length > 0).length;
  if (processed > 0 || withBlocks === 0) {
    if (processed > 0) info(`CFG: ${withBlocks} functions analyzed (native bulk)`);
    return;
  }
  debug(`CFG: bulk insert returned 0 — falling back to JS`);
  // fall through
}

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed — cfg.ts now checks processed === batches.length and falls through to the JS path on mismatch, matching the ast.ts reference pattern. Also handles the empty-blocks edge case by checking expectedFunctions === 0.

Comment thread src/features/dataflow.ts
Comment on lines +333 to +337
if (!needsJsFallback) {
const inserted = native.bulkInsertDataflow(db.name, batches);
info(`Dataflow: ${inserted} edges inserted (native bulk)`);
return;
}

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Silent data loss on bulk-insert failure — no JS fallback

Same issue as in cfg.ts: bulk_insert_dataflow returning 0 silently drops all dataflow writes without attempting the JS fallback. If all batches have edges but Rust returns 0, no edges are written and no fallback is attempted while the caller observes a clean return.

The ast.ts pattern protects against this by checking inserted === expectedNodes. The dataflow path should do the same:

if (!needsJsFallback) {
  const inserted = native.bulkInsertDataflow(db.name, batches);
  const expectedEdges = batches.reduce((s, b) => s + b.edges.length, 0);
  if (inserted === expectedEdges || expectedEdges === 0) {
    info(`Dataflow: ${inserted} edges inserted (native bulk)`);
    return;
  }
  debug(
    `Dataflow: bulk insert expected ${expectedEdges} edges, got ${inserted} — falling back to JS`,
  );
  // fall through to JS path
}

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed — dataflow.ts now checks inserted === expectedEdges and falls through to the JS path on mismatch, matching the ast.ts reference pattern.

Match the ast.ts reference pattern: check the Rust return value
against the expected count and fall through to the JS path when
they disagree, preventing silent data loss on Rust-side errors.
@carlos-alm

Copy link
Copy Markdown
Contributor Author

Addressed all Greptile feedback in 1a20f46:

  • P1 (cfg.ts): Added count-mismatch fallback — processed === batches.length check before returning, with debug() log and fall-through to JS on mismatch.
  • P1 (dataflow.ts): Same pattern — inserted === expectedEdges check before returning.
  • P2 (cfg_db.rs empty-blocks count): The JS-side check now compares against batches.length (total functions including empty-block ones), so the Rust return value aligns correctly. The empty-blocks edge case is handled by the expectedFunctions === 0 guard.

All 11 CFG and 23 dataflow integration tests pass.

@carlos-alm

Copy link
Copy Markdown
Contributor Author

@carlos-alm carlos-alm merged commit b302648 into main Mar 27, 2026
27 checks passed
@carlos-alm carlos-alm deleted the perf/ast-node-bulk-insert branch March 27, 2026 07:51
@github-actions github-actions Bot locked and limited conversation to collaborators Mar 27, 2026
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant