cuda.core: graph slot table for node attachment lifetimes (draft) by Andy-Jost · Pull Request #2280 · NVIDIA/cuda-python · GitHub
Skip to content

cuda.core: graph slot table for node attachment lifetimes (draft)#2280

Draft
Andy-Jost wants to merge 9 commits into
NVIDIA:mainfrom
Andy-Jost:ajost/graph-slots
Draft

cuda.core: graph slot table for node attachment lifetimes (draft)#2280
Andy-Jost wants to merge 9 commits into
NVIDIA:mainfrom
Andy-Jost:ajost/graph-slots

Conversation

@Andy-Jost

Copy link
Copy Markdown
Contributor

Summary

Draft PR for self-review of the graph slot-table work that was held back while #2008 and #2026 landed.

This branch is rebased onto current main and currently includes the GraphBuilder.graph_definition commit from #2026 as well (intentional duplication for review). After #2026 merges, I plan to rebase again to drop that overlap before marking this ready for review.

The slot table uses CUDA graph user objects to retain resources attached to graph nodes through instantiation and launch:

  • Kernel launch arguments and kernel handles
  • Event record/wait nodes
  • Host callbacks (capture and explicit construction paths)
  • Memcpy/memset operands, including Buffer inputs with optional dst_owner / src_owner overrides

Changes

  • Add slot-table infrastructure in resource_handles (graph_set_slot, make_opaque_py, make_opaque_malloc).
  • Wire graph node creation paths in _graph_node.pyx to populate slots.
  • Split host-callback attachment into _host_callback.pyx (replacing _utils.pyx).
  • Add _weak_handles.pyx test helper for observing DevicePtrHandle lifetimes.
  • Expand test_graph_definition_lifetime.py with retention and capture-path coverage.

Test plan

  • Local pip install -v . in cuda_core
  • pytest cuda_core/tests/graph/test_graph_definition_lifetime.py -v
  • Full pytest cuda_core/tests/graph/ -v on a GPU machine

Related

Completes step 3 of NVIDIA#1330 by exposing the captured graph as an explicit
`GraphDefinition` view that shares ownership of the underlying `CUgraph`.
The handle-layer plumbing landed in PR NVIDIA#2008; this commit wires up the
user-facing surface and locks in the state-guard rules.

State semantics:

- PRIMARY builder: only valid after `end_building()`. Before
  `begin_building()` no graph exists; during capture the driver is the
  sole writer, so explicit access is unsafe.
- CONDITIONAL_BODY builder: valid both before `begin_building()` (the
  body graph is allocated at conditional-node creation time) and after
  `end_building()`. This enables a hybrid flow where a conditional body
  is populated entirely via the explicit API, with no capture at all.
- FORKED builder: never valid. Forked builders share the primary's
  graph; access through the primary instead.

Tests cover the happy path, both hybrid flows on conditional bodies
(populate-via-explicit-API and capture-then-augment), the three error
states (forked, capturing, primary pre-capture), and the
shared-ownership guarantee (the `GraphDefinition` survives the
builder's `close()`).

Co-authored-by: Cursor <cursoragent@cursor.com>
@Andy-Jost Andy-Jost added this to the cuda.core next milestone Jun 30, 2026
@Andy-Jost Andy-Jost added P0 High priority - Must do! feature New feature or request cuda.core Everything related to the cuda.core module labels Jun 30, 2026
@Andy-Jost Andy-Jost self-assigned this Jun 30, 2026
@copy-pr-bot

copy-pr-bot Bot commented Jun 30, 2026

Copy link
Copy Markdown
Contributor

@Andy-Jost

Copy link
Copy Markdown
Contributor Author

/ok to test

Introduce OpaqueHandle and a per-graph slot table retained on the CUgraph
as a user object, preparing to replace ad-hoc per-resource user objects when
wiring graph node attachments in a follow-up change.
@github-actions

Copy link
Copy Markdown

Replace the per-resource CUDA user objects attached at each graph node
with the per-graph slot table from phase 1. Kernel, event-record,
event-wait, and host-callback nodes now store their owning handles in
node slots via graph_set_slot. Stream-captured callbacks map the
just-captured host node from cuStreamGetCaptureInfo and use the same
path; forked builders share the primary's graph handle so their
attachments reach the same table.

Refine the phase 1 surface to support this: the slot table is created
lazily on first attachment, so conditional-branch bodies (ref handles)
get one too, and graph_set_slot returns CUresult for HANDLE_RETURN-style
error checking. Removes _attach_user_object and the per-type heap-copy
deleters.
Rename graph/_utils to graph/_host_callback now that it holds only
host-callback machinery (the trampoline, _is_py_host_trampoline, and
_resolve_host_callback), matching the concept-named files around it, and
update the three cimport sites. Add _attach_host_callback_owners to share
the "callback -> slot 0, user_data -> slot 1" attachment between the eager
(GN_callback) and capture (add_callback) paths. Guard a zero-length
user_data copy against malloc(0) and hoist the per-call ctypes import.

Attach the kernel-argument tuple to the kernel node's slot 1 so the Python
objects backing the arguments -- notably device Buffers -- outlive the
graph. The driver copies argument values into the node at add time but does
not keep the referenced device memory alive, so without this a kernel node
could be left with a stale device pointer. This is the slot-table port of
the user-object fix from NVIDIA#2041 (currently only on main).
GraphNode.memcpy/memset (and the GraphDefinition pass-throughs) now accept a
Buffer or a raw int for each address. A new _resolve_ptr helper reads the
device pointer from a Buffer and returns it as an owner; a raw int casts
through with no owner. GN_memcpy attaches a Buffer dst to slot 0 and src to
slot 1, and GN_memset attaches dst to slot 0, so buffers passed by value
outlive the graph. Raw ints behave exactly as before (caller owns the
lifetime), so this is backward compatible.

Document the stream-capture lifetime contract on GraphBuilder: operations
recorded during capture reference caller-owned memory and are not retained,
unlike explicit GraphDefinition construction. Host callbacks are the one
exception, retained on both the capture and explicit paths.
… capture callbacks

Cover GraphDefinition memset/memcpy with Buffer operands (including clone),
and GraphBuilder capture host callbacks retained after dropping Python refs.
Keyword-only *_owner args retain arbitrary objects for raw pointer
operands; Buffer+owner combinations are rejected. Strengthen owner tests
with weakref retention checks and add src_owner rejection test.
Store DevicePtrHandle in slot table instead of Buffer wrappers so
reset/close cannot release memory while a graph still references it.
Add test-only weak_handle() for deterministic allocation lifetime
checks and extend graph lifetime tests accordingly.
@Andy-Jost Andy-Jost force-pushed the ajost/graph-slots branch from 13dab04 to 621ade8 Compare June 30, 2026 17:29
@Andy-Jost

Copy link
Copy Markdown
Contributor Author

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

cuda.core Everything related to the cuda.core module feature New feature or request P0 High priority - Must do!

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant