{{ message }}
cuda.core: graph slot table for node attachment lifetimes (draft)#2280
Draft
Andy-Jost wants to merge 9 commits into
Draft
cuda.core: graph slot table for node attachment lifetimes (draft)#2280Andy-Jost wants to merge 9 commits into
Andy-Jost wants to merge 9 commits into
Conversation
Completes step 3 of NVIDIA#1330 by exposing the captured graph as an explicit `GraphDefinition` view that shares ownership of the underlying `CUgraph`. The handle-layer plumbing landed in PR NVIDIA#2008; this commit wires up the user-facing surface and locks in the state-guard rules. State semantics: - PRIMARY builder: only valid after `end_building()`. Before `begin_building()` no graph exists; during capture the driver is the sole writer, so explicit access is unsafe. - CONDITIONAL_BODY builder: valid both before `begin_building()` (the body graph is allocated at conditional-node creation time) and after `end_building()`. This enables a hybrid flow where a conditional body is populated entirely via the explicit API, with no capture at all. - FORKED builder: never valid. Forked builders share the primary's graph; access through the primary instead. Tests cover the happy path, both hybrid flows on conditional bodies (populate-via-explicit-API and capture-then-augment), the three error states (forked, capturing, primary pre-capture), and the shared-ownership guarantee (the `GraphDefinition` survives the builder's `close()`). Co-authored-by: Cursor <cursoragent@cursor.com>
Contributor
Contributor
Author
|
/ok to test |
Introduce OpaqueHandle and a per-graph slot table retained on the CUgraph as a user object, preparing to replace ad-hoc per-resource user objects when wiring graph node attachments in a follow-up change.
|
Replace the per-resource CUDA user objects attached at each graph node with the per-graph slot table from phase 1. Kernel, event-record, event-wait, and host-callback nodes now store their owning handles in node slots via graph_set_slot. Stream-captured callbacks map the just-captured host node from cuStreamGetCaptureInfo and use the same path; forked builders share the primary's graph handle so their attachments reach the same table. Refine the phase 1 surface to support this: the slot table is created lazily on first attachment, so conditional-branch bodies (ref handles) get one too, and graph_set_slot returns CUresult for HANDLE_RETURN-style error checking. Removes _attach_user_object and the per-type heap-copy deleters.
Rename graph/_utils to graph/_host_callback now that it holds only host-callback machinery (the trampoline, _is_py_host_trampoline, and _resolve_host_callback), matching the concept-named files around it, and update the three cimport sites. Add _attach_host_callback_owners to share the "callback -> slot 0, user_data -> slot 1" attachment between the eager (GN_callback) and capture (add_callback) paths. Guard a zero-length user_data copy against malloc(0) and hoist the per-call ctypes import. Attach the kernel-argument tuple to the kernel node's slot 1 so the Python objects backing the arguments -- notably device Buffers -- outlive the graph. The driver copies argument values into the node at add time but does not keep the referenced device memory alive, so without this a kernel node could be left with a stale device pointer. This is the slot-table port of the user-object fix from NVIDIA#2041 (currently only on main).
GraphNode.memcpy/memset (and the GraphDefinition pass-throughs) now accept a Buffer or a raw int for each address. A new _resolve_ptr helper reads the device pointer from a Buffer and returns it as an owner; a raw int casts through with no owner. GN_memcpy attaches a Buffer dst to slot 0 and src to slot 1, and GN_memset attaches dst to slot 0, so buffers passed by value outlive the graph. Raw ints behave exactly as before (caller owns the lifetime), so this is backward compatible. Document the stream-capture lifetime contract on GraphBuilder: operations recorded during capture reference caller-owned memory and are not retained, unlike explicit GraphDefinition construction. Host callbacks are the one exception, retained on both the capture and explicit paths.
… capture callbacks Cover GraphDefinition memset/memcpy with Buffer operands (including clone), and GraphBuilder capture host callbacks retained after dropping Python refs.
Keyword-only *_owner args retain arbitrary objects for raw pointer operands; Buffer+owner combinations are rejected. Strengthen owner tests with weakref retention checks and add src_owner rejection test.
Store DevicePtrHandle in slot table instead of Buffer wrappers so reset/close cannot release memory while a graph still references it. Add test-only weak_handle() for deterministic allocation lifetime checks and extend graph lifetime tests accordingly.
13dab04 to
621ade8
Compare
Contributor
Author
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.

Summary
Draft PR for self-review of the graph slot-table work that was held back while #2008 and #2026 landed.
This branch is rebased onto current
mainand currently includes theGraphBuilder.graph_definitioncommit from #2026 as well (intentional duplication for review). After #2026 merges, I plan to rebase again to drop that overlap before marking this ready for review.The slot table uses CUDA graph user objects to retain resources attached to graph nodes through instantiation and launch:
Bufferinputs with optionaldst_owner/src_owneroverridesChanges
resource_handles(graph_set_slot,make_opaque_py,make_opaque_malloc)._graph_node.pyxto populate slots._host_callback.pyx(replacing_utils.pyx)._weak_handles.pyxtest helper for observingDevicePtrHandlelifetimes.test_graph_definition_lifetime.pywith retention and capture-path coverage.Test plan
pip install -v .incuda_corepytest cuda_core/tests/graph/test_graph_definition_lifetime.py -vpytest cuda_core/tests/graph/ -von a GPU machineRelated
graph_definition).