Custom-Devin-Demos · devin-ai-integration · Dec 5, 2025
diff --git a/.github/workflows/test-wheel-linux.yml b/.github/workflows/test-wheel-linux.yml
@@ -261,6 +261,13 @@ jobs:
           LOCAL_CTK: ${{ matrix.LOCAL_CTK }}
         run: run-tests core
 
+      - name: Run cuda.core performance regression tests
+        if: ${{ matrix.GPU == 'a100' }}
+        run: |
+          pushd ./cuda_core
+          pytest -v tests/test_performance_regression.py -m performance
+          popd
+
       - name: Ensure cuda-python installable
         run: |
           if [[ "${{ matrix.LOCAL_CTK }}" == 1 ]]; then

diff --git a/PERFORMANCE_THRESHOLDS.md b/PERFORMANCE_THRESHOLDS.md
@@ -0,0 +1,88 @@
+# Performance Regression Test Thresholds
+
+This document describes the performance thresholds used in the cuda.core performance regression tests and provides guidance on how to interpret and update them.
+
+## Overview
+
+Performance regression tests are critical for maintaining the quality of the cuda.core API. These tests detect when changes to the codebase inadvertently slow down critical CUDA operations, allowing developers to catch and fix performance issues before they reach production.
+
+## Current Thresholds
+
+### Kernel Launch Overhead
+
+The kernel launch test measures the time to launch a minimal empty kernel, which represents the pure overhead of the kernel launch mechanism.
+
+**Threshold: 50 microseconds per launch**
+
+This threshold was determined by benchmarking on A100 GPU hardware. The value includes a safety margin to account for normal system variance while still catching significant regressions. A well-optimized kernel launch path should complete in under 20 microseconds on modern hardware, so the 50 microsecond threshold provides room for variance while catching major regressions.
+
+### Memory Transfer Bandwidth
+
+The memory transfer test measures host-to-device transfer bandwidth for various buffer sizes.
+
+| Buffer Size | Threshold (GB/s) | Rationale |
+|-------------|------------------|-----------|
+| 1 KB        | 0.1              | Small transfers are dominated by launch overhead, not bandwidth |
+| 1 MB        | 5                | Medium transfers begin to show bandwidth characteristics |
+| 64 MB       | 10               | Large transfers should achieve significant bandwidth |
+
+These thresholds are calibrated for A100 GPU with PCIe Gen4 connectivity. The A100 can theoretically achieve up to 32 GB/s on PCIe Gen4 x16, but practical measurements are typically lower due to system overhead, DMA setup time, and other factors.
+
+## Why A100 Hardware?
+
+Performance tests run exclusively on A100 GPUs for several reasons:
+
+1. **Consistency**: Running on consistent hardware eliminates false positives from hardware variations
+2. **Availability**: A100 GPUs are available in the CI infrastructure on arm64 architecture
+3. **Representative**: A100 represents modern datacenter GPU performance characteristics
+4. **Stability**: A100 drivers and hardware are mature and stable
+
+## Updating Thresholds
+
+Thresholds may need to be updated in the following scenarios:
+
+### Hardware Changes
+
+If the CI infrastructure moves to different GPU hardware, thresholds should be recalibrated:
+
+1. Run the performance tests multiple times on the new hardware
+2. Calculate the mean and standard deviation of results
+3. Set thresholds at approximately mean + 3 standard deviations
+4. Document the new hardware and calibration methodology
+
+### Legitimate Performance Improvements
+
+If code changes legitimately improve performance, thresholds can be tightened:
+
+1. Verify the improvement is consistent across multiple runs
+2. Update thresholds to reflect the new baseline
+3. Document the change and the improvement that enabled it
+
+### False Positives
+
+If tests fail due to system noise rather than actual regressions:
+
+1. Investigate whether the failure is reproducible
+2. Consider increasing the number of iterations to reduce variance
+3. If necessary, slightly relax thresholds with documentation
+
+## Running Performance Tests Locally
+
+To run performance tests locally:
+
+```bash
+cd cuda_core
+pytest -v tests/test_performance_regression.py -m performance
+```
+
+Note that local results may differ from CI results due to hardware differences.
+
+## Adding New Performance Tests
+
+When adding new performance tests:
+
+1. Use the `@pytest.mark.performance` marker
+2. Document the threshold rationale in this file
+3. Ensure tests run on consistent hardware (A100 in CI)
+4. Include warm-up iterations to avoid cold-start effects
+5. Use sufficient iterations to reduce measurement variance
diff --git a/cuda_core/tests/conftest.py b/cuda_core/tests/conftest.py
@@ -154,3 +154,7 @@ def mempool_device_x3():
 
 
 skipif_need_cuda_headers = pytest.mark.skipif(helpers.CUDA_INCLUDE_PATH is None, reason="need CUDA header")
+
+
+def pytest_configure(config):
+    config.addinivalue_line("markers", "performance: mark test as a performance regression test")
diff --git a/cuda_core/tests/test_performance_regression.py b/cuda_core/tests/test_performance_regression.py
@@ -0,0 +1,103 @@
+# SPDX-FileCopyrightText: Copyright (c) 2024-2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# SPDX-License-Identifier: Apache-2.0
+
+import time
+
+import numpy as np
+import pytest
+from cuda.core.experimental import Device, LaunchConfig, LegacyPinnedMemoryResource, Program, launch
+
+
+@pytest.mark.performance
+def test_kernel_launch_performance_regression(init_cuda):
+    """Test that kernel launch performance doesn't regress.
+
+    This test measures the overhead of launching a minimal empty kernel
+    to detect any regressions in the kernel launch path.
+    """
+    device = Device()
+    stream = device.create_stream()
+
+    kernel_code = """
+    extern "C" __global__ void empty_kernel() {
+        // No operations, just measure launch overhead
+    }
+    """
+
+    prog = Program(kernel_code, "c++")
+    obj = prog.compile("cubin")
+    kernel = obj.get_kernel("empty_kernel")
+
+    num_launches = 1000
+    config = LaunchConfig(grid=1, block=1)
+
+    start = time.perf_counter()
+    for _ in range(num_launches):
+        launch(stream, config, kernel)
+    stream.sync()
+    end = time.perf_counter()
+
+    avg_launch_time_us = (end - start) * 1e6 / num_launches
+
+    # Fail if average launch time exceeds threshold (50 microseconds)
+    # This threshold is calibrated for A100 GPU hardware
+    # See PERFORMANCE_THRESHOLDS.md for details on threshold selection
+    assert avg_launch_time_us < 50, f"Kernel launch too slow: {avg_launch_time_us:.2f}us > 50us"
+
+
+@pytest.mark.performance
+def test_memory_transfer_performance_regression(init_cuda):
+    """Test that memory transfer performance doesn't regress.
+
+    This test measures host-to-device memory transfer bandwidth for
+    various buffer sizes to detect any regressions in the memory
+    transfer path.
+    """
+    device = Device()
+    stream = device.create_stream()
+
+    # Test different transfer sizes with corresponding bandwidth thresholds (GB/s)
+    # Thresholds are calibrated for A100 GPU with PCIe Gen4
+    # See PERFORMANCE_THRESHOLDS.md for details on threshold selection
+    test_cases = [
+        (1024, 0.1),  # 1KB - small transfers have high overhead, low effective bandwidth
+        (1024 * 1024, 5),  # 1MB - medium transfers
+        (64 * 1024 * 1024, 10),  # 64MB - large transfers should achieve higher bandwidth
+    ]
+
+    pinned_mr = LegacyPinnedMemoryResource()
+
+    for size, threshold in test_cases:
+        # Allocate pinned host memory and device memory
+        host_buffer = pinned_mr.allocate(size)
+        device_buffer = device.allocate(size, stream=stream)
+
+        # Initialize host data
+        host_array = np.from_dlpack(host_buffer).view(np.uint8)
+        host_array[:] = np.random.randint(0, 256, size, dtype=np.uint8)
+
+        # Adjust iterations based on size to keep test time reasonable
+        num_iterations = max(1, 100 // max(1, size // (1024 * 1024)))
+
+        # Warm up
+        device_buffer.copy_from(host_buffer, stream=stream)
+        stream.sync()
+
+        # Benchmark host-to-device transfer
+        start = time.perf_counter()
+        for _ in range(num_iterations):
+            device_buffer.copy_from(host_buffer, stream=stream)
+        stream.sync()
+        end = time.perf_counter()
+
+        bytes_transferred = size * num_iterations
+        bandwidth_gb_s = bytes_transferred / (end - start) / 1e9
+
+        # Clean up
+        host_buffer.close()
+        device_buffer.close()
+
+        size_str = f"{size // 1024}KB" if size < 1024 * 1024 else f"{size // (1024 * 1024)}MB"
+        assert bandwidth_gb_s > threshold, (
+            f"Host-to-device transfer too slow for {size_str}: {bandwidth_gb_s:.2f} GB/s < {threshold} GB/s"
+        )
Original file line number	Diff line number	Diff line change
Expand Up		@@ -154,3 +154,7 @@ def mempool_device_x3():


		skipif_need_cuda_headers = pytest.mark.skipif(helpers.CUDA_INCLUDE_PATH is None, reason="need CUDA header")


		def pytest_configure(config):
		config.addinivalue_line("markers", "performance: mark test as a performance regression test")