Add performance regression tests for cuda.core by devin-ai-integration[bot] · Pull Request #2 · Custom-Devin-Demos/cuda-python · GitHub
Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
7 changes: 7 additions & 0 deletions .github/workflows/test-wheel-linux.yml
88 changes: 88 additions & 0 deletions PERFORMANCE_THRESHOLDS.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,88 @@
# Performance Regression Test Thresholds

This document describes the performance thresholds used in the cuda.core performance regression tests and provides guidance on how to interpret and update them.

## Overview

Performance regression tests are critical for maintaining the quality of the cuda.core API. These tests detect when changes to the codebase inadvertently slow down critical CUDA operations, allowing developers to catch and fix performance issues before they reach production.

## Current Thresholds

### Kernel Launch Overhead

The kernel launch test measures the time to launch a minimal empty kernel, which represents the pure overhead of the kernel launch mechanism.

**Threshold: 50 microseconds per launch**

This threshold was determined by benchmarking on A100 GPU hardware. The value includes a safety margin to account for normal system variance while still catching significant regressions. A well-optimized kernel launch path should complete in under 20 microseconds on modern hardware, so the 50 microsecond threshold provides room for variance while catching major regressions.

### Memory Transfer Bandwidth

The memory transfer test measures host-to-device transfer bandwidth for various buffer sizes.

| Buffer Size | Threshold (GB/s) | Rationale |
|-------------|------------------|-----------|
| 1 KB | 0.1 | Small transfers are dominated by launch overhead, not bandwidth |
| 1 MB | 5 | Medium transfers begin to show bandwidth characteristics |
| 64 MB | 10 | Large transfers should achieve significant bandwidth |

These thresholds are calibrated for A100 GPU with PCIe Gen4 connectivity. The A100 can theoretically achieve up to 32 GB/s on PCIe Gen4 x16, but practical measurements are typically lower due to system overhead, DMA setup time, and other factors.

## Why A100 Hardware?

Performance tests run exclusively on A100 GPUs for several reasons:

1. **Consistency**: Running on consistent hardware eliminates false positives from hardware variations
2. **Availability**: A100 GPUs are available in the CI infrastructure on arm64 architecture
3. **Representative**: A100 represents modern datacenter GPU performance characteristics
4. **Stability**: A100 drivers and hardware are mature and stable

## Updating Thresholds

Thresholds may need to be updated in the following scenarios:

### Hardware Changes

If the CI infrastructure moves to different GPU hardware, thresholds should be recalibrated:

1. Run the performance tests multiple times on the new hardware
2. Calculate the mean and standard deviation of results
3. Set thresholds at approximately mean + 3 standard deviations
4. Document the new hardware and calibration methodology

### Legitimate Performance Improvements

If code changes legitimately improve performance, thresholds can be tightened:

1. Verify the improvement is consistent across multiple runs
2. Update thresholds to reflect the new baseline
3. Document the change and the improvement that enabled it

### False Positives

If tests fail due to system noise rather than actual regressions:

1. Investigate whether the failure is reproducible
2. Consider increasing the number of iterations to reduce variance
3. If necessary, slightly relax thresholds with documentation

## Running Performance Tests Locally

To run performance tests locally:

```bash
cd cuda_core
pytest -v tests/test_performance_regression.py -m performance
```

Note that local results may differ from CI results due to hardware differences.

## Adding New Performance Tests

When adding new performance tests:

1. Use the `@pytest.mark.performance` marker
2. Document the threshold rationale in this file
3. Ensure tests run on consistent hardware (A100 in CI)
4. Include warm-up iterations to avoid cold-start effects
5. Use sufficient iterations to reduce measurement variance
4 changes: 4 additions & 0 deletions cuda_core/tests/conftest.py
Original file line number Diff line number Diff line change
Expand Up @@ -154,3 +154,7 @@ def mempool_device_x3():


skipif_need_cuda_headers = pytest.mark.skipif(helpers.CUDA_INCLUDE_PATH is None, reason="need CUDA header")


def pytest_configure(config):
config.addinivalue_line("markers", "performance: mark test as a performance regression test")
103 changes: 103 additions & 0 deletions cuda_core/tests/test_performance_regression.py