{{ message }}
PR08: C++ Platform Backends#10441
Open
agibsonccc wants to merge 7 commits into
Open
Conversation
Part of the 22-PR split of ag_new_release_updates_2 branch. Merge layer: 3 (native features) Files: 382 See pr-plans/00-master-plan.md for the full split plan and merge order.
This was referenced Jun 15, 2026
… debug output
Remove the debug printf("F2 opType:[%i]\n", opNum) blocks from
NativeOpExecutioner_indexreduce.cu and NativeOpExecutioner_reduce.cu entirely,
following the no-ad-hoc-printf rule for C++.
Gate all printf diagnostic output in the llamacpp backend behind
sd::Environment::getInstance().isVerbose(): printAllBackendCapabilities()
in backend_capabilities.cpp, and printArrayLocalityInfo() plus the locality
mismatch warning in device_locality.cpp.
Contributor
Author
…backends and CUDA helpers
Add Metal Performance Shaders implementations for all major op categories: attention, blas, comparison, conv, elementwise, embedding, image, loss, math, matrix, normalization, reductions, rnn, sorting, transform. Add MpsVersionProvider for runtime Metal feature detection. Refactor mpsUtils for shared utilities.
Apply THROW_EXCEPTION macro consolidation to platform backends.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.

Summary
PR 08 of 22 PRs in the
ag_new_release_updates_2branch split. Merge after Layer 2 (native core ops + helpers).FlashAttentionCache+SDPACacheusing oneDNN Graph API compiled partitions; FP32/FP16/BF16; thread-local stream avoids per-call overheadGgmlContextGuardRAII; zero-copycreateGgmlTensor(); quantized GEMM (Q4_0/Q4_1/Q8_0); KV cache, MoE, RWKV, Mamba, Gemma4; 20 CUDA variantsflash_attention.custub always declines (no cuDNN SDPA primitive), routes to cuBLAS fallback.mmfiles; zero-copyMTLBufferwrapping on Unified Memory; SDPA viaMPSGraphScaledDotProductAttentionOp; LSTM/GRU viaMPSLSTMVlmBackendManagersingleton probes backends at startup (CUDA>METAL>CPU); full ViT pipeline: preprocess → patch embed → vision encode → project → cross-attention → multimodal fusionDECLARE_PLATFORM/PLATFORM_IMPL/PLATFORM_CHECKtriple;PLATFORM_CHECKgates on compile-time feature flags and runtime dtype/shape compatibilityWhat Changed
ARM Compute Library (ACL) — platform/armcompute/ (~124 files)
armcomputeUtils.h— ACL tensor/layout conversion utilitiesArmComputeVersionProvider.h— build-time version detectiondeconv2d.cpp— deconvolution via NEDeconvolutionLayerApple Accelerate — platform/accelerate/ (28 files)
AccelerateVersionProvider.h— version detectionaccelerateUtils.h/.cpp— BLAS/vDSP bridge for NDArraymatmul.cpp,gemv.cpp,dot.cpp,axpy.cpp,blas_extra.cpp— GEMM/GEMV/dot via cblasfft.cpp— FFT via vDSP_fft_zrip (split-complex format)conv1d.cpp,conv2d.cpp— convolution via vDSP_convbatchnorm.cpp,layer_norm.cpp— normalization via vDSP vector opselementwise.cpp,arithmetic.cpp,transcendental.cpp,trigonometric.cpp— element-wise via vDSP and Accelerate math (vvsin, vvcos, vvexp, etc.)reductions.cpp,pooling2d.cpp,comparison.cpp,cumulative.cpp,linalg.cpp,batch_ops.cpp— full coverage of pooling, SVD/solve, batch opscuDNN — platform/cudnn/ (20 files)
CudnnVersionProvider.h— version detectionactivations.cu/activations_extended.cu— relu/sigmoid/tanh/elu/gelu/softplus and swish/mish/hardswish/hardsigmoid via cudnnActivationForwardbiasadd.cu— bias-add via cuDNN tensor add APIconv1d.cu— 1D convolution via cuDNN 2D conv (expanded dims trick)cudnnUtils.h/.cu— centralized per-stream cuDNN handle cachingdeconv2d.cu/deconv3d.cu— transposed convolution via cudnnConvolutionBackwardDatadropout.cu— dropout via cudnnDropoutForward with stateful RNG descriptorflash_attention.cu— stub: PLATFORM_CHECK always false; routes to cuBLAS FlashAttentionHelperglobal_pooling.cu— global max/avg pool via cudnnPoolingForwardgru.cu— GRU forward via cudnnRNNForwardinstancenorm.cu,layernorm.cu,log_softmax.cu,lrn.cu,softmax.cu,reduce.cu,sconv2d.cu,simple_rnn.cu,spatial_transformer.cu— additional norm/attention/RNN opsllama.cpp/GGML — platform/llamacpp/ (60 files)
GgmlVersionProvider.h— version detectionllamacppUtils.h/.cpp— GgmlContextGuard RAII (64MB context);createGgmlTensor()wraps NDArray buffer zero-copy;executeGgmlGraph()runs ggml_cgraph;copyGgmlToNDArray()copies result backmatmul.cpp,quantized_matmul.cpp— GEMM via ggml_mul_mat; Q4_0/Q4_1/Q8_0 quantized variantsrms_norm.cpp,rope.cpp— RMSNorm and RoPE via GGML primitiveskv_cache_ops.cpp— KV cache update via ggml_set (in-place scatter at sequence position)grouped_query_attention.cpp,flash_attention.cpp— GQA and Flash Attentionmoe_ops.cpp,rwkv_ops.cpp,ssm_ops.cpp,gated_delta_ops.cpp,gemma4_ops.cpp— MoE, RWKV, Mamba-style SSM, gated delta rule, Gemma4model_parallel.cpp,device_locality.cpp— tensor parallelism and NUMA-aware placementcuda/(20 .cu files) — CUDA variants using GGML CUDA backendMIOpen/AMD — platform/miopen/ (5 files)
miopenUtils.h— ZLUDA/MIOpen bridge for ENGINE_ZLUDA_AMDactivations.cpp,batchnorm.cpp,conv2d.cpp,softmax.cpp— MIOpen GPU ops via HIP runtimeoneDNN/MKL-DNN — platform/mkldnn/ (80+ files)
OnednnVersionProvider.h— version detectionmkldnnUtils.h/.cpp— thread-local dnnl::stream, Graph API helpersflash_attention.cpp— Flash Attention via oneDNN Graph API;FlashAttentionCachekeyed on (batch, seqQ, seqKV, numHeads, headDim, dtype, isCausal, is3D); FP32/FP16/BF16sdpa.cpp— SDPA via oneDNN Graph API;SDPACachewith 4D/3D and additive-bias; thread-local streamgru.cpp,global_pooling.cpp— GRU and global poolingeltwise_*.cpp(5 files) — element-wise ops grouped by category via oneDNN eltwise primitiveMLIR — platform/mlir/ (17 files)
mlirUtils.h— MLIR platform utility headerApple MPS — platform/mps/ (21 .mm files)
MpsVersionProvider.h,mpsUtils.h/.mm— MPS bridge for NDArray to MPSMatrix/MPSNDArray; Metal command buffer lifecyclemps_blas.mm— GEMM/GEMV via MPSMatrixMultiplicationmps_conv.mm— MPSCNNConvolution and depthwise convmps_activations.mm/mps_activations_ext.mm— MPSCNNNeuron variantsmps_attention.mm— SDPA via MPSGraphScaledDotProductAttentionOp with batched matmul fallbackmps_normalization.mm— MPSCNNBatchNormalizationmps_rnn.mm— LSTM/GRU via MPSLSTMmps_comparison.mm,mps_elementwise.mm,mps_embedding.mm,mps_image.mm,mps_loss.mm,mps_math.mm,mps_matrix.mm,mps_reductions.mm,mps_sorting.mm,mps_transform.mm— full op coverage via MPSGraphPJRT/TPU — platform/pjrt/ (10 files)
pjrtUtils.h/.cpp— PJRT C API bridge for ENGINE_TPU: matmul, element-wise, activations, conv2d, pooling, reductions, shape opsmatmul.cpp,batchnorm.cpp,conv2d.cpp,elementwise.cpp,pooling.cpp,reductions.cpp,shape_ops.cpp— XLA HLO computations via PJRT clientVLM — platform/vlm/ (12 files)
vlmBackend.h/.cpp—VlmBackendManagersingleton; backend priority: CUDA=1 > METAL=2 > CPU=0; AUTO=99; 512MB context defaultvlmUtils.h/.cpp— image preprocessing, NDArray ↔ GGML tensor conversionvlm_image_preprocess.cpp— resize, normalize, patch tokenization for ViTvlm_image_embed.cpp— patch embedding via GGML vision encoder weightsvlm_vision_encode.cpp— ViT-style transformer forward passvlm_vision_projection.cpp— project vision features to language model embedding dimensionvlm_cross_attention.cpp— cross-attention between visual features and language tokens (SmolDocling/InternVL)vlm_multimodal_fusion.cpp— combine visual and textual representationsvlm_patch_embed.cpp/vlm_2d_position_encode.cpp— patch and 2D position embeddingscuda/vlmCudaUtils.cu,cuda/vlm_core_ops.cu— CUDA kernels for VLM memory and patch embeddingDependencies
Merge Order
These 22 PRs must merge in layer order. Each layer depends on the layers above it being merged first. PRs within the same layer are independent and can merge in parallel.