ROCm ç©éµä¹æ³æºä»£ç | AIçæåç¿»è¯
ç°å¨æå¯¹ROCmçç©éµä¹æ³æºä»£ç æäºéå½»ççè§£ã以䏿¯å ¨é¢çè§£éï¼
ROCmå¦ä½å¨æºä»£ç ä¸å®ç°ç©éµä¹æ³
ROCm主è¦éè¿ä¸ä¸ªåºæ¥å®ç°ç©éµä¹æ³ï¼æ¯ä¸ªåºæä¾ä¸å屿¬¡çæ½è±¡ï¼
1. rocBLAS â BLASåºå ¥å£ç¹
æä»¶ï¼ library/src/blas3/rocblas_gemm.hpp
rocBLASå®ç°äºæ åçBLAS GEMMï¼éç¨ç©éµä¹æ³ï¼APIï¼C = α·op(A)·op(B) + β·Cã
主æºç«¯æµç¨ä»rocblas_gemm_impl()ï¼ä½äºrocblas_gemm_imp.hppï¼å¼å§ãå¨éªè¯åæ°å¹¶å°Î±/βå¤å¶å°ä¸»æºåï¼å®è°ç¨ï¼
status = ROCBLAS_API(rocblas_internal_gemm_template)(handle, trans_a, trans_b,
m, n, k, alpha, A, 0, lda, 0, B, 0, ldb, 0, beta, C, 0, ldc, 0, 1);
è¿ä¼ååå°Tensileå端ã
2. Tensile â å æ ¸çæå¼æï¼ä¸»è¦GEMMå端ï¼
Tensileæ¯ä¸ä¸ªYAML驱å¨çå æ ¸çæå¨ï¼ç¨äºå建æå·¥ä¼åçæ±ç¼å æ ¸ãrocBLAS â Tensileçæ¥å£ä½äºï¼
æä»¶ï¼ library/src/tensile_host.cpp
ååæµç¨
rocblas_gemm()
â rocblas_internal_gemm_template()
â runContractionProblem() [tensile_host.cpp]
â ConstructTensileProblem() æå»ºTensile::ContractionProblem
â GetTensileInputs() 设置AãBãCãDçGPUæé
â library->findBestSolution() ä»é¢è°ä¼åºä¸éæ©æä½³å
æ ¸
â adapter.launchKernels() å¯å¨GPUå
æ ¸
å
³é®ä»£ç ï¼æ¥èªtensile_host.cppï¼ï¼
// é对æ¤é®é¢è§æ¨¡æ¾å°æä½³GPUå
æ ¸
solution = library->findBestSolution(tensile_prob, *hardware, fitness_query);
// å¯å¨å
æ ¸
hipError_t hip_status = adapter.launchKernels(
solution->solve(tensile_prob, GetTensileInputs(prob), *hardware),
handle->get_stream(), ...);
Tensileé¢ç¼è¯äºæ°å个ç»è¿è°ä¼çå
æ ¸åä½ï¼é对ä¸åçMãNãK大å°ãæ°æ®ç±»åãGPUæ¶æï¼ï¼å¹¶å°å
¶åå¨ä¸º.co代ç 对象æä»¶ï¼ä½äº/opt/rocm/lib/rocblas/library/ã
Tensileå æ ¸ç¼åå¨ï¼æ±ç¼ï¼
æä»¶ï¼ Tensile/KernelWriterAssembly.py
Tensileçæå®é
çGCN/AMDGPUæ±ç¼ï¼.sæä»¶ï¼ãå
æ ¸ç¼åå¨ä¼çæç±»ä¼¼v_mfma_f32_16x16x4f32çMFMAæä»¤ãä¾å¦ï¼
# æ¥èªKernelWriterAssembly.py
class KernelWriterAssembly(KernelWriter):
def __init__(self, ...):
self.do["MAC"] = True # ä¹å è¿ç®
self.do["GlobalReadA"] = True
self.do["GlobalReadB"] = True
self.do["LocalWrite"] = True
self.do["GlobalWrite"] = True
å®ä¼çæç±»ä¼¼ä»¥ä¸çæ±ç¼ä»£ç ï¼
v_mfma_f32_16x16x4f32 v[0:3], v4, v5, v[0:3] // C += A * B
3. Composable Kernel (CK) â ç°ä»£C++模æ¿åºï¼è¾æ°æ¹æ³ï¼
ä»åºï¼ https://github.com/ROCm/composable_kernel
CKæ¯ä¸ç§åºäºç°ä»£C++模æ¿çæ¹æ³ãå®éç¨åºäºtileçç¼ç¨æ¨¡åï¼æå»ºå¨AMDGPUå å»ºå½æ°ä¹ä¸ã
ä¸å±å±æ¬¡ç»æ
第ä¸å± â ç½æ ¼çº§GEMMï¼å æ ¸å ¥å£ï¼ï¼
GridGemm
ââ BlockGemm ï¼æ¯ä¸ªçº¿ç¨åï¼
ââ WarpGemm ï¼æ¯ä¸ªæ³¢åï¼
ââ MFMA / WMMAæä»¤
第äºå± â å级GEMMï¼åºäºå ±äº«å åï¼ï¼
æä»¶ï¼ ck/tutorial/ck_tile/gemm/01_naive_gemm/block_gemm_asmem_bsmem_creg.hpp
// C += A * B ï¼AåBæ¥èªå
±äº«å
åï¼Cå¨å¯åå¨ä¸ï¼
template <typename Problem, typename Policy>
struct BlockGemmASmemBSmemCReg {
template <typename CBlockTensor, typename ABlockWindow, typename BBlockWindow>
CK_TILE_DEVICE void operator()(CBlockTensor& c,
const ABlockWindow& a,
const BBlockWindow& b) const {
// å¨å
å±å¾ªç¯ä¸éåKï¼
static_for<0, KIterPerWarp, 1>{}([&](auto kIter) {
static_for<0, MIterPerWarp, 1>{}([&](auto mIter) {
AWarpTensor a_warp = load_tile(a_warp_windows(mIter)(kIter));
static_for<0, NIterPerWarp, 1>{}([&](auto nIter) {
BWarpTensor b_warp = load_tile(b_warp_windows(nIter)(kIter));
WarpGemm{}(c_warp, a_warp, b_warp); // â å®é
çä¹å è¿ç®
});
});
});
}
};
第ä¸å± â æ³¢å级GEMMï¼å®é çMFMAæä»¤ï¼ï¼
æä»¶ï¼ ck/include/ck_tile/ops/gemm/warp/warp_gemm_attribute_mfma_impl.hpp
è¿éæ¯å®é çGPUç©éµä¹æ³æä»¤è¢«è°ç¨çå°æ¹ï¼
// F32 GEMMï¼ä½¿ç¨MFMAç16x16x4 tile
struct WarpGemmAttributeMfmaImplF32F32F32M16N16K4 {
// c_vec += a_vec * b_vec
template <typename... Params>
CK_TILE_DEVICE void operator()(CVecType& c_vec,
const AVecType& a_vec,
const AVecType& b_vec) const {
#if defined(__gfx9__)
// ç´æ¥ä½¿ç¨å
å»ºå½æ°
c_vec = __builtin_amdgcn_mfma_f32_16x16x4f32(a_vec[0], b_vec[0], c_vec, 0, 0, 0);
#else
// æä½¿ç¨å
èæ±ç¼
asm volatile("v_mfma_f32_16x16x4f32 %0, %1, %2, %3\n"
: "+v"(c_vec)
: "v"(a_vec), "v"(b_vec), "v"(c_vec));
#endif
}
};
4. hipBLASLt â è½»é级å¤ç¨å端
rocBLASè¿æ¯æhipBLASLtä½ä¸ºè½»é级å端ï¼ä½äºhipblaslt_host.cppï¼ãè¿æä¾äºä½¿ç¨hipBLASLtçhipblasLtMatmul() APIçæ¿ä»£è·¯å¾ã
5. rocWMMA â æ³¢åç©éµä¹å API
æä»¶ï¼ rocwmma/internal/mma_impl.hpp
rocWMMAæä¾äºä¸ä¸ªåºçº§å«çWMMA APIï¼å°è£ äºç¡¬ä»¶MFMAæä»¤ï¼æ¯æä¸åçæ°æ®å¸å±ï¼è¡ä¸»åº/å主åºï¼ã
æ¶ææ»ç»
PyTorch / TensorFlow / ç¨æ·åºç¨
â
ââââââ¼âââââ
â rocBLAS â â C = α·A·B + β·C (BLAS GEMM API)
ââââââ¬âââââ
â
ââââââ¼âââââââââââ
â Tensile â â YAMLå®ä¹ãé¢è°ä¼çæ±ç¼å
æ ¸
â (主è¦) â å è½½ç®æ GPUç.co代ç 对象
ââââââ¬âââââââââââ
â
ââââââ¼âââââââââââââââââââ
â Composable Kernel (CK)â â ç°ä»£C++模æ¿åº
â (è¾æ°/å¯é) â åºäºtileï¼ä½¿ç¨å
å»ºå½æ°
ââââââ¬âââââââââââââââââââ
â
ââââââ¼âââââââââââââ
â AMDGPU ISA â
â v_mfma / v_wmma â â 硬件ç©éµæä»¤
âââââââââââââââââââ
å¨AMD GPUä¸ï¼å®é
çç©éµä¹æ³æç»å½ç»ä¸ºMFMAï¼ç©éµèåä¹å ï¼æä»¤ââè¿æ¯NVIDIA Tensor CoreçAMDçæç©ââéè¿å
èæ±ç¼æç¼è¯å¨å
å»ºå½æ°ï¼å¦__builtin_amdgcn_mfma_f32_16x16x4f32ï¼æ¥è°ç¨ã
