çè§£ROCmå æ ¸ä¼å | AIçæåç¿»è¯
æç°å¨å¯¹æ´ä¸ªä»£ç åºæäºéå½»ççè§£ã以䏿¯å ¨é¢çæ¦åµï¼
ROCm å æ ¸ä¼åï¼å ¨è²æ¦è§
æ¬ä»åºï¼/Users/lzwjava/projects/ROCmï¼æ¯ä¸ä¸ªå
ä»åºââå³éè¿ Google ç repo å·¥å
·ç®¡ç约 30+ å项ç®çæ¸
åæä»¶ï¼default.xmlï¼ãå®é
çå
æ ¸æºä»£ç ä½äºå个ç¬ç«çä»åºä¸ï¼å¦ llvm-projectãcomposable_kernelãrocm-libraries çï¼ï¼éè¦éè¿ repo sync æåã以䏿¯æ ¹æ®ææ¡£æ»ç»ç ROCm çæç³»ç»ä¸å
æ ¸ä¼åçè¿ä½æ¹å¼ï¼
- å æ ¸ç¼åââéæ©ä½ çå·¥å ·
å¨ ROCm ä¸ç¼åå æ ¸ä¸»è¦æä¸æ¡è·¯å¾ï¼
HIP (C++)ï¼åçæ¹å¼ãä½¿ç¨ HIP çè¿è¡æ¶ API åå æ ¸è¯è¨ç¼å C++ å æ ¸ãå¯åå© hipify-clang ç§»æ¤ç°æ CUDA 代ç ãå æ ¸éè¿ AMDGPU LLVM å端ç¼è¯ä¸º GCN/RDNA ISAã
OpenCL (C)ï¼è·¨å¹³å°æ åãROCm æ¯æ AMD GPU ä¸ç OpenCL 2.0+ã
Triton (Python)ï¼ç¨ Python ç¼å GPU å æ ¸ãTriton éè¿ MLIR â LLVM IR â AMDGPU ISA çç¼è¯æµç¨ãå¨ AI å·¥ä½è´è½½ä¸è¶æ¥è¶æµè¡ãææ¡£ä¸æå¤§éå ³äº Triton ä¼åçç« èï¼åè§ workload.rst 第 1276-1532 è¡ï¼ã
- å æ ¸ä¼åæµæ°´çº¿
ç¬¬ä¸æ¥ï¼å åæ§è½åæ ââââââââââââââââââââ æ°¸è¿ä¸è¦çæµç¶é¢å¨åªéãROCm æä¾äºåå±çæ§è½åæå·¥å ·æ ï¼
PyTorch Profiler â é«çº§æ¶é´çº¿ï¼å¯¼åºå° Perfetto UIï¼ ROCm Systems Profiler â CPU+GPU è·è¸ªãå åãä¸ä¸æåæ¢ ROCProfiler (rocprof) â åå§ GPU 硬件计æ°å¨ï¼ææ¬/CSV è¾åºï¼ ROCm Compute Profiler â å¼å¯¼å¼åæï¼roofline 模åãå éãå åå¾ãåºçº¿å¯¹æ¯ï¼GUI + CLIï¼ ROCr Debug Agent â å åæ éæè·ãwavefront 转å¨
å¿«éæ§è½åæç¤ºä¾ï¼ rocprof âstats ./my_kernel_app # æ¶éææè®¡æ°å¨ rocprof âhip-trace ./my_kernel_app # HIP API è·è¸ª
ç¬¬äºæ¥ï¼è¯å«ç¶é¢ âââââââââââââââââââââââââââ æ§è½æ¯è¯è¡¨ï¼docs/reference/glossary/performance.rstï¼å®ä¹äºå ³é®æ¦å¿µï¼
Compute-bound â å æ ¸åç®æ¯ååééå¶ï¼ALU å¿ç¢ï¼ Memory-bound â å æ ¸å HBM 带宽éå¶ï¼å è½½/åå¨å ä¸»å¯¼ï¼ Occupancy â æ´»è· wavefront æ°ä¸æ¯ä¸ª CU æå¤§å¯è½æ°çæ¯å¼ Register pressure â VGPR è¿å¤ = æ¯ä¸ª CU ç wave åå°ï¼éèå»¶è¿è½åä¸é Bank conflicts â LDS 访é®ä¸²è¡åèéå¹¶è¡å Wavefront divergence â åä¸ wave å ç线ç¨èµ°ä¸å忝
妿 GPU æ¯ç¶é¢ï¼èé CPU/å æ ¸å¯å¨å¼éï¼ï¼åè¿å ¥å æ ¸çº§æ§è½åæã ROCm Compute Profiler ä¼å¤æ¬¡è¿è¡ä½ çå æ ¸ï¼æ¶éä¸åç计æ°å¨éåï¼ç¶åç»åº roofline 模åï¼ç²¾ç¡®æ¾ç¤ºä½ å¤äºåªä¸ªä½ç½®ã
ç¬¬ä¸æ¥ï¼èªå¨è°ä¼ï¼æç®å â æè´¹åï¼ âââââââââââââââââââââââââââââââââââââââââ
Level 1 â å¼å¯èªå¨è°ä¼ï¼é¶ä»£ç ä¿®æ¹ï¼ï¼
PyTorch TunableOpï¼ä» rocBLAS/hipBLASLt ä¸å°è¯æ°å个 GEMM å æ ¸
PYTORCH_TUNABLEOP_ENABLED=1 python my_model.py
ç¶ååæ¾æä½³é ç½®
PYTORCH_TUNABLEOP_ENABLED=1 PYTORCH_TUNABLEOP_TUNING=0 python my_model.py
TorchInductor max-autotuneï¼è°ä¼ Triton GEMM/å·ç§¯ tile 尺寸
TORCHINDUCTOR_MAX_AUTOTUNE=1 python my_model.py
MIOpen autotuneï¼å¯»æ¾æä½³å·ç§¯å æ ¸
MIOPEN_FIND_ENFORCE=3 MIOPEN_FIND_MODE=1 python my_model.py
Level 2 â Composable Kernel (CK) å端ï¼
å®è£ CK Python å è£ å¨ï¼å° CK å å ¥èªå¨è°ä¼å端
pip install git+https://github.com/rocm/composable_kernel@develop TORCHINDUCTOR_MAX_AUTOTUNE_GEMM_BACKENDS=âTRITON,CK,ATENâ
Level 3 â hipBLASLt æå¨è°ä¼ï¼TensileLiteï¼ï¼
为è·å¾æå¤§ GEMM æ§è½ï¼è°ä¼æ±ç¼å端çæå¨
cd hipBLASLt/tensilelite ./Tensile/bin/Tensile config.yaml output_path
7 æ¥è°ä¼æµæ°´çº¿ï¼åºåæµè¯å¸¸ç¨åæ° â fork â join â æç»
Level 4 â å¨ Triton æ HIP ä¸ç¼åèªå®ä¹è°ä¼å æ ¸ï¼
Triton èªå¨å¯è°åæ°ï¼å ³é®æé®ï¼ï¼ BLOCK_M, BLOCK_N, BLOCK_K â tile 尺寸ï¼å¹³è¡¡è®¡ç®ä¸å åï¼ num_stages = 2 â æµæ°´çº¿é¶æ®µï¼å GEMM 设为 2ï¼ num_warps â æ¯ä¸ªå·¥ä½ç»ä¸ç wave æ°ï¼å½±å occupancyï¼ waves_per_eu â æç¤ºç¼è¯å¨åå° VGPR ä½¿ç¨ matrix_instr_nonkdim = 16 â MFMA æä»¤å°ºå¯¸ï¼å¨ MI300X ä¸ 16x16 ä¼äº 32x32ï¼
- æ·±åº¦å æ ¸ä¼åææ¯
å å访é®ä¼åï¼
- åå¹¶å ¨å±å å访é®ï¼ä¼å ä½¿ç¨ 128 åèäºå¡ï¼
- æå¤§åå©ç¨ LDSï¼çä¸å ±äº«å åï¼ââ MI300X 䏿¯ä¸ª CU 64KB
- æå°åå ¨å±âLDS æ°æ®ä¼ è¾ï¼ä½¿ç¨åå/é»å¡ï¼
- é¿å LDS ä¸ç bank å²çªï¼å¡«å å ±äº«å åæ°ç»ï¼
- åéåï¼ä½¿ç¨ global_load_dwordx4ï¼128 ä½å è½½ï¼èéæ éå è½½
- å¯¹äº MI300X GEMMï¼é¿å æ¥é¿ä¸º 512 åèçåæ°ï¼Tagram çç¹é®é¢ï¼
计ç®ä¼åï¼
- MI300Xï¼ä¼å ä½¿ç¨ mfma_16x16 èé mfma_32x32ï¼æ´å¥½çè½æï¼
- bf16 ç©éµè¿ç®ææ¾å¿«äº f16
- ç®æ occupancyï¼ç½æ ¼ä¸è³å° 1024 个线ç¨åï¼å·¥ä½ç»ï¼
- MI300X æ 304 ä¸ªæ´»è· CUï¼8 个 XCD à æ¯ä¸ª XCD 38 ä¸ªæ´»è· CUï¼
- ä½¿ç¨ WorkGroupMapping 为 8 çåæ°ï¼XCD æ°éï¼ä»¥æé« L2 ç¼åæç
Occupancy 计ç®ï¼workload.rst 第 1643-1690 è¡ï¼ï¼
- ä» ISA 䏿¾å° .vgpr_countï¼N
- æ¾å° LDS åé ï¼ä» MLIR 转å¨ä¸ grep âtriton_gpu.sharedâ â L åè
- æ¾å° num_warpsï¼ä» MLIR ä¸ grep âtriton_gpu.num-warpsâ â nW
- occ_vgpr = ä» VGPR/occupancy è¡¨ä¸æ¥æ¾
- occ_lds = floor(65536 / L)
- occ = min(floor(occ_vgpr à 4 / nW), occ_lds) à nW / 4
ISA æ±ç¼åæï¼
- 设置 export AMDGCN_ENABLE_DUMP=1 è½¬å¨ ISA
- æ£æ¥ global_load_dwordx4ï¼åéåå è½½ï¼
- æ£æ¥ LDS å è½½/å卿¯å¦ä½¿ç¨ _b128 åç¼ï¼å尿令æ°ï¼
- æ£æ¥ s_waitcnt(lgkmcnt, vmcnt) ç忥æç
- éå æä»¤ä»¥éèå»¶è¿
MLIR åæï¼
- 设置 export MLIR_ENABLE_DUMP=1 æ¥ç Triton ä¸é´è¡¨ç¤º
- è¯å«åä½ç LDS å¾è¿ï¼å¦å è½½ã转置ãéæ°åå¨ï¼
- æ£æ¥æ°æ®å¸å±ï¼blocked â shared â transpose â blocked â dot_op
- ç³»ç»çº§å æ ¸ä¼å
CU æ©ç ï¼docs/how-to/setting-cus.rstï¼ï¼ # å°ç¹å® CU ç»å®å°ç¹å® GPU HSA_CU_MASK=â0:0-19;1:20-39â ./my_app # GPU 0 ä½¿ç¨ CU 0-19ï¼GPU 1 ä½¿ç¨ 20-39 ROC_GLOBAL_CU_MASK=â0:0-15,32-47â ./my_app # HIP 级æ©ç
å¤ GPUï¼RCCLï¼ï¼ - å§ç»ä½¿ç¨å ¨é¨ 8 个 GPU è¿è¡éåéä¿¡ï¼MI300X å ¨è¿æ¥ææï¼ - ç¦ç¨ NUMA èªå¨å¹³è¡¡ï¼sudo sysctl kernel.numa_balancing=0 - ä½¿ç¨æ¯ä¸ª GPU ä¸ä¸ªè¿ç¨ï¼é¿å 串è¡åå æ ¸å¯å¨ï¼ - TORCH_NCCL_HIGH_PRIORITY=1 GPU_MAX_HW_QUEUES=2 ç¨äº FSDP - ç产工ä½è´è½½ä½¿ç¨ NCCL_MIN_NCHANNELS=112
ç¯å¢åééæ¥è¡¨ï¼ PYTORCH_TUNABLEOP_ENABLED=1 èªå¨éæ©æä½³ GEMM å æ ¸ TORCHINDUCTOR_MAX_AUTOTUNE=1 è°ä¼ Triton tile é ç½® TORCHINDUCTOR_FREEZING=1 å èæéãæå 常éï¼æ¨çï¼ MIOPEN_FIND_ENFORCE=3 èªå¨è°ä¼ MIOpen å·ç§¯å æ ¸ FLASH_ATTENTION_TRITON_AMD_ENABLE=FALSE ä½¿ç¨ CK å端è¿è¡ Flash Attention HSA_DISABLE_FRAGMENT_ALLOCATOR=1 ç¦ç¨å åç¼åï¼è°è¯ï¼ AMDGCN_ENABLE_DUMP=1 è½¬å¨ ISA ç¨äºåæ MLIR_ENABLE_DUMP=1 è½¬å¨ Triton MLIR IR
- å ³é®å æ ¸åºï¼å·²ä¼åï¼ä¼å 使ç¨ï¼
rocBLAS/hipBLASLt â GEMMï¼æ¯æåå¤çï¼ReLUãGELUãåç½®çï¼ Composable Kernel â åºäºæ¨¡æ¿çå æ ¸çæï¼ç¨äº GEMM/å·ç§¯ï¼ MIOpen â å·ç§¯åè¯ï¼WinogradãFFTãç´æ¥ãéå¼ GEMMï¼ RCCL â å¤ GPU éåéä¿¡ï¼all-reduceãall-gather çï¼ rocWMMA â wave ç©éµä¹ç´¯å ï¼å¼ éæ ¸å¿æä½ï¼ rocFFT â FFT è¿ç® rocRAND â GPU ä¸çéæºæ°çæ rocPRIM/rocThrust â å¹¶è¡åè¯ï¼æåºãæ«æãè§çº¦ï¼
æ»ç»ï¼ä¼åè·¯å¾ä¸ºï¼æ§è½åæ â è¯å«ç¶é¢ â é¦å å°è¯èªå¨è°ä¼ â 妿ä»ä¸å¤ï¼æå¨è°ä¼ tile 尺寸/occupancy/æä»¤éæ© â 妿éè¦æè´æ§è½ï¼æ·±å ¥ ISA/MLIR åæå¹¶ç¼åèªå®ä¹å æ ¸ã大夿°ç¨æ·ä» éè¿ TunableOp + TorchInductor max-autotune + Flash Attention å°±è½è·å¾å·¨å¤§æåï¼èæ éç¼åä¸è¡å æ ¸ä»£ç ã
