è°è¯MI300X VF GPUå¯ç¨æ§ | AIçæåç¿»è¯
以䏿¯ææåææå·¥ä½ç端å°ç«¯åè§£ï¼
第ä¸é¶æ®µï¼åå§ä¾¦æ¥
ç®æ ï¼äºè§£æºç®±å ç硬件以åæåªäºå¯ç¨å·¥å ·ã
æ¥éª¤ 1 â å°è¯ææ¾çå·¥å ·
ssh root@134.199.199.108 rocm-smi
ç»æï¼command not foundãAMD ä¸ç¸å½äº nvidia-smi çå·¥å
·æªå®è£
ã
æ¥éª¤ 2 â å¯»æ¾æ¿ä»£æ¹æ¡
æç´¢äº amd-smiãrocminfoãclinfoï¼å¹¶æ£æ¥äº /opt/rocm*/ã䏿 æè·ãä»
åå¨ Debian 软件å
ä¸ç libdrm-amdgpu1ââè¿æ¯æåºæ¬ç DRM ç¨æ·ç©ºé´åºã
æ¥éª¤ 3 â éè¿ PCI è¯å« GPU
lspci | grep -iE 'vga|3d|display|amd|ati'
æ¾å°ï¼83:00.0 Processing accelerators: AMD/ATI Aqua Vanjaram [Instinct MI300X VF]
å
³é®åç°ï¼GPU å±äºç±»å« 0x12ï¼å¤çå éå¨ï¼ï¼èé 0x03ï¼VGA/æ¾ç¤ºï¼ãä»
æ¥æ¾æ¾ç¤ºç±»è®¾å¤çæ å GPU æ£æµèæ¬ä¼éæ¼å®ã
æ¥éª¤ 4 â ç´æ¥è¯»å PCI sysfs
cat /sys/bus/pci/devices/0000:83:00.0/{vendor,device,class}
- ååï¼
0x1002ï¼AMDï¼ - 设å¤ï¼
0x74b5ï¼MI300X VFï¼ - ç±»å«ï¼
0x120000ï¼å¤çå éå¨ï¼ - PCIeï¼
32.0 GT/s x16ï¼Gen 5ï¼æå¤§å®½åº¦ï¼ - å
å BARï¼
0x4000000000å¤ 256 GB
æ¥éª¤ 5 â æ£æ¥ DRM/KFD ææ
cat /sys/class/kfd/kfd/topology/nodes/*/properties
ä»
èç¹ 0ï¼CPUï¼ä¸ simd_count=0ãKFD ææä¸æ²¡æ GPU èç¹ââè¿æ¯ SR-IOV èæåè½çæ å¿ï¼å
æ ¸è®¡ç®é©±å¨ç¨åºæ æ³æä¸¾è¯¥è®¾å¤ã
è¿æ£æ¥äº /sys/class/drm/card*/device/ ä¸ amdgpu ç¹å®çç»è®¡ä¿¡æ¯ï¼gpu_busy_percentãmem_info_vram_*ãhwmon/temp*ï¼ââå
¨é¨ä¸ºç©ºãVF æªéè¿æ å DRM sysfs è·¯å¾æ´é²ç®¡çæ¥å£ã
第äºé¶æ®µï¼æå»ºå¤ç¨å·¥å
·ï¼amd-smiï¼
ç®æ ï¼å¨ä¿®å¤å®æ´å æ çåæ¶ï¼ç»ç¨æ·ä¸ä¸ªå¯ä»¥è¿è¡ä»¥æ¥ç GPU ä¿¡æ¯çå·¥å ·ã
æ¥éª¤ 6 â ç¼å Python èæ¬ï¼ä»æ¿ä»£æ°æ®æºè¯»åï¼å 为 rocm-smi æ æ³çå° VFï¼
é¨ç½²å¤±è´¥åæç»æåçæ¨¡å¼ï¼
å°è¯ 1ï¼éè¿ SSH ç bash heredoc â è¯æ³é误ï¼heredoc åé符ä¸åµå¥å¼å·å²çªï¼
å°è¯ 2ï¼éè¿ SSH ç Python heredoc â 被å®å ¨è¿æ»¤é»æ¢ï¼heredoc < PYEOF 模å¼ï¼
å°è¯ 3ï¼æåï¼ï¼éè¿ write_file å°èæ¬æ¬å°åå
¥ /tmp/amd-smi.pyï¼ç¶åéè¿ scp å¤å¶å°æå¡å¨ãè¿æ¯å¯é çè·¨æºå¨é¨ç½²æ¨¡å¼ï¼æ¬å°åå
¥ â scp â è¿ç¨å®è£
ã
æ¥éª¤ 7 â Bug ä¿®å¤ï¼é¦æ¬¡è¿è¡æ¾ç¤º Device ID: 0x0x74b5ââPCI sysfs å¼å·²å
å« 0x åç¼ãä½¿ç¨ .removeprefix("0x") ä¿®å¤ã
第ä¸é¶æ®µï¼å®è£ ROCmï¼çæ£çå æ ï¼
ç®æ ï¼è®© rocm-smiãrocminfo å hipcc æ£å¸¸å·¥ä½ï¼ä½¿ GPU çæ£å¯ç¨äºè®¡ç®ã
æ¥éª¤ 8 â æ·»å AMD ç apt ä»åº
echo 'deb [arch=amd64 signed-by=/etc/apt/keyrings/rocm.gpg] https://repo.radeon.com/rocm/apt/latest noble main' > /etc/apt/sources.list.d/rocm.list
å¨ Ubuntu 25.10ï¼pluckyï¼ä¸ä½¿ç¨ ânobleâï¼Ubuntu 24.04ï¼è½¯ä»¶å ãAMD ä» å®æ¹æ¯æ LTS çæ¬ï¼ä½ç¨æ·ç©ºé´è½¯ä»¶å æ¯å ¼å®¹çã
æ¥éª¤ 9 â 馿¬¡å®è£
å°è¯ï¼rocm-hip-sdk å
å
å ä¾èµå°ç±å¤±è´¥âârocm-cmake 0.14.0ï¼æ¥èª AMDï¼ä¸ rocm-cmake 6.4.3ï¼æ¥èª Ubuntu ç universe ä»åºï¼å²çªãApt æç»é级 6.4.3 â 0.14.0ï¼å 为尽管 AMD ççæ¬å·æ¹æ¡ä¸åï¼ä½ Ubuntu å
ççæ¬å·çèµ·æ¥æ´æ°ã
æ¥éª¤ 10 â ç¬¬äºæ¬¡å°è¯ï¼rocm-hip-runtimeï¼ä¸å« -devï¼
ç¸åç rocm-cmake å²çªãæ ¹æ¬é®é¢ï¼Ubuntu 25.10 å¨ universe 䏿ä¾ç ROCm ç»ä»¶ä¸ AMD èªå·±ä»åºä¸çå
å²çªã
æ¥éª¤ 11 â åç°ï¼çæ¬å软件å
apt-cache search rocm | grep '7.2.3'
Ubuntu 25.10 æä¾äºçæ¬å软件å
ï¼rocm-hip-runtime7.2.3ãhsa-rocr7.2.3ãcomgr7.2.3 çãè¿äºè½¯ä»¶å
åç§°ä¸åï¼å æ¤ä¸ Ubuntu çéçæ¬å rocm-cmake å
±åãè¿æ¯å¹²åçè·¯å¾ã
æ¥éª¤ 12 â å®è£ çæ¬åè¿è¡æ¶
apt-get install hsa-rocr7.2.3 comgr7.2.3 rocm-core7.2.3 rocm-language-runtime7.2.3 rocminfo7.2.3 rocm-hip-runtime7.2.3 hip-runtime-amd
å¤±è´¥ï¼æä»¶å²çªãéçæ¬å软件å
ï¼hsa-rocrãcomgrãhip-runtime-amdï¼ä½ä¸º rocm-smiï¼æ©æå®è£
ï¼çä¼ éä¾èµè¢«æåï¼å®ä»¬å¨ /opt/rocm-7.2.3/lib/* ä¸çæä»¶ä¸çæ¬å软件å
éå ã
æ¥éª¤ 13 â å¼ºå¶æ¸ 餿æå²çªè½¯ä»¶å
dpkg --purge --force-depends --force-remove-reinstreq rocm-core hsa-rocr comgr hip-runtime-amd rocprofiler-register [以åå®ä»¬ç 7.2.3 åä½]
è¿æç ´äºä¾èµæ»éââé¨åå®è£ ççæ¬å软件å ä¾èµäºæ£å¨è¢«ç§»é¤çéçæ¬å软件å ï¼å½¢æäºå¾ªç¯å¤±è´¥ã
æ¥éª¤ 14 â 使ç¨å®æ´ä¾èµæ è¿è¡å¹²åéè£
apt-get install rocm-core7.2.3 hsa-rocr7.2.3 comgr7.2.3 hip-runtime-amd7.2.3 rocprofiler-register7.2.3 rocm-device-libs7.2.3 openmp-extras-runtime7.2.3 rocm-language-runtime7.2.3 rocminfo7.2.3 rocm-hip-runtime7.2.3
æåââææè½¯ä»¶å åæ å²çªå®è£ ã
æ¥éª¤ 15 â å®è£ HIP ç¼è¯å¨
apt-get install hipcc7.2.3 hipify-clang7.2.3 hip-dev7.2.3
hipcc --version â HIP 7.2.53211ï¼AMD clang 22.0.0ã
æ¥éª¤ 16 â ä¿®å¤ libxml2 ABI ä¸å¹é
HIP ç¼è¯å¤±è´¥ï¼lld: error while loading shared libraries: libxml2.so.2: cannot open shared object fileãUbuntu 25.10 æä¾äº libxml2-16ï¼ABI .so.16ï¼ï¼è ROCm ç龿¥å¨ææ .so.2ã
ln -sf /lib/x86_64-linux-gnu/libxml2.so.16 /lib/x86_64-linux-gnu/libxml2.so.2
ldconfig
è¿æ¯ä¸ä¸ªå ¼å®¹æ§ç¬¦å·é¾æ¥ââè¾æ°ç ABI ååå ¼å®¹è¾æ§ç APIã
æ¥éª¤ 17 â HIP æµè¯ç¼è¯æå使¾ç¤º 0 个设å¤
HIP devices: 0
GPU å¨ HSA 级å«å¯è§ï¼rocm_agent_enumerator â gfx942ï¼ï¼ä½ HIP çè®¾å¤æä¸¾è¿å 0ãKFD ææä»ç¶åªæ¾ç¤º CPU èç¹ã
第åé¶æ®µï¼çæ£æ ¹å ââ缺å°åºä»¶
æ¥éª¤ 18 â æ£æ¥ dmesg ä¸ç GPU åå§åé误
dmesg | grep -i 'amdgpu.*83:00'
å ³é®é误ï¼
Direct firmware load for amdgpu/psp_13_0_6_ta.bin failed with error -2
Direct firmware load for amdgpu/gc_9_4_3_rlc.bin failed with error -2
Direct firmware load for amdgpu/sdma_4_4_2.bin failed with error -2
Direct firmware load for amdgpu/vcn_4_0_3.bin failed with error -2
amdgpu: Fatal error during GPU init
amdgpu: amdgpu: finishing device.
amdgpu 驱å¨ç¨åºå·²ç»å®å°è®¾å¤ï¼ä½æ æ³åå§åï¼å 为 MI300X ç IP åçåºä»¶ blob 缺å°äº /lib/firmware/amdgpu/ã
æ¥éª¤ 19 â å®è£ åºä»¶
apt-get install linux-firmware
éªè¯æä»¶åå¨ï¼gc_9_4_3_rlc.bin.zstãpsp_13_0_6_ta.bin.zstãsdma_4_4_2.bin.zstãvcn_4_0_3.bin.zstï¼ä½¿ç¨ zstd å缩ââå
æ ¸çåºä»¶å è½½å¨éæå°å¤çæ¤æ ¼å¼ï¼ã
æ¥éª¤ 20 â éæ°ç»å® GPU 驱å¨ç¨åºï¼å¼ºå¶åºä»¶éæ°å è½½ï¼
echo '0000:83:00.0' > /sys/bus/pci/drivers/amdgpu/unbind
sleep 2
echo '0000:83:00.0' > /sys/bus/pci/drivers/amdgpu/bind
dmesg 确认ï¼[drm] Initialized amdgpu 3.64.0 for 0000:83:00.0 on minor 1
第äºé¶æ®µï¼éªè¯ââä¸åæ£å¸¸
æ¥éª¤ 21 â 宿´éªè¯
rocm-smi: Device 0 | 37°C | 154W | 139MHz SCLK | 900MHz MCLK | 750W cap
rocminfo: AMD Instinct MI300X VF (gfx942) | 304 CUs | 191 GB HBM3
HIP test: HIP devices: 1 | 191 GB | 304 CUs | 2100 MHz
KFD topology: Node 1 with simd_count=1216, gfx_target_version=90402
æ¥éª¤ 22 â æä¹ ç¯å¢è®¾ç½®
/etc/profile.d/rocm.sh:
ROCM_PATH=/opt/rocm-7.2.3
PATH=$ROCM_PATH/bin:$PATH
LD_LIBRARY_PATH=$ROCM_PATH/lib:$LD_LIBRARY_PATH
å ³é®ç»éª
1. SR-IOV VF å¨åºä»¶å è½½ä¹å对æ åå·¥å
·ä¸å¯è§ã
VF éè¦åºä»¶æ¥åå§åå
¶ IP åï¼PSP â å®å
¨ï¼GC â 计ç®ï¼SDMA â DMAï¼ã没æåºä»¶ï¼KFD æ æ³æä¸¾å®ï¼rocm-smi æ¾ç¤ºä¸ºç©ºï¼HIP è¿å 0 个设å¤ã
2. çæ¬å软件å
æ¯é LTS Ubuntu çéçè±ã
Ubuntu 25.10 å° ROCm 7.2.3 ä½ä¸º *7.2.3 软件å
æä¾ï¼è¿äºè½¯ä»¶å
ä¸ç³»ç»è½¯ä»¶å
å
±åãéçæ¬åçå
å
ä¼è§¦åä¸ Ubuntu èªå·±ç rocm-cmake çä¾èµå²çªã
3. å¼ºå¶æ¸
é¤å¯æç ´ dpkg ä¾èµæ»éã
å½é¨åå®è£
ççæ¬å软件å
ä¾èµäºæ£å¨è¢«ç§»é¤çéçæ¬å软件å
æ¶ï¼éè¦ä½¿ç¨ dpkg --purge --force-dependsãæ®éç apt-get purge 伿ç»ï¼å 为å®è¯å¾æ»¡è¶³æ£å¨è¢«å é¤çä¾èµå
³ç³»ã
4. 缺失 .so çæ¬ç ABI 符å·é¾æ¥ã
Ubuntu 25.10 å° libxml2 ç ABI ä» 2 åçº§å° 16ãROCm 龿¥å¨ï¼ä¸º 24.04 æå»ºï¼ææ .so.2ãå建符å·é¾æ¥å³å¯ä¿®å¤ï¼å ä¸ºè¾æ°ç ABI ååå
¼å®¹ã
5. è¿ç¨é¨ç½²æ¨¡å¼ï¼write_file â scp â chmodã SSH heredoc ä¼å åµå¥å¼å·èå´©æºãå°èæ¬æ¬å°åå ¥ç¶åå¤å¶å¯é¿å ææ shell 转ä¹é®é¢ã
6. PCI sysfs å¼ä¸ç 0x åç¼ã
/sys/bus/pci/devices/*/vendor è¿å 0x1002ï¼å¸¦åç¼ï¼ãæ ¼å¼å代ç å¿
é¡»èèè¿ä¸ç¹ï¼å¦åä½ ä¼å¾å° 0x0x1002ã
7. å§ç»æ£æ¥ dmesg 以è·å驱å¨ç¨åºåå§å失败信æ¯ã
amdgpu 驱å¨ç¨åºä¸ç´å 载并ç»å®å°è®¾å¤ââå®å¨åºä»¶å è½½æ¶éé»å¤±è´¥ãdmesg æç¤ºäº lspci å sysfs 仿ªæç¤ºç缺失æä»¶ã
