å¤èç¹CPUè®ç»ç¶æ | AIçæåç¿»è¯
以䏿¯æä»¬çè¿å±æ»ç»åå½åç¶æï¼
å¤èç¹CPUåå¸å¼è®ç»è®¾ç½® â ç¶ææ¥å
æä»¬åç°çå 容
æºå¨æ¸ åï¼
192.168.1.36 (lzw@) â 主èç¹
CPU: i9-13900KF, 24æ ¸/32线ç¨
å
å: 62 GB
GPU: RTX 4070 12GB (CUDA)
Torch: 2.9.1+cu128
192.168.1.47 (xiaoxin@xx) â å·¥ä½èç¹
CPU: i5-12450H, 8æ ¸/12线ç¨
å
å: 16 GB
GPU: Intel UHD (éææ¾å¡ï¼æ CUDA)
Torch: 2.9.1+cpu (åå®è£
)
å
³é®æ¶ææ´å¯ï¼ Nanochat å¹¶æªä½¿ç¨ PyTorch ç DDP å°è£
å¨ãç¸åï¼å®æä¸ä¸ªèªå®ä¹ç DistMuonAdamW ä¼åå¨ï¼ç´æ¥ä½¿ç¨ dist.all_reduce / dist.reduce_scatter_tensor / dist.all_gather_into_tensor å¤çæ¢¯åº¦åæ¥ ââ è¿æ¯ä¸ç§ ZeRO-2 飿 ¼çæ¹æ³ã
å·²åºç¨ç代ç è¡¥ä¸ï¼3个æä»¶ï¼
1. nanochat/common.py â compute_init()
为 CPU DDP æ·»å äº Gloo åç«¯æ¯æï¼
elif is_ddp_requested and device_type == "cpu":
device = torch.device("cpu")
dist.init_process_group(backend="gloo")
dist.barrier()
2. nanochat/flash_attention.py â _resolve_impl()
æ·»å äº NANOCHAT_FORCE_SDPA=1 ç¯å¢åéè¦çãFlash Attention ä»
éç¨äº CUDAï¼CPU è®ç»å¿
é¡»ä½¿ç¨ PyTorch SDPA åéæ¹æ¡ã
3. nanochat/optim.py â DistMuonAdamW
æ·»å äº _AsyncWorkWrapper + _async_op() è¾
å©å½æ°ï¼å 为 Gloo 䏿¯æ Work.get_future()ï¼NCCL æ¯æï¼ãææ 5 个 .get_future() è°ç¨å被å°è£
ï¼
class _AsyncWorkWrapper:
def __init__(self, work):
self._work = work
def wait(self):
self._work.wait()
def _async_op(work):
try:
return work.get_future()
except RuntimeError:
return _AsyncWorkWrapper(work)
4. nanochat/dataset.py
æ·»å äº NANOCHAT_DATA_DIR ç¯å¢åéè¦çï¼ä½¿ä¸¤å°æºå¨è½ä½¿ç¨å
±äº«çæ°æ®åçåéã
å·²çæçé¨å
å¨ 1.36 ä¸åè¿ç¨ CPU è®ç»ï¼ â
step 00001 | loss: 10.397800 | tok/sec: 45
å¨ 1.36 ä¸åèç¹åè¿ç¨ CPU DDPï¼Glooï¼ï¼ â
step 00001 | loss: 10.397759 | tok/sec: 89 (2åå é!)
Gloo è¿æ¥æåï¼æ¢¯åº¦åæ¥æ£å¸¸ï¼æ£æ¥ç¹ä¿åæ£å¸¸ã
ä»é宿çé¨å
1. æ°æ®ä¸è´æ§ï¼ä¸¤å°æºå¨éè¦ç¸åç parquet æä»¶ãå½åç¶æï¼
- 1.36ï¼
~/.cache/nanochat/base_data_climbmix/䏿 201 个åçï¼18GBï¼ - 1.47ï¼
~/.cache/nanochat/base_data_climbmix/䏿 4 个åçï¼352MBï¼ - å·²å¨ 1.36 ä¸å建äºå
å« 4 个åçç
base_data_small/符å·é¾æ¥ç®å½ - éè¦å¨ä¸¤å°æºå¨ä¸é½ä»¥
NANOCHAT_DATA_DIR=~/.cache/nanochat/base_data_smallè¿è¡
2. ç«¯å£ 29500 访é®ï¼ä¸»èç¹ï¼1.36ï¼éè¦ç¡®ä¿ 1.47 è½è®¿é®ç«¯å£ 29500ãå·²éªè¯ç½ç»è¿æ¥æ£å¸¸ï¼SSH ç«¯å£ 22 å¯è¾¾ï¼ã
3. æç»å¤èç¹æµè¯å°æªè¿è¡ãææè¡¥ä¸å·²å¨ä¸¤å°æºå¨ä¸åæ¥ã
å¦ä½è¿è¡å¤èç¹è®ç»
å¨ 1.36ï¼ä¸»èç¹ï¼ââ ç»ç«¯ 1ï¼
cd /mnt/data/nanochat && source .venv/bin/activate
export NANOCHAT_BASE_DIR=~/.cache/nanochat
export NANOCHAT_DATA_DIR=~/.cache/nanochat/base_data_small
export NANOCHAT_FORCE_SDPA=1
export OMP_NUM_THREADS=8
torchrun --nnodes=2 --nproc_per_node=1 --node_rank=0 \
--master_addr=192.168.1.36 --master_port=29500 \
-m scripts.base_train -- \
--device-type cpu --depth 4 --max-seq-len 512 \
--device-batch-size 2 --total-batch-size 2048 \
--num-iterations 20 --eval-every 10 --tracker none --run dist-cpu-2node
å¨ 1.47ï¼å·¥ä½èç¹ï¼ââ ç»ç«¯ 2ï¼
cd ~/projects/nanochat && source .venv/bin/activate
export NANOCHAT_BASE_DIR=~/.cache/nanochat
export NANOCHAT_FORCE_SDPA=1
export OMP_NUM_THREADS=4
torchrun --nnodes=2 --nproc_per_node=1 --node_rank=1 \
--master_addr=192.168.1.36 --master_port=29500 \
-m scripts.base_train -- \
--device-type cpu --depth 4 --max-seq-len 512 \
--device-batch-size 2 --total-batch-size 2048 \
--num-iterations 20 --eval-every 10 --tracker none --run dist-cpu-2node
å¯å¨èæ¬ä¹å·²ä¿åï¼
- 1.36 ä¸ç
/mnt/data/nanochat/run_dist_cpu_master.sh - 1.47 ä¸ç
~/projects/nanochat/run_dist_cpu_worker.sh
å ³é®è¦ç¹
- åç«¯éæ©ï¼ NCCLï¼GPU å° GPUï¼é«å¸¦å®½ï¼ vs Glooï¼CPU å° CPUï¼åºäº TCP ç½ç»ï¼
- æ é DDP å°è£ å¨ï¼ Nanochat 使ç¨èªå®ä¹ä¼åå¨çº§å«çæ¢¯åº¦åæ¥ï¼ZeRO-2 飿 ¼ï¼
- 弿¥æä½ï¼ NCCL æ¯æ
get_future()å®ç°å¼æ¥éä¿¡ï¼Gloo éè¦åæ¥åéæ¹æ¡ - Flash Attention ä» é CUDAï¼ CPU è®ç»å¿ 须强å¶ä½¿ç¨ SDPA åéæ¹æ¡
- torchrun ç
--åéç¬¦ï¼ èæ¬åæ°ä¸ç--runä¸ torchrun ç--run-pathå²çªï¼ä½¿ç¨--è¿è¡åé - æ¹æ¬¡å¤§å°è®¡ç®ï¼
total_batch_sizeå¿ é¡»è½è¢«device_batch_size à seq_len à world_sizeæ´é¤
