æå¤§åGPUå åçå®å ¨ä½¿ç¨ | AIçæåç¿»è¯
åæï¼å° 4070 æ¾åä»çº¦ 7.5 GB æ¨è³çº¦ 10â11 GB
æ¾åæ¥æº
å¨ batch_size=4, block_size=1024 æ¶ï¼å ç¨ 7538 MiBã大è´å¦ä¸ï¼
- åºå®é¨åï¼æé + 梯度 + AdamW ç¶æï¼m, vï¼â 约 2 GBï¼é对 124M åæ°ï¼
- æ¿æ´»å¼ï¼å©ä½é¨åï¼çº¦ 5.5 GBï¼ï¼é
batch_size线æ§å¢é¿ï¼éblock_sizeåè¿ä¼¼å¹³æ¹å¢é¿
å æ¤æ¯åä½ batch_size çæ¿æ´»å¼çº¦ 1.4 GBãè¦è¾¾å°çº¦ 10.5 GBï¼å¤§çº¦éè¦ å¢å 3 GB æ¿æ´»å¼ â batch_size â 6ï¼å¦ææ¿è¿ä¸ç¹å¯ä»¥å° 7ã
ä¸»è¦æ æï¼batch_sizeï¼ç¬¬ 15ã17 è¡ï¼
ä¿ææ¯æ¥ token æ°ï¼çº¦ 524,288ï¼ä¸åï¼è¿æ ·å¦ä¹ çè°åº¦ / warmup_iters / max_iters æ ééæ°è°æ´ï¼
æä¼ä» batch_size=6, gradient_accumulation_steps=86 å¼å§ï¼å¹¶å
³æ³¨ nvidia-smiãå¦æè¿æä½éï¼å°è¯ 7ãä½¿ç¨ 6 æ¶æ¯æ¥ token æ°ä¸ä¼å®å
¨ç¸åââæ²¡å
³ç³»ï¼å¦ä¹ çè°åº¦å¯¹ç¾åä¹å çåå·®å
·æé²æ£æ§ã
为ä»ä¹ä¸ä½¿ç¨å ¶ä»æ æ
block_sizeï¼ä¸ GPT-2 124M æ¶æä»¥åæ°æ®ç tokenization / sharding æ¹å¼ç»å®ãä¸è¦æ¹å¨ãn_layer / n_head / n_embdï¼ä¼æ¹å模åï¼ä¸ä» ä» æ¯æ¾åãcompile=Trueï¼å·²å¼å¯ã- 精度ï¼nanoGPT å¨
train.pyä¸é»è®¤å·²ä½¿ç¨ bf16/AMP ââ å¯éªè¯ï¼ä½æ éä¿®æ¹é ç½®ã
注æäºé¡¹
- æ¾ç¤º / Xorg å ç¨çº¦ 115 MiBï¼å æ¤å¯ç¨é¢ç®çº¦ä¸º 12,160 MiBï¼èé 12,288ã
- 馿¬¡
torch.compileæ¥éª¤ä¼æ¾åå°å³°ï¼å¦æä» å¨ç¼è¯æ¶ OOMï¼è¿ä¸æ¯ç¨³æé®é¢ï¼ä½ä»ç¶æ æ³è¿è¡ã请ä¿å®å¤çã - ä» 128 éå° 86 个梯度累积æ¥éª¤æå³ç æ¯ä¸ª wall-clock ç§å çä¼å卿´æ°æ¬¡æ°åå°ï¼ä½ ææ batch ç¸åï¼å æ¤æå¤±æ²çº¿åºä¿æä¸è´ãæ¯æ¥ååéï¼token/ç§ï¼é¢è®¡ä¼ç¥ææåï¼å 为æ¯ä¸ª token ç Python/PyTorch å¼éåå°ã
æ»ç»
å° batch_size ä» 4 æåå° 6ï¼gradient_accumulation_steps ä» 128 éå° 86ãå
¶ä»ä¿æä¸åãé¢è®¡æ¾å约 10 GBï¼ååéç¥ææåã
