è¿ç¨çæ§å¹è®è¿åº¦ | AIçæåç¿»è¯
ä½ çç´è§å ¶å®é常åç¡®ââè®ç»æ¨¡åç¡®å®å䏿¬¡æ¼«é¿çâåè²âè¿ç¨ãä½ä¸çç©å¦ä¸åçæ¯ï¼è¿ä¸ªè¿ç¨å®å ¨å¯è¢«çæ§å颿µã
让æä»¬å两é¨å讨论ï¼(1) å¦ä½è¿ç¨æ£æ¥è¿åº¦ å (2) å¦ä½éè¿AI/æ¥å¿ä¼°ç®å©ä½æ¶é´ã
1. å¦ä½æ£æ¥è®ç»è¿åº¦ï¼å¤é´åæ ·éç¨ï¼
A. åºäºæ¥å¿ççæ§ï¼æå ³é®ï¼
nanoGPT伿å°å¦ä¸æ¥å¿ï¼
iter 12000: loss 2.34, time 120ms
ä»ä¸ä½ å¯ä»¥è·åï¼
- å½åè¿ä»£æ¬¡æ°
- æå¤±è¶å¿
- 忬¡è¿ä»£æ¶é´
ð ä½ åºè¯¥å°æ¥å¿éå®åå°æä»¶ï¼
python train.py > train.log 2>&1
è¿ç¨æ¥çï¼
tail -f train.log
ä» æ¤ä¸é¡¹å°±è½æä¾å®æ¶å¯è§æ§ã
B. è¿ç¨è®¿é®ï¼ç®å设置ï¼
妿è®ç»æºå¨è¿ç¨ï¼
-
使ç¨SSHï¼
ssh your_server -
使ç¨
tmuxæscreenï¼é常éè¦ï¼ï¼tmux new -s training
è¿è½ç¡®ä¿ï¼
- å³ä½¿æå¼è¿æ¥è®ç»ä»æç»è¿è¡
-
éæ¶éæ°è¿æ¥ï¼
tmux attach -t training
C. GPUçæ§ï¼ç¶ææ£æ¥ï¼
watch -n 1 nvidia-smi
坿¥çï¼
- GPUå©ç¨çï¼åºä¿æé«ä½ï¼å¦90%以ä¸ï¼
- å å使ç¨é
- 温度
妿GPUå©ç¨çéè³0% â è®ç»å¯è½åæ»æå´©æºã
D. å¯éï¼Web仪表æ¿ï¼æ´å¥½ä½éªï¼
å¦éæ´âAIåçâççæ§ä½éªï¼
-
TensorBoardï¼
tensorboard --logdir=out -
æè®°å½å°ï¼
- Weights & Biases (
wandb) - MLflow
- Weights & Biases (
è¿æ ·å¯éæ¶éå°éè¿ææºæ¥çè¿åº¦ã
2. ä¼°ç®è®ç»ä½æ¶å®æ
è¿éå°±æ¶åå°âç¨AI颿µå®ææ¶é´âçæ³æ³ââè¿æ¯å®å ¨å¯è¡çã
æ¹æ³ä¸ï¼ç®åç¡®å®æ§ä¼°ç®
仿¥å¿è·åï¼
- å½åè¿ä»£æ¬¡æ° =
i - æå¤§è¿ä»£æ¬¡æ° =
max_iters - å¹³å忬¡è¿ä»£æ¶é´ =
t
计ç®ï¼
å©ä½æ¶é´ = (max_iters - i) * t
示ä¾ï¼
- max_iters = 600,000
- å½åè¿åº¦ = 120,000
- 忬¡è¿ä»£æ¶é´ = 0.12 ç§
å©ä½æ¶é´ = (600k - 120k) * 0.12 â 57,600 ç§ â 16 å°æ¶
ð è¿å·²ç»ç¸å½ç²¾ç¡®ã
æ¹æ³äºï¼æºè½ä¼°ç®ï¼AI飿 ¼ï¼
ä½ å¯ä»¥ï¼
- è§£æ
train.log - 计ç®è¿ä»£æ¶é´çç§»å¨å¹³åå¼
- æ£æµé度ä¸éï¼å¦æ£æ¥ç¹ä¿åãè¯ä¼°é¶æ®µï¼
ç¶åä¼åä¼°ç®ã
çè³å¯ä»¥å个å°èæ¬ï¼
import re
iters = []
times = []
with open("train.log") as f:
for line in f:
m = re.search(r"iter (\d+).*time ([\d.]+)ms", line)
if m:
iters.append(int(m.group(1)))
times.append(float(m.group(2)))
avg_time = sum(times[-100:]) / 100 / 1000 # ç§
current_iter = iters[-1]
max_iters = 600000
remaining = (max_iters - current_iter) * avg_time
print("é¢è®¡å©ä½æ¶é´ï¼å°æ¶ï¼ï¼", remaining / 3600)
æ¹æ³ä¸ï¼è¯¢é®AIï¼å¦ä½ ææ³ï¼
ä½ å¯ä»¥ç´æ¥å°æè¿çæ¥å¿ç²è´´æé®ï¼
âæ ¹æ®å½åè®ç»é度è¶å¿ï¼é¢è®¡ä½æ¶å®æï¼â
AIè½å¤ï¼
- å¹³æ»åªå£°æ¶é´æ°æ®
- æ£æµå¼å¸¸æ åµ
- ä¼°ç®å®ææ¶é´
ææåºå¥å°å¥½ã
3. è¿é¶ï¼è¥é âç彿âï¼å¼åºä½ çæ¯å»ï¼
å¦æä½ åæ¬¢âå »è²å©åâçä½éªï¼å¯ä»¥è·è¸ªï¼
- æå¤±æ²çº¿ â âå¦ä¹ è¿åº¦â
- éªè¯æå¤± â âæ³åæç度â
- çææ ·æ¬ â âä¸ªæ§æ¾ç°â
nanoGPTçè³æ¯æå®æçæææ¬ï¼
python sample.py
è¿å°±åï¼
â让æççæçå©åç°å¨è¯´è¯æä¹æ ·äºâ
4. å®è·µå»ºè®®ï¼æ¨èæå°åæ¹æ¡ï¼
å¦éç®æ´ç设置ï¼
tmuxâ ä¿æä»»å¡è¿è¡train.logâ æä¹ åæ¥å¿- ç®åPythonèæ¬ â ä¼°ç®å®ææ¶é´
- å¯éç
wandbâ è¿ç¨å¯è§å
æ ¸å¿è¦ç¹
- ä¼°ç®å®ææ¶é´æ éå¤æå·¥å ·ââæ¥å¿å·²å 嫿æä¿¡æ¯
- éè¿è¿ä»£é度 + å©ä½æ¥éª¤ï¼å®ææ¶é´ä¼°ç®æ¯ç´æªäºå½ç
- å å ¥AIåªæ¯è®©ä¼°ç®æ´èªéåºï¼èéæ ¹æ¬æ§æ¹å
å¦æä½ éè¦ï¼æå¯ä»¥å¸®ä½ æå»ºä¸ä¸ªä¸é®çæ§èæ¬ï¼è½å¤ï¼
- è§£ææ¥å¿
- æ¾ç¤ºé¢è®¡å®ææ¶é´
- å±ç¤ºè®ç»é度è¶å¿
- è®ç»åæ»æ¶ååºè¦æ¥
è¿å°ä½¿ä½ çâå夿£æ¥âæ´å 顺ç ã
