åè¯ä¸æä»¶å缩æ¯ç详解 | AIçæåç¿»è¯
æ¯çï¼è¿å®å ¨æ£å¸¸ï¼åå å¦ä¸ï¼
æ°å¦åç
Raw text â Tokenized (uint16) 转æ¢ï¼
åå§: "The cat sat on the mat"
å符: 20 characters à 1 byte = 20 bytes
Tokens: ~5 tokens à 2 bytes = 10 bytes (GPT-2 BPE tokenization)
å缩æ¯çï¼ææ¬å° tokens ~2-4x
æ¨çæ°æ®
*åè®¾ä½¿ç¨ GPT-2 tokenizer çè±æææ¬å¹³å ~4 chars/token
为ä»ä¹è¿ä¹å°ï¼
- Text â Tokensï¼è±æå¨åè¯æ¶å缩 ~2-4xï¼å¸¸è§åè¯å为å个 tokenï¼
- uint16ï¼æ¯ä¸ª token åå¨ä¸º 2 åèï¼ç¸è¾ Python int ç ~4 åèï¼
- é¨åæ°æ®ï¼æ¨åªåå¤äº 61GB æä»¶ç 477MB æ ·æ¬ï¼ä¸å° 1%ï¼
ç°å®ä¸çæ¯è¾
nanoGPT çæ å Shakespeare æ°æ®éï¼
- Raw textï¼~1MB
- Tokenized
train.binï¼~1MB - FineWeb full (10B tokens)ï¼~20GB tokenized
å æ¤ï¼ä» 477MB ææ¬æ ·æ¬å¾å° 27MB train.bin = ~18x å缩ãè¿æ¯æ£å¸¸çã 妿æ¨åå¤å®æ´ç 61GBï¼å°å¾å°æ»è®¡ ~30GB ç .bin æä»¶ã
