Tokenization and File Compression Ratios Explained | Generated by AI
Yes, this is completely normal! Here’s why:
The Math
Raw text → Tokenized (uint16) conversion:
Original: "The cat sat on the mat"
Chars: 20 characters × 1 byte = 20 bytes
Tokens: ~5 tokens × 2 bytes = 10 bytes (GPT-2 BPE tokenization)
Compression ratio: ~2-4x from text to tokens
Your Numbers
*Assuming ~4 chars/token average for English text with GPT-2 tokenizer
Why So Small?
- Text → Tokens: English compresses ~2-4x when tokenized (common words become single tokens)
- uint16: Each token stored as 2 bytes (vs ~4 bytes for Python int)
- Partial data: You only prepared a 477MB sample of the 61GB file (less than 1%)
Real-World Comparison
nanoGPT’s standard Shakespeare dataset:
- Raw text: ~1MB
- Tokenized
train.bin: ~1MB - FineWeb full (10B tokens): ~20GB tokenized
So 27MB train.bin from 477MB text sample = ~18x compression. That’s normal. If you prepared the full 61GB, you’d get ~30GB of .bin files total.
