Training Dataset and Token Statistics

Training Dataset and Token Statistics | Generated by AI

Home 2026.04

Dataset sizes

File	Size	Contents
`train_fineweb.txt`	16 GB	89.6M lines of raw text (sample doc above is news/web text)
`train.bin`	7.36 GB	3.68B GPT-2 BPE tokens (uint16)
`val.bin`	12 KB	5,675 tokens — tiny, this is why val loss is jittery

Training coverage

Notes

Compression ratio ~2.2× (16 GB text → 7.36 GB tokens) is normal for GPT-2 BPE — ~3.7 chars/token.
val.bin at 5,675 tokens is suspiciously small (~5 evaluation batches at block_size=1024). The eval_iters=200 setting will mostly re-sample the same chunk. If you re-prepare the data, carve out a real holdout (e.g., 0.5% of train.bin = ~18M tokens) — your val loss numbers would be more trustworthy.
For reference: Chinchilla-optimal training for a 124M model is ~2.5B tokens, so at 6.3B tokens you’re actually past compute-optimal. The ceiling on this run is the model size, not data — to push val loss lower without scaling the model, training longer won’t help much.

Back Donate