Training Dataset and Token Statistics | Generated by AI
Dataset sizes
Training coverage
- Tokens seen: 12,000 iters × 524,288 = 6.29B tokens
- Epochs over
train.bin: ~1.71 (each token seen <2× on average)
Notes
- Compression ratio ~2.2× (16 GB text → 7.36 GB tokens) is normal for GPT-2 BPE — ~3.7 chars/token.
val.binat 5,675 tokens is suspiciously small (~5 evaluation batches at block_size=1024). The eval_iters=200 setting will mostly re-sample the same chunk. If you re-prepare the data, carve out a real holdout (e.g., 0.5% of train.bin = ~18M tokens) — your val loss numbers would be more trustworthy.- For reference: Chinchilla-optimal training for a 124M model is ~2.5B tokens, so at 6.3B tokens you’re actually past compute-optimal. The ceiling on this run is the model size, not data — to push val loss lower without scaling the model, training longer won’t help much.
