DeepSeek Model Training Time and Efficiency DeepSeek Model Training Time and Efficiency | Zhiwei Li

DeepSeek Model Training Time and Efficiency | Generated by AI

Home 2026.04

To answer your question directly: training a model like DeepSeek-V2 (and its successor V3) takes about two months of continuous compute time.

Because DeepSeek uses a relatively small but highly optimized cluster of GPUs, they don’t need a year-long training cycle. Here is the specific breakdown of how that time is used:

Training Timeline Breakdown

For DeepSeek-V3 (the most recent and larger version), the training was completed in several phases:

  • Pre-training: Less than 2 months (specifically around 55–60 days).
  • Rate of Progress: On their cluster of 2,048 H800 GPUs, they were able to process 1 trillion tokens every 3.7 days.
  • Context Extension: An additional few days to teach the model how to handle long documents (up to 128k tokens).
  • Post-training (SFT/RL): A very short period (days) compared to the main pre-training.

Why it doesn’t take longer

You might wonder why a “smaller” cluster of 2,048 GPUs can finish so quickly when other companies use 20,000+ GPUs.

  1. Mixture-of-Experts (MoE): Since the model only “turns on” about $10\% - 15\%$ of its brain for any given word, the math moves much faster.
  2. FP8 Training: They used a data format called FP8 (8-bit floating point), which essentially doubles the speed of the hardware compared to the traditional 16-bit format used by older models.
  3. High Stability: DeepSeek reported that their training run was “remarkably stable” with zero irrecoverable loss spikes. In AI training, “crashes” or “glitches” often force engineers to restart from a week-old backup; DeepSeek avoided these delays entirely.

The “500 Million USD” Comparison

To put your $500M estimate into perspective:

  • DeepSeek’s Run: ~2 months on 2,048 GPUs $\approx$ $5.6$ million.
  • Meta’s Llama 3 (405B): ~Month on 16,000 GPUs $\approx$ Hundreds of millions.

DeepSeek didn’t spend less because they trained for less time; they spent less because their “car” (the model architecture) gets 10x better gas mileage than the competition.


References: