Quantization-Aware Training (QAT) | Unsloth Documentation

Introducing Unsloth Studio: a new web UI for local AI

Quantization-Aware Training (QAT)

Quantize models to 4-bit with Unsloth and PyTorch to recover accuracy.

In collaboration with PyTorch, we're introducing QAT (Quantization-Aware Training) in Unsloth to enable trainable quantization that recovers as much accuracy as possible. This results in significantly better model quality compared to standard 4-bit naive quantization. QAT can recover up to 70% of the lost accuracy and achieve a 1–3% model performance improvement on benchmarks such as GPQA and MMLU Pro.

Try QAT with our free Qwen3 (4B) notebook

📚Quantization

Naively quantizing a model is called post-training quantization (PTQ). For example, assume we want to quantize to 8bit integers:

Find max(abs(W))
Find a = 127/max(abs(W)) where a is int8's maximum range which is 127
Quantize via qW = int8(round(W * a))

Dequantizing back to 16bits simply does the reverse operation by float16(qW) / a . Post-training quantization (PTQ) can greatly reduce storage and inference costs, but quite often degrades accuracy when representing high-precision values with fewer bits - especially at 4-bit or lower. One way to solve this to utilize our dynamic GGUF quants, which uses a calibration dataset to change the quantization procedure to allocate more importance to important weights. The other way is to make quantization smarter, by making it trainable or learnable!

🔥Smarter Quantization

To enable smarter quantization, we collaborated with the TorchAO team to add Quantization-Aware Training (QAT) directly inside of Unsloth - so now you can fine-tune models in Unsloth and then export them to 4-bit QAT format directly with accuracy improvements!

In fact, QAT recovers 66.9% of Gemma3-4B on GPQA, and increasing the raw accuracy by +1.0%. Gemma3-12B on BBH recovers 45.5%, and increased the raw accuracy by +2.1%. QAT has no extra overhead during inference, and uses the same disk and memory usage as normal naive quantization! So you get all the benefits of low-bit quantization, but with much increased accuracy!

🔍Quantization-Aware Training

QAT simulates the true quantization procedure by "fake quantizing" weights and optionally activations during training, which typically means rounding high precision values to quantized ones (while staying in high precision dtype, e.g. bfloat16) and then immediately dequantizing them.

TorchAO enables QAT by first (1) inserting fake quantize operations into linear layers, and (2) transforms the fake quantize operations to actual quantize and dequantize operations after training to make it inference ready. Step 1 enables us to train a more accurate quantization representation.

✨QAT + LoRA finetuning

QAT in Unsloth can additionally be combined with LoRA fine-tuning to enable the benefits of both worlds: significantly reducing storage and compute requirements during training while mitigating quantization degradation! We support multiple methods via qat_scheme including fp8-int4, fp8-fp8, int8-int4, int4 . We also plan to add custom definitions for QAT in a follow up release!

from unsloth import FastLanguageModel
model, tokenizer = FastLanguageModel.from_pretrained(
    model_name = "unsloth/Qwen3-4B-Instruct-2507",
    max_seq_length = 2048,
    load_in_16bit = True,
)
model = FastLanguageModel.get_peft_model(
    model,
    r = 16,
    target_modules = ["q_proj", "k_proj", "v_proj", "o_proj",
                      "gate_proj", "up_proj", "down_proj",],
    lora_alpha = 32,
    
    # We support fp8-int4, fp8-fp8, int8-int4, int4
    qat_scheme = "int4",
)

🫖Exporting QAT models

After fine-tuning in Unsloth, you can call model.save_pretrained_torchao to save your trained model using TorchAO’s PTQ format. You can also upload these to the HuggingFace hub! We support any config, and we plan to make text based methods as well, and to make the process more simpler for everyone! But first, we have to prepare the QAT model for the final conversion step via:

from torchao.quantization import quantize_
from torchao.quantization.qat import QATConfig
quantize_(model, QATConfig(step = "convert"))

And now we can select which QAT style you want:

# Use the exact same config as QAT (convenient function)
model.save_pretrained_torchao(
    model, "tokenizer", 
    torchao_config = model._torchao_config.base_config,
)

# Int4 QAT
from torchao.quantization import Int4WeightOnlyConfig
model.save_pretrained_torchao(
    model, "tokenizer",
    torchao_config = Int4WeightOnlyConfig(),
)

# Int8 QAT
from torchao.quantization import Int8DynamicActivationInt8WeightConfig
model.save_pretrained_torchao(
    model, "tokenizer",
    torchao_config = Int8DynamicActivationInt8WeightConfig(),
)

You can then run the merged QAT lower precision model in vLLM, Unsloth and other systems for inference! These are all in the Qwen3-4B QAT Colab notebook we have as well!

🫖Quantizing models without training

You can also call model.save_pretrained_torchao directly without doing any QAT as well! This is simply PTQ or native quantization. For example, saving to Dynamic float8 format is below:

# Float8
from torchao.quantization import PerRow
from torchao.quantization import Float8DynamicActivationFloat8WeightConfig
torchao_config = Float8DynamicActivationFloat8WeightConfig(granularity = PerRow())
model.save_pretrained_torchao(torchao_config = torchao_config)

📱ExecuTorch - QAT for mobile deployment

With Unsloth and TorchAO’s QAT support, you can also fine-tune a model in Unsloth and seamlessly export it to ExecuTorch (PyTorch’s solution for on-device inference) and deploy it directly on mobile. See an example in action here with more detailed workflows on the way!

Announcement coming soon!

🌻How to enable QAT

Update Unsloth to the latest version, and also install the latest TorchAO!

Then try QAT with our free Qwen3 (4B) notebook

pip install --upgrade --no-cache-dir --force-reinstall unsloth unsloth_zoo
pip install torchao==0.14.0 fbgemm-gpu-genai==1.3.0

💁Acknowledgements

Huge thanks to the entire PyTorch and TorchAO team for their help and collaboration! Extreme thanks to Andrew Or, Jerry Zhang, Supriya Rao, Scott Roy and Mergen Nachin for helping on many discussions on QAT, and on helping to integrate it into Unsloth! Also thanks to the Executorch team as well!

Previous500K Context Training NextDGX Station

Last updated 4 months ago

Was this helpful?

Sunbelt Computer Software

PL/B Language Development and Support

Quantization-Aware Training (QAT)

📚Quantization

🔥Smarter Quantization

🔍Quantization-Aware Training

✨QAT + LoRA finetuning

🫖Exporting QAT models

🫖Quantizing models without training

📱ExecuTorch - QAT for mobile deployment

🌻How to enable QAT

💁Acknowledgements

Sunbelt Computer Software

PL/B Language Development and Support

hashtag📚Quantization

hashtag🔥Smarter Quantization

hashtag🔍Quantization-Aware Training

hashtag✨QAT + LoRA finetuning

hashtag🫖Exporting QAT models

hashtag🫖Quantizing models without training

hashtag📱ExecuTorch - QAT for mobile deployment

hashtag🌻How to enable QAT

hashtag💁Acknowledgements

📚Quantization

🔥Smarter Quantization

🔍Quantization-Aware Training

✨QAT + LoRA finetuning

🫖Exporting QAT models

🫖Quantizing models without training

📱ExecuTorch - QAT for mobile deployment

🌻How to enable QAT

💁Acknowledgements