Official PyTorch implement of paper EfficientQAT: Efficient Quantization-Aware Training for Large Language Models
- [2025/11] 🔥 We open-source INT vs. FP, a framework to compare low-bit integer and float-point formats, including MXFP8/MXFP6/MXFP4/NVFP4 and MXINT8/MXINT6/MXINT4/NVINT4.
- [2025/05] 🔥 We explore the Scaling Law for Quantization-Aware Training, which offers insights and instruction for LLMs QAT.
- [2025/05] 🌟 Our EfficientQAT paper has been accepted for ACL 2025 Main Conference! 🎉 Cheers!
- [2024/10] 🔥 We release a new weight-activation quantization algorithm, PrefixQuant, which proposed an efficient method to isolate sink token (token-wise outlier).
- [2024/08] The new inference backend T-MAC from Microsoft has supported EffcientQAT models.
- [2024/08] We support for the quantization of Mistral-Large-Instruct. W2g64 Mistral-Large-Instruct with our EfficientQAT can compress the 123B models to 35 GB with only 4 points accuracy degeneration.
- [2024/07] New featurs! We support to transfer EfficientQAT quantized models into
GPTQ v2format andBitBLASformat, which can be directly loaded through GPTQModel. - [2024/07] We release EfficientQAT, which pushes the limitation of uniform (INT) quantization in an efficient manner.
- Clone this repository and navigate to EfficientQAT folder
git clone https://github.com/OpenGVLab/EfficientQAT.git
cd EfficientQAT
- Install package
conda create -n efficientqat python==3.11
conda activate efficientqat
pip install -r requirements.txt
We provide a number of prequantized EfficientQAT models as follows:
- WikiText2 PPL is measured in 2048 context length.
- Avg. Accuracy indicate the average accuracy in 5 zero-shot reasoning tasks (WinoGrande,PIQA,HellaSwag,Arc-Easy, Arc-Challenge) with lm-eval v0.4.2.
- 1GB =
$10^9$ Bit - Hub Link: EQAT indicates the original checkpoints. We also transfer the checkpoints into GPTQ and BitBLAS formats, which can be loaded directly through GPTQModel. (PS: GPTQModel is a official bug-fixed repo of AutoGPTQ, which would be merged into AutoGPTQ in future.)
EfficientQAT involves two consecutive training phases: Block-wise training of all parameters (Block-AP) and end-to-end training of quantization parameters (E2E-QP). The detailed training script can be found in ./examples. We give the training script examples on Llama-2-7B with w2g64 quantization in the following.
- Block-AP
You should modify --model to the folder of full-precision model in the script before you running the following command.
bash examples/block_ap/Llama-2-7b/w2g64.sh
Specifically, the --weight_lr is 2e-5 for 2-bit and 1e-5 for 3-/4-bits in our experiments.
Some other important arguments:
--train_size: number of training data samples, 4096 as default--val_size: number of validation data samples, 64 as default--off_load_to_disk: save training dataset to disk, saving CPU memory but may reduce training speed
- E2E-QP
Then, you can load the quantized model of Block-AP for further E2E-QP. Specifically, E2E-QP can adapt to different scenarios by changing the training datasets. You should modify --quant_model_path to the folder of quantized model in the script before you running the following command.
1) Train on RedPajama
bash examples/e2e_qp/Llama-2-7b/w2g64-redpajama.sh
2) Train on Alpaca
bash examples/e2e_qp/Llama-2-7b/w2g128-redpajama.sh
Specifically, the --learning_rate is 2e-5 for 2-bit and 1e-5 for 3-/4-bits in our experiments. You can decrease the --per_device_train_batch_size to reduce the memory footprint during training, and making sure that --gradient_accumulation_steps increases by the same multiple to maintain the same batch size.
- Download the pre-quantized EfficientQAT models from Huggingface
pip install huggingface_hub
huggingface-cli download ChenMnZ/Llama-2-7b-EfficientQAT-w2g64 --local-dir ./output/pre_quantized_models/Llama-2-7b-EfficientQAT-w2g64
- Evaluate the pre-quantized EfficientQAT model
CUDA_VISIBLE_DEVICES=0 python main_block_ap.py \
--resume_quant ./output/pre_quantized_models/Llama-2-7b-EfficientQAT-w2g64 \
--net Llama-2 \
--wbits 2 \
--group_size 64 \
--output_dir ./output/inference_results/ \
--eval_ppl \
--eval_tasks piqa,arc_easy,arc_challenge,hellaswag,winogrande
Firstly, you should install gptqmodel package to support GPTQ and BitBLAS quantization format:
git clone https://github.com/ModelCloud/GPTQModel.git && cd GPTQModel
bash install.sh
- In our experiences, we test with
gptqmodel v0.9.8.
Then, we offer three types of transferring as follows:
- Transfer EfficientQAT checkpoints to GPTQ format
bash examples/model_transfer/efficientqat_to_gptq/llama-2-7b.sh
- Note: Currently AutoGPTQ has overflow bugs for asymmetric quantization. Therefore, we choose the official bug-fixed version GPTQModel to transfer our asymmetric quantized models. Therefore, the GPTQ models provide by this repo can be only successfully loaded through GPTQModel otherwise AutoGPTQ.
- Transfer EfficientQAT checkpoints to BitBLAS format
bash examples/model_transfer/efficientqat_to_bitblas/llama-2-7b.sh
- Speedup has some problem, refer this issue for details.
- Transfer fp32 datas in EfficientQAT checkpoints to half-precision counterparts. Some of parameters are saved as fp32 for training, you can transfer them into half-precision to further reducing model size after training.
bash examples/model_transfer/fp32_to_16/llama-2-7b.sh
Below is an example to inference with GPTQ or BitBLAS quantized formats.
from transformers import AutoTokenizer
from gptqmodel import GPTQModel
quant_dir = "ChenMnZ/Llama-2-7b-EfficientQAT-w2g128-GPTQ"
# quant_dir = "ChenMnZ/Llama-2-7b-EfficientQAT-w2g128-BitBLAS"
# or local path
tokenizer = AutoTokenizer.from_pretrained(quant_dir, use_fast=True)
# load quantized model to the first GPU
model = GPTQModel.from_quantized(quant_dir)
# inference with model.generate
print(tokenizer.decode(model.generate(**tokenizer("Model quantization is", return_tensors="pt").to(model.device))[0]))If you found this work useful, please consider citing:
@article{efficientqat,
title={EfficientQAT: Efficient Quantization-Aware Training for Large Language Models},
author={Chen, Mengzhao and Shao, Wenqi and Xu, Peng and Wang, Jiahao and Gao, Peng and Zhang, Kaipeng and Qiao, Yu and Luo, Ping},
journal={arXiv preprint arXiv:2407.11062},
year={2024}
}
