Add CPU QMoE 2-bit support and LUT GEMM fast path by tianleiwu · Pull Request #28185 · microsoft/onnxruntime · GitHub
Skip to content

Add CPU QMoE 2-bit support and LUT GEMM fast path#28185

Draft
tianleiwu wants to merge 8 commits intomainfrom
tlwu/qmoe_2bit_cpu
Draft

Add CPU QMoE 2-bit support and LUT GEMM fast path#28185
tianleiwu wants to merge 8 commits intomainfrom
tlwu/qmoe_2bit_cpu

Conversation

@tianleiwu
Copy link
Copy Markdown
Contributor

PR: Add CPU QMoE 2-bit support and LUT GEMM fast path

Description

This PR adds expert_weight_bits=2 support to the CPU QMoE operator and introduces a fast path for supported block-wise shapes using MLAS LUT GEMM. It also tightens CPU-side validation, expands test coverage for non-trivial 2-bit behavior, and adds implementation notes for the CPU QMoE kernel.

Summary of Changes

CPU QMoE Kernel

File Change
onnxruntime/contrib_ops/cpu/moe/moe_quantization_cpu.cc Adds CPU 2-bit dequant support, 2-bit LUT GEMM eligibility checks, LUT prepack/cache support, and LUT execution for FC1/FC2 on supported block-wise shapes. Refactors the compute flow so the 2-bit LUT path is isolated while routing and accumulation remain shared.
onnxruntime/contrib_ops/cpu/moe/moe_quantization_cpu.h Adds CPU-side state for LUT prepacked buffers and shared compute inputs.
onnxruntime/contrib_ops/cpu/moe/moe_helper.h Tightens shape validation, including hidden_size % pack_size == 0 and inferred inter_size divisibility checks.

Schema and Documentation

File Change
onnxruntime/core/graph/contrib_ops/contrib_defs.cc Updates QMoE schema/docs to allow CPU-side 2-bit weights.
docs/contrib_ops/cpu/qmoe.md Adds CPU QMoE implementation notes covering routing, quantization layouts, prepack behavior, LUT fast paths, fallbacks, and current limitations.

Tests

File Change
onnxruntime/test/contrib_ops/moe_test.cc Adds CPU 2-bit smoke, validation, non-zero functional, and LUT-eligible block-wise identity tests.
onnxruntime/test/python/transformers/test_qmoe_cpu.py Extends Python-side QMoE parity coverage for 2-bit row-wise and block-wise packing paths.

Testing

  • Built the provider object:
    • ninja -C build/cu128/Release CMakeFiles/onnxruntime_providers.dir/home/tlwu/git/onnxruntime/onnxruntime/contrib_ops/cpu/moe/moe_quantization_cpu.cc.o
  • Built the provider test object:
    • ninja -C build/cu128/Release CMakeFiles/onnxruntime_provider_test.dir/home/tlwu/git/onnxruntime/onnxruntime/test/contrib_ops/moe_test.cc.o
  • Added CPU-side test coverage for:
    • 2-bit validation failures
    • non-trivial non-zero 2-bit outputs
    • LUT-eligible 2-bit block-wise identity behavior
  • Full end-to-end provider gtest execution was not run from this checkout because the available top-level test binary does not expose the MoETest suite here.

Motivation and Context

This work addresses CPU-provider support for QMoE 2-bit expert weights, matching the issue request for QMoE 2 bits on CPU. The PR also aligns the CPU implementation with how MLAS currently exposes optimized 2-bit execution: block-wise 2-bit shapes can use LUT GEMM, while unsupported shapes continue to use dequantize-plus-GEMM fallback paths.

Checklist

  • Tests added/updated
  • Documentation updated
  • No breaking changes
  • CI passes

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant