iframe-proxy

tianleiwu · 2026-04-22T16:35:05Z

PR: Add CPU QMoE 2-bit support and LUT GEMM fast path

Description

This PR adds expert_weight_bits=2 support to the CPU QMoE operator and introduces a fast path for supported block-wise shapes using MLAS LUT GEMM. It also tightens CPU-side validation, expands test coverage for non-trivial 2-bit behavior, and adds implementation notes for the CPU QMoE kernel.

Summary of Changes

CPU QMoE Kernel

Schema and Documentation

File	Change
`onnxruntime/core/graph/contrib_ops/contrib_defs.cc`	Updates QMoE schema/docs to allow CPU-side 2-bit weights.
`docs/contrib_ops/cpu/qmoe.md`	Adds CPU QMoE implementation notes covering routing, quantization layouts, prepack behavior, LUT fast paths, fallbacks, and current limitations.

Tests

Testing

Built the provider object:
- ninja -C build/cu128/Release CMakeFiles/onnxruntime_providers.dir/home/tlwu/git/onnxruntime/onnxruntime/contrib_ops/cpu/moe/moe_quantization_cpu.cc.o
Built the provider test object:
- ninja -C build/cu128/Release CMakeFiles/onnxruntime_provider_test.dir/home/tlwu/git/onnxruntime/onnxruntime/test/contrib_ops/moe_test.cc.o
Added CPU-side test coverage for:
- 2-bit validation failures
- non-trivial non-zero 2-bit outputs
- LUT-eligible 2-bit block-wise identity behavior
Full end-to-end provider gtest execution was not run from this checkout because the available top-level test binary does not expose the MoETest suite here.

Motivation and Context

This work addresses CPU-provider support for QMoE 2-bit expert weights, matching the issue request for QMoE 2 bits on CPU. The PR also aligns the CPU implementation with how MLAS currently exposes optimized 2-bit execution: block-wise 2-bit shapes can use LUT GEMM, while unsupported shapes continue to use dequantize-plus-GEMM fallback paths.

Checklist

Tests added/updated
Documentation updated
No breaking changes
CI passes

…workflow

…nto tlwu/qmoe_2bit_cpu

tianleiwu added 3 commits April 22, 2026 08:27

Add 2 bit QMoE

b7e9fd8

Add LuT GEMM for 2 bits

ae555a3

Add doc

16ca1f9

tianleiwu mentioned this pull request Apr 22, 2026

[Feature Request] QMoE: support 2-bits quantized expert Weights #28163

Open

tianleiwu marked this pull request as draft April 22, 2026 16:39

tianleiwu added 5 commits April 22, 2026 14:40

Add doc gen

620d057

upload docs

c459709

Merge remote-tracking branch 'origin/main' into tlwu/win_gpu_doc_gen_…

c059653

…workflow

Merge remote-tracking branch 'origin/main' into tlwu/qmoe_2bit_cpu

e1f2436

Merge remote-tracking branch 'origin/tlwu/win_gpu_doc_gen_workflow' i…

9eeb277

…nto tlwu/qmoe_2bit_cpu

File	Change
`onnxruntime/contrib_ops/cpu/moe/moe_quantization_cpu.cc`	Adds CPU 2-bit dequant support, 2-bit LUT GEMM eligibility checks, LUT prepack/cache support, and LUT execution for FC1/FC2 on supported block-wise shapes. Refactors the compute flow so the 2-bit LUT path is isolated while routing and accumulation remain shared.
`onnxruntime/contrib_ops/cpu/moe/moe_quantization_cpu.h`	Adds CPU-side state for LUT prepacked buffers and shared compute inputs.
`onnxruntime/contrib_ops/cpu/moe/moe_helper.h`	Tightens shape validation, including `hidden_size % pack_size == 0` and inferred `inter_size` divisibility checks.

File	Change
`onnxruntime/test/contrib_ops/moe_test.cc`	Adds CPU 2-bit smoke, validation, non-zero functional, and LUT-eligible block-wise identity tests.
`onnxruntime/test/python/transformers/test_qmoe_cpu.py`	Extends Python-side QMoE parity coverage for 2-bit row-wise and block-wise packing paths.

Sunbelt Computer Software

PL/B Language Development and Support

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add CPU QMoE 2-bit support and LUT GEMM fast path#28185

Add CPU QMoE 2-bit support and LUT GEMM fast path#28185
tianleiwu wants to merge 8 commits intomainfrom
tlwu/qmoe_2bit_cpu

tianleiwu commented Apr 22, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Sunbelt Computer Software

PL/B Language Development and Support

Conversation

tianleiwu commented Apr 22, 2026

PR: Add CPU QMoE 2-bit support and LUT GEMM fast path

Description

Summary of Changes

CPU QMoE Kernel

Schema and Documentation

Tests

Testing

Motivation and Context

Checklist

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant