Kimi K2.6 - How to Run Locally | Unsloth Documentation

🥝Kimi K2.6 - How to Run Locally

Step-by-step guide to running Kimi-K2.6 on your own local device.

Kimi K2.6 is an open model by Moonshot that delivers SOTA performance across vision, coding, agentic, long context and chat tasks. The 1T-parameter hybrid thinking model has 256K context length and full precision requires 610GB of disk space Dynamic 2-bit requires 350GB (-43% size). Run Kimi K2.6 via Unsloth Dynamic Kimi-K2.6-GGUFsarrow-up-right on Unsloth Studio or llama.cpp.

Dynamic 2-bit upcasts important layers to 8-bit and needs 350GB+ VRAM/RAM setups. For lossless Kimi K2.6, use Q8 (UD-Q8_K_XL), which is only 10GB larger than Q4 (UD-Q4_K_XL). All uploads use Dynamic 2.0 for SOTA quantization performance. Kimi-K2.6 GGUFs also support vision.

Table: Hardware requirements (units = total memory: RAM + VRAM, or unified memory)

Measurement
Dynamic 2-bit
Q4
Q8 (Lossless)

Disk Space

340 GB

584 GB

595 GB

Perplexity

2.4131

1.8420

1.8419

📊 Quantization Analysis

UD-Q8_K_XL is lossless because Kimi uses int4 for MoE weights and BF16 for everything else, and Q8_K_XL follows that. UD-Q4_K_XL is similar except the remaining tensors are Q8_0, so it is near full precision and requires 600GB RAM/VRAM. Other non-Unsloth GGUFs from other providers may follow the UD-Q4_K_XL approach rather than the 'truly lossless' UD-Q8_K_XL.

We followed jukofyorkarrow-up-right's finding that const float d = max / -7; instead of the default const float d = max / -8; during the quantization process only on the MoE layers. This bijection patch on INT4-native MoEs allows the Q4_0 quant-type to reduce absolute error from 1.8% to near 0% (epsilon).

However we must keep other layers in BF16, and show below the error plots for both versus the BF16 baseline. UD-Q8-K_XL is truly "lossless" with some machine epsilon difference when converting Q4_0 to BF16. The perplexity for UD-Q8_K_XL was 1.8419 ± 0.00721 and UD-Q4_K_XL 1.8420 ± 0.00720. Note the error plot below is RMSE divided by bfloat16 epsilon, so it's a small error scale.

See difference between Q4_K_XL (blue) and Q8_K_XL (orange) which is lossless and 10GB larger.

⚙️ Usage Guide

Thinking and non-thinking mode require different settings:

Default (Thinking Mode)
Instant Mode

temperature = 1.0

temperature = 0.6

top_p = 0.95

top_p = 0.95

  • Suggested context length = 98,304 (up to 262,144)

If the model fits, you will get >40 tokens/s when using B200s. We recommend UD-Q2_K_XL (350GB) as a good size/quality balance. Best rule of thumb: RAM+VRAM ≈ the quant size; otherwise it’ll still work, just slower due to offloading.

Chat Template for Kimi K2.6

Running tokenizer.apply_chat_template([{"role": "user", "content": "What is 1+1?"},]) gets:

Run Kimi K2.6 Guide

🦥 Run Kimi-K2.6 in Unsloth Studio

Kimi K2.6 can run in Unsloth Studio, an open-source web UI for local AI. Unsloth Studio automatically offloads to RAM and detects multiGPU setups. With Unsloth Studio, you can run models locally on MacOS, Windows, Linux and:

1

Install and Launch Unsloth

To install, run in your terminal:

MacOS, Linux, WSL:

Windows PowerShell:

Launch Unsloth

MacOS, Linux, WSL and Windows:

Then open http://localhost:8888 in your browser.

2

Search and download Kimi-K2.6

Unsloth Studio automatically offloads to RAM and detects multiGPU setups. On first launch you will need to create a password to secure your account and sign in again later.

Then go to the Studio Chat tab and search for Kimi-K2.6 in the search bar and download your desired model and quant. Ensure you have enough compute the run the model.

3

Run Kimi-K2.6

Inference parameters should be auto-set when using Unsloth Studio, however you can still change it manually. You can also edit the context length, chat template and other settings.

For more information, you can view our Unsloth Studio inference guide.

Example of Qwen3.6 running with tool-calling

🦙 Run Kimi K2.6 in llama.cpp

For this guide we'll be running the UD-Q2_K_XL quant which will require at least 350GB RAM. Feel free to change quantization type. GGUF: Kimi-K2.6-GGUFarrow-up-right

For these tutorials, we will using llama.cpparrow-up-right for fast local inference, especially if you have a CPU.

1

Obtain the latest llama.cpp on GitHub herearrow-up-right. You can follow the build instructions below as well. Change -DGGML_CUDA=ON to -DGGML_CUDA=OFF if you don't have a GPU or just want CPU inference. For Apple Mac / Metal devices, set -DGGML_CUDA=OFF then continue as usual - Metal support is on by default.

2

If you want to use llama.cpp directly to load models, you can do the below: (:Q2_K_XL) is the quantization type. You can also download via Hugging Face (point 3). This is similar to ollama run . Use export LLAMA_CACHE="folder" to force llama.cpp to save to a specific location. The model has a maximum of 262,144 context length.

Use one of the specific commands below, according to your use-case:

Thinking mode:

Non-thinking mode (Instant):

3

Download the model via the code below (after installing pip install huggingface_hub hf_transfer). If downloads get stuck, see: Hugging Face Hub, XET debugging

4

Then run the model in conversation mode:

📊 Benchmarks

You can view further below for benchmarks in table format:

Last updated

Was this helpful?