Kimi K2.6 - How to Run Locally | Unsloth Documentation

Introducing Unsloth Studio: a new web UI for local AI

🥝Kimi K2.6 - How to Run Locally

Step-by-step guide to running Kimi-K2.6 on your own local device.

Kimi K2.6 is an open model by Moonshot that delivers SOTA performance across vision, coding, agentic, long context and chat tasks. The 1T-parameter hybrid thinking model has 256K context length and full precision requires 610GB of disk space Dynamic 2-bit requires 350GB (-43% size). Run Kimi K2.6 via Unsloth Dynamic Kimi-K2.6-GGUFs on Unsloth Studio or llama.cpp.

Dynamic 2-bit upcasts important layers to 8-bit and needs 350GB+ VRAM/RAM setups. For lossless Kimi K2.6, use Q8 (UD-Q8_K_XL), which is only 10GB larger than Q4 (UD-Q4_K_XL). All uploads use Dynamic 2.0 for SOTA quantization performance. Kimi-K2.6 GGUFs also support vision.

Table: Hardware requirements (units = total memory: RAM + VRAM, or unified memory)

Measurement

Dynamic 2-bit

Q8 (Lossless)

Disk Space

340 GB

584 GB

595 GB

Perplexity

2.4131

1.8420

1.8419

📊 Quantization Analysis

UD-Q8_K_XL is lossless because Kimi uses int4 for MoE weights and BF16 for everything else, and Q8_K_XL follows that. UD-Q4_K_XL is similar except the remaining tensors are Q8_0, so it is near full precision and requires 600GB RAM/VRAM. Other non-Unsloth GGUFs from other providers may follow the UD-Q4_K_XL approach rather than the 'truly lossless' UD-Q8_K_XL.

We followed jukofyork's finding that const float d = max / -7; instead of the default const float d = max / -8; during the quantization process only on the MoE layers. This bijection patch on INT4-native MoEs allows the Q4_0 quant-type to reduce absolute error from 1.8% to near 0% (epsilon).

However we must keep other layers in BF16, and show below the error plots for both versus the BF16 baseline. UD-Q8-K_XL is truly "lossless" with some machine epsilon difference when converting Q4_0 to BF16. The perplexity for UD-Q8_K_XL was 1.8419 ± 0.00721 and UD-Q4_K_XL 1.8420 ± 0.00720. Note the error plot below is RMSE divided by bfloat16 epsilon, so it's a small error scale.

⚙️ Usage Guide

Thinking and non-thinking mode require different settings:

Default (Thinking Mode)

Instant Mode

temperature = 1.0

temperature = 0.6

top_p = 0.95

Suggested context length = 98,304 (up to 262,144)

If the model fits, you will get >40 tokens/s when using B200s. We recommend UD-Q2_K_XL (350GB) as a good size/quality balance. Best rule of thumb: RAM+VRAM ≈ the quant size; otherwise it’ll still work, just slower due to offloading.

Chat Template for Kimi K2.6

Running tokenizer.apply_chat_template([{"role": "user", "content": "What is 1+1?"},]) gets:

<|im_system|>system<|im_middle|>You are Kimi, an AI assistant created by Moonshot AI.<|im_end|><|im_user|>user<|im_middle|>What is 1+1?<|im_end|><|im_assistant|>assistant<|im_middle|><think>

Run Kimi K2.6 Guide

🦥 Run Kimi-K2.6 in Unsloth Studio

Kimi K2.6 can run in Unsloth Studio, an open-source web UI for local AI. Unsloth Studio automatically offloads to RAM and detects multiGPU setups. With Unsloth Studio, you can run models locally on MacOS, Windows, Linux and:

Search, download, run GGUFs and safetensor models
Self-healing tool calling + web search
Code execution (Python, Bash)
Automatic inference parameter tuning (temp, top-p, etc.)
Fast CPU + GPU inference via llama.cpp
Train LLMs 2x faster with 70% less VRAM

Install and Launch Unsloth

To install, run in your terminal:

MacOS, Linux, WSL:

curl -fsSL https://unsloth.ai/install.sh | sh

Windows PowerShell:

irm https://unsloth.ai/install.ps1 | iex

Launch Unsloth

MacOS, Linux, WSL and Windows:

unsloth studio -H 0.0.0.0 -p 8888

Then open http://localhost:8888 in your browser.

Search and download Kimi-K2.6

Unsloth Studio automatically offloads to RAM and detects multiGPU setups. On first launch you will need to create a password to secure your account and sign in again later.

Then go to the Studio Chat tab and search for Kimi-K2.6 in the search bar and download your desired model and quant. Ensure you have enough compute the run the model.

Run Kimi-K2.6

Inference parameters should be auto-set when using Unsloth Studio, however you can still change it manually. You can also edit the context length, chat template and other settings.

For more information, you can view our Unsloth Studio inference guide.

🦙 Run Kimi K2.6 in llama.cpp

For this guide we'll be running the UD-Q2_K_XL quant which will require at least 350GB RAM. Feel free to change quantization type. GGUF: Kimi-K2.6-GGUF

For these tutorials, we will using llama.cpp for fast local inference, especially if you have a CPU.

Obtain the latest llama.cpp on GitHub here. You can follow the build instructions below as well. Change -DGGML_CUDA=ON to -DGGML_CUDA=OFF if you don't have a GPU or just want CPU inference. For Apple Mac / Metal devices, set -DGGML_CUDA=OFF then continue as usual - Metal support is on by default.

apt-get update
apt-get install pciutils build-essential cmake curl libcurl4-openssl-dev -y
git clone https://github.com/ggml-org/llama.cpp
cmake llama.cpp -B llama.cpp/build \
    -DBUILD_SHARED_LIBS=OFF -DGGML_CUDA=ON
cmake --build llama.cpp/build --config Release -j --clean-first --target llama-cli llama-mtmd-cli llama-server llama-gguf-split
cp llama.cpp/build/bin/llama-* llama.cpp

If you want to use llama.cpp directly to load models, you can do the below: (:Q2_K_XL) is the quantization type. You can also download via Hugging Face (point 3). This is similar to ollama run . Use export LLAMA_CACHE="folder" to force llama.cpp to save to a specific location. The model has a maximum of 262,144 context length.

Use one of the specific commands below, according to your use-case:

Thinking mode:

export LLAMA_CACHE="unsloth/Kimi-K2.6-GGUF"
./llama.cpp/llama-cli \
    -hf unsloth/Kimi-K2.6-GGUF:UD-Q2_K_XL \
    --temp 1.0 \
    --top-p 0.95

Non-thinking mode (Instant):

export LLAMA_CACHE="unsloth/Qwen3.6-35B-A3B-GGUF"
./llama.cpp/llama-cli \
    -hf unsloth/Kimi-K2.6-GGUF:UD-Q2_K_XL \
    --temp 0.6 \
    --top-p 0.95 \
    --chat-template-kwargs '{"enable_thinking":false}'

Download the model via the code below (after installing pip install huggingface_hub hf_transfer). If downloads get stuck, see: Hugging Face Hub, XET debugging

hf download unsloth/Kimi-K2.6-GGUF \
    --local-dir unsloth/Kimi-K2.6-GGUF \
    --include "*mmproj-F16*" \
    --include "*UD-Q2_K_XL*" # Use "*UD-Q8_K_XL*" for full precision

Then run the model in conversation mode:

./llama.cpp/llama-cli \
    --model unsloth/Kimi-K2.6-GGUF/UD-Q2_K_XL/Kimi-K2.6-UD-Q2_K_XL-00001-of-0008.gguf \
    --mmproj unsloth/Kimi-K2.6-GGUF/mmproj-F16.gguf \
    --temp 1.0 \
    --top-p 0.95

📊 Benchmarks

You can view further below for benchmarks in table format:

PreviousFine-tune Gemma 4 NextQwen3.5

Last updated 1 day ago

Was this helpful?

Sunbelt Computer Software

PL/B Language Development and Support

🥝Kimi K2.6 - How to Run Locally

📊 Quantization Analysis

⚙️ Usage Guide

Chat Template for Kimi K2.6

Run Kimi K2.6 Guide

🦥 Run Kimi-K2.6 in Unsloth Studio

🦙 Run Kimi K2.6 in llama.cpp

📊 Benchmarks

Sunbelt Computer Software

PL/B Language Development and Support

hashtag📊 Quantization Analysis

hashtag⚙️ Usage Guide

hashtagChat Template for Kimi K2.6

hashtagRun Kimi K2.6 Guide

hashtag🦥 Run Kimi-K2.6 in Unsloth Studio

hashtag🦙 Run Kimi K2.6 in llama.cpp

hashtag📊 Benchmarks

📊 Quantization Analysis

⚙️ Usage Guide

Chat Template for Kimi K2.6

Run Kimi K2.6 Guide

🦥 Run Kimi-K2.6 in Unsloth Studio

🦙 Run Kimi K2.6 in llama.cpp

📊 Benchmarks