Gemma 4 - How to Run Locally | Unsloth Documentation

Introducing Unsloth Studio: a new web UI for local AI

✨Gemma 4 - How to Run Locally

Run Google’s new Gemma 4 models locally, including E2B, E4B, 26B A4B, and 31B.

Gemma 4 is Google DeepMind’s new family of open models, including E2B, E4B, 26B-A4B, and 31B. The multimodal, hybrid-thinking models support 140+ languages, up to 256K context, and have dense and MoE variants. Gemma 4 is Apache-2.0 licensed and can run on your local device.

Run Gemma 4 Fine-tune Gemma 4

Gemma-4-E2B and E4B support image and audio. Run on 5GB RAM (4-bit) or 15GB (full 16-bit). See our Gemma 4 GGUFs.

Gemma-4-26B-A4B runs on 18GB (4-bit) or 28GB (8-bit). Gemma-4-31B needs 20GB RAM (4-bit) or 34GB (8-bit).

Apr 20: We conducted Gemma 4 GGUF Benchmarks to help you pick the best quant.

Apr 11 update: Gemma 4 is now updated with Google's updated chat template + llama.cpp fixes. Do NOT use CUDA 13.2 runtime for any GGUF as it will cause poor outputs.

You can now run GGUFs and fine-tune Gemma 4 in Unsloth Studio✨

Usage Guide

Gemma 4 excels at reasoning, coding, tool use, long-context and agentic workflows, and multimodal tasks. The smaller E2B and E4B variants are designed for phones and laptops, while the larger models target medium-high CPU /VRAM systems such as PCs with NVIDIA RTX GPUs.

Gemma 4 Variant

Details

Best fit

E2B

Dense + PLE (128K context) Supports: Text, Image, Audio

For phone / edge inference, ASR, speech translation

E4B

Dense + PLE (128K context) Support: Text, Image, Audio

Small model for laptops and fast local multimodal use

26B-A4B

MoE (256K context) Support: Text, Image

Best speed / quality tradeoff for computer use

31B

Dense (256K context) Support: Text, Image

Strongest performance at slower inference

See Gemma 4: Performance benchmarks and GGUF benchmarks.

Should I pick 26B-A4B or 31B?

26B-A4B - balances speed and accuracy. Its MoE design makes it faster than 31B, with 4B active parameters. Pick it if RAM is limited and you are fine trading a bit of quality for speed.
31B - currently the strongest Gemma 4 model. Pick it for maximum quality if you have enough memory and can accept slightly slower speeds.

Hardware requirements

Table: Gemma 4 Inference GGUF recommended hardware requirements (units = total memory: RAM + VRAM, or unified memory). You can use Gemma 4 on MacOS, NVIDIA RTX GPUs etc.

Gemma 4 variant

4-bit

8-bit

BF16 / FP16

E2B

4 GB

5–8 GB

10 GB

E4B

5.5–6 GB

9–12 GB

16 GB

26B A4B

16–18 GB

28–30 GB

52 GB

31B

17–20 GB

34–38 GB

62 GB

As a rule of thumb, your total available memory should at least exceed the size of the quantized model you download. If it does not, llama.cpp can still run using partial RAM / disk offload, but generation will be slower. You will also need more compute, depending on the context window you use.

Recommended Settings

It is recommended to use Google's default Gemma 4 parameters:

temperature = 1.0
top_p = 0.95
top_k = 64

Recommended practical defaults for local inference:

Start with 32K context for responsiveness, then increase
Keep repetition/presence penalty disabled or 1.0 unless you see looping.
The End of Sentence token is <turn|>

Gemma 4's max context is 128K for E2B / E4B and 256K for 26B A4B / 31B.

Thinking Mode

Compared to older Gemma chat templates, Gemma 4 uses the standard system, assistant, and user roles and adds explicit thinking control.

How to enable thinking:

Add the token <|think|> at the start of the system prompt.

Thinking enabled

<|think|>
You are a careful coding assistant. Explain your answer clearly.

Thinking disabled

You are a careful coding assistant. Explain your answer clearly.

Output behavior:

When thinking is enabled, the model outputs its internal reasoning channel before the final answer.

<|channel>thought
[internal reasoning]
<channel|>
[final answer]

When thinking is disabled, the larger models may still emit an empty thought block before the final answer.

<|channel>thought
<channel|>
[final answer]

For example using "What is the capital of France?":

<bos><|turn>system\n<|think|><turn|>\n<|turn>user\nWhat is the capital of France?<turn|>\n<|turn>model\n

then it outputs with:

<|channel>thought\nThe user is asking for the capital of France.\nThe capital of France is Paris.<channel|>The capital of France is Paris.<turn|>

Multi-turn chat rule:

For multi-turn conversations, only keep the final visible answer in chat history. Do not feed prior thought blocks back into the next turn.

<bos><|turn>user\nWhat is 1+1?<turn|>\n<|turn>model\n2<turn|>\n<|turn>user\nWhat is 1+1?<turn|>\n<|turn>model\n2<turn|>\n<|turn>user\nWhat is 1+1?<turn|>\n<|turn>model\n2<turn|>\n<|turn>user\nWhat is 1+1?<turn|>\n<|turn>model\n2<turn|>\n

How to disable thinking:

Note llama-cli might not work reliably, so use llama-server for disabling reasoning:

To disable thinking / reasoning, use --chat-template-kwargs '{"enable_thinking":false}'

If you're on Windows Powershell, use: --chat-template-kwargs "{\"enable_thinking\":false}"

Use 'true' and 'false' interchangeably.

Run Gemma 4 Tutorials

Because Gemma 4 GGUFs comes in several sizes, the recommended starting point for the small models is 8-bit and the larger models is Dynamic 4-bit. Gemma 4 GGUFs or MLX:

🦥 Unsloth Studio Guide 🦙 Llama.cpp Guide

You can run and train Gemma 4 for free with a UI in our Unsloth Studio✨ notebook:

Google Colabcolab.research.google.com

🦥 Unsloth Studio Guide

Gemma 4 can now be run and fine-tuned in Unsloth Studio, our new open-source web UI for local AI. Unsloth Studio lets you run models locally on MacOS, Windows, Linux and:

Search, download, run GGUFs and safetensor models
Self-healing tool calling + web search
Code execution (Python, Bash)
Automatic inference parameter tuning (temp, top-p, etc.)
Fast CPU + GPU inference via llama.cpp
Train LLMs 2x faster with 70% less VRAM

Install Unsloth

Run in your terminal:

MacOS, Linux, WSL:

curl -fsSL https://unsloth.ai/install.sh | sh

Windows PowerShell:

irm https://unsloth.ai/install.ps1 | iex

Launch Unsloth

MacOS, Linux, WSL and Windows:

unsloth studio -H 0.0.0.0 -p 8888

Then open http://localhost:8888 in your browser.

Search and download Gemma 4

On first launch you will need to create a password to secure your account and sign in again later. You’ll then see a brief onboarding wizard to choose a model, dataset, and basic settings. You can skip it at any time.

Then go to the Studio Chat tab and search for Gemma 4 in the search bar and download your desired model and quant.

Run Gemma 4

Inference parameters should be auto-set when using Unsloth Studio, however you can still change it manually. You can also edit the context length, chat template and other settings.

For more information, you can view our Unsloth Studio inference guide.

🦙 Llama.cpp Guide

For this guide we will be utilizing Dynamic 4-bit for the 26B-A4B and 31B and 8-bit for E2B and E4B. See: Gemma 4 GGUF collection

For these tutorials, we will using llama.cpp for fast local inference, especially if you have a CPU.

Obtain the latest llama.cpp on GitHub here. You can follow the build instructions below as well. Change -DGGML_CUDA=ON to -DGGML_CUDA=OFF if you don't have a GPU or just want CPU inference. For Apple Mac / Metal devices, set -DGGML_CUDA=OFF then continue as usual - Metal support is on by default.

apt-get update
apt-get install pciutils build-essential cmake curl libcurl4-openssl-dev -y
git clone https://github.com/ggml-org/llama.cpp
cmake llama.cpp -B llama.cpp/build \
    -DBUILD_SHARED_LIBS=OFF -DGGML_CUDA=ON
cmake --build llama.cpp/build --config Release -j --clean-first --target llama-cli llama-mtmd-cli llama-server llama-gguf-split
cp llama.cpp/build/bin/llama-* llama.cpp

If you want to use llama.cpp directly to load models, you can follow commands below, according to each model. UD-Q4_K_XL is the quantization type. You can also download via Hugging Face (step 3). This is similar to ollama run . Use export LLAMA_CACHE="folder" to force llama.cpp to save to a specific location. There is no need to set context length as llama.cpp automatically uses the exact amount required.

26B-A4B:

export LLAMA_CACHE="unsloth/gemma-4-26B-A4B-it-GGUF"
./llama.cpp/llama-cli \
    -hf unsloth/gemma-4-26B-A4B-it-GGUF:UD-Q4_K_XL \
    --temp 1.0 \
    --top-p 0.95 \
    --top-k 64

31B:

export LLAMA_CACHE="unsloth/gemma-4-31B-it-GGUF"
./llama.cpp/llama-cli \
    -hf unsloth/gemma-4-31B-it-GGUF:UD-Q4_K_XL \
    --temp 1.0 \
    --top-p 0.95 \
    --top-k 64

E4B:

export LLAMA_CACHE="unsloth/gemma-4-E4B-it-GGUF"
./llama.cpp/llama-cli \
    -hf unsloth/gemma-4-E4B-it-GGUF:Q8_0 \
    --temp 1.0 \
    --top-p 0.95 \
    --top-k 64

E2B:

export LLAMA_CACHE="unsloth/gemma-4-E2B-it-GGUF"
./llama.cpp/llama-cli \
    -hf unsloth/gemma-4-E2B-it-GGUF:Q8_0 \
    --temp 1.0 \
    --top-p 0.95 \
    --top-k 64

Download the model via (after installing pip install huggingface_hub hf_transfer ). You can choose UD-Q4_K_XL or other quantized versions like Q8_0 . If downloads get stuck, see: Hugging Face Hub, XET debugging

hf download unsloth/gemma-4-26B-A4B-it-GGUF \
    --local-dir unsloth/gemma-4-26B-A4B-it-GGUF \
    --include "*mmproj-BF16*" \
    --include "*UD-Q4_K_XL*" # Use "*UD-Q2_K_XL*" for Dynamic 2bit

Then run the model in conversation mode (with vision mmproj-F16):

./llama.cpp/llama-cli \
    --model unsloth/gemma-4-26B-A4B-it-GGUF/gemma-4-26B-A4B-it-UD-Q4_K_XL.gguf \
    --mmproj unsloth/gemma-4-26B-A4B-it-GGUF/mmproj-BF16.gguf \
    --temp 1.0 \
    --top-p 0.95 \
    --top-k 64

Llama-server deployment

To deploy Gemma-4 on llama-server, use:

./llama.cpp/llama-server \
    --model unsloth/gemma-4-26B-A4B-it-GGUF/gemma-4-26B-A4B-it-UD-Q4_K_XL.gguf \
    --mmproj unsloth/gemma-4-26B-A4B-it-GGUF/mmproj-BF16.gguf \
    --temp 1.0 \
    --top-p 0.95 \
    --top-k 64 \
    --alias "unsloth/gemma-4-26B-A4B-it-GGUF" \
    --port 8001 \
    --chat-template-kwargs '{"enable_thinking":true}'

To disable thinking / reasoning, use --chat-template-kwargs '{"enable_thinking":false}'

If you're on Windows Powershell, use: --chat-template-kwargs "{\"enable_thinking\":false}"

Use 'true' and 'false' interchangeably.

MLX Dynamic Quants

We also uploaded dynamic 4bit and 8bit quants as a first trial for MacOS device!

Now with vision support!

Gemma 4

4-bit MLX

8-bit MLX

31B

26B-A4B

E4B

E2B

To try them out use:

curl -fsSL https://raw.githubusercontent.com/unslothai/unsloth/refs/heads/main/scripts/install_gemma4_mlx.sh | sh
source ~/.unsloth/unsloth_gemma4_mlx/bin/activate
python -m mlx_vlm.chat --model unsloth/gemma-4-26b-a4b-it-UD-MLX-4bit

Gemma 4 Best Practices

Prompting examples

Simple reasoning prompt

System:
<|think|>
You are a precise reasoning assistant.

User:
A train leaves at 8:15 AM and arrives at 11:47 AM. How long was the journey?

OCR / document prompt

For OCR, use a high visual token budget like 560 or 1120.

[image first]
Extract all text from this receipt. Return line items, total, merchant, and date as JSON.

[image 1]
[image 2]
Compare these two screenshots and tell me which one is more likely to confuse a new user.

Audio ASR prompt

[audio first]
Transcribe the following speech segment in English into English text.

Follow these specific instructions for formatting the answer:
* Only output the transcription, with no newlines.
* When transcribing numbers, write the digits, i.e. write 1.7 and not one point seven, and write 3 instead of three.

Audio translation prompt

[audio first]
Transcribe the following speech segment in Spanish, then translate it into English.
When formatting the answer, first output the transcription in Spanish, then one newline, then output the string 'English: ', then the translation in English.

For best results with multimodal prompts, put multimodal content first:

Put image and/or audio before text.
For video, pass a sequence of frames first, then the instruction.

Variable image resolution

Gemma 4 supports multiple visual token budgets:

70
140
280
560
1120

Use them like this:

70 / 140: classification, captioning, fast video understanding
280 / 560: general multimodal chat, charts, screens, UI reasoning
1120: OCR, document parsing, handwriting, small text

Audio and video limits

Audio is available on E2B and E4B only.
Audio supports a maximum of 30 seconds.
Video supports a maximum of 60 seconds assuming 1 frame per second processing.

Audio prompt templates

ASR prompt

Transcribe the following speech segment in {LANGUAGE} into {LANGUAGE} text.

Follow these specific instructions for formatting the answer:
* Only output the transcription, with no newlines.
* When transcribing numbers, write the digits, i.e. write 1.7 and not one point seven, and write 3 instead of three.

Speech translation prompt

Transcribe the following speech segment in {SOURCE_LANGUAGE}, then translate it into {TARGET_LANGUAGE}.
When formatting the answer, first output the transcription in {SOURCE_LANGUAGE}, then one newline, then output the string '{TARGET_LANGUAGE}: ', then the translation in {TARGET_LANGUAGE}.

📊 Benchmarks

Unsloth GGUF Benchmarks

We conducted Mean KL Divergence benchmarks for Gemma 4 GGUFs across providers to help you pick the best quant (lower is better).

KL Divergence puts all Unsloth GGUFs on the SOTA Pareto frontier
KLD shows how well a quantized model matches the original BF16 output distribution, indicating retained accuracy.

Official Gemma Benchmarks

Gemma 4

MMLU Pro

AIME 2026 (no tools)

LiveCodeBench v6

MMMU Pro

31B

85.2%

89.2%

80.0%

76.9%

26B A4B

82.6%

88.3%

77.1%

73.8%

E4B

69.4%

42.5%

52.0%

52.6%

E2B

60.0%

37.5%

44.0%

44.2%

PreviousQwen3.6 NextFine-tune Gemma 4

Last updated 3 days ago

Was this helpful?

Sunbelt Computer Software

PL/B Language Development and Support

✨Gemma 4 - How to Run Locally

Usage Guide

Hardware requirements

Recommended Settings

Thinking Mode

Run Gemma 4 Tutorials

🦥 Unsloth Studio Guide

Install Unsloth

Launch Unsloth

Search and download Gemma 4

Run Gemma 4

🦙 Llama.cpp Guide

Llama-server deployment

MLX Dynamic Quants

Gemma 4 Best Practices

Prompting examples

Simple reasoning prompt

OCR / document prompt

Audio ASR prompt

Audio translation prompt

Variable image resolution

Audio and video limits

Audio prompt templates

📊 Benchmarks

Unsloth GGUF Benchmarks

Official Gemma Benchmarks

Sunbelt Computer Software

PL/B Language Development and Support

hashtagUsage Guide

hashtagHardware requirements

hashtagRecommended Settings

hashtagThinking Mode

hashtagRun Gemma 4 Tutorials

hashtag🦥 Unsloth Studio Guide

hashtagInstall Unsloth

hashtagLaunch Unsloth

hashtagSearch and download Gemma 4

hashtagRun Gemma 4

hashtag🦙 Llama.cpp Guide

hashtagLlama-server deployment

hashtagMLX Dynamic Quants

hashtagGemma 4 Best Practices

hashtagPrompting examples

hashtagSimple reasoning prompt

hashtagOCR / document prompt

hashtagMulti-modal comparison prompt

hashtagAudio ASR prompt

hashtagAudio translation prompt

hashtagMulti-modal Settings

hashtagVariable image resolution

hashtagAudio and video limits

hashtagAudio prompt templates

hashtag📊 Benchmarks

hashtagUnsloth GGUF Benchmarks

hashtagOfficial Gemma Benchmarks

Usage Guide

Hardware requirements

Recommended Settings

Thinking Mode

Run Gemma 4 Tutorials

🦥 Unsloth Studio Guide

Install Unsloth

Launch Unsloth

Search and download Gemma 4

Run Gemma 4

🦙 Llama.cpp Guide

Llama-server deployment

MLX Dynamic Quants

Gemma 4 Best Practices

Prompting examples

Simple reasoning prompt

OCR / document prompt

Multi-modal comparison prompt

Audio ASR prompt

Audio translation prompt

Multi-modal Settings

Variable image resolution

Audio and video limits

Audio prompt templates

📊 Benchmarks

Unsloth GGUF Benchmarks

Official Gemma Benchmarks