Gemma 4 å¤å¿æ¸å & gemma4 cheatsheet & Quick Reference

å¿«éä¸æ

from transformers import AutoProcessor, AutoModelForCausalLM
import torch

MODEL_ID = "google/gemma-4-E4B-it"

processor = AutoProcessor.from_pretrained(MODEL_ID)
model = AutoModelForCausalLM.from_pretrained(
    MODEL_ID,
    dtype=torch.bfloat16,
    device_map="auto"
)

messages = [
    {"role": "system", "content": "You are helpful."},
    {"role": "user",   "content": "Explain MoE briefly."},
]

text = processor.apply_chat_template(
    messages,
    tokenize=False,
    add_generation_prompt=True,
    enable_thinking=False
)
inputs = processor(
    text=text, return_tensors="pt"
).to(model.device)
input_len = inputs["input_ids"].shape[-1]

outputs = model.generate(**inputs, max_new_tokens=512)
response = processor.decode(
    outputs[0][input_len:], skip_special_tokens=False
)
result = processor.parse_response(response)
# result["thinking"]  â åé¨æ¨çé¾
# result["response"] â å±ç¤ºç»ç¨æ·çæç»åç

å®è£

$ pip install -U transformers torch accelerate

å¨é¨æ¨¡ååå¯å¨ HuggingFace éè¿ google/<model-id> è·å

è§æ ¼	Model ID
E2B	`gemma-4-E2B-it`
E4B	`gemma-4-E4B-it`
31B	`gemma-4-31b-it`
26B A4B	`gemma-4-26b-a4b-it`

åæ°	å¼
`temperature`	`1.0`
`top_p`	`0.95`
`top_k`	`64`

æ¨¡åè§æ ¼

æ¨¡å	æ¶æ	æ»åæ°	æ¯ Token æ¿æ´»	å±æ°	ä¸ä¸æ	æ¨¡æ
E2B	Dense+PLE	5.1B (2.3B eff)	2.3B	35	128K	Text+Image+Audio
E4B	Dense+PLE	8B (4.5B eff)	4.5B	42	128K	Text+Image+Audio
31B	Dense	30.7B	30.7B	60	256K	Text+Image
26B A4B	MoE	25.2B	3.8B	30	256K	Text+Image

æ»å¨çªå£ï¼512 tokensï¼E2B/E4Bï¼Â· 1024 tokensï¼31B/26Bï¼Â· è¯è¡¨ï¼262Kï¼å¨æ¨¡åä¸è´ï¼

æ¶æç¹æ§

æ··åæ³¨æå

Local å±ï¼æ»å¨çªå£ï¼512 æ 1024 tokensï¼
Global å±ï¼å¨å±ä¸ä¸æï¼ä¸ local å±äº¤éæå
æåä¸å±å§ç»æ¯ global attention å±

PLE â è¾¹ç¼æ¨¡åï¼E2B/E4Bï¼

Per-Layer Embeddingsï¼æ¯ä¸ªè§£ç å±é½æèªå·±çå°å embedding è¡¨
å¤§åéæè¡¨éè¿å¿«éæ¥è¡¨ï¼ä¸èµ°ç¨ å¯ç©éµä¹æ³
ææè®¡ç®åæ°è¿å°äºå è½½æ»åæ°
å¯å¨ 1.5 GB ä»¥ä¸ VRAMï¼4-bit éåï¼è¿è¡æ¨ç

p-RoPE & Shared KV Cache

å¨ global å±ä½¿ç¨ Proportional RoPEï¼p-RoPEï¼æåé¿ç¨ä¸è´æ§
global å±å±äº« KV Cacheï¼éä½å³°å¼æ¾åå ç¨
å¨ 256K ä¸ä¸æä¸ä¿æç¨³å®æ§è½

MoE â 26B A4B

128 ä¸ªä¸å®¶ + 1 ä¸ªå§ç»æ¿æ´»çå±äº«ä¸å®¶
æ¨çæ¶æ¯ä¸ª token æ¿æ´» 8 ä¸ªä¸å®¶
éåº¦æ¥è¿ 4B ç¨ å¯æ¨¡åï¼è´¨éæ¥è¿ 30B
è§è§ç¼ç å¨çº¦ 5.5 äº¿åæ°ï¼ä¸ 31B ç¸åï¼

æ¾åéæ±

æ¨¡å	BF16ï¼16-bitï¼	8-bit	4-bit
E2B	9.6 GB	4.6 GB	3.2 GB
E4B	15 GB	7.5 GB	5 GB
31B	58.3 GB	30.4 GB	17.4 GB
26B A4B	48 GB	25 GB	15.6 GB

ä»ä¸ºåºç¡æéæ¾åï¼ä¸å« KV cache é¢å¤å ç¨ã

éåå»ºè®®

å¯ç¨ VRAM	æ¨èæ¨¡å
< 5 GB	E2B (4-bit)
5â8 GB	E4B (4-bit)
15â20 GB	E4B (BF16)
24â32 GB	31B (4-bit)
48â80 GB	31B (BF16)
é«åååºæ¯	26B A4B

æ ¸å¿åºå

åºåé¡¹	31B	26B A4B	E4B	E2B	Gemma 3 27B
MMLU Pro	85.2%	82.6%	69.4%	60.0%	67.6%
MMMLU (multilingual)	88.4%	86.3%	76.6%	67.4%	70.7%
AIME 2026 (math)	89.2%	88.3%	42.5%	37.5%	20.8%
GPQA Diamond	84.3%	82.3%	58.6%	43.4%	42.4%
LiveCodeBench v6	80.0%	77.1%	52.0%	44.0%	29.1%
Codeforces ELO	2150	1718	940	633	110
BigBench Extra Hard	74.4%	64.8%	33.1%	21.9%	19.3%
Tau2 avg (agentic)	76.9%	68.2%	42.2%	24.5%	16.2%
HLE no tools	19.5%	8.7%	â	â	â
HLE with search	26.5%	17.2%	â	â	â

ä»¥ä¸ç»æååºäºå¯ç¨ thinking æ¨¡å¼çæä»¤å¾®è°æ¨¡åã

è§è§åºå

åºåé¡¹	31B	26B A4B	E4B	E2B
MMMU Pro	76.9%	73.8%	52.6%	44.2%
MATH-Vision	85.6%	82.4%	59.5%	52.4%
MedXPertQA MM	61.3%	58.1%	28.7%	23.5%
OmniDocBenchâ	0.131	0.149	0.181	0.290

OmniDocBench = ææ¡£ç¼è¾è·ç¦»ï¼è¶ä½è¶å¥½ï¼ã

åºåé¡¹	31B	26B A4B	E4B	E2B
MRCR v2 128K	66.4%	44.1%	25.4%	19.1%

Model	ELO	Open Rank
Gemma 4 31B	1452	#3
Gemma 4 26B A4B	1441	#6

Thinking æ§å¶

å¨ system prompt å¼å¤´å ä¸ <|think|>ï¼

messages = [
    {
        "role": "system",
        "content": "<|think|>You are a math expert."
    },
    {"role": "user", "content": "Solve: 3x + 7 = 22"}
]

text = processor.apply_chat_template(
    messages,
    tokenize=False,
    add_generation_prompt=True,
    enable_thinking=True
)
outputs = model.generate(**inputs, max_new_tokens=2048)
response = processor.decode(
    outputs[0][input_len:], skip_special_tokens=False
)
result = processor.parse_response(response)
# result["thinking"]  â åæ¥æ¨ç
# result["response"] â æç»åç

Thinking è¾åºç»æ

<|channel>thought
[åé¨éæ¥æ¨ç - å¯¹ç¨æ·éè]
<channel|>
[å¯¹ç¨æ·å¯è§çæç»åç]

å³é thinking æ¶çè¡ä¸º

å¯¹ 31B/26B A4B æ¥è¯´ï¼å³ä½¿ enable_thinking=Falseï¼ä»ä¼è¾åºç©ºæ ç¾ï¼

<|channel>thought
<channel|>
[æç»åç]

E2B/E4B å¨å³éæ¶ä¼å®å¨è·³è¿ç©ºæ ç¾ã

æ§å¶ Token

Token	ä½ç¨
`<\|think\|>`	å¨ system prompt å¯ç¨ thinking
`<\|channel>thought\n`	æå¼åé¨æèå
`<channel\|>`	å³éæèå
`<\|turn>`	æå¼ä¸è½®å¯¹è¯
`<turn\|>`	å³éä¸è½®å¯¹è¯

å¾çè¾å¥

æ¯æå¯åå®½é«æ¯ + å¯éç½®çè§è§ token é¢ç®ï¼

é¢ç®	éç¨åºæ¯
70	å¿«éåç±»ãè§é¢å¸§
140	å¾åæè¿°ãç¼©ç¥å¾
280	éç¨å¾åçè§£
560	å¾è¡¨ãç¤ºæå¾
1120	OCRãPDF è§£æãç»èè¯å«

# å¾çå¿é¡»å¨ææ¬ä¹åï¼ç¡¬æ§è¦æ±ï¼
messages = [{"role": "user", "content": [
  {"type": "image", "image": image},
  {"type": "text",  "text": "Describe this chart."},
]}]
inputs = processor(
  text=text,
  images=image,
  return_tensors="pt"
).to(model.device)

è§è§ç¼ç å¨ï¼çº¦ 1.5 äº¿åæ°ï¼E2B/E4Bï¼Â· çº¦ 5.5 äº¿åæ°ï¼31B/26Bï¼

é³é¢è¾å¥

ä» E2B ä¸ E4B æ¯æï¼çº¦ 3 äº¿åæ°é³é¢ç¼ç å¨ï¼

é³é¢æé¿ï¼30 ç§
æ¯æä»»å¡ï¼ASR ä¸è¯é³ç¿»è¯

ASR æç¤ºè¯

Transcribe the following speech in
{LANGUAGE} into {LANGUAGE} text.

ç¿»è¯æç¤ºè¯

Transcribe in {SRC_LANG}, then
translate to {TARGET_LANG}.

è§é¢è¾å¥

æè¿ç»å¾åå¸§å¤çï¼

æå¤§ï¼60 ç§ï¼1 fps å³ 60 å¸§
æ¯å¸§å»ºè®®ä½¿ç¨ä½ token é¢ç®ï¼70-140ï¼
E2B/E4B å¯åæ¶å¤çé³è½¨

# å°å¸§ä½ä¸ºå¾ååè¡¨ä¼ å¥
inputs = processor(
    text=text,
    images=[frame1, frame2, ..., frame60],
    return_tensors="pt"
)

æ¨¡æé¡ºåº

åå®¹ä¸å§ç»å°å¾ç/é³é¢æ¾å¨ææ¬åé¢ï¼

# â æ£ç¡®é¡ºåº
content = [
    {"type": "image", "image": img},
    {"type": "text",  "text": "Describe it."},
]

# â ææ¬å¨åä¼å¯¼è´å¯¹é½é®é¢
content = [
    {"type": "text",  "text": "Describe it."},
    {"type": "image", "image": img},
]

HuggingFace

$ pip install -U transformers accelerate

BF16ï¼é»è®¤ï¼

from transformers import (
    AutoProcessor, AutoModelForCausalLM
)
import torch

mid = "google/gemma-4-31b-it"
processor = AutoProcessor.from_pretrained(mid)
model = AutoModelForCausalLM.from_pretrained(
    mid,
    torch_dtype=torch.bfloat16,
    device_map="auto"
)

4-bit éå

from transformers import BitsAndBytesConfig

bnb = BitsAndBytesConfig(load_in_4bit=True)
model = AutoModelForCausalLM.from_pretrained(
    mid,
    quantization_config=bnb,
    device_map="auto"
)

Ollama

$ ollama pull gemma4     # 31Bï¼é»è®¤ï¼
$ ollama pull gemma4:e4b # è¾¹ç¼ 4B çæ¬
$ ollama pull gemma4:e2b # è¾¹ç¼ 2B çæ¬
$ ollama run  gemma4     # äº¤äºèå¤©

èªå®ä¹ GGUFï¼Modelfileï¼

FROM /path/to/fine-tuned.gguf
SYSTEM "You are a coding assistant."

$ ollama create mygemma -f Modelfile
$ ollama run mygemma

vLLM Server

$ vllm serve google/gemma-4-31B-it \
  --max-model-len 8192 \
  --enable-auto-tool-choice \
  --tool-call-parser gemma4 \
  --reasoning-parser gemma4

OpenAI å¼å®¹ API å°åï¼http://localhost:8000/v1

å¹³å°	è¯´æ
Gemini API	`gemma-4-31b-it`
AI Studio	æµè§å¨ Playground
Vertex AI	èªå®ä¹ç«¯ç¹
Cloud Run	Serverless GPU
GKE + vLLM	èªå¨ä¼¸ç¼©

è¾¹ç¼ä¸ç§»å¨ç«¯

è¿è¡æ¶	éç¨åºæ¯
AICore (Android)	ç³»ç»çº§ API
LiteRT-LM	IoTãRaspberry Pi
AI Edge Gallery	ç«¯ä¾§è¯æµ
LM Studio	æ¡é¢å¾å½¢çé¢
llama.cpp	CPU/GPU æ··å

QLoRA éç½®

åå¼ 16 GB GPUï¼T4 / åè´¹ Colab / Kaggleï¼å³å¯ï¼

from unsloth import FastModel

model, tokenizer = FastModel.from_pretrained(
    "google/gemma-4-E4B-it",
    load_in_4bit=True,
    max_seq_length=4096
)

model = FastModel.get_peft_model(
    model,
    r=16,
    lora_alpha=16,
    lora_dropout=0,
    target_modules=[
        "q_proj", "k_proj",
        "v_proj", "o_proj",
        "gate_proj", "up_proj", "down_proj"
    ],
)

è§è§å¾®è°ï¼E2B / E4Bï¼

from unsloth import FastVisionModel

model, tokenizer = FastVisionModel.from_pretrained(
    "google/gemma-4-E4B-it",
    finetune_vision_layers=False,   # å»ç»è§è§ç¼ç å¨
    finetune_language_layers=True,
    load_in_4bit=True,
)

æ°æ®è¦æ±

è¦æ±	å»ºè®®å¼
CoT æ°æ®å æ¯	â¥ 75% çè®ç»é
Thinking æ ¼å¼	åå« `<\|think\|>` è§¦åæ è®°
å¤æ¨¡æé¡ºåº	å¾ç/é³é¢å¨ææ¬å
Chat æ¨¡æ¿	ShareGPT æ OpenAI æ ¼å¼
RL å¥å±	å¯éªè¯çæç»çæ¡

ä¸é¡¹æ¨¡å

æ¨¡å	é¢å	è¯´æ
MedGemma 4B	å»å¦å½±å	å¤æ¨¡æ X-ray/MRI åæ
MedGemma 27B	ä¸´åºææ¬	EHR + å»çæ¥åæ¨ç
CodeGemma	ç¼ç¨	ä»£ç è¡¥å¨ä¸éæ
PaliGemma 2	è§è§è¯è¨	ç»ç²åº¦ VLM ä¸è§è§æ¨ç
ShieldGemma	å®å¨	LLM è¾åºå®å¨åç±»å¨
DataGemma	äºå®æ°æ®	åºäº Google Data Commons å¯¹é½
FunctionGemma	å·¥å·è°ç¨	ä½èµæºå½æ°è°ç¨è§£æ

å¿«éå¼å§

å¿«éä¸æ

ç®ä»

å®è£

éæ ·åæ°

æä½³å®è·µ

æ¨¡åå®¶æ

æ¨¡åè§æ ¼

æ¶æç¹æ§

æ··åæ³¨æå

PLE â è¾¹ç¼æ¨¡åï¼E2B/E4Bï¼

p-RoPE & Shared KV Cache

MoE â 26B A4B

æ¾å­éæ±

éåå»ºè®®

åºåæµè¯

æ ¸å¿åºå

è§è§åºå

é¿ä¸ä¸æ

Arena AI (LMSYS ELO)

Thinking æ¨¡å¼

Thinking æ§å¶

Thinking è¾åºç»æ

å ³é­ thinking æ¶çè¡ä¸º

æ§å¶ Token

å¤è½®å¯¹è¯è§å

å¤æ¨¡æ

å¾çè¾å ¥

é³é¢è¾å ¥

ASR æç¤ºè¯

ç¿»è¯æç¤ºè¯

è§é¢è¾å ¥

æ¨¡æé¡ºåº

é¨ç½²

HuggingFace

BF16ï¼é»è®¤ï¼

4-bit éå

Ollama

èªå®ä¹ GGUFï¼Modelfileï¼

vLLM Server

Cloud & API

è¾¹ç¼ä¸ç§»å¨ç«¯

å¾®è°

QLoRA é ç½®

è§è§å¾®è°ï¼E2B / E4Bï¼

MoE å¾®è°

æ°æ®è¦æ±

è§è§å¾®è°å»ºè®®

Gemmaverse

ä¸é¡¹æ¨¡å

çæ

å»¶ä¼¸é è¯»

å¿«éå¼å§

å¿«éä¸æ

ç®ä»

å®è£

éæ ·åæ°

æä½³å®è·µ

æ¨¡åå®¶æ