Gemma 4 æ¯ Google DeepMind åºäº Apache 2.0 åå¸ç弿¾æé夿¨¡æå¤§è¯è¨æ¨¡åå®¶æï¼è¦çä»ç§»å¨ç«¯è¾¹ç¼é¨ç½²å°æå¡å¨çº§é¨ç½²çåç§è§æ¨¡ï¼å¹¶å ·å¤å沿æºè½è½åã
from transformers import AutoProcessor, AutoModelForCausalLM
import torch
MODEL_ID = "google/gemma-4-E4B-it"
processor = AutoProcessor.from_pretrained(MODEL_ID)
model = AutoModelForCausalLM.from_pretrained(
MODEL_ID,
dtype=torch.bfloat16,
device_map="auto"
)
messages = [
{"role": "system", "content": "You are helpful."},
{"role": "user", "content": "Explain MoE briefly."},
]
text = processor.apply_chat_template(
messages,
tokenize=False,
add_generation_prompt=True,
enable_thinking=False
)
inputs = processor(
text=text, return_tensors="pt"
).to(model.device)
input_len = inputs["input_ids"].shape[-1]
outputs = model.generate(**inputs, max_new_tokens=512)
response = processor.decode(
outputs[0][input_len:], skip_special_tokens=False
)
result = processor.parse_response(response)
# result["thinking"] â å
鍿¨çé¾
# result["response"] â å±ç¤ºç»ç¨æ·çæç»åç
åå¸äº 2026-03-31 · Apache 2.0 · åºäº Gemini 3 ç ç©¶æå»º · æ¯æ 140+ è¯è¨
$ pip install -U transformers torch accelerate
å
¨é¨æ¨¡ååå¯å¨ HuggingFace éè¿ google/<model-id> è·å
| è§æ ¼ | Model ID |
|---|---|
| E2B | gemma-4-E2B-it |
| E4B | gemma-4-E4B-it |
| 31B | gemma-4-31b-it |
| 26B A4B | gemma-4-26b-a4b-it |
| åæ° | å¼ |
|---|---|
temperature | 1.0 |
top_p | 0.95 |
top_k | 64 |
| 模å | æ¶æ | æ»åæ° | æ¯ Token æ¿æ´» | 屿° | ä¸ä¸æ | 模æ |
|---|---|---|---|---|---|---|
| E2B | Dense+PLE | 5.1B (2.3B eff) | 2.3B | 35 | 128K | Text+Image+Audio |
| E4B | Dense+PLE | 8B (4.5B eff) | 4.5B | 42 | 128K | Text+Image+Audio |
| 31B | Dense | 30.7B | 30.7B | 60 | 256K | Text+Image |
| 26B A4B | MoE | 25.2B | 3.8B | 30 | 256K | Text+Image |
æ»å¨çªå£ï¼512 tokensï¼E2B/E4Bï¼Â· 1024 tokensï¼31B/26Bï¼Â· è¯è¡¨ï¼262Kï¼å ¨æ¨¡åä¸è´ï¼
| 模å | BF16ï¼16-bitï¼ | 8-bit | 4-bit |
|---|---|---|---|
| E2B | 9.6 GB | 4.6 GB | 3.2 GB |
| E4B | 15 GB | 7.5 GB | 5 GB |
| 31B | 58.3 GB | 30.4 GB | 17.4 GB |
| 26B A4B | 48 GB | 25 GB | 15.6 GB |
ä» ä¸ºåºç¡æéæ¾åï¼ä¸å« KV cache é¢å¤å ç¨ã
| å¯ç¨ VRAM | æ¨è模å |
|---|---|
| < 5 GB | E2B (4-bit) |
| 5â8 GB | E4B (4-bit) |
| 15â20 GB | E4B (BF16) |
| 24â32 GB | 31B (4-bit) |
| 48â80 GB | 31B (BF16) |
| é«åååºæ¯ | 26B A4B |
| åºå项 | 31B | 26B A4B | E4B | E2B | Gemma 3 27B |
|---|---|---|---|---|---|
| MMLU Pro | 85.2% | 82.6% | 69.4% | 60.0% | 67.6% |
| MMMLU (multilingual) | 88.4% | 86.3% | 76.6% | 67.4% | 70.7% |
| AIME 2026 (math) | 89.2% | 88.3% | 42.5% | 37.5% | 20.8% |
| GPQA Diamond | 84.3% | 82.3% | 58.6% | 43.4% | 42.4% |
| LiveCodeBench v6 | 80.0% | 77.1% | 52.0% | 44.0% | 29.1% |
| Codeforces ELO | 2150 | 1718 | 940 | 633 | 110 |
| BigBench Extra Hard | 74.4% | 64.8% | 33.1% | 21.9% | 19.3% |
| Tau2 avg (agentic) | 76.9% | 68.2% | 42.2% | 24.5% | 16.2% |
| HLE no tools | 19.5% | 8.7% | â | â | â |
| HLE with search | 26.5% | 17.2% | â | â | â |
以ä¸ç»æååºäºå¯ç¨ thinking 模å¼çæä»¤å¾®è°æ¨¡åã
| åºå项 | 31B | 26B A4B | E4B | E2B |
|---|---|---|---|---|
| MMMU Pro | 76.9% | 73.8% | 52.6% | 44.2% |
| MATH-Vision | 85.6% | 82.4% | 59.5% | 52.4% |
| MedXPertQA MM | 61.3% | 58.1% | 28.7% | 23.5% |
| OmniDocBenchâ | 0.131 | 0.149 | 0.181 | 0.290 |
OmniDocBench = ææ¡£ç¼è¾è·ç¦»ï¼è¶ä½è¶å¥½ï¼ã
| åºå项 | 31B | 26B A4B | E4B | E2B |
|---|---|---|---|---|
| MRCR v2 128K | 66.4% | 44.1% | 25.4% | 19.1% |
| Model | ELO | Open Rank |
|---|---|---|
| Gemma 4 31B | 1452 | #3 |
| Gemma 4 26B A4B | 1441 | #6 |
å¨ system prompt å¼å¤´å ä¸ <|think|>ï¼
messages = [
{
"role": "system",
"content": "<|think|>You are a math expert."
},
{"role": "user", "content": "Solve: 3x + 7 = 22"}
]
text = processor.apply_chat_template(
messages,
tokenize=False,
add_generation_prompt=True,
enable_thinking=True
)
outputs = model.generate(**inputs, max_new_tokens=2048)
response = processor.decode(
outputs[0][input_len:], skip_special_tokens=False
)
result = processor.parse_response(response)
# result["thinking"] â 忥æ¨ç
# result["response"] â æç»åç
<|channel>thought
[å
é¨éæ¥æ¨ç - å¯¹ç¨æ·éè]
<channel|>
[å¯¹ç¨æ·å¯è§çæç»åç]
对 31B/26B A4B æ¥è¯´ï¼å³ä½¿ enable_thinking=Falseï¼ä»ä¼è¾åºç©ºæ ç¾ï¼
<|channel>thought
<channel|>
[æç»åç]
E2B/E4B å¨å ³éæ¶ä¼å®å ¨è·³è¿ç©ºæ ç¾ã
| Token | ä½ç¨ |
|---|---|
<|think|> | å¨ system prompt å¯ç¨ thinking |
<|channel>thought\n | æå¼å 鍿èå |
<channel|> | å ³éæèå |
<|turn> | æå¼ä¸è½®å¯¹è¯ |
<turn|> | å ³éä¸è½®å¯¹è¯ |
result["response"] ä½ä¸ºæ¨¡å轮次å
å®¹ä¼ åmax_new_tokensenable_thinking=Trueæ¯æå¯å宽髿¯ + å¯é ç½®çè§è§ token é¢ç®ï¼
| é¢ç® | éç¨åºæ¯ |
|---|---|
| 70 | å¿«éåç±»ãè§é¢å¸§ |
| 140 | å¾åæè¿°ã缩ç¥å¾ |
| 280 | éç¨å¾åçè§£ |
| 560 | å¾è¡¨ã示æå¾ |
| 1120 | OCRãPDF è§£æãç»èè¯å« |
# å¾çå¿
须卿æ¬ä¹åï¼ç¡¬æ§è¦æ±ï¼
messages = [{"role": "user", "content": [
{"type": "image", "image": image},
{"type": "text", "text": "Describe this chart."},
]}]
inputs = processor(
text=text,
images=image,
return_tensors="pt"
).to(model.device)
è§è§ç¼ç å¨ï¼çº¦ 1.5 äº¿åæ°ï¼E2B/E4Bï¼Â· 约 5.5 äº¿åæ°ï¼31B/26Bï¼
ä» E2B ä¸ E4B æ¯æï¼çº¦ 3 äº¿åæ°é³é¢ç¼ç å¨ï¼
Transcribe the following speech in
{LANGUAGE} into {LANGUAGE} text.
Transcribe in {SRC_LANG}, then
translate to {TARGET_LANG}.
æè¿ç»å¾å帧å¤çï¼
# å°å¸§ä½ä¸ºå¾ååè¡¨ä¼ å
¥
inputs = processor(
text=text,
images=[frame1, frame2, ..., frame60],
return_tensors="pt"
)
å 容ä¸å§ç»å°å¾ç/é³é¢æ¾å¨ææ¬åé¢ï¼
# â
æ£ç¡®é¡ºåº
content = [
{"type": "image", "image": img},
{"type": "text", "text": "Describe it."},
]
# â ææ¬å¨åä¼å¯¼è´å¯¹é½é®é¢
content = [
{"type": "text", "text": "Describe it."},
{"type": "image", "image": img},
]
$ pip install -U transformers accelerate
from transformers import (
AutoProcessor, AutoModelForCausalLM
)
import torch
mid = "google/gemma-4-31b-it"
processor = AutoProcessor.from_pretrained(mid)
model = AutoModelForCausalLM.from_pretrained(
mid,
torch_dtype=torch.bfloat16,
device_map="auto"
)
from transformers import BitsAndBytesConfig
bnb = BitsAndBytesConfig(load_in_4bit=True)
model = AutoModelForCausalLM.from_pretrained(
mid,
quantization_config=bnb,
device_map="auto"
)
$ ollama pull gemma4 # 31Bï¼é»è®¤ï¼
$ ollama pull gemma4:e4b # è¾¹ç¼ 4B çæ¬
$ ollama pull gemma4:e2b # è¾¹ç¼ 2B çæ¬
$ ollama run gemma4 # 交äºè天
FROM /path/to/fine-tuned.gguf
SYSTEM "You are a coding assistant."
$ ollama create mygemma -f Modelfile
$ ollama run mygemma
$ vllm serve google/gemma-4-31B-it \
--max-model-len 8192 \
--enable-auto-tool-choice \
--tool-call-parser gemma4 \
--reasoning-parser gemma4
OpenAI å
¼å®¹ API å°åï¼http://localhost:8000/v1
| å¹³å° | 说æ |
|---|---|
| Gemini API | gemma-4-31b-it |
| AI Studio | æµè§å¨ Playground |
| Vertex AI | èªå®ä¹ç«¯ç¹ |
| Cloud Run | Serverless GPU |
| GKE + vLLM | èªå¨ä¼¸ç¼© |
| è¿è¡æ¶ | éç¨åºæ¯ |
|---|---|
| AICore (Android) | ç³»ç»çº§ API |
| LiteRT-LM | IoTãRaspberry Pi |
| AI Edge Gallery | ç«¯ä¾§è¯æµ |
| LM Studio | æ¡é¢å¾å½¢çé¢ |
| llama.cpp | CPU/GPU æ··å |
åå¼ 16 GB GPUï¼T4 / å è´¹ Colab / Kaggleï¼å³å¯ï¼
from unsloth import FastModel
model, tokenizer = FastModel.from_pretrained(
"google/gemma-4-E4B-it",
load_in_4bit=True,
max_seq_length=4096
)
model = FastModel.get_peft_model(
model,
r=16,
lora_alpha=16,
lora_dropout=0,
target_modules=[
"q_proj", "k_proj",
"v_proj", "o_proj",
"gate_proj", "up_proj", "down_proj"
],
)
from unsloth import FastVisionModel
model, tokenizer = FastVisionModel.from_pretrained(
"google/gemma-4-E4B-it",
finetune_vision_layers=False, # å»ç»è§è§ç¼ç å¨
finetune_language_layers=True,
load_in_4bit=True,
)
å¯¹äº 26B A4Bï¼å ¨éå¾®è°ä¼ç ´åä¸å®¶è·¯ç±ï¼
r=16ãlora_alpha=16 èµ·æ¥| è¦æ± | å»ºè®®å¼ |
|---|---|
| CoT æ°æ®å æ¯ | ⥠75% çè®ç»é |
| Thinking æ ¼å¼ | å
å« <|think|> è§¦åæ è®° |
| 夿¨¡æé¡ºåº | å¾ç/é³é¢å¨ææ¬å |
| Chat æ¨¡æ¿ | ShareGPT æ OpenAI æ ¼å¼ |
| RL å¥å± | å¯éªè¯çæç»çæ¡ |
finetune_vision_layers=False| 模å | é¢å | 说æ |
|---|---|---|
| MedGemma 4B | å»å¦å½±å | 夿¨¡æ X-ray/MRI åæ |
| MedGemma 27B | ä¸´åºææ¬ | EHR + å»çæ¥åæ¨ç |
| CodeGemma | ç¼ç¨ | 代ç è¡¥å ¨ä¸éæ |
| PaliGemma 2 | è§è§è¯è¨ | ç»ç²åº¦ VLM ä¸è§è§æ¨ç |
| ShieldGemma | å®å ¨ | LLM è¾åºå®å ¨åç±»å¨ |
| DataGemma | äºå®æ°æ® | åºäº Google Data Commons å¯¹é½ |
| FunctionGemma | å·¥å ·è°ç¨ | ä½èµæºå½æ°è°ç¨è§£æ |
åºä¸æ¡æ¶
google-deepmind/gemma)google-gemma/gemma-cookbook)google/adk-samples starter agents社åºåä½