Thinking Mode
A built-in reasoning capability where language models generate step-by-step internal reasoning before producing their final answer. The thinking process is visible and controllable via special tokens in the system prompt and output structure.
Mechanism
Thinking mode is triggered by including a special token (<|think|>) in the system prompt. When enabled, the model outputs:
<|channel>thought
[Internal reasoning — step-by-step work]
<channel|>
[Final answer — the response shown to the user]
The thought block contains the model’s reasoning process. The final answer is what follows. Libraries like Hugging Face Transformers provide parse_response() functions to separate these components programmatically.
Configuration
Thinking mode is controlled at inference time:
Enable: Set enable_thinking=True in the chat template. The model generates a populated thought block before the answer.
Disable: Set enable_thinking=False. Behavior varies by model size:
- E2B/E4B: No thinking generated, output is direct
- 26B A4B/31B: Thinking still generated internally but output shows empty thought block followed by answer
The system prompt token (<|think|>) must be present or absent accordingly.
When to use thinking mode
Enable for:
- Mathematical reasoning
- Multi-step logic problems
- Complex coding tasks
- Tasks requiring explicit planning
- Debugging and analysis
Disable for:
- Simple factual questions
- Conversational exchanges
- Latency-sensitive applications
- Token budget constraints
Thinking adds tokens to every response. In long multi-turn conversations, this compounds rapidly.
Best practice: exclude thinking from history
In multi-turn conversations, the conversation history should include only final answers, not thinking content. Including previous thinking blocks in the context wastes tokens and doesn’t improve performance — the model re-derives reasoning at each turn.
Standard chat template libraries handle this automatically via the parse_response() function.
Performance impact
Gemma 4 31B with thinking enabled:
- AIME 2026 (no tools): 89.2%
- GPQA Diamond: 84.3%
- Tau2 (average): 76.9%
These results are frontier-level for open models at this size. The thinking capability is trained into the model, not a post-hoc prompting technique.
Contrast with chain-of-thought prompting
| Approach | Mechanism | Visibility | Controllability |
|---|---|---|---|
| Chain-of-thought prompting | Prompt engineering | In-context examples | Implicit |
| Thinking mode (built-in) | Trained capability | Special tokens | Explicit flag |
Chain-of-thought prompting (CoT) asks the model to “think step by step” via natural language instructions. Thinking mode is trained directly into the model’s generation process and controlled via structured tokens. Thinking mode produces more consistent formatting and clearer separation between reasoning and output.
Structured reasoning in agents
For agentic workflows, thinking mode enables observable reasoning. The harness can log thinking blocks separately, compress them differently from answers, or surface them in debugging interfaces.
Thinking blocks can also inform tool-calling decisions — the model may reason about which tools to invoke before actually calling them.
Comparison to o1-style reasoning
OpenAI’s o1 model family uses a similar reasoning mechanism but with more extensive training on reasoning tasks and longer internal reasoning chains. Gemma 4’s thinking mode is a lighter-weight implementation available in open models at multiple scales.
Implementation notes
The <|channel>thought and <channel|> tokens are special vocabulary entries. The model is trained to use these tokens to delimit reasoning from output. Libraries must parse these tokens correctly to extract the final answer.
For Gemma 4 26B A4B and 31B, thinking is always computed (internal) but can be suppressed in output. This is an architectural property — the models rely on intermediate reasoning even when not surfaced.
Connections
- Gemma 4 Model Card: all Gemma 4 models support thinking mode
- Context window compression: thinking blocks add tokens, requiring compression in long sessions
- Hermes Agent: agent frameworks can leverage thinking mode for observable reasoning