Thinking Mode

A built-in reasoning capability where language models generate step-by-step internal reasoning before producing their final answer. The thinking process is visible and controllable via special tokens in the system prompt and output structure.

Mechanism

Thinking mode is triggered by including a special token (<|think|>) in the system prompt. When enabled, the model outputs:

<|channel>thought
[Internal reasoning — step-by-step work]
<channel|>
[Final answer — the response shown to the user]

The thought block contains the model’s reasoning process. The final answer is what follows. Libraries like Hugging Face Transformers provide parse_response() functions to separate these components programmatically.

Configuration

Thinking mode is controlled at inference time:

Enable: Set enable_thinking=True in the chat template. The model generates a populated thought block before the answer.

Disable: Set enable_thinking=False. Behavior varies by model size:

E2B/E4B: No thinking generated, output is direct
26B A4B/31B: Thinking still generated internally but output shows empty thought block followed by answer

The system prompt token (<|think|>) must be present or absent accordingly.

When to use thinking mode

Enable for:

Mathematical reasoning
Multi-step logic problems
Complex coding tasks
Tasks requiring explicit planning
Debugging and analysis

Disable for:

Simple factual questions
Conversational exchanges
Latency-sensitive applications
Token budget constraints

Thinking adds tokens to every response. In long multi-turn conversations, this compounds rapidly.

Best practice: exclude thinking from history

In multi-turn conversations, the conversation history should include only final answers, not thinking content. Including previous thinking blocks in the context wastes tokens and doesn’t improve performance — the model re-derives reasoning at each turn.

Standard chat template libraries handle this automatically via the parse_response() function.

Performance impact

Gemma 4 31B with thinking enabled:

AIME 2026 (no tools): 89.2%
GPQA Diamond: 84.3%
Tau2 (average): 76.9%

These results are frontier-level for open models at this size. The thinking capability is trained into the model, not a post-hoc prompting technique.

Contrast with chain-of-thought prompting

Approach	Mechanism	Visibility	Controllability
Chain-of-thought prompting	Prompt engineering	In-context examples	Implicit
Thinking mode (built-in)	Trained capability	Special tokens	Explicit flag

Chain-of-thought prompting (CoT) asks the model to “think step by step” via natural language instructions. Thinking mode is trained directly into the model’s generation process and controlled via structured tokens. Thinking mode produces more consistent formatting and clearer separation between reasoning and output.

Structured reasoning in agents

For agentic workflows, thinking mode enables observable reasoning. The harness can log thinking blocks separately, compress them differently from answers, or surface them in debugging interfaces.

Thinking blocks can also inform tool-calling decisions — the model may reason about which tools to invoke before actually calling them.

Comparison to o1-style reasoning

OpenAI’s o1 model family uses a similar reasoning mechanism but with more extensive training on reasoning tasks and longer internal reasoning chains. Gemma 4’s thinking mode is a lighter-weight implementation available in open models at multiple scales.

Implementation notes

The <|channel>thought and <channel|> tokens are special vocabulary entries. The model is trained to use these tokens to delimit reasoning from output. Libraries must parse these tokens correctly to extract the final answer.

For Gemma 4 26B A4B and 31B, thinking is always computed (internal) but can be suppressed in output. This is an architectural property — the models rely on intermediate reasoning even when not surfaced.

Connections

Gemma 4 Model Card: all Gemma 4 models support thinking mode
Context window compression: thinking blocks add tokens, requiring compression in long sessions
Hermes Agent: agent frameworks can leverage thinking mode for observable reasoning