Thinking Mode

concept
reasoningchain-of-thoughtmodel-capability

A built-in reasoning capability where language models generate step-by-step internal reasoning before producing their final answer. The thinking process is visible and controllable via special tokens in the system prompt and output structure.

Mechanism

Thinking mode is triggered by including a special token (<|think|>) in the system prompt. When enabled, the model outputs:

<|channel>thought
[Internal reasoning — step-by-step work]
<channel|>
[Final answer — the response shown to the user]

The thought block contains the model’s reasoning process. The final answer is what follows. Libraries like Hugging Face Transformers provide parse_response() functions to separate these components programmatically.

Configuration

Thinking mode is controlled at inference time:

Enable: Set enable_thinking=True in the chat template. The model generates a populated thought block before the answer.

Disable: Set enable_thinking=False. Behavior varies by model size:

  • E2B/E4B: No thinking generated, output is direct
  • 26B A4B/31B: Thinking still generated internally but output shows empty thought block followed by answer

The system prompt token (<|think|>) must be present or absent accordingly.

When to use thinking mode

Enable for:

  • Mathematical reasoning
  • Multi-step logic problems
  • Complex coding tasks
  • Tasks requiring explicit planning
  • Debugging and analysis

Disable for:

  • Simple factual questions
  • Conversational exchanges
  • Latency-sensitive applications
  • Token budget constraints

Thinking adds tokens to every response. In long multi-turn conversations, this compounds rapidly.

Best practice: exclude thinking from history

In multi-turn conversations, the conversation history should include only final answers, not thinking content. Including previous thinking blocks in the context wastes tokens and doesn’t improve performance — the model re-derives reasoning at each turn.

Standard chat template libraries handle this automatically via the parse_response() function.

Performance impact

Gemma 4 31B with thinking enabled:

  • AIME 2026 (no tools): 89.2%
  • GPQA Diamond: 84.3%
  • Tau2 (average): 76.9%

These results are frontier-level for open models at this size. The thinking capability is trained into the model, not a post-hoc prompting technique.

Contrast with chain-of-thought prompting

ApproachMechanismVisibilityControllability
Chain-of-thought promptingPrompt engineeringIn-context examplesImplicit
Thinking mode (built-in)Trained capabilitySpecial tokensExplicit flag

Chain-of-thought prompting (CoT) asks the model to “think step by step” via natural language instructions. Thinking mode is trained directly into the model’s generation process and controlled via structured tokens. Thinking mode produces more consistent formatting and clearer separation between reasoning and output.

Structured reasoning in agents

For agentic workflows, thinking mode enables observable reasoning. The harness can log thinking blocks separately, compress them differently from answers, or surface them in debugging interfaces.

Thinking blocks can also inform tool-calling decisions — the model may reason about which tools to invoke before actually calling them.

Comparison to o1-style reasoning

OpenAI’s o1 model family uses a similar reasoning mechanism but with more extensive training on reasoning tasks and longer internal reasoning chains. Gemma 4’s thinking mode is a lighter-weight implementation available in open models at multiple scales.

Implementation notes

The <|channel>thought and <channel|> tokens are special vocabulary entries. The model is trained to use these tokens to delimit reasoning from output. Libraries must parse these tokens correctly to extract the final answer.

For Gemma 4 26B A4B and 31B, thinking is always computed (internal) but can be suppressed in output. This is an architectural property — the models rely on intermediate reasoning even when not surfaced.

Connections