ReflCtrl: Controlling LLM Reflection via Representation Engineering

Ge Yan, Chung-En Sun, Tsui-Wei (Lily) Weng

CSE / HDSI, University of California, San Diego

geyan@ucsd.edu, cesun@ucsd.edu, lweng@ucsd.edu

Preprint, NeurIPS 2025 Mechanistic Interpretability Workshop (Spotlight)

Abstract

Large language models (LLMs) with Chain-of-Thought (CoT) reasoning have the ability of self-reflection: the ability to review and revise previous reasoning steps. In this work, we study self-reflection through the lens of representation engineering. We segment model's reasoning into steps, identify the steps corresponding to reflection, and extract a reflection direction in the latent space that governs this behavior. Using this direction, we propose a stepwise steering method that can control reflection frequency. We call our framework ReflCtrl. Our experiments show that:

For many cases the reflections are redundant, especially in stronger models (in our experiment, we can save up to 33.6% of reasoning tokens while preserving the performance).
Model's reflection behavior is highly correlated with internal uncertainty signal, implying self-reflection may be controlled by model's uncertainty.

Overview

In this paper, we study self-reflection in modern reasoning LLMs (e.g. DeepSeek-R1, QwQ-32B, distilled R1 variants) from a representation engineering viewpoint. Instead of simply suppressing reflection tokens, we find a latent direction that aligns with reflection and then uses that direction to steer the model.

Two research questions guide the work:

RQ1: When does the model initiate reflection during its reasoning process?
RQ2: How does reflection affect reasoning performance?

We propose ReflCtrl, a framework that segments reasoning traces into steps, identifies reflection steps using keywords, extracts a latent reflection direction from model activations, and applies it to steer model's reflection.

ReflCtrl high-level pipeline: segment, identify reflection, extract direction, steer.

Figure 1: ReflCtrl segments a reasoning trace into steps, detects reflection steps via keywords, extracts a reflection direction as the mean difference of reflection vs. non-reflection activations, and injects this direction back during generation to control reflection frequency.

ReflCtrl Framework

1. Segment reasoning into steps

We observe that reasoning outputs are naturally separated by blank lines (\n\n). Each segment is treated as the smallest unit for analysis and intervention, which matches how current R1-style models narrate their thinking. (See Appendix A in the paper for an example trace.)

2. Identify reflection steps

Reflection steps are detected using reflection-related keywords such as “Wait”, “Let me check”, which are widely used in reasoning models. These steps form the positive set \( R \). The remaining steps except the final conclusion step form the negative set \(NR\).

3. Extract reflection direction

At each layer \(l\), for each step \(s\), extract the attention / MLP outputs on the first token of the step (the point where reflection starts): \( z^{\text{attn}}_l(s), z^{\text{mlp}}_l(s) \). Then define the direction as the mean difference(Zou et al., 2023.) \[ d_l = \frac{1}{|R|} \sum_{s \in R} z_l(s) - \frac{1}{|NR|} \sum_{s \in NR} z_l(s), \] where \(NR\) is the non-reflection set. This is the core latent direction that encodes "reflection".

4. Stepwise steering during generation

Instead of steering every token, ReflCtrl adds the direction only when a new reasoning step is started (i.e. when the model generates the delimiter \n\n): \[ z^{\text{intv}}_l = z_l + \lambda d_l, \] where \(\lambda\) controls how strongly to suppress or encourage reflection. This avoids pushing the model too far off-distribution, something that often happens with full-token steering.

Token usage and accuracy under different intervention strengths.

Figure 2: The extracted reflection direction effectively controls model's reflection.

Experiments

The paper evaluates on GSM8K, MATH-500, and several MMLU subsets (professional accounting, highschool CS, formal logic) using open reasoning models such as DeepSeek-R1 Llama 8B, DeepSeek-R1 Qwen 14B, and QwQ-32B. Steering is applied to all but the first and last six layers, which was shown to give the best robustness. As a baseline, we compare with NoWait, which suppresses reflection tokens directly at every token.

Figure 3: Compared with NoWait, our method better preserves the performance while still effectively reduce reasoning tokens used.

Key findings:

Reflection is sometimes redundant. On QwQ-32B, reducing reflection by strong negative steering (e.g. \(\lambda = -0.96\)) only drops accuracy by <0.4% while saving up to 33.6% tokens.
Smaller models benefit more from reflection. Distilled DeepSeek-R1 Llama 8B gains a bit from extra reflection, but it also pays a large token cost.
Stepwise > every-token. Applying steering only at step boundaries preserves accuracy at high intervention strengths, unlike token-level steering. (Shown in Figure 4.)

Figure 4: Stepwise steering preserves accuracy better than all-token steering.

Cite this work

Ge Yan, Chung-En Sun, and Tsui-Wei (Lily) Weng. “ReflCtrl: Controlling LLM Reflection via Representation Engineering.” NeurIPS Mechanistic Interpretability Workshop, 2025.

@misc{
yan2025reflctrl,
title={ReflCtrl: Controlling {LLM} Reflection via Representation Engineering},
author={Ge Yan and Chung-En Sun and Tsui-Wei Weng},
booktitle={Mechanistic Interpretability Workshop at NeurIPS 2025},
year={2025},
url={https://openreview.net/forum?id=ungnJ4O0AD}
}