Large language models (LLMs) with Chain-of-Thought (CoT) reasoning have the ability of self-reflection: the ability to review and revise previous reasoning steps. In this work, we study self-reflection through the lens of representation engineering. We segment model's reasoning into steps, identify the steps corresponding to reflection, and extract a reflection direction in the latent space that governs this behavior. Using this direction, we propose a stepwise steering method that can control reflection frequency. We call our framework ReflCtrl. Our experiments show that:
In this paper, we study self-reflection in modern reasoning LLMs (e.g. DeepSeek-R1, QwQ-32B, distilled R1 variants) from a representation engineering viewpoint. Instead of simply suppressing reflection tokens, we find a latent direction that aligns with reflection and then uses that direction to steer the model.
Two research questions guide the work:
We propose ReflCtrl, a framework that segments reasoning traces into steps, identifies reflection steps using keywords, extracts a latent reflection direction from model activations, and applies it to steer model's reflection.
Figure 1: ReflCtrl segments a reasoning trace into steps, detects reflection steps via keywords, extracts a reflection direction as the mean difference of reflection vs. non-reflection activations, and injects this direction back during generation to control reflection frequency.
We observe that reasoning outputs are naturally separated by blank lines (\n\n). Each segment is treated as the
smallest unit for analysis and intervention, which matches how current R1-style models narrate
their thinking. (See Appendix A in the paper for an example trace.)
Reflection steps are detected using reflection-related keywords such as “Wait”, “Let me check”, which are widely used in reasoning models. These steps form the positive set \( R \). The remaining steps except the final conclusion step form the negative set \(NR\).
At each layer \(l\), for each step \(s\), extract the attention / MLP outputs on the first token of the step (the point where reflection starts): \( z^{\text{attn}}_l(s), z^{\text{mlp}}_l(s) \). Then define the direction as the mean difference(Zou et al., 2023.) \[ d_l = \frac{1}{|R|} \sum_{s \in R} z_l(s) - \frac{1}{|NR|} \sum_{s \in NR} z_l(s), \] where \(NR\) is the non-reflection set. This is the core latent direction that encodes "reflection".
Instead of steering every token, ReflCtrl adds the direction only when a new reasoning step is
started (i.e. when the model generates the delimiter \n\n):
\[
z^{\text{intv}}_l = z_l + \lambda d_l,
\]
where \(\lambda\) controls how strongly to suppress or encourage reflection. This avoids pushing the model
too far off-distribution, something that often happens with full-token steering.
Figure 2: The extracted reflection direction effectively controls model's reflection.
The paper evaluates on GSM8K, MATH-500, and several MMLU subsets (professional accounting, highschool CS, formal logic) using open reasoning models such as DeepSeek-R1 Llama 8B, DeepSeek-R1 Qwen 14B, and QwQ-32B. Steering is applied to all but the first and last six layers, which was shown to give the best robustness. As a baseline, we compare with NoWait, which suppresses reflection tokens directly at every token.
Figure 3: Compared with NoWait, our method better preserves the performance while still effectively reduce reasoning tokens used.
Key findings:
Figure 4: Stepwise steering preserves accuracy better than all-token steering.
The method currently relies on keyword-based reflection detection, which may be model-specific; closed-source models (e.g. GPT-4, Claude) may not expose the needed internal states for the same steering method.
Ge Yan, Chung-En Sun, and Tsui-Wei (Lily) Weng. “ReflCtrl: Controlling LLM Reflection via Representation Engineering.” NeurIPS Mechanistic Interpretability Workshop, 2025.
@inproceedings{
yan2025reflctrl,
title={ReflCtrl: Controlling {LLM} Reflection via Representation Engineering},
author={Ge Yan and Chung-En Sun and Tsui-Wei Weng},
booktitle={Mechanistic Interpretability Workshop at NeurIPS 2025},
year={2025},
url={https://openreview.net/forum?id=ungnJ4O0AD}
}