ReFIne: A Framework for Trustworthy Large Reasoning Models with Reliability, Faithfulness, and Interpretability

UCSD
arXiv 2025

Abstract

Reasoning models have focused heavily on accuracy and token efficiency, often overlooking trustworthiness. ReFIne is a training framework that makes large reasoning models more interpretable, faithful, and reliable without sacrificing utility. It uses a two-stage pipeline—supervised fine-tuning to learn a structured, tag-based reasoning format, followed by GRPO to reinforce structural compliance, explicit cross-section references, and calibrated self-assessed confidence. Across math benchmarks and model scales, ReFIne produces clearer traces, more transparent use of information, and more informative confidence estimates, while remaining competitive in accuracy and improving reasoning efficiency.

overview

Figure 1: Overview comparing standard LRMs vs. ReFIne with gains in interpretability, faithfulness, and reliability.


Trustworthy Reasoning: Definition and Motivation

While prior works on LRM have largely emphasized accuracy and efficiency, we argue that a reasoning model is trustworthy only if it satisfies the following three dimensions:


The ReFIne Framework

ReFIne trains models in two stages, combining supervised fine-tuning for structured output and RL for enhancing trustworthy behavior.

Stage 1. SFT for Structural Reasoning Format

During fine-tuning, the model learns to produce structured outputs that explicitly follow the ReFIne reasoning format. The output is organized into the following hierarchical sections:

This structured format is taught via supervised fine-tuning on curated data where each example adheres to this template.

Stage 2. GRPO for Enhancing Trustworthy Reasoning

In the second stage, Group Relative Policy Optimization (GRPO) refines the model using a multi-component reward function. For each prompt x and gold answer a, a generated trace y is scored on four dimensions:

The total reward is the weighted sum of these components:

$$R(y\mid x,a) = \alpha r_{\text{corr}} + \beta r_{\text{struct}} + \gamma r_{\text{ref}} + \zeta r_{\text{conf}},$$

where all weights are set equally (\(\alpha=\beta=\gamma=\zeta=0.25\)). This balanced objective reinforces correctness, structural compliance, cross-section coherence, and reliable confidence reporting.


ReFIne Model Output Showcase

Janet’s ducks lay 16 eggs per day. She eats three for breakfast every morning and bakes muffins for her friends every day with four. She sells the remainder at the farmers' market daily for $2 per fresh duck egg. How much in dollars does she make every day at the farmers' market?

understanding (click to expand)
understanding
The problem is asking about how much money Janet makes each day at the farmers' market from selling duck eggs. It provides information about the number of eggs Janet's ducks lay daily, how many eggs Janet uses for breakfast, how many she uses for baking muffins, and the price per egg when she sells the remaining eggs. The task is to determine the daily income from the sales at the farmers' market, based on the number of eggs sold and the price per egg.
/understanding
facts (click to expand)
facts
  • Janet’s ducks lay 16 eggs per day.
  • Janet eats 3 eggs for breakfast every morning.
  • Janet bakes muffins for her friends every day using 4 eggs.
  • Janet sells the remainder of the eggs at the farmers' market daily.
  • Janet sells each fresh duck egg for $2.
/facts
plan (click to expand)
plan
  1. Determine the total number of eggs laid by Janet's ducks each day, which is given as 16.
  2. Calculate the total number of eggs used by Janet for personal consumption, which includes 3 eggs for breakfast and 4 eggs for baking muffins, totaling 7 eggs.
  3. Subtract the number of eggs used for personal consumption from the total number of eggs laid to find out how many eggs are left for sale at the farmers' market.
  4. Multiply the number of eggs available for sale by the price per egg, which is $2, to calculate the daily income from the farmers' market sales.
/plan
think (click to expand)
think
Okay, let's tackle this problem step by step. First, I need to make sure I understand what's being asked. The question is about how much money Janet makes each day at the farmers' market by selling her ducks' eggs. The key here is to figure out how many eggs she has left after she uses some for her own breakfast and baking, and then multiply that by the price per egg.

Looking at the understanding section, it confirms that the goal is to calculate her daily income from the market sales. The facts provide the numbers: 16 eggs laid daily, 3 eaten for breakfast, 4 used for muffins, and each egg sold for $2. The plan outlines the steps clearly, so I should follow that.

Starting with step 1 from the plan, the total eggs laid per day are 16. That's straightforward. Then step 2 says to calculate the eggs used for personal consumption. According to the facts, she eats 3 for breakfast and uses 4 for muffins. Adding those together, 3 + 4 equals 7 eggs. So, she uses 7 eggs each day.

Moving to step 3, I need to subtract the eggs used from the total laid. So 16 total eggs minus 7 used equals 9 eggs left for sale. Wait, let me check that again. 16 minus 7 is indeed 9. So, 9 eggs are available for the market.

Step 4 is to multiply the number of eggs sold by the price per egg. The price is $2 per egg. So 9 eggs times $2 gives $18. That seems right. Let me verify each step again to be sure.

Total eggs: 16. Eggs used: 3 + 4 = 7. Remaining eggs: 16 - 7 = 9. Sales revenue: 9 * 2 = $18. Yep, that adds up. I don't see any mistakes in the facts or the plan. The calculations are simple arithmetic, so as long as each step is followed correctly, the answer should be correct. Therefore, Janet makes $18 every day at the farmers' market.
/think
final_answer (click to expand)
final_answer
To determine how much money Janet makes every day at the farmers' market, we follow the steps outlined in the plan and use the information provided in the facts.

---

Step 1: Total Eggs Laid Per Day
From the facts, we know that Janet’s ducks lay 16 eggs per day.

---

Step 2: Eggs Used for Personal Consumption
Janet uses 3 eggs for breakfast and 4 eggs for baking muffins.
$$\text{Total eggs used} = 3 + 4 = 7$$

---

Step 3: Eggs Available for Sale
To find the number of eggs left for sale, subtract the eggs used from the total laid:
$$\text{Eggs available for sale} = 16 - 7 = 9$$

---

Step 4: Revenue from Farmers' Market Sales
Each egg is sold for $2, so the daily income from sales is:
$$\text{Daily income} = 9 \times 2 = 18$$

---

Final Answer
$$\boxed{18}$$
/final_answer
self_assessment (click to expand)
self_assessment
I followed the plan step by step, verifying each calculation and ensuring that the logic was consistent with the given facts. I double-checked the arithmetic and confirmed that the total eggs laid, the eggs used, and the remaining eggs for sale all align with the information provided. The final calculation of $18 seems accurate based on the given data. I do not see any contradictions, unverified assumptions, or logical gaps in my reasoning. Every step was clearly justified and internally consistent. I believe my reasoning is solid and well-supported.
Confidence: 10/10
/self_assessment

Experiments

We train three ReFIne models—ReFIne-Qwen3-1.7B, ReFIne-Qwen3-4B, and ReFIne-Qwen3-8B—using the two-stage pipeline described above. Each model undergoes supervised fine-tuning on 10k structured traces followed by GRPO reinforcement on 2k problems. For comparison, baseline models Plain-Qwen3-{1.7B, 4B, 8B} are trained with identical data and compute budgets but using normal reasoning traces and correctness-only rewards. All other training hyperparameters are held constant to isolate the effects of structured reasoning and multi-component rewards.

We evaluate performance on four mathematical reasoning benchmarks—AIME-2024, GPQA-Diamond, MATH-500, and GSM8K—spanning diverse difficulty levels. We assess models along five complementary dimensions: interpretability, faithfulness, reliability, accuracy, and efficiency.

1. Interpretability

ReFIne improves interpretability by producing structured reasoning that explicitly links sections and maintains coherence. We assess two metrics: Format & References and Readability.

Table 1 placeholder

Table 1 (Format & References): Percentage of think sections that explicitly reference understanding / facts / plan. GRPO substantially strengthens the cross-section referencing behavior..

Figure 1 placeholder

Figure 2 (Readability): Readability comparison across datasets, judged by QwQ-32B. ReFIne produces clearer, easier-to-follow traces.

2. Faithfulness

Faithfulness measures whether reasoning traces genuinely reflect the internal solving process and stay grounded in disclosed context. We evaluate Disclosure Faithfulness (transparency about decisive information) and Commitment Faithfulness (consistency with self-declared premises).

Table 2 placeholder

Table 2 (Disclosure Faithfulness): ReFIne is more transparent to disclose any premise (hints in the question).

Table 3 placeholder

Table 3 (Commitment faithfulness): We report the fraction of traces where think strictly follows understanding / facts / plan. ReFIne models almost always follow their own prior commitments.

3. Reliability

Reliability quantifies the model’s self-awareness—its ability to verbalize a confidence estimate for its answer, and align those confidence values with actual correctness. We therefore assess reliability along Confidence Verbalization, AUROC (discrimination), and ECE (calibration).

Table 4 placeholder

Table 4 (Confidence Verbalization): Confidence verbalization rate. ReFIne nearly always reports confidence, unlike Plain.

Table 5 placeholder

Table 5 (Discrimination): AUROC, showing how well confidence distinguishes correct from incorrect answers. ReFIne achieves high AUROC. Plain models on AIME-2024 is marked in red since it rarely outputs confidence, making its AUROC unreliable.

Table 6 placeholder

Table 6 (Calibration): Expected Calibration Error (ECE). ReFIne demonstrates high confidence calibration. Plain models on AIME-2024 is marked in red since it rarely outputs confidence, making its ECE unreliable.

4. Accuracy and Efficiency

We further assess task-level utility, comparing final-answer accuracy and reasoning-token efficiency across datasets.

Figure 3 placeholder

Figure 3 (Accuracy): Accuracy across benchmarks and scales. ReFIne matches Plain models' performance.

Figure 4 placeholder

Figure 4 (Efficiency): Average reasoning length (tokens). ReFIne generates more concise reasoning traces.


Conclusion

We introduce ReFIne, the first training framework for Large Reasoning Models (LRMs) explicitly optimized for interpretability, faithfulness, and reliability—the three pillars of trustworthy reasoning. ReFIne improves interpretability by 44.0%, faithfulness by 18.8%, and reliability by 42.4% across four benchmarks and three model sizes, while maintaining similar accuracy and achieving 1.16× better reasoning efficiency.


Cite this work

Chung-En Sun, Ge Yan, Akshay Kulkarni, Tsui-Wei Weng. "ReFIne: A Framework for Trustworthy Large Reasoning Models with Reliability, Faithfulness, and Interpretability", arXiv 2025.


@inproceedings{refine,
  title     = {ReFIne: A Framework for Trustworthy Large Reasoning Models with Reliability, Faithfulness, and Interpretability},
  author    = {Sun, Chung-En and Yan, Ge and Kulkarni, Akshay and Weng, Tsui-Wei},
  booktitle = {arXiv},
  year      = {2025}
}