ReFIne: A Framework for Trustworthy Large Reasoning Models with Reliability, Faithfulness, and Interpretability

Chung-En Sun, Ge Yan, Akshay Kulkarni, Tsui-Wei Weng

UCSD

arXiv 2025

description Paper code Code

Abstract

Reasoning models have focused heavily on accuracy and token efficiency, often overlooking trustworthiness. ReFIne is a training framework that makes large reasoning models more interpretable, faithful, and reliable without sacrificing utility. It uses a two-stage pipeline—supervised fine-tuning to learn a structured, tag-based reasoning format, followed by GRPO to reinforce structural compliance, explicit cross-section references, and calibrated self-assessed confidence. Across math benchmarks and model scales, ReFIne produces clearer traces, more transparent use of information, and more informative confidence estimates, while remaining competitive in accuracy and improving reasoning efficiency.

Figure 1: Overview comparing standard LRMs vs. ReFIne with gains in interpretability, faithfulness, and reliability.

Trustworthy Reasoning: Definition and Motivation

While prior works on LRM have largely emphasized accuracy and efficiency, we argue that a reasoning model is trustworthy only if it satisfies the following three dimensions:

#1 Interpretability: The reasoning trace should be presented in a clear, well-organized structure that allows humans to easily follow.
#2 Faithfulness: The reasoning trace should accurately reflect the actual process by which the model arrives at its answer.
#3 Reliability: The model should perform an explicit self-assessment to judge whether each step of its derivation is rigorous and produce a well-calibrated estimate of the likelihood that its final answer is correct.

The ReFIne Framework

ReFIne trains models in two stages, combining supervised fine-tuning for structured output and RL for enhancing trustworthy behavior.

Stage 1. SFT for Structural Reasoning Format

During fine-tuning, the model learns to produce structured outputs that explicitly follow the ReFIne reasoning format. The output is organized into the following hierarchical sections:

Preparation Phase:
1. Problem Understanding, understanding: The model restates the problem in its own words, clarifying what is being asked.
2. List Facts, facts: The model lists all known conditions, variables, and constraints.
3. Stepwise Plan, plan: The model outlines a high-level roadmap for solving the problem.
Reasoning Phase, think: The model performs detailed step-by-step derivations, explicitly referencing relevant parts from understanding, facts, and plan. If it changes its approach, it must identify and justify the switch.
Answer Phase, final_answer: The model states the final result and includes a short justification.
Evaluation Phase, self_assessment: The model audits the soundness of its reasoning and reports a confidence score (0–10) indicating belief in correctness.

This structured format is taught via supervised fine-tuning on curated data where each example adheres to this template.

Stage 2. GRPO for Enhancing Trustworthy Reasoning

In the second stage, Group Relative Policy Optimization (GRPO) refines the model using a multi-component reward function. For each prompt x and gold answer a, a generated trace y is scored on four dimensions:

(1) Correctness: Reward for answer accuracy using a robust verifier Verify(y,a).
(2) Structural Integrity: Reward for correctly producing all expected tags (understanding ... self_assessment) exactly once and in the right order.
(3) Cross-Section References: Reward for explicitly referencing understanding, facts, and plan inside the think section.
(4) Confidence Estimation: Reward based on how well the confidence score (0–10) in self_assessment aligns with actual correctness, and penalty if missing.

The total reward is the weighted sum of these components:

$$R(y\mid x,a) = \alpha r_{\text{corr}} + \beta r_{\text{struct}} + \gamma r_{\text{ref}} + \zeta r_{\text{conf}},$$

where all weights are set equally ($\alpha=\beta=\gamma=\zeta=0.25$). This balanced objective reinforces correctness, structural compliance, cross-section coherence, and reliable confidence reporting.

ReFIne Model Output Showcase

Janet’s ducks lay 16 eggs per day. She eats three for breakfast every morning and bakes muffins for her friends every day with four. She sells the remainder at the farmers' market daily for $2 per fresh duck egg. How much in dollars does she make every day at the farmers' market?

understanding (click to expand)

understanding

The problem is asking about how much money Janet makes each day at the farmers' market from selling duck eggs. It provides information about the number of eggs Janet's ducks lay daily, how many eggs Janet uses for breakfast, how many she uses for baking muffins, and the price per egg when she sells the remaining eggs. The task is to determine the daily income from the sales at the farmers' market, based on the number of eggs sold and the price per egg.

/understanding

facts (click to expand)

facts

Janet’s ducks lay 16 eggs per day.
Janet eats 3 eggs for breakfast every morning.
Janet bakes muffins for her friends every day using 4 eggs.
Janet sells the remainder of the eggs at the farmers' market daily.
Janet sells each fresh duck egg for $2.

/facts

plan (click to expand)

plan

Determine the total number of eggs laid by Janet's ducks each day, which is given as 16.
Calculate the total number of eggs used by Janet for personal consumption, which includes 3 eggs for breakfast and 4 eggs for baking muffins, totaling 7 eggs.
Subtract the number of eggs used for personal consumption from the total number of eggs laid to find out how many eggs are left for sale at the farmers' market.
Multiply the number of eggs available for sale by the price per egg, which is $2, to calculate the daily income from the farmers' market sales.

/plan

think (click to expand)

think

Okay, let's tackle this problem step by step. First, I need to make sure I understand what's being asked. The question is about how much money Janet makes each day at the farmers' market by selling her ducks' eggs. The key here is to figure out how many eggs she has left after she uses some for her own breakfast and baking, and then multiply that by the price per egg.

Looking at the understanding section, it confirms that the goal is to calculate her daily income from the market sales. The facts provide the numbers: 16 eggs laid daily, 3 eaten for breakfast, 4 used for muffins, and each egg sold for $2. The plan outlines the steps clearly, so I should follow that.

Starting with step 1 from the plan, the total eggs laid per day are 16. That's straightforward. Then step 2 says to calculate the eggs used for personal consumption. According to the facts, she eats 3 for breakfast and uses 4 for muffins. Adding those together, 3 + 4 equals 7 eggs. So, she uses 7 eggs each day.

Moving to step 3, I need to subtract the eggs used from the total laid. So 16 total eggs minus 7 used equals 9 eggs left for sale. Wait, let me check that again. 16 minus 7 is indeed 9. So, 9 eggs are available for the market.

Step 4 is to multiply the number of eggs sold by the price per egg. The price is $2 per egg. So 9 eggs times $2 gives $18. That seems right. Let me verify each step again to be sure.

Total eggs: 16. Eggs used: 3 + 4 = 7. Remaining eggs: 16 - 7 = 9. Sales revenue: 9 * 2 = $18. Yep, that adds up. I don't see any mistakes in the facts or the plan. The calculations are simple arithmetic, so as long as each step is followed correctly, the answer should be correct. Therefore, Janet makes $18 every day at the farmers' market.

/think

final_answer (click to expand)

final_answer

To determine how much money Janet makes every day at the farmers' market, we follow the steps outlined in the plan and use the information provided in the facts.

---

Step 1: Total Eggs Laid Per Day
From the facts, we know that Janet’s ducks lay 16 eggs per day.

---

Step 2: Eggs Used for Personal Consumption
Janet uses 3 eggs for breakfast and 4 eggs for baking muffins.
$$\text{Total eggs used} = 3 + 4 = 7$$

---

Step 3: Eggs Available for Sale
To find the number of eggs left for sale, subtract the eggs used from the total laid:
$$\text{Eggs available for sale} = 16 - 7 = 9$$

---

Step 4: Revenue from Farmers' Market Sales
Each egg is sold for $2, so the daily income from sales is:
$$\text{Daily income} = 9 \times 2 = 18$$

---

Final Answer
$$\boxed{18}$$

/final_answer

self_assessment (click to expand)

self_assessment

I followed the plan step by step, verifying each calculation and ensuring that the logic was consistent with the given facts. I double-checked the arithmetic and confirmed that the total eggs laid, the eggs used, and the remaining eggs for sale all align with the information provided. The final calculation of $18 seems accurate based on the given data. I do not see any contradictions, unverified assumptions, or logical gaps in my reasoning. Every step was clearly justified and internally consistent. I believe my reasoning is solid and well-supported.
Confidence: 10/10

/self_assessment

Experiments

We train three ReFIne models—ReFIne-Qwen3-1.7B, ReFIne-Qwen3-4B, and ReFIne-Qwen3-8B—using the two-stage pipeline described above. Each model undergoes supervised fine-tuning on 10k structured traces followed by GRPO reinforcement on 2k problems. For comparison, baseline models Plain-Qwen3-{1.7B, 4B, 8B} are trained with identical data and compute budgets but using normal reasoning traces and correctness-only rewards. All other training hyperparameters are held constant to isolate the effects of structured reasoning and multi-component rewards.

We evaluate performance on four mathematical reasoning benchmarks—AIME-2024, GPQA-Diamond, MATH-500, and GSM8K—spanning diverse difficulty levels. We assess models along five complementary dimensions: interpretability, faithfulness, reliability, accuracy, and efficiency.

1. Interpretability

ReFIne improves interpretability by producing structured reasoning that explicitly links sections and maintains coherence. We assess two metrics: Format & References and Readability.

Table 1 (Format & References): Percentage of think sections that explicitly reference understanding / facts / plan. GRPO substantially strengthens the cross-section referencing behavior..

Figure 2 (Readability): Readability comparison across datasets, judged by QwQ-32B. ReFIne produces clearer, easier-to-follow traces.

2. Faithfulness

Faithfulness measures whether reasoning traces genuinely reflect the internal solving process and stay grounded in disclosed context. We evaluate Disclosure Faithfulness (transparency about decisive information) and Commitment Faithfulness (consistency with self-declared premises).

Table 2 (Disclosure Faithfulness): ReFIne is more transparent to disclose any premise (hints in the question).

Table 3 (Commitment faithfulness): We report the fraction of traces where think strictly follows understanding / facts / plan. ReFIne models almost always follow their own prior commitments.

3. Reliability

Reliability quantifies the model’s self-awareness—its ability to verbalize a confidence estimate for its answer, and align those confidence values with actual correctness. We therefore assess reliability along Confidence Verbalization, AUROC (discrimination), and ECE (calibration).

Table 4 (Confidence Verbalization): Confidence verbalization rate. ReFIne nearly always reports confidence, unlike Plain.

Table 5 (Discrimination): AUROC, showing how well confidence distinguishes correct from incorrect answers. ReFIne achieves high AUROC. Plain models on AIME-2024 is marked in red since it rarely outputs confidence, making its AUROC unreliable.

Table 6 (Calibration): Expected Calibration Error (ECE). ReFIne demonstrates high confidence calibration. Plain models on AIME-2024 is marked in red since it rarely outputs confidence, making its ECE unreliable.

4. Accuracy and Efficiency

We further assess task-level utility, comparing final-answer accuracy and reasoning-token efficiency across datasets.

Figure 3 (Accuracy): Accuracy across benchmarks and scales. ReFIne matches Plain models' performance.

Figure 4 (Efficiency): Average reasoning length (tokens). ReFIne generates more concise reasoning traces.

Conclusion

We introduce ReFIne, the first training framework for Large Reasoning Models (LRMs) explicitly optimized for interpretability, faithfulness, and reliability—the three pillars of trustworthy reasoning. ReFIne improves interpretability by 44.0%, faithfulness by 18.8%, and reliability by 42.4% across four benchmarks and three model sizes, while maintaining similar accuracy and achieving 1.16× better reasoning efficiency.

Cite this work

Chung-En Sun, Ge Yan, Akshay Kulkarni, Tsui-Wei Weng. "ReFIne: A Framework for Trustworthy Large Reasoning Models with Reliability, Faithfulness, and Interpretability", arXiv 2025.


@inproceedings{refine,
  title     = {ReFIne: A Framework for Trustworthy Large Reasoning Models with Reliability, Faithfulness, and Interpretability},
  author    = {Sun, Chung-En and Yan, Ge and Kulkarni, Akshay and Weng, Tsui-Wei},
  booktitle = {arXiv},
  year      = {2025}
}