Evaluating Neuron Explanations:
A Unified Framework with Sanity Checks

UC San Diego
ICML 2025

Abstract

  • Understanding the function of individual units in a neural network is an important building block for mechanistic interpretability.
  • This is often done by generating a simple text explanation of the behavior of individual neurons or units. For these explanations to be useful, we must understand how reliable and truthful they are.
  • In this work we unify many existing explanation evaluation methods under one mathematical framework.
  • This allows us to compare existing evaluation metrics, understand the evaluation pipeline with increased clarity and apply existing statistical methods on the evaluation.
  • In addition, we propose two simple sanity checks on the evaluation metrics and show that many commonly used metrics fail these tests and do not change their score after massive changes to the concept labels.
  • Based on our experimental and theoretical results, we propose guidelines that future evaluations should follow and identify a set of reliable evaluation metrics.

Contributions

(I) Unified NeuronEval framework

We argue that most evaluations of individual neuron or SAE latent explanations can be formalized as measuring the similarity between the following two vectors: Neuron activation ak and concept vector ct representing the presense of the explanation t on each input. Existing evaluations then differ on a few key ways:
  1. Evaluation Metric - Which metric is used to measure the similarity between the vectors? Examples include Recall, IoU or Correlation.
  2. Source of concept vector ct - How are the concept labels are generated? For example crowdsourced raters, generated by a model or using a labeled dataset.
  3. Domain and granularity - Whether the inputs are in vision or language domain, and whether inputs are labeled as a whole or more fine-grained such per-pixel or per-token.
With our framework, we unify diverse evaluations from 20 previous studies [1-20], allowing for more clear analysis and comparison.

Fig 1: Overview of NeuronEval framework.


(II) Sanity Checks for Evaluation Metrics

Another goal of our paper is to perform meta-evaluation, that is to understand which evaluation metrics are reliable to use when evaluating neuron explanations. To do this, we propose two necessary sanity checks that a good evaluation metric should pass.

  1. Extra Labels Test - Can the metric differentiate a perfect explanation from an overly generic explanation, i.e. a random superclass of the perfect explanation?
  2. Missing Labels Test - Can the metric differentiate a perfect explanation from an overly specific explanation, i.e. a random subclass of the perfect explanation?


Fig 2: Overview of our missing and extra labels tests.



Results

Meta-Evaluation 1: Missing and Extra Labels Test

  • We test 18 evaluation metrics across diverse setups, and find that almost all commonly used metrics fail at least one of these simple sanity checks, as marked by the red color.
  • Only Correlation, F1-score, IoU, Cosine similarity and AUPRC pass both tests.


Table 1: The results of our Missing and Extra Labels test. A metric passes the tests if the label perturbation decreases it's score at least 90% of the time in each setting.


Meta-Evaluation 2: Neurons with Known Concepts

  • We also compare metrics using neurons where we know their true concept, such as final layer neurons.
  • In this evaluation we test >1 million (neuron, explanation) pairs and tested whether each metric consistently assigned higher scores to pairs with correct explanations over incorrect explanations using AUPRC.
  • We can see that metrics that passed our sanity tests also perform the best here, with Correlation scoring the highest overall, followed by Cosine similarity, AUPRC, IoU and F1-score.

Table 2: The results of our neurons with known concepts evaluation.



Cite this work


T. Oikarinen, G. Yan and T.-W. Weng, Evaluating Neuron Explanations: A Unified Framework with Sanity Checks, ICML 2025.
        
@inproceedings{oikarinen2025evaluating,
    title={Evaluating Neuron Explanations: A Unified Framework with Sanity Checks},
    author={Oikarinen, Tuomas and Yan, Ge and Weng, Tsui-Wei},
    booktitle={International Conference on Machine Learning},
    year={2025}
    }
        
        

This webpage template was recycled from here.

Accessibility