Evaluating Neuron Explanations:
A Unified Framework with Sanity Checks

Tuomas Oikarinen, Ge Yan, Tsui-Wei (Lily) Weng

UC San Diego

ICML 2025

Abstract

Understanding the function of individual units in a neural network is an important building block for mechanistic interpretability. This is often done by generating a simple text explanation of the behavior of individual neurons or units. For these explanations to be useful, we must understand how "reliable" and "faithful" they are.
In this work we unify many existing explanation evaluation methods under one mathematical framework -- This allows us to compare existing evaluation metrics, understand the evaluation pipeline with increased clarity and apply existing statistical methods on the evaluation.
In addition, we propose two simple sanity checks on the evaluation metrics and show that many commonly used metrics fail these tests and do not change their score after massive changes to the concept labels.
Based on our experimental and theoretical results, we propose guidelines that future evaluations should follow and identify a set of reliable evaluation metrics.

Contributions

(I) Unified NeuronEval framework

We argue that most evaluations of individual neuron or SAE latent explanations can be formalized as measuring the similarity between the following two vectors: Neuron activation a_k and concept vector c_t representing the presense of the explanation t on each input. Existing evaluations then differ on a few key ways:

Evaluation Metric - Which metric is used to measure the similarity between the vectors? Examples include Recall, IoU or Correlation.
Source of concept vector c_t - How are the concept labels are generated? For example crowdsourced raters, generated by a model or using a labeled dataset.
Domain and granularity - Whether the inputs are in vision or language domain, and whether inputs are labeled as a whole or more fine-grained such per-pixel or per-token.

With our framework, we unify diverse evaluations from 20 previous studies [1-20], allowing for more clear analysis and comparison.

Fig 1: Overview of NeuronEval framework.

(II) Sanity Checks for Evaluation Metrics

Another goal of our paper is to perform meta-evaluation, that is to understand which evaluation metrics are reliable to use when evaluating neuron explanations. To do this, we propose two necessary sanity checks that a good evaluation metric should pass.

Extra Labels Test - Can the metric differentiate a perfect explanation from an overly generic explanation, i.e. a random superclass of the perfect explanation?
Missing Labels Test - Can the metric differentiate a perfect explanation from an overly specific explanation, i.e. a random subclass of the perfect explanation?

Fig 2: Overview of our missing and extra labels tests.

Results

Meta-Evaluation 1: Missing and Extra Labels Test

We test 18 evaluation metrics across diverse setups, and find that almost all commonly used metrics fail at least one of these simple sanity checks, as marked by the red color.
Only Correlation, F1-score, IoU, Cosine similarity and AUPRC pass both tests.

Table 1: The results of our Missing and Extra Labels test. A metric passes the tests if the label perturbation decreases it's score at least 90% of the time in each setting.

Meta-Evaluation 2: Neurons with Known Concepts

We also compare metrics using neurons where we know their true concept, such as final layer neurons.
In this evaluation we test >1 million (neuron, explanation) pairs and tested whether each metric consistently assigned higher scores to pairs with correct explanations over incorrect explanations using AUPRC.
We can see that metrics that passed our sanity tests also perform the best here, with Correlation scoring the highest overall, followed by Cosine similarity, AUPRC, IoU and F1-score.

Table 2: The results of our neurons with known concepts evaluation.

Cite this work

T. Oikarinen, G. Yan and T.-W. Weng, Evaluating Neuron Explanations: A Unified Framework with Sanity Checks, ICML 2025.

        
@inproceedings{oikarinen2025evaluating,
    title={Evaluating Neuron Explanations: A Unified Framework with Sanity Checks},
    author={Oikarinen, Tuomas and Yan, Ge and Weng, Tsui-Wei},
    booktitle={International Conference on Machine Learning},
    year={2025}
    }

References

[1] Zhou, B., Khosla, A., Lapedriza, A., Oliva, A., and Torralba, A. Object detectors emerge in deep scene cnns. In ICLR, 2015.

[2] Bau, D., Zhou, B., Khosla, A., Oliva, A., and Torralba, A. Network dissection: Quantifying interpretability of deep visual representations. In CVPR, 2017.

[3] Oikarinen, T. and Weng, T.-W. Clip-dissect: Automatic description of neuron representations in deep vision networks. In ICLR, 2023.

[4] Bai, N., Iyer, R. A., Oikarinen, T., Kulkarni, A., and Weng, T.-W. Interpreting neurons in deep vision networks with language models. TMLR, 2025.

[5] Srinivas, A. A., Oikarinen, T., Srivastava, D., Weng, W.-H., and Weng, T.-W. Sand: Enhancing open-set neuron descriptions through spatial awareness. In 2025 IEEE/CVF Winter Conference on Applications of Computer Vision(WACV), pp. 2993–3002. IEEE, 2025.

[6] Oikarinen, T., Das, S., Nguyen, L. M., and Weng, T.-W. Label-free concept bottleneck models. In International Conference on Learning Representations, 2023.

[7] Huang, J., Geiger, A., D’Oosterlinck, K., Wu, Z., and Potts, C. Rigorously assessing natural language explanations of neurons. In Proceedings of the 6th BlackboxNLP Work-shop: Analyzing and Interpreting Neural Networks for NLP, pp. 317–331, 2023.

[8] Gurnee, W., Nanda, N., Pauly, M., Harvey, K., Troitskii, D., and Bertsimas, D. Finding neurons in a haystack: Case studies with sparse probing. TMLR, 2023.

[9] Koh, P. W., Nguyen, T., Tang, Y. S., Mussmann, S., Pierson, E., Kim, B., and Liang, P. Concept bottleneck models. In International conference on machine learning, pp. 5338–5348. PMLR, 2020.

[10] Mu, J. and Andreas, J. Compositional explanations of neurons. In NeurIPS, 2020.

[11] La Rosa, B., Gilpin, L., and Capobianco, R. Towards a fuller understanding of neurons with clustered compositional explanations. Advances in Neural Information Processing Systems, 36, 2024.

[12] Zimmermann, R. S., Klein, T., and Brendel, W. Scale alone does not improve mechanistic interpretability in vision models. In NeurIPS, 2023.

[13] Bykov, K., Kopf, L., Nakajima, S., Kloft, M., and Höhne, M. M. Labeling neural representations with inverse recognition. In NeurIPS, 2023.

[14] Kopf, L., Bommer, P. L., Hedström, A., Lapuschkin, S., Höhne, M. M.-C., and Bykov, K. Cosy: Evaluating textual explanations of neurons. In NeurIPS, 2024

[15] Bills, S., Cammarata, N., Mossing, D., Tillman, H., Gao, L., Goh, G., Sutskever, I., Leike, J., Wu, J., and Saunders, W. Language models can explain neurons in language models. OpenAI blog, 2023.

[16] Oikarinen, T. and Weng, T.-W. Linear explanations for individual neurons. In International Conference on Machine Learning, 2024.

[17] Bricken, T., Templeton, A., Batson, J., Chen, B., Jermyn, A., Conerly, T., Turner, N., Anil, C., Denison, C., Askell, A., Lasenby, R., Wu, Y., Kravec, S., Schiefer, N., Maxwell, T., Joseph, N., Hatfield-Dodds, Z., Tamkin, A., Nguyen, K., McLean, B., Burke, J. E., Hume, T., Carter, S., Henighan, T., and Olah, C. Towards monosemanticity: Decomposing language models with dictionary learning. Transformer Circuits Thread, 2023.

[18] Templeton, A., Conerly, T., Marcus, J., Lindsey, J., Bricken, T., Chen, B., Pearce, A., Citro, C., Ameisen, E., Jones, A., Cunningham, H., Turner, N. L., McDougall, C., MacDiarmid, M., Freeman, C. D., Sumers, T. R., Rees, E., Batson, J., Jermyn, A., Carter, S., Olah, C., and Henighan, T. Scaling monosemanticity: Extracting interpretable features from claude 3 sonnet. Transformer Circuits Thread, 2024.

[19] Shaham, T. R., Schwettmann, S., Wang, F., Rajaram, A., Hernandez, E., Andreas, J., and Torralba, A. A multimodal automated interpretability agent. In Forty-first International Conference on Machine Learning, 2024.

[20] Singh, C., Hsu, A. R., Antonello, R., Jain, S., Huth, A. G., Yu, B., and Gao, J. Explaining black box text modules in natural language with language models, 2023.

This webpage template was recycled from here.

Accessibility