Beyond Top Activations:
Efficient and Reliable Crowdsourced Evaluation
of Automated Interpretability

UC San Diego
CVPR 2026

Abstract

  • Interpreting individual neurons or directions in activation space is an important topic in mechanistic interpretability. Numerous automated interpretability methods generate text explanations of neurons, but it remains unclear how reliable these explanations are, and which methods produce the most accurate descriptions.
  • While crowd-sourced evaluations are commonly used, existing pipelines are noisy, costly, and typically assess only the highest-activating inputs, leading to unreliable results.
  • We introduce two techniques to enable cost-effective and accurate crowdsourced evaluation beyond top activating inputs. First, Model-Guided Importance Sampling (MG-IS) selects the most informative inputs to show raters, reducing the number of inputs needed by ~13×. Second, Bayesian Rating Aggregation (BRAgg) addresses label noise, reducing the number of ratings per input required to overcome noise by ~3×.
  • Together, these techniques reduce evaluation cost by ~40×, making large-scale evaluation feasible. We use our methods to conduct a large scale crowd-sourced study and systematically compare recent automated interpretability methods for deep vision networks.

Overview

To assess the quality of neuron explanations, researchers often rely on crowd-sourced human studies (e.g. via Amazon Mechanical Turk). We aim to evaluate explanations using a principled metric — the correlation coefficient between neuron activations \(a_k\) and concept (explanation) presence labels \(c_t\) — rather than only checking whether an explanation matches a neuron's top activating inputs. However, computing this full-distribution metric reliably introduces two challenges that drive up cost:

  1. Challenge 1 — High Labeling Cost. Computing correlation requires concept presence labels across the entire probing dataset. Labeling a 50,000-image dataset for a single (neuron, explanation) pair costs ~$600, and evaluating a few thousand neurons can reach ~$1M.
  2. Challenge 2 — Rater Noise. Human concept judgments are noisy, and even small error rates can significantly affect correlation scores, particularly for rare concepts.

Fig 1: Overview of the explanation evaluation pipeline. We focus on two main challenges: 1: How to reduce high labeling cost? and 2: How to effectively handle noisy ratings? Our proposed solutions are MG-IS and BRAgg.


Methods

(I) Model-Guided Importance Sampling (MG-IS)

Concepts of interest are often rare in the probing dataset, so a uniform (Monte Carlo) sample is unlikely to contain enough positive examples to estimate correlation accurately. Instead, we use importance sampling, drawing inputs from a proposal distribution \(\mathcal{Q}\) that places higher probability on informative inputs. The variance-minimizing optimal proposal samples each input proportionally to its contribution to the correlation:
\[ q^{*}(x_i) \propto |\bar{a}_{ki} \cdot \bar{c}_{ti}| \]
Since the true concept labels \(c_t\) are unknown before the study, we approximate them with a cheap model (SigLIP) to guide sampling, mixing with the uniform distribution to keep the estimator unbiased:
\[ q^{\text{MG-IS}}(x) = (1-\gamma)\, q^{\text{siglip}}(x) + \gamma\, p(x), \qquad \gamma = 0.2 \]
We then apply the importance-sampling correction \(\frac{p(x)}{q(x)}\) to obtain an unbiased estimate of the correlation.

Fig 2: Comparing sampling strategies for estimating the correlation. Our MG-IS using SigLIP estimates significantly outperforms the baseline regardless of subset size, matching uniform-sampling accuracy with ~15× fewer samples.


(II) Bayes Rating Aggregation (BRAgg)

Crowd-sourced concept labels are noisy due to variability in rater attention, interpretation, and experience. To handle this, we collect \(m\) independent binary ratings per input and aggregate them. Rather than simple Averaging or Majority Vote, BRAgg estimates \([c_t]_i\) as the posterior probability that the concept is truly present given the ratings \(R_{ti}\):

\[ [c_t]_i = \mathbb{P}([c_t^{*}]_i = 1 \mid R_{ti}) = \frac{\mathbb{P}(R_{ti} \mid C_{ti})\,\mathbb{P}(C_{ti})}{\mathbb{P}(R_{ti}\mid C_{ti})\,\mathbb{P}(C_{ti}) + \mathbb{P}(R_{ti}\mid \lnot C_{ti})\,\mathbb{P}(\lnot C_{ti})} \]

The likelihood assumes each rater errs at rate \(\eta\) (estimated as ~23% on MTurk), and the prior \(\mathbb{P}(C_{ti})\) can be uniform (\(\beta=0.05\)) or initialized from a cheap SigLIP evaluator — yielding a hybrid that combines human ratings with model knowledge.


Fig 3: Comparison of rating aggregation strategies (simulated study, error rate \(\eta=23\%\)). BRAgg (SigLIP) achieves the lowest error and can reach a target accuracy with 2–10× fewer raters than the best baseline.


Validation: Combining MG-IS and BRAgg

We validate our techniques in two realistic settings with both label noise and limited samples: a simulation with 23% label noise, and a real crowd-sourced study on Amazon Mechanical Turk (10 ResNet-50 neurons, 600 inputs each, 9 raters per input, with ground-truth ImageNet labels for comparison).

(a) Simulation with 23% label noise.

(b) Mechanical Turk validation study.

Fig 4: Effect of MG-IS and BRAgg, individually and together. The best results come from using MG-IS + BRAgg together.

  • Using MG-IS + BRAgg together by far outperforms either technique alone. In the simulation, reaching the same error requires more evaluations with MG-IS only, 25× with BRAgg only, and 40× with the uniform-sampling + majority-vote baseline.
  • This ~40× cost reduction brings the cost of a systematic study down from ~$90,000 to $2,160 (just $2.16 per neuron-explanation pair).
  • SigLIP-only automated evaluation is a reliable cheap alternative (only ~15% higher error than our human study), but human evaluation remains the gold standard.

Large-Scale Crowdsourced Study

Using our methods, we conducted a large-scale study on neurons in ResNet-50 (Layer4) and ViT-B-16 (Layer11 MLP), both trained on ImageNet. We first used SigLIP-based automated evaluation to select the 5 best-performing methods, then ran a crowd-sourced study over 100 neurons per model, 180 inputs per neuron, 3 raters each — 1000 (neuron, explanation) pairs for a total cost of $2,160.

Automated Evaluation (SigLIP simulation, single-concept explanations)

Model NDMILANCDINVERTDnDMAIALE (label)LE (SigLIP)
RN-50 (Layer4) 0.1240.0920.1900.1870.1530.1400.179 0.241
ViT-B-16 (Layer11 MLP) 0.0340.0190.1850.1340.1050.1500.270 0.297

Table 1: SigLIP-based simulation with correlation scoring (\(l=1\)). Best in bold, second best underlined. Linear Explanations (LE) performs best even when restricted to single-concept explanations.


Crowdsourced Results

(a) ResNet-50 (Layer4).

(b) ViT-B-16 (Layer11 MLP).

Fig 5: Results of our MTurk study using BRAgg (SigLIP) for rating aggregation. Overall, Linear Explanations (LE) achieve the highest correlation coefficient.

  • Linear Explanations perform best overall, likely because it is the only tested method optimized to explain the entire range of activations rather than only the highest activations.
  • Recent LLM-based methods (MAIA, DnD) did not outperform simpler baselines, due to their focus on highly-activating inputs (overly specific explanations) and higher variance in description quality.
  • Overall correlation scores were quite low (~0.2 at best), highlighting the need for further improvement in explanation methods and/or more interpretable models.

Study Interface

In each task, raters are shown an explanation \(t\) and 15 inputs, and asked to select the inputs where the concept is present.

Fig 6: The crowd-sourced study interface used on Amazon Mechanical Turk.


Cite this work


T. Oikarinen, G. Yan, A. Kulkarni and T.-W. Weng, Beyond Top Activations: Efficient and Reliable Crowdsourced Evaluation of Automated Interpretability, CVPR 2026.
        
@inproceedings{oikarinen2026beyond,
    title={Beyond Top Activations: Efficient and Reliable Crowdsourced
           Evaluation of Automated Interpretability},
    author={Oikarinen, Tuomas and Yan, Ge and Kulkarni, Akshay and Weng, Tsui-Wei},
    booktitle={Proceedings of the IEEE/CVF Conference on Computer
               Vision and Pattern Recognition (CVPR)},
    year={2026}
    }
        
        

This webpage template was recycled from here.

Accessibility