In recent years many methods have been developed to understand the internal workings of neural networks,
often by describing the function of individual neurons in the model. However, these methods typically only
focus on explaining the very highest activations of a neuron. In this paper we show this is not sufficient,
and that the highest activation range is only responsible for a very small percentage of the neuron's causal
effect. In addition, inputs causing lower activations are often very different and can't be reliably
predicted by only looking at high activations. We propose that neurons should instead be understood
as a linear combination of concepts, and develop an efficient method for producing these linear
explanations. In addition, we show how to automatically evaluate description quality using
simulation, i.e. predicting neuron activations on unseen inputs in vision setting.
Inspired by recent work in language models[4],
we propose a new automated and efficient method to evaluate the explanation quality for vision models, as shown in below chart.
The idea is to use a Simulator (e.g. vision-language model) to predict neuron activations on unseen inputs based on the explanation
from an Explainer (e.g. Linear Explanation, CLIP-Dissect[1],
MILAN[2], Network Dissection[3]).
The explanations are then scored based on how well the predicted activations match the actual neuron activations.
Cite this work
T. Oikarinen and T.-W. Weng, Linear Explanations for Individual Neurons
, ICML 2024.
@inproceedings{oikarinen2024linear,
title={Linear Explanations for Individual Neurons},
author={Oikarinen, Tuomas and Weng, Tsui-Wei},
booktitle={International Conference on Machine Learning},
year={2024}
}