Linear Explanations for Individual Neurons

UC San Diego

ICML 2024

Abstract

In recent years many methods have been developed to understand the internal workings of neural networks, often by describing the function of individual neurons in the model. However, these methods typically only focus on explaining the very highest activations of a neuron. In this paper we show this is not sufficient, and that the highest activation range is only responsible for a very small percentage of the neuron's causal effect. In addition, inputs causing lower activations are often very different and can't be reliably predicted by only looking at high activations. We propose that neurons should instead be understood as a linear combination of concepts, and develop an efficient method for producing these linear explanations. In addition, we show how to automatically evaluate description quality using simulation, i.e. predicting neuron activations on unseen inputs in vision setting.

Method

(I) Dissecting DNNs through Linear Explanation (LE)

Given a DNN and neurons of interest, our method (Linear Explanation, abbreviated as LE) will return text-based descriptions of the neurons to describe their functionalities through human-understandable concepts. As shown in the below overview figure, our method consists of 4 main steps to explain neurons through linear combination of concepts.

In the LE framework, we propose 2 different ways to construct the concept activation matrix P in step 1. The corresponding LE explantion is denoted as follows:

LE(Label) - Use human annotations;
LE(SigLIP) - Automatically generate pseudo-labels using the SigLIP vision-language model.

Fig 1: Overview of Linear Explanation.

(II) Evaluating Neuron Explanations through Simulation

Inspired by recent work in language models[4], we propose a new automated and efficient method to evaluate the explanation quality for vision models, as shown in below chart. The idea is to use a Simulator (e.g. vision-language model) to predict neuron activations on unseen inputs based on the explanation from an Explainer (e.g. Linear Explanation, CLIP-Dissect[1], MILAN[2], Network Dissection[3]). The explanations are then scored based on how well the predicted activations match the actual neuron activations.

Fig 2: Overview of Simulation for Vision models.

Experimental Results

Simulation results

Evaluating our method against existing neuron explanation methods (Network Dissection[3], MILAN[2], CLIP-Dissect[1]) using the proposed simulation metric, we can see our explanations LE are much more accurate than existing works.

Table 1: Average correlation scores between simulated and actual neuron activations, across all neurons in the second to last layer of the respective models.

Qualitative Evaluations

1. Area Graphs

We can automatically visualize how the neuron activates on different activation ranges in area chart using labeled data. This showcases how our linear explanation accurately captures different behavior at different activation levels, which are sometimes completely unrelated, but both important for the neurons effect on final predictions.

Fig 3: An area chart of the activations of two neurons in layer4 of ResNet-50. We can see Neuron 140 is mostly monosemantic, but represents different types of birds at different activation ranges. In contrast, neuron 136 has two distinct roles, snow and skiing related concepts at high activations and dog-like animals at lower activations.

2. Highly activating images

Below we display the highly activating images from different activation ranges for the same neurons, as well as explanations from baseline methods. We can see linear explanation accurately describes the different roles of the neurons.

Fig 4: Descriptions and highly activating images from different ranges of example neurons. We can see Linear Explanation provides a more complete description than baselines in both cases.

Related Works

[1] T. Oikarinen and T.-W. Weng, "Clip-dissect: Automatic description of neuron representations in deep vision networks." International Conference on Learning Representations, 2023.

[2] E. Hernandez, S. Schwettmann, D. Bau, T. Bagashvili, A. Torralba, and J. Andreas, "Natural language descriptions of deep visual features." International Conference on Learning Representations, 2022.

[3] D. Bau, B. Zhou, A. Khosla, A. Oliva, and A. Torralba, "Network dissection: Quantifying interpretability of deep visual representations." Computer Vision and Pattern Recognition, 2017.

[4] S. Bills, N. Cammarata, D. Mossing, H. Tillman, L. Gao, G. Goh, I. Sutskever, J. Leike, J. Wu and W. Saunders, "Language models can explain neurons in language models." Open AI blogpost, 2023.

Cite this work

T. Oikarinen and T.-W. Weng, Linear Explanations for Individual Neurons , ICML 2024.

            
    @inproceedings{oikarinen2024linear,
        title={Linear Explanations for Individual Neurons},
        author={Oikarinen, Tuomas and Weng, Tsui-Wei},
        booktitle={International Conference on Machine Learning},
        year={2024}
      }

This webpage template was recycled from here.

Accessibility