Quantitative Evaluations
1. Final Layer Evaluation
We follow CLIP-Dissect [1] to quantitatively
analyze description quality on the last layer neurons, which
have known ground truth labels (i.e. class name) to allow
us to evaluate the quality of neuron descriptions automatically.
Our results show that DnD outperforms MILAN [2], having a
greater average CLIP cosine similarity by 0.0518, a greater
average mpnet cosine similarity by 0.18, and a greater average
BERTScore by 0.008.
Textual similarity between DnD/MILAN labels and ground truths on ResNet-50 (Imagenet).
We can see DnD outperforms MILAN.
2. MILANNOTATIONS
We also performed quantitative evaluation by calculating the textual similarity between a method's label
and the corresponding MILANNOTATIONS. Our analysis found that
if every neuron is described with the same constant concept:
'depictions', it will achieve better results than any
explanation method on this dataset, but this is not a useful
nor meaningful description. Thus, the dataset is unreliable to serve as ground
truths and can't be relied on for comparing different methods.
Textual similarity between descriptions produced by methods and MILANNOTATIONS.
Simply labeling every neuron as ”depictions” outperforms all other methods, demonstrating the unreliability
of MILANNOTATIONS as an evaluation method.
3. Crowdsourced Experiment
Our experiment compares the quality of labels produced by DnD against 3
baselines: CLIP-Dissect, MILAN, and Network Dissection [3]. For both models we evaluated
4 of the intermediate layers (end of each residual block),
with 200 randomly chosen neurons per layer for ResNet50
and 50 per layer for ResNet-18. Each neurons description is evaluated by 3 different workers.
We outline specifics of the experiment below:
- Workers are presented with the top 10 highest activating images of a neuron.
- Four separate descriptions are given, each corresponding to a label produced by one of the four methods compared.
- Workers select the description that best represents the 10 highly activating images presented.
- Descriptions are rated on a 1-5 scale. A rating of 1 represents that the user "strongly disagrees" with the
given description, and a rating of 5 represents that the user "strongly agrees" with the given description.
Our results show that DnD performs over 2× better than all baseline methods when dissecting ResNet-50
and over 3× better when dissecting ResNet18, being selected the best of the three an impressive 63.21%
of the time.
Results for individual layers of ResNet-50.
We observe that DnD is the best method across all layers in ResNet-50.
Results for individual layers of ResNet-18.
DnD performs significatly better across every layer in ResNet-18 when compared
to the baseline methods
4. Use Case
To showcase a potential use case for neuron descriptions,
we experimented with using neuron descriptions to find a good classifier for a class missing from
the training set. We use neurons from Layer 4 of ResNet-50 (Imagenet) to find neurons in this layer that
could serve as the best classifiers for an unseen class, specifically the classes in CIFAR-10 and CIFAR-100 datasets.
Our setup is as follows:
- Explain all neurons in Layer 4 of ResNet-50 (ImageNet) using different methods.
- Find the neuron whose description is closest to the CIFAR class name in a text embedding space
(ensemble of the CLIP ViT-B/16 and mpnet text encoders)
- Measure the average activation to determine how well the neuron performs as a single class class
classifier on the CIFAR validation dataset, measured by area under ROC curve.
The average classification AUC on out of distribution
dataset when using neurons with similar description as a classifier.
We can see that our DnD clearly outperforms MILAN, the only
other generative description method.
5. Ablation Studies
For further insight, we conduct comprehensive ablation studies analyzing the importance of each step in the DnD pipeline.
We perform the following experiments with additional details shown here:
- Attention Cropping Ablation (Step 1)
- Image Captioning with Fixed Concept Sets (Step 2)
- Image-to-Text Model Ablation (Step 2)
- Effects of GPT Summarization (Step 2)
- Effects of Concept Selection (Step 3)