Interpretability-Guided Test-Time Adversarial Defense

UC San Diego
ECCV 2024

Abstract

We propose a novel and low-cost test-time defense against adversarial examples by devising interpretability-guided neuron importance ranking methods to identify neurons important to the output classes. Our method is training-free and can significantly improve the robustness-accuracy tradeoff with minimal computational overhead. While being among the most efficient test-time defenses (4x faster), our method is also robust to a wide range of black-box, white-box, and adaptive attacks that break previous test-time defenses. We demonstrate the efficacy of our method for CIFAR10, CIFAR100, and ImageNet-1k on the standard RobustBench benchmark (with average gains of 2.6%, 4.9%, and 2.8% respectively). We also show improvements (average 1.5%) over the state-of-the-art test-time defenses even under strong adaptive attacks.


Analysis

We analyze adversarial attacks with neuron-level interpretability. Specifically, we observe the average change in activations for neurons important to each class before and after an adversarial attack. As shown in the figure below, we find that:

  • Successful adversarial attacks boost the activations that are important to the post-attack predicted class while causing a drop for those of the GT class.
  • The important activations for the remaining classes and unimportant activations are only marginally affected by successful attacks.
  • Unsuccessful adversarial attacks cause a drop in the activations of all neurons

Figure 1. Analysis of adversarial attacks through the lens of neuron-interpretability

Based on our observations, we hypothesize that adversarial robustness can be improved if the activation shift to non-GT class important neurons is restricted.


Method

Interpretability-Guided Masking

As per our hypothesis, we design a simple approach for a test-time adversarial defense where we mask all the neurons except the important neurons of the correct class. Our approach consists of three steps:

Figure 2. Overview of IG-Defense


    Step 1: Neuron Importance Ranking
    For a given layer with \(N\) neurons in a given base model, we rank each neuron based on its importance to each of \(C\) classes. A binary mask \(m\in \{0, 1\}^{N\times C}\) of top-\(k\) important neurons per class is obtained as shown above. We propose two neuron importance ranking methods (Fig. 3) inspired from neuron-interpretability tools:

      A. Leave-one-Out Importance Ranking (LO-IR)

      • Following our analysis and NetDissect [2], we compute the importance of a neuron \(j\) to class \(i\) as the average change in the class \(i\) logits, before and after masking out that particular neuron \(j\). The average is computed over the training data.
      • Intuitively, a higher logit-change for a particular class implies higher dependence of the network on that neuron, i.e. higher importance to that class.

      B. CLIP-Dissect Importance Ranking (CD-IR)

      • CLIP-Dissect [3] uses the multimodal CLIP model [4] to assign concept labels to individual neurons.
      • We extend this idea by replacing concept names with class names and obtain a class-wise neuron importance ranking.

    Step 2: Vanilla Forward Pass
    We obtain a pseudo-label \(\hat{y}\) by performing a standard forward pass with the base model.

    Step 3: Masked Forward Pass
    We apply a soft-pseudo-label weighted mask \(m\hat{y}\) to the activations of the layer being masked and obtain the final prediction.

Randomized Smoothing. If the soft-pseudo-label is incorrect, masking yields no gains in robustness. To avoid this and obtain the soft-pseudo-label in a more robust manner, we use randomized smoothing [1] in the first forward pass (Step 2 above).

Figure 3. Neuron Importance Ranking Methods


Experiments

We compare with existing test-time adversarial defenses like HedgeDefense (HD) [5], SODEF [6], and CAAA [7] against image-wise worst-case (IW-WC) adaptive attacks [8] including AutoAttack, RayS, and their transfer and EoT variants. Please refer to Sec. 5 of our paper for complete details and more results. Overall, we find that our proposed IG-Defense obtains consistent improvements in image-wise worst-case (IW-WC) robust accuracy, unlike existing test-time defenses.

Table 1. Comparison of our IG-Defense with existing test-time adversarial defenses. The number in green/red indicates gain/drop in image-wise worst-case robust accuracy compared to the base model without any test-time defense.


Conclusion


Cite this work

A. Kulkarni, and T.-W. Weng, Interpretability-Guided Test-Time Adversarial Defense, ECCV 2024.
@inproceedings{kulkarni2024igdefense,
    title={Interpretability-Guided Test-Time Adversarial Defense},
    author={Kulkarni, Akshay and Weng, Tsui-Wei},
    booktitle={European Conference on Computer Vision},
    year={2024}
}

This webpage template was recycled from here.

Accessibility