AND: Audio Network Dissection for Interpreting Deep Acoustic Models

1 National Taiwan University  2 University of California San Diego

* indicates equal contribution.
ICML 2024

Abstract

Neuron-level interpretations aim to explain network behaviors and properties by investigating neurons responsive to specific perceptual or structural input patterns. Although there is emerging work in the vision and language domains, none is explored for acoustic models. To bridge the gap, we introduce AND, the first Audio Network Dissection framework that automatically establishes natural language explanations of acoustic neurons based on highly-responsive audio. AND features the use of LLMs to summarize mutual acoustic features and identities among audio. Extensive experiments are conducted to verify AND's precise and informative descriptions. In addition, we demonstrate a potential use of AND for audio machine unlearning by conducting concept-specific pruning based on the generated descriptions. Finally, we highlight two acoustic model behaviors with analysis by AND: (i) models discriminate audio with a combination of basic acoustic features rather than high-level abstract concepts; (ii) training strategies affect model behaviors and neuron interpretability -- supervised training guides neurons to gradually narrow their attention, while self-supervised learning encourages neurons to be polysemantic for exploring high-level features.


Method

Overview of Audio Network Dissection pipeline. Our framework utilizes SALMONN to produce captions for each audio clip in the probing dataset. For each neuron, we select top-K highly-activated audio samples with their descriptions to serve as one of the inputs of AND. To identify the common characteristics among top-k discriminated audios, we adopt Llama-2-chat-13B to summarize these descriptions. We highlight three dedicated modules in AND: (A) closed-concept identification, (B) summary calibration, and (C) open-concept identification.

Fig 1. Overall Pipeline of AND.


Module A: Closed-concept Identification
The module takes (1) a pre-defined concept set, (2) an audio description set, and (3) an activation vector as inputs. (1) is the open vocabulary set users can define. (2) is the set of each audio's description generated by SALMONN. (3) is a vector with entries being target neuron' activation value given each audio. We present three approaches to implement closed-concept identification:

  1. Description-based (DB): Identify most-matched concept by vector similarity between the concept vector and audio description vector encoded by CLIP's text encoder.

  2. Text-audio-based (TAB): Similar to DB but using CLAP to encode audio features and text features. This is similar to CLIP-Dissect [1].

  3. In-context learning (ICL): We prompt Llama-2-chat-13B to select a concept out of the given concept set that best matches top-K highly-activated audio's descriptions by in-context learning.

Fig 2. Module A in AND

Module B: Summary Calibration
The module takes summaries of top-K highly- and lowly- activated audio descriptions as inputs, aiming to remove spurious concepts in highly-activated summaries. We achieve this goal by computing text similarity between both summaries and removing concurrent concepts.

Module C: Open-concept Identification
The module takes calibrated summary as input and extract concepts from these sentences using POS tagging and non-acoustic word filtering.

Fig 3. Module B and C in AND


Experiment Results

Experimental Settings

We adopt ESC50 as our probing dataset and consider Audio Spectrogram Transformer (AST) and BEATs as the target network. Note that BEATs is a self-supervised learning (SSL) audio model pre-trained with Masked Audio Modeling (MLM), which allows us to use either by finetuning the whole model (denoted as BEATs-finetuned) or training only the last linear layer (denoted as BEATs-frozen). We adopt both versions in our experiments.

Table 1. Training settings of acoustic models used in our experiments.


Qualitative Evaluations

Below, we show an example output of AND's module B with a randomly selected neuron in AST. Additional results are available in Neuron Examples.


Fig 4. The pipeline of AND's module B. Texts with the same color means they refer to the same concept or property. In this example, a property "no background noise" is removed during calibration because it appears in both summaries.


Quantitative Evaluations

1. Final Layer Evaluation

We follow CLIP-Dissect [1] to assess the dissection accuracy on the last layer neurons. Since neurons in last layer have their inherent labels, we are able to evaluate the descriptions produced by AND. As AND is the first work to describe the roles of hidden neurons in audio networks, we conduct the experiments within AND's module A: Closed-concept Identification, which contains three methods to explain the neuron within given concepts (i.e. class name of probing dataset):

Table 2. Last layer network dissection accuracy of AST, BEATs-frozen, and BEATs-finetuned on the ESC50 dataset, with the highest performance marked in bold.



2. Human Evaluation

We also conduct two author-based evaluation experiments. First, We rate the quality of AND’s description on a scale of 1 to 5, with 1 being strongly disagree and 5 being strongly agree, in response to the question, “Does the given description accurately describe most of these audio clips?”. Second, Following MILAN [2] , we write the shared properties of highly activated audio samples and evaluate the description generated by AND through semantic similarity. Results show that calibrated summaries from module B achieve the highest preference and similarity.

Table 3. Results of human evaluation to measure AND’s capability of dissecting middle-layer neurons. Rating is the mean of scores (1-5) across neurons. Cos similarity and BERTScore are computed between AND’s descriptions and human’s written descriptions.



Use Case: Audio Machine Unlearning

1. Macro Observation

After identifying the representative concepts for each neuron, we can observe how models' confidence changes for samples in the ESC50 testing set after neuron ablation. Ideally, if the predicted concepts are well-aligned with neurons' interests, the confidence in predicting related samples will decrease after pruning. We provide results using TAB, DB, and Open-concept Pruning(OCP) in the following table. Results show that OCP is most efficient in middle-layer concept pruning.

Table 4. Averaged change of confidence after neuron ablation, with each class being the target concept. Avg, ∆A, and ∆R refer to the averaged pruned numbers of neurons, confidence change on ablating class samples, and confidence change on remaining class samples, respectively.



2. Micro Observation

To delve into the use case, we provide an example in the following figure. When we prune neurons related to “water drops" in BEATs-finetuned with the OCP strategy, the classification abilities on water-related concepts (with class names labeled in red), such as “toilet flush” and “pouring water”, are significantly impacted, with a smaller influence on other unrelated concepts.

Fig 5. Change of model confidence when neurons associated with “water drops" are ablated. Confidence in recognizing water-related audio (with class names labeled in red) decreases, while other sounds are not significantly affected.



Findings: Training Strategy Affects Neuron Interpretability

We say that a neuron is “uninterpretable” if there is no shared properties among it's highly activated audio descriptions. We train a K-means model to cluster audio descriptions. We say a description is related to a cluster if at least one sentence within the description belongs to that cluster, and descriptions having some shared properties when belonging to the same cluster. Results are illustrated in the following figure. AST demonstrates a decrease in the percentage of uninterpretable neurons from shallow to deep transformer blocks. Neurons in shallow layers are more diverse and gradually concentrate on certain concepts in deeper layers, with more overlapped information among highly-activated audio, to conduct the classification task. On the other hand, BEATs-frozen has consistent percentages of unexplainable neurons among all layers, which may be attributed to the effect of SSL pretraining.

Fig 6. Percentage of uninterpretable neurons in different transformer blocks of AST, BEATs-finetuned, and BEATs-frozen.


Conclusion

In sum, the key takeaways are:

  1. We introduce the first Audio Network Dissection framework named AND.

  2. AND provides both open-vocabulary concepts and generative natural language explainations of acoustic neurons based on LLMs, and can seamlessly adopt progressive LLMs in the future.

  3. AND showcases the potential use-case for audio machine unlearning by conducting concept-specific pruning.


Cite this work

T.-Y. Wu1, Y.-X. Lin1, and T.-W. Weng, AND: Audio Network Dissection for Interpreting Deep Acoustic Models, ICML 2024.
            
    @inproceedings{AND,
        title={AND: Audio Network Dissection for Interpreting Deep Acoustic Models},
        author={Tung-Yu Wu, Yu-Xiang Lin, and Tsui-Wei Weng},
        booktitle={Proceedings of International Conference on Machine Learning (ICML)},
        year={2024}
    }
            
            

This webpage template was recycled from LION and DnD.

Accessibility