Effective Skill Unlearning through Intervention and Abstention

Yongce Li, Chung-En Sun, Tsui-Wei (Lily) Weng

UCSD

NAACL 2025

Abstract

In this work, we first observe that the pre-activation distribution of neurons in LLM's Feed-Forward Layer (FFL) differs when the model demonstrates different skills. Additionally, we find that queries triggering the same skill cluster within the FFL key space and can be separated from other queries using a hypercube. Based on these observations, we propose two lightweight, training-free skill unlearning methods via intervention and abstention:

Neuron Adjust unlearns one skill by shifting selected feed-forward layer neurons' pre-activation value from the forgetting skill distribution to the retaining skill distribution.
Key Space Detection unlearns one skill by abstaining the model's output when the model's inference-time key vector falls in the forgetting skill hypercube.

We evaluate our methods on unlearning math-solving, Python-coding, and comprehension skills across seven different languages. The results demonstrate their strong unlearning capabilities for the designated skills. Specifically, Key Space Detection achieves over 80% relative performance drop on the forgetting skill and less than 10% relative performance drop on other skills and the model's general knowledge for most unlearning tasks.

Figure 1. An overview of the (a) Neuron Adjust and (b) Key Space Detection methods.

Motivation

Past works have shown that neurons represent different concepts, both in vision models and language models [1][2]. Pruning neurons (setting their values to 0) that represent unwanted concepts or skills could change the model's behavior [3][4]. However, recent works also show that neurons exhibit multisemanticity [5], with some being expressible as a linear combination of concepts [6], which means pruning those neurons could be harmful to the model's overall capabilities.

In this work, we observe that neurons exhibit different value distributions when the model demonstrates various skills. Additionally, we find that key vectors in the model's feed-forward layers (FFL) tend to cluster when the model performs different skills, allowing them to be bounded within a hypercube. Building on these observations, we propose two machine skill unlearning methods: Neuron Adjust and Key Space Detection, which achieve better performance in unlearning specific skills while preserving the model's overall capabilities.

Method 1: Neuron Adjust

Neuron Adjust consists of two steps:

Distribution Modeling: We probe the model with the forgetting and retaining datasets, record each token's preactivation value on each neuron, and model the whole value distribution using Gaussian distribution.
Inference-time Neuron Adjust: During inference time, we choose to adjust the top k% of the neurons with the most distributional difference. If we detect the neuron's preactivation value is more likely to be sampled from the forgetting distribution, we probablistically shift the neuron's value to the retaining distribution.

Figure 2. The overview of Neuron Adjust method.

Method 2: Key Space Detection

Key Space Detection consists of two steps:

Hypercube creation: We probe the model with the forgetting dataset, record each FFL layer's key vectors' element-wise mean and standard distribution. Based on the two statistics vectors, we create forgetting hypercube to bound the forgetting key vectors.
Inference-time Abstention: During inference time, if we see the key vector fall within the forgetting hypercube, we abstain the model's output.

Figure 3.The overview of Key Space Detection method.

Experiments

We experiment the methods on two different tasks:

Forget Math/Code skill

Math dataset: GSM8K
Code dataset: MBPP
General Knowledge dataset: MMLU
Figure 4 shows the performance of different methods:
- On the horizontal axis, NA, SP, KSD stand for Neuron Adjust (ratio), Selective Pruning (ratio), and Key Space Detection, respectively.
- The vertical axis represents the relative performance of the model compared to the original model after applying each unlearning method.
- We tested on gemma-2B, llama-2-7B, llama-3-8B, and llama-3-70B models. Experiment results show that our methods effectively unlearn the forgetting skill while retaining the model's overall knowledge. Specifically, KSD achieves >80% unlearning quality with minimal performance degradation (<10%) of other skills.

Figure 4. Performance of Neuron Adjust and Key Space Detection on Math/Code Skill Unlearning task.

Forget certain language comprehension skill

Language dataset: MLQA
General Knowledge dataset: MMLU
Figure 5 shows the performance of different methods:

The i-th row shows the model's performance drop on different languages after unlearning the i-th language.
On most of the languages, KSD achieves >60% unlearning quality with <12% performance degradation of other language skills.

Figure 5. Results of unlearning one language in MLQA dataset while retaining the others with Neuron Adjust and Key Space Detection on llama-3-8b. (English (en), Spanish (es), Hindi (hi), German (de), Chinese (zh), Vietnamese (vi), and Arabic (ar).)

[2] Bills, et al., "Language models can explain neurons in language models", 2023.

[3] Wu et al. "DEPN: Detecting and Editing Privacy Neurons in Pretrained Language Models", EMNLP 2023.

[4] Pochinkov et al. "Dissecting Language Models: Machine Unlearning via Selective Pruning", arxiv.

[5] Bricken, et al., "Towards Monosemanticity: Decomposing Language Models With Dictionary Learning", Transformer Circuits Thread, 2023.

[6] T. Oikarinen and T.-W. Weng, Linear Explanations for Individual Neurons, ICML 2024.

Cite this work

Y. Li, C. Sun, and T.-W. Weng, Effective Skill Unlearning through Intervention and Abstention, NAACL 2025

            
@article{Li2025effective,
    title={Effective Skill Unlearning through Intervention and Abstention},
    author={Li, Yongce and Sun, Chung-En and Weng, Tsui-Wei},
    journal={NAACL},
    year={2025},
}

This webpage template was recycled from here.

Accessibility