Concept Bottleneck Large Language Models

Chung-En, Sun, Tuomas Oikarinen, Berk Ustun, Tsui-Wei (Lily) Weng

UCSD

ICLR 2025

Abstract

We introduce Concept Bottleneck Large Language Model (CB-LLM), a novel framework for building *inherently interpretable* LLMs to ensure saftey, reliability, and trustworthiness of LLMs. Unlike traditional black-box models, CB-LLM offers built-in interpretability, scalability, and clear explanations.

We applied CB-LLM to two key NLP tasks:

Text Classification: CB-LLM is competitive with (and sometimes outperforms) traditional black-box models while providing explicit and interpretable reasoning.
Text Generation: CB-LLM is the 1st framework to incorporate intrinsically-interpretable neurons, enabling accurate concept detection, controllable model behavior, and ultimately contributing to more transparent and safer LLM outputs.

Motivation

Recent works have successfully applied concept bottlenecks to create interpretable models for image classification[1][2][3]. However, the development of interpretable models for NLP remains limited, with few studies exploring inherently interpretable language models, mostly restricted to small-scale text classification tasks[4][5].

Our study aims to address these limitations by improving the interpretability of LLMs in text classification and, more challengingly, developing a generative LLM with interpretable features. By integrating interpretability, we can enhance the reliability of LLMs, enabling humans to understand the underlying reasoning behind outputs and tracing how individual neurons contribute to their generation.

Our method contain two parts: CB-LLM (classification) and CB-LLM (generation)

Part I: CB-LLM (classification)

Training pipeline

The overview of CB-LLM for text classification. The pipeline consists of five steps: (1) Generate concept set via querying ChatGPT. (2) Automatically label the samples with sentence embedding models. (3) Fix the incorrect concept labels. (4) Train backbone LLM and CBL with the concept labels. (5) Train a linear layer on top of the CBL to make the class predictions.

Our approach consists of five steps:

Concept Generation: We use ChatGPT to generate a concept set for a given text classification dataset.
Automatic Concept Scoring (ACS): We utilize sentence embedding models to provide pseudo concept labels for each text sample.
Automatic Concept Correction (ACC) We refine the concept scores by filtering out negative scores and ensuring concept-class alignment.
Training the Concept Bottleneck Layer (CBL): The CBL is trained to learn interpretable embeddings by maximizing similarity between neuron activations and concept scores, effectively mapping each neuron to a specific concept.
Training the Linear Layer: A final linear layer is trained on top of the interpretable neurons with sparsity constraints, enabling transparent predictions.

Accuracy

Test accuracy of CB-LLM. CB-LLMs are competitive with the black-box model after applying ACC. Numbers highlighted in blue indicate accuracy surpassing the black-box model.

CB-LLMs w/ ACC achieve high accuracy across various datasets, often nearing or exceeding the standard black-box model's performance. The ACC strategy in Step 3, effectively corrects concept scores and enhances learning.

Human studies

We evaluate the faithfulness of CB-LLMs' interpretability through two human study tasks via MTurk:

Task1 --- Activation Faithfulness: assessing neuron activations' alignment with learned concepts.

Human evaluation results for Task 1. The higher rating of CB-LLM suggests that the neurons in CB-LLMs are reasonably interpretable to humans.

Task2 --- Contribution Faithfulness: comparing explanation quality based on neuron contributions to predictions.

Human evaluation results for Task 2. Results show that CB-LLMs provide good explanations to the predictions.

Use case: Concept Unlearning

CB-LLMs enable concept unlearning, which can enhance prediction fairness by allowing users to remove biased or unfair elements. Concept Unlearning involves forcing the model to forget a specific concept, which can be achieved by deactivating a neuron in the CBL or removing its weights from the final linear layer. For example, the below figure illustrates unlearning the concept of "overpriced", which may be subjective or geographically influenced in Yelp reviews (as the standard of overpricing varies across individuals and locations). This adjustment encourages CB-LLM to focus more on product quality.

An example of concept unlearning. This example is initially classified as negative due to the customer complaining about the high price, despite the lobster tails being great. After unlearning the concept "Overpriced", the concepts "Amazing flavors" and "Generous portion sizes" dominate the prediction, resulting in a positive prediction.

Part II: CB-LLM (generation)

Training pipeline

The overview of CB-LLM for text generation. The training consists of two modules: (1) The main module, which makes CB-LLM learn the concepts and next-token predictions. (2) The adversarial training module, which prevents the unsupervised layer from learning concept-related information to enhance steerability.

Our CB-LLM for text generation introduces interpretable neurons through a hybrid CBL structure, combining a concept bottleneck layer (CBL) and an unsupervised layer to enable both interpretability and token prediction. A key challenge is that the final prediction might rely entirely on the unsupervised layer (black-box module), and therefore losing interpretability. We solved this problem by including a novel adversarial training module to remove the concepts from the unsupervised layer. The training structure contains two modules:

Module 1 --- CB-LLM training: training the CB-LLM by learning concept-specific neurons through the CBL while ensuring effective token predictions using unsupervised neurons.
Module 2 --- Adversarial training: disentangling (removing) concept-related information from the unsupervised layer, ensuring the CBL remains effective in providing concept information for token predictions.

Combining these two modules, CB-LLM (generation) enables enhanced interpretability and controllability.

Evaluation

we evaluate our CB-LLM (generation) based on three crucial aspects: Concept detection, Steerability, and Generation quality.

Concept Detection: measuring the ability of CB-LLM to identify the correct concept based on interpretable neuron activations, with accuracy computed as the proportion of correctly aligned cases. CB-LLM achieves near-parity performance with the Llama3-8B finetuned for concept detection.
Steerability: evaluating the model's ability to generate text aligned with specific target concepts by setting the corresponding neuron activations in the CBL. Measured using a RoBERTa classifier to calculate the success rate of steering interventions. CB-LLM generates text aligned with target concepts, achieving significantly higher steerability than models without ADV training (Module 2). Results confirm the critical role of adversarial training in controllability.
Generation Quality: assessing through the perplexity of generated sentences, calculated using a pretrained Llama3-8B model. Lower perplexity indicates grammatically correct and fluent generations. CB-LLM maintains grammatical fluency and quality, with perplexity comparable to fine-tuned Llama3-8B. This indicates that improved interpretability and steerability come without sacrificing generation quality.

The accuracy, steerability, and perplexity of CB-LLMs (generation). CB-LLMs perform well on accuracy (↑) and perplexity (↓) while providing higher steerability (↑).

Use case: Toxicity Reduction

A useful application of CB-LLM (generation) is to build a chatbot, which can detect the harmful prompts and reduce the toxicity in the response. We incorporate four interpretable neurons in this chatbot: two for detecting benign and harmful prompts, and two for generating benign or toxic responses. This design allows users to control the chatbot's generation by adjusting neuron activations.

An example of toxicity detection and successful steering the generation via CB-LLM. CB-LLM identifies the harmful query token by token (marked in red), and users can steer the response to be benign (green) or toxic (red) through intervention on CBL.

Related Works

[1] Koh etal. "Concept Bottleneck Models", ICML 2020

[2] Yuksekgonul etal. "Post-hoc Concept Bottleneck Models", ICLR 2023

[3] Oikarinen etal. "Label-Free Concept Bottleneck Models", ICLR 2023

[4] Ludan etal. "Interpretable-by-Design Text Understanding with Iteratively Generated Concept Bottleneck", arXiv 2023

[5] Tan etal "Interpreting Pretrained Language Models via Concept Bottlenecks", arXiv 2023

Cite this work

Chung-En Sun, Tuomas Oikarinen, Tsui-Wei Weng. "Concept Bottleneck Large Language Models", ICLR 2025

            
@article{cbllm,
    title={Concept Bottleneck Large Language Models},
    author={Chung-En Sun, Tuomas Oikarinen, Berk Ustun, Tsui-Wei Weng},
    journal={ICLR 2025},
    year={2025}
}

This webpage template was recycled from here.

Accessibility