VLG-CBM: Training Concept Bottleneck Models with Vision-Language Guidance

Abstract

In this work, we propose a novel framework called Vision-Language-Guided Concept Bottleneck Model (VLG-CBM) to enable faithful interpretability with the benefits of boosted performance. Our method leverages off-the-shelf open-domain grounded object detectors to provide visually grounded concept annotation, which largely enhances the faithfulness of concept prediction while further improving the model performance. In addition, we propose a new metric called Number of Effective Concepts (NEC) to control the information leakage and provide better interpretability. Extensive evaluations across five standard benchmarks show that our method, VLG-CBM, outperforms existing methods by at least 4.27% and up to 51.09% on accuracy at NEC=5, and by at least 0.45% and up to 29.78% on average accuracy across different NECs, while preserves both faithfulness and interpretability of the learned concepts as demonstrated in extensive experiments.

Figure 1: We compare the decision explanation of VLG-CBM with existing methods by listing top-5 contributions for their decisions. Our observations include: (1) VLG-CBM provides concise and accurate concept attribution for the decision; (2) LF-CBM[2] frequently uses negative concepts for explanation, which is less informative; (3) LM4CV[3] attributes the decision to concepts that do not match the images, a reason for this is that LM4CV uses a limited number of concepts, which hurts CBM’s ability to explain diverse images; (4) Both LF-CBM and LM4CV have a significant portion of contribution from non-top concepts, making decisions less transparent.

Background

As deep neural networks become popular in real-world applications, it is crucial to understand the decision of these black-box models. One approach to provide interpretable decisions is the Concept Bottleneck Model(CBM)[1], which introduced an intermediate concept layer to encode human-understandable concepts and make final predictions based on these concept predictions.

However, currently CBM models still face two major limitations:

Challenge #1 Inaccurate concept prediction: Inaccurate or wrong explanations which do not match the input images

Figure 2: Challenge #1 Inaccurate concept prediction: The concept prediction from LM4CV[3] does not match the input image, leading to inaccurate explanations.
Challenge #2 Information leakage: The concept prediction encodes unintended information for downstream tasks, even if the concepts are irrelevant to the task (e.g. random concepts can still get high acc.)

VLG-CBM

To address the Challenge #1 Inaccurate concept prediction, we proposed VLG-CBM: a faithful CBM trained with vision-language guidance.

Figure 3: VLG-CBM pipeline: We design automated Vision+Language Guided approach to train Concept Bottleneck Models

Training VLG-CBM involves the following major steps:

Generation of auxiliary dataset: We first follow previous works to utilize LLMs to generate a set of candidate concepts. Then, we utilize vision supervision from open-domain object detectors to ground candidate concepts to spatial information. More specifically, we use Grounding-DINO Swin-B model for obtaining bounding boxes for candidate concepts.
Training Concept Bottleneck Layer (CBL): The training of CBL is cast as a multi-label classification task: We predict the concepts in the image and use the Binary Cross Entropy (BCE) loss to train the model. In order to utilize the position information obtained in the last step, we introduce an augmentation to the dataset by cropping images to a single selected bounding box with the target as the corresponding concept.
Training final prediction layer: We adopt sequential training of CBM in this work. Thus, during training of the final prediction layer, we freeze the CBL. Following Label-Free CBM[2], we train a sparse final layer using the GLM-SAGA solver with the standard cross-entropy objective.

Unifying CBM Evaluation with Number of Effective Concepts (NEC)

As we mention in the Challenge #2 Information leakage, existing CBMs face the information leakage problem. This means that concept predictions could encode unintended information which may be used in final class label prediction. A recent work[3] showed in empirical experiments that, when increasing the number of concepts, a randomly selected concept set could approach the accuracy of the concept set chosen with sophistication, supporting the existence of information leakage. Here, we show a theoretical result explaining this phenomenon:

Theorem 4.1: Here we consider a 1-D regression problem and show that a random CBL could approximate any linear function, with expected error going down linearly as the number of concepts increases. The multi-class classification result can be derived similarly, see Appendix A Corollary A.1 in the manuscript.

Given this, we find that the traditional metric, which measures the accuracy over final class label prediction, may not be a good indicator of semantical information learned in the CBL, as even random CBLs could get high accuracy. Thus, we introduce Number of Effective Concepts (NEC), which measures how many concepts contribute to the prediction of a single class (k in the theorem) and Accuracy at NEC (ANEC) to measure the accuracy of CBMs at different NECs.

Controlling NEC provides the following benefits:

Limits information leakage, as suggested by the theory.
Smaller NEC enforces the model to make decisions in a way closer to human reasoning, providing more interpretable decision explanations.

Experiments

In this section, we show quantitative comparisons and qualitative visualizations of our method. The key metrics we measure are:

Accuracy at NEC=k: This metric (ANEC-k) is designed to show the performance of CBMs that can provide an interpretable prediction. We choose the number k=5 so that human users can easily inspect all concepts related to the decision without much effort.
Average accuracy: To evaluate the trade-off between interpretability and performance, we also calculate the average accuracy under different NECs (ANEC-avg). In general, higher NEC indicates a more complex model, which may achieve better performance but can also hurt interpretability. We choose six different levels: 5, 10, 15, 20, 25, 30 and measure the average accuracy.

Table 1: Performance comparison with CLIP-RN50 backbone. We compare our method with a random baseline, LF-CBM[2], LM4CV[3], and LaBo[4]. The random baseline has 1024 neurons for CIFAR10 and CIFAR100, 512 for CUB, 2048 for Places365, and 4096 for ImageNet.

Table 2: Performance comparison with non-CLIP backbone. We compare against LF-CBM[2], and a random baseline, as LM4CV[3] and LaBo[4] do not support non-CLIP backbone. The random baseline has 1024 neurons for CIFAR10 and CIFAR100, 512 for CUB, 2048 for Places365, and 4096 for ImageNet.

Figure 4: Visualization of Top-5 activated images of randomly selected neurons with VLG-CBM on the CUB dataset. VLG-CBM faithfully captures concepts that are aligned with human perception.

Conclusion

In sum, the key takeaways are:

We proposed VLG-CBM, a novel framework to address inaccurate concept prediction (challenge #1) of previous CBMs.
We provide the first theoretical analysis for information leakage (challenge #2) and proposed a new metric NEC to control it, allowing fair comparison between CBMs.

Related Works

[1] Koh etal. "Concept bottleneck models." In ICML, 2020.
[2] Oikarinen etal. "Label-free concept bottleneck models." In ICLR, 2023.
[3] Yan etal. "Learning concise and descriptive attributes for visual recognition." In ICCV, 2023.
[4] Yang etal. "Language in a bottle: Language model guided concept bottlenecks for interpretable image classification." In CVPR, 2023.
[5] Alvarez-Melis and Jaakkola. "Towards Robust Interpretability with Self-Explaining Neural Networks." In NeurIPS, 2018
[6] Yuksekgonul etal. "Post-hoc Concept Bottleneck Models." In ICLR, 2023
[7] Kim etal. "Concept bottleneck with visual concept filtering for explainable medical image classification." arXiv preprint, 2023.
[8] Sun etal. "Eliminating information leakage in hard concept bottleneck models with supervised, hierarchical concept learning." arXiv preprint, 2024
[9] Pham etal. "Peeb: Part-based image classifiers with an explainable and editable language bottleneck" arXiv preprint, 2024.

Cite this work

D. Srivastava*, G. Yan*, and T.-W. Weng, VLG-CBM: Training Concept Bottleneck Models with Vision-Language Guidance, NeurIPS 2024.

    @inproceedings{srivastava2024vlg,
        title={VLG-CBM: Training Concept Bottleneck Models with Vision-Language Guidance},
        author={Srivastava, Divyansh and Yan, Ge and Weng, Tsui-Wei},
        journal={NeurIPS},
        year={2024}
    }