In this work, we propose a novel framework called Vision-Language-Guided Concept Bottleneck Model (VLG-CBM) to enable faithful interpretability with the benefits of boosted performance. Our method leverages off-the-shelf open-domain grounded object detectors to provide visually grounded concept annotation, which largely enhances the faithfulness of concept prediction while further improving the model performance. In addition, we propose a new metric called Number of Effective Concepts (NEC) to control the information leakage and provide better interpretability. Extensive evaluations across five standard benchmarks show that our method, VLG-CBM, outperforms existing methods by at least 4.27% and up to 51.09% on accuracy at NEC=5, and by at least 0.45% and up to 29.78% on average accuracy across different NECs, while preserves both faithfulness and interpretability of the learned concepts as demonstrated in extensive experiments.
Figure 1: We compare the decision explanation of VLG-CBM with existing methods by listing top-5 contributions for their decisions. Our observations include: (1) VLG-CBM provides concise and accurate concept attribution for the decision; (2) LF-CBM[2] frequently uses negative concepts for explanation, which is less informative; (3) LM4CV[3] attributes the decision to concepts that do not match the images, a reason for this is that LM4CV uses a limited number of concepts, which hurts CBM’s ability to explain diverse images; (4) Both LF-CBM and LM4CV have a significant portion of contribution from non-top concepts, making decisions less transparent.
As deep neural networks become popular in real-world applications, it is crucial to understand the decision of these black-box models. One approach to provide interpretable decisions is the Concept Bottleneck Model(CBM)[1], which introduced an intermediate concept layer to encode human-understandable concepts and make final predictions based on these concept predictions.
However, currently CBM models still face two major limitations:
Figure 2: Challenge #1 Inaccurate concept prediction: The concept prediction from LM4CV[3] does not match the input image, leading to inaccurate explanations.
To address the Challenge #1 Inaccurate concept prediction, we proposed VLG-CBM: a faithful CBM trained with vision-language guidance.
Figure 3: VLG-CBM pipeline: We design automated Vision+Language Guided approach to train Concept Bottleneck Models
Training VLG-CBM involves the following major steps:
As we mention in the Challenge #2 Information leakage, existing CBMs face the information leakage problem. This means that concept predictions could encode unintended information which may be used in final class label prediction. A recent work[3] showed in empirical experiments that, when increasing the number of concepts, a randomly selected concept set could approach the accuracy of the concept set chosen with sophistication, supporting the existence of information leakage. Here, we show a theoretical result explaining this phenomenon:
Theorem 4.1: Here we consider a 1-D regression problem and show that a random CBL could approximate any linear function, with expected error going down linearly as the number of concepts increases. The multi-class classification result can be derived similarly, see Appendix A Corollary A.1 in the manuscript.
Given this, we find that the traditional metric, which measures the accuracy over final class label prediction, may not be a good indicator of semantical information learned in the CBL, as even random CBLs could get high accuracy. Thus, we introduce Number of Effective Concepts (NEC), which measures how many concepts contribute to the prediction of a single class (k in the theorem) and Accuracy at NEC (ANEC) to measure the accuracy of CBMs at different NECs.
Controlling NEC provides the following benefits:
In this section, we show quantitative comparisons and qualitative visualizations of our method. The key metrics we measure are:
Table 1: Performance comparison with CLIP-RN50 backbone. We compare our method with a random baseline, LF-CBM[2], LM4CV[3], and LaBo[4]. The random baseline has 1024 neurons for CIFAR10 and CIFAR100, 512 for CUB, 2048 for Places365, and 4096 for ImageNet.
Table 2: Performance comparison with non-CLIP backbone. We compare against LF-CBM[2], and a random baseline, as LM4CV[3] and LaBo[4] do not support non-CLIP backbone. The random baseline has 1024 neurons for CIFAR10 and CIFAR100, 512 for CUB, 2048 for Places365, and 4096 for ImageNet.
Figure 4: Visualization of Top-5 activated images of randomly selected neurons with VLG-CBM on the CUB dataset. VLG-CBM faithfully captures concepts that are aligned with human perception.
In sum, the key takeaways are:
[1] Koh etal. "Concept bottleneck
models." In ICML, 2020.
[2] Oikarinen etal.
"Label-free concept bottleneck models."
In ICLR, 2023.
[3] Yan etal. "Learning concise and descriptive
attributes for visual
recognition." In ICCV, 2023.
[4] Yang etal. "Language in a
bottle: Language model guided concept bottlenecks for interpretable
image classification." In CVPR, 2023.
[5] Alvarez-Melis and Jaakkola. "Towards Robust Interpretability with Self-Explaining Neural Networks." In NeurIPS, 2018
[6] Yuksekgonul etal. "Post-hoc Concept Bottleneck Models." In ICLR, 2023
[7] Kim etal. "Concept bottleneck with
visual concept filtering for explainable medical image classification." arXiv preprint, 2023.
[8] Sun etal. "Eliminating information leakage in hard concept
bottleneck models with supervised, hierarchical concept learning." arXiv preprint, 2024
[9] Pham etal. "Peeb: Part-based image classifiers with an explainable and editable language bottleneck" arXiv preprint, 2024.
D. Srivastava*, G. Yan*, and T.-W. Weng, VLG-CBM: Training Concept Bottleneck Models with Vision-Language Guidance, NeurIPS 2024.
@inproceedings{srivastava2024vlg, title={VLG-CBM: Training Concept Bottleneck Models with Vision-Language Guidance}, author={Srivastava, Divyansh and Yan, Ge and Weng, Tsui-Wei}, journal={NeurIPS}, year={2024} }
This webpage template was recycled from here.