Interpretable Generative Models through
Post-hoc Concept Bottlenecks

Akshay Kulkarni, Ge Yan*, Chung-En Sun*, Tuomas Oikarinen, Tsui-Wei (Lily) Weng

UC San Diego

* indicates equal contribution

CVPR 2025

article Paper code Code

Abstract

Recent work on Concept Bottleneck Models (CBMs) [1-5] focus solely on the image classification task, while we focus on the image generation task.
The existing approach, CBGM [6], to design interpretable generative models based on CBMs is not efficient and scalable, as it requires expensive generative model training from scratch as well as real images with labor-intensive concept supervision.

To address this, we propose efficient post-hoc concept bottleneck training for frozen pretrained generative models with our novel concept bottleneck autoencoder (CB-AE) and concept controller (CC), providing interpretability at minimal training cost.
Our approach enables efficient and scalable training by using generated images and our method can work with minimal to no concept supervision.
We demonstrate the superior interpretability and steerability of our methods on numerous standard datasets like CelebA, CelebA-HQ, and CUB with large improvements (average ~25%) over the prior work, while being 4-15x faster to train.

Figure 1. Comparison with prior work on concept bottleneck generative models [6]

Characteristic Comparison

Our CB-AE enables more efficient training than CBGM [6] and transforms a frozen pretrained generative model into a CBM.
On the other hand, our CC trades-off inherent interpretability for better steerability, image quality, and faster training

Table 1. Characteristic comparisons of our CB-AE and CC with prior work CBGM [6]

Method

1. Post-hoc Concept Bottleneck Autoencoder (CB-AE)

Our training involves three objectives:

Objective 1. Reconstruction losses for auto-encoding at latent and image level.

Figure 2A. Reconstruction losses for CB-AE training (only \(E, D\) are trainable)

Objective 2. Concept alignment loss with pseudo-label supervision from zero-shot CLIP or off-the-shelf classifiers.

Figure 2B. Concept alignment loss for CB-AE training (only \(E\) is trainable)

Objective 3. Intervention losses simulate interventions at training-time and encourage intervened concept alignment.

Figure 2C. Intervention losses for CB-AE training (only \(E, D\) are trainable)

2. CB-AE Interventions

Concept interventions can be performed by changing the concept prediction during generation.
Specifically, we swap the logits of desired concepts for intervention (e.g. see Fig. 2C).

3. Optimization-based Interventions

We propose a novel intervention method inspired from adversarial attacks [7].

Consider a generated image \(x=g_2(w)\) with concept prediction \(c=E(w)\). We can intervene to obtain desired concepts \(c^*\) using gradient ascent for the following optimization:
Here, \(\Delta=\{\delta: \Vert\delta\Vert_\infty \leq \epsilon\}\) restricts \(\delta\) to the \(\ell_\infty\) \(\epsilon\)-ball.
The intervened image can be obtained using the CB-AE encoder or CC like \(x^*=E(w^*)\).

4. Post-hoc Concept Controller (CC) for Steering

The post-hoc concept controller (CC) \(\Omega\) (Fig. 1B, bottom) is trained efficiently using only concept loss (Objective 2).
CC can steer image generation with optimization-based interventions.

Experiments

1. Concept Prediction

Our CB-AE and CC can provide interpretability with concept predictions for generated images (Fig. 3).

Figure 3. Qualitative evaluation of concept prediction

2. Steerability

Our CB-AE enables concept interventions for controllable image generation (Fig. 4).
Our optimization-based interventions further improve image quality and orthogonality of concept interventions (Fig. 5).

Figure 4. Qualitative evaluation of concept steerability with CB-AE

Figure 5. Qualitative evaluation of concept steerability with optimization-based interventions

Steerability (%) is the success rate of concept interventions averaged over all concepts and computed over 5k generated images.

The success rate is calculated w.r.t. a pretrained concept classifier (ViT-L-16-based, not used in CB-AE/CC training).

From Table 2, we find improved steerability of CB-AE and CC over CBGM as well as reduced training time.

Table 2. Quantitative evaluation of concept steerability i.e. concept interventions (train time on 1 V100 GPU)

3. Generation Quality

CBGM produces better image quality than CB-AE w.r.t. the base model generations since CBGM directly optimizes image generation objectives.
However, our optimization-based interventions can obtain almost the same image quality as the base model.

Table 3. Generation quality evaluation using FID (train time in V100 GPU-hours)

Conclusion

We are the first to propose a post-hoc concept bottleneck autoencoder (CB-AE) for interpretable generative models. CB-AE can be trained efficiently with a frozen pretrained generative model, without real concept-labeled images.
We propose a novel optimization-based concept intervention method with improved steerability (average +19%) and higher image quality (average +32% better).
We validate the effectiveness of our methods for GANs and diffusion models (avg.+31% and +28% steerability w.r.t. prior state-of-the-art) across varying image resolutions, while being 4-15x faster to train on average.

Related Works

[1] P. W. Koh et al., Concept Bottleneck Models, ICML 2020
[2] E. Marconato et al., GlanceNets: Interpretable, Leak-Proof Concept-based Models, NeurIPS 2022
[3] M. Yuksekgonul et al., Post-hoc Concept Bottleneck Models, ICLR 2023
[4] T. Oikarinen et al., Label-Free Concept Bottleneck Models, ICLR 2023
[5] D. Srivastava et al., VLG-CBM: Training Concept Bottleneck Models with Vision-Language Guidance, NeurIPS 2024
[6] A. A. Ismail et al., Concept Bottleneck Generative Models, ICLR 2024
[7] E. Wong et al., Fast is better than Free: Revisiting Adversarial Training, ICLR 2020

Cite this work

A. Kulkarni, G. Yan, C. Sun, T. Oikarinen, and T.-W. Weng, Interpretable Generative Models through Post-hoc Concept Bottlenecks, CVPR 2025.

@inproceedings{kulkarni2025interpretable
    title={Interpretable Generative Models through Post-hoc Concept Bottlenecks},
    author={Kulkarni, Akshay and Yan, Ge and Sun, Chung-En and Oikarinen, Tuomas and Weng, Tsui-Wei},
    booktitle={IEEE/CVF Conference on Computer Vision and Pattern Recognition},
    year={2025},
}

This webpage template was recycled from here.

Accessibility