[1] Zhou, B., Khosla, A., Lapedriza, A., Oliva, A., and Torralba, A. Object detectors emerge in deep scene cnns. In ICLR, 2015.
[2] Bau, D., Zhou, B., Khosla, A., Oliva, A., and Torralba, A.
Network dissection: Quantifying interpretability of deep visual representations. In CVPR, 2017.
[3] Oikarinen, T. and Weng, T.-W. Clip-dissect: Automatic description of neuron representations in deep vision networks. In ICLR, 2023.
[4] Bai, N., Iyer, R. A., Oikarinen, T., Kulkarni, A., and Weng, T.-W. Interpreting neurons in deep vision networks with language models. TMLR, 2025.
[5] Srinivas, A. A., Oikarinen, T., Srivastava, D., Weng, W.-H., and Weng, T.-W. Sand: Enhancing open-set neuron descriptions through spatial awareness. In 2025 IEEE/CVF
Winter Conference on Applications of Computer Vision(WACV), pp. 2993–3002. IEEE, 2025.
[6] Oikarinen, T., Das, S., Nguyen, L. M., and Weng, T.-W. Label-free concept bottleneck models. In International Conference on Learning Representations, 2023.
[7] Huang, J., Geiger, A., D’Oosterlinck, K., Wu, Z., and Potts, C. Rigorously assessing natural language explanations of
neurons. In Proceedings of the 6th BlackboxNLP Work-shop: Analyzing and Interpreting Neural Networks for NLP, pp. 317–331, 2023.
[8] Gurnee, W., Nanda, N., Pauly, M., Harvey, K., Troitskii, D., and Bertsimas, D.
Finding neurons in a haystack: Case studies with sparse probing. TMLR, 2023.
[9] Koh, P. W., Nguyen, T., Tang, Y. S., Mussmann, S., Pierson, E., Kim, B., and Liang, P. Concept bottleneck models. In
International conference on machine learning, pp. 5338–5348. PMLR, 2020.
[10] Mu, J. and Andreas, J. Compositional explanations of neurons. In NeurIPS, 2020.
[11] La Rosa, B., Gilpin, L., and Capobianco, R. Towards a fuller understanding of neurons with clustered compositional explanations. Advances in Neural Information Processing Systems, 36, 2024.
[12] Zimmermann, R. S., Klein, T., and Brendel, W. Scale alone does not improve mechanistic interpretability in vision models. In NeurIPS, 2023.
[13] Bykov, K., Kopf, L., Nakajima, S., Kloft, M., and Höhne, M. M. Labeling neural representations with inverse recognition. In NeurIPS, 2023.
[14] Kopf, L., Bommer, P. L., Hedström, A., Lapuschkin, S., Höhne, M. M.-C., and Bykov, K. Cosy: Evaluating textual explanations of neurons. In NeurIPS, 2024
[15] Bills, S., Cammarata, N., Mossing, D., Tillman, H., Gao, L., Goh, G., Sutskever, I., Leike, J., Wu, J., and Saunders, W.
Language models can explain neurons in language models. OpenAI blog, 2023.
[16] Oikarinen, T. and Weng, T.-W. Linear explanations for individual neurons. In International Conference on Machine Learning, 2024.
[17] Bricken, T., Templeton, A., Batson, J., Chen, B., Jermyn, A., Conerly, T., Turner, N., Anil, C., Denison, C., Askell, A., Lasenby, R., Wu, Y., Kravec, S., Schiefer, N., Maxwell,
T., Joseph, N., Hatfield-Dodds, Z., Tamkin, A., Nguyen, K., McLean, B., Burke, J. E., Hume, T., Carter, S., Henighan, T., and Olah, C.
Towards monosemanticity: Decomposing language models with dictionary learning. Transformer Circuits Thread, 2023.
[18] Templeton, A., Conerly, T., Marcus, J., Lindsey, J., Bricken, T., Chen, B., Pearce, A., Citro, C., Ameisen, E., Jones, A., Cunningham, H., Turner, N. L., McDougall, C.,
MacDiarmid, M., Freeman, C. D., Sumers, T. R., Rees, E., Batson, J., Jermyn, A., Carter, S., Olah, C., and Henighan, T.
Scaling monosemanticity: Extracting interpretable features from claude 3 sonnet. Transformer Circuits Thread, 2024.
[19] Shaham, T. R., Schwettmann, S., Wang, F., Rajaram, A., Hernandez, E., Andreas, J., and Torralba, A. A multimodal automated interpretability agent.
In Forty-first International Conference on Machine Learning, 2024.
[20] Singh, C., Hsu, A. R., Antonello, R., Jain, S., Huth, A. G., Yu, B., and Gao, J. Explaining black box text modules in
natural language with language models, 2023.