Provable and Efficient Dataset Distillation
for Kernel Ridge Regression

1UCSD 2RIEKN AIP
NeurIPS 2024

Abstract

Deep learning models are now trained on increasingly larger datasets, making it crucial to reduce computational costs and improve data quality. Dataset distillation aims to distill a large dataset into a small synthesized dataset such that models trained on it can achieve similar performance to those trained on the original dataset. While there have been many empirical efforts to improve dataset distillation algorithms, a thorough theoretical analysis and provable, efficient algorithms are still lacking.

In this paper, by focusing on dataset distillation for kernel ridge regression (KRR), we show that one data point per class is already necessary and sufficient to recover the original model’s performance in many settings.

  • For linear ridge regression and KRR with surjective feature mappings, we provide necessary and sufficient conditions for the distilled dataset to recover the original model’s parameters.

  • For KRR with injective feature mappings of deep neural networks, we show that while one data point per class is not sufficient in general, k+1 data points can be sufficient for deep linear neural networks, where k is the number of classes.

  • Our theoretical results enable directly constructing analytical solutions for distilled datasets, resulting in a provable and efficient dataset distillation algorithm for KRR. We verify our theory experimentally and show that our algorithm outperforms previous work such as KIP while being significantly more efficient, e.g. 15840× faster on CIFAR-100.


What is Dataset Distillation

An illustration of dataset distillation from [1]

Given an original dataset: (X,Y)Rd×n×Rk×n, where d is the dimenion of the data, n is the number of the original data, and k is the dimension of the label, dataset distillation aims to distill the large original dataset into a small synthesized dataset (XS,YS)Rd×m×Rk×n, such that models trained on it can achieve similar performance to those trained on the original dataset, where m is the number of distilled data and mn.


Dataset Distillation for Kernel Ridge Regression (KRR)

A Kernel ridge regression (KRR) is f(x)=Wϕ(x), where ϕ:RdRp and WRk×p, that minimize the following loss: minWYWϕ(X)F2+λW2 where λ>0 is the regularization parameter. The solution can be computed analytically as W=Yϕλ(X)+, where ϕλ(X)+={(K(X,X)+λIn)1ϕ(X)=ϕ(X)(ϕ(X)ϕ(X)+λIp)1,if λ>0ϕ(X)+,if λ=0. and K(X,X)=ϕ(X)ϕ(X)Rn×n. ϕλ(X) can be considered as regularized features. Linear ridge regression is a special case of kernel ridge regression with ϕ(X)=X.

Similarly, a KRR trained on distilled dataset with regularization λS0 is fS(x)=WSϕ(x), The goal of dataset distillation is to find (XS,YS) such that WS=W.


Dataset Distillation for Linear Ridge Regression (LRR)

For a LRR model, we show that k distilled data points (one per class) are necessary and sufficient to guarantee WS=W. We provide analytical solutions of such (XS,YS) allowing us to compute the distilled dataset analytically instead of having to learn it heuristically in existing works.


Intuitively, original dataset (X,Y) is compressed into (XS,YS) through original model’s parameter W.

Below is an illustration of distilled data of MNIST and CIFAR-100 when m=k.

Besides, we can also

  1. Find realistic distilled data that is close to original data by solving closed-from solution.
  2. Generate distilled data from random noise.
Below is an illustration on MNIST and CIFAR-100.


Kernel Ridge Regression (KRR)

Surjective Feature Mapping

The results of LRR can be extended to KRR by replacing XS with ϕ(XS). When ϕ is surjective or bijective, we can always find a XS for a desired ϕ(XS). Examples of surjective ϕ:pd:

  1. Invertible NNs
  2. Fully-connected NN (FCNN)
  3. Convolutional Neural Network (CNN)
  4. Random Fourier Features (RFF)

Non-surjective Feature Mapping

For non-surjective ϕ such as deep nonlinear NNs, one data per class is generally not sufficient as long as (YS+W)+ is not in the range space of ϕ. For deep linear NNs, we show m=k+1 can be sufficient under certain conditions.


Comparison with Existing Theoretical Results

Below is a comparison with existing theoretical analysis of dataset distillation.

Experiments

For surjective 𝜙, our algorithm outperforms previous work such as KIP [2] while being significantly more efficient.


Related works

[1] Tongzhou Wang, Jun-Yan Zhu, Antonio Torralba, Alexei A. Efros. "Dataset Distillation." 2020.
[2] Timothy Nguyen, Roman Novak, Lechao Xiao, and Jaehoon Lee. "Dataset distillation with infinitely wide convolutional networks." Advances in Neural Information Processing Systems. 2021.


Cite this work

Yilan Chen, Wei Huang, Tsui-Wei Weng, Provable and Efficient Dataset Distillation for Kernel Ridge Regression , NeurIPS 2024.
            
@inproceedings{
    chen2024provable,
    title={Provable and Efficient Dataset Distillation for Kernel Ridge Regression},
    author={Yilan Chen and Wei Huang and Tsui-Wei Weng},
    booktitle={The Thirty-eighth Annual Conference on Neural Information Processing Systems},
    year={2024},
    url={https://openreview.net/forum?id=WI2VpcBdnd}
    }
            
        

This webpage template was recycled from here.

Accessibility