Provable and Efficient Dataset Distillation
for Kernel Ridge Regression
Abstract
Deep learning models are now trained on increasingly larger datasets, making it
crucial to reduce computational costs and improve data quality. Dataset distillation
aims to distill a large dataset into a small synthesized dataset such that models
trained on it can achieve similar performance to those trained on the original
dataset. While there have been many empirical efforts to improve dataset distillation
algorithms, a thorough theoretical analysis and provable, efficient algorithms are
still lacking.
In this paper, by focusing on dataset distillation for kernel ridge
regression (KRR), we show that one data point per class is already necessary and
sufficient to recover the original model’s performance in many settings.
- For linear ridge regression and KRR with surjective feature mappings, we provide necessary
and sufficient conditions for the distilled dataset to recover the original model’s
parameters.
- For KRR with injective feature mappings of deep neural networks, we
show that while one data point per class is not sufficient in general, k+1 data points
can be sufficient for deep linear neural networks, where k is the number of classes.
- Our theoretical results enable directly constructing analytical solutions for distilled
datasets, resulting in a provable and efficient dataset distillation algorithm for KRR.
We verify our theory experimentally and show that our algorithm outperforms
previous work such as KIP while being significantly more efficient, e.g. 15840×
faster on CIFAR-100.
What is Dataset Distillation
An illustration of dataset distillation from [1]
Given an original dataset: , where is the dimenion of the data,
is the number of the original data, and is the dimension of the label, dataset distillation aims to distill the large original dataset
into a small synthesized dataset , such that models trained on it
can achieve similar performance to those trained on the original dataset, where is the number of distilled data and .
Dataset Distillation for Kernel Ridge Regression (KRR)
A Kernel ridge regression (KRR) is , where and ,
that minimize the following loss:
where is the regularization parameter.
The solution can be computed analytically as , where
and . can be considered as regularized features.
Linear ridge regression is a special case of kernel ridge regression with .
Similarly, a KRR trained on distilled dataset with regularization is ,
The goal of dataset distillation is to find such that .
Dataset Distillation for Linear Ridge Regression (LRR)
For a LRR model, we show that distilled data points (one per class) are necessary and
sufficient to guarantee . We provide analytical solutions of such allowing us to
compute the distilled dataset analytically instead of having to learn it heuristically in existing works.
Intuitively, original dataset is compressed into through original model’s parameter .
- When , there is only one solution. When .
- When , there are infinitely many distilled datasets since is a free variable to choose
- When , is a distilled dataset for itself.
- When , we can generate more data than original dataset.
Below is an illustration of distilled data of MNIST and CIFAR-100 when .
Besides, we can also
- Find realistic distilled data that is close to original data by solving closed-from solution.
- Generate distilled data from random noise.
Below is an illustration on MNIST and CIFAR-100.
Kernel Ridge Regression (KRR)
Surjective Feature Mapping
The results of LRR can be extended to KRR by replacing with .
When is surjective or bijective, we can always find a for a desired .
Examples of surjective :
- Invertible NNs
- Fully-connected NN (FCNN)
- Convolutional Neural Network (CNN)
- Random Fourier Features (RFF)
Non-surjective Feature Mapping
For non-surjective such as deep nonlinear NNs, one data per class is generally not sufficient as long as is not in the range space of .
For deep linear NNs, we show can be sufficient under certain conditions.
Comparison with Existing Theoretical Results
Below is a comparison with existing theoretical analysis of dataset distillation.
Experiments
For surjective 𝜙, our algorithm outperforms previous work such as KIP [2] while being significantly more efficient.
Cite this work
Yilan Chen, Wei Huang, Tsui-Wei Weng,
Provable and Efficient Dataset Distillation for Kernel Ridge Regression
, NeurIPS 2024.
@inproceedings{
chen2024provable,
title={Provable and Efficient Dataset Distillation for Kernel Ridge Regression},
author={Yilan Chen and Wei Huang and Tsui-Wei Weng},
booktitle={The Thirty-eighth Annual Conference on Neural Information Processing Systems},
year={2024},
url={https://openreview.net/forum?id=WI2VpcBdnd}
}