Spotlight Poster

On the Provable Advantage of Unsupervised Pretraining

Jiawei Ge ⋅ Shange Tang ⋅ Jianqing Fan ⋅ Chi Jin

2024 Spotlight Poster

[ Poster] [ OpenReview]

Abstract

Unsupervised pretraining, which learns a useful representation using a large amount of unlabeled data to facilitate the learning of downstream tasks, is a critical component of modern large-scale machine learning systems. Despite its tremendous empirical success, the rigorous theoretical understanding of why unsupervised pretraining generally helps remains rather limited---most existing results are restricted to particular methods or approaches for unsupervised pretraining with specialized structural assumptions. This paper studies a generic framework,where the unsupervised representation learning task is specified by an abstract class of latent variable models $\Phi$ and the downstream task is specified by a class of prediction functions $\Psi$. We consider a natural approach of using Maximum Likelihood Estimation (MLE) for unsupervised pretraining and Empirical Risk Minimization (ERM) for learning downstream tasks. We prove that, under a mild ``informative'' condition, our algorithm achieves an excess risk of $\\tilde{\\mathcal{O}}(\sqrt{\mathcal{C}\_\Phi/m} + \sqrt{\mathcal{C}\_\Psi/n})$ for downstream tasks, where $\mathcal{C}\_\Phi, \mathcal{C}\_\Psi$ are complexity measures of function classes $\Phi, \Psi$, and $m, n$ are the number of unlabeled and labeled data respectively. Comparing to the baseline of $\tilde{\mathcal{O}}(\sqrt{\mathcal{C}\_{\Phi \circ \Psi}/n})$ achieved by performing supervised learning using only the labeled data, our result rigorously shows the benefit of unsupervised pretraining when $m \gg n$ and $\mathcal{C}\_{\Phi\circ \Psi} > \mathcal{C}\_\Psi$. This paper further shows that our generic framework covers a wide range of approaches for unsupervised pretraining, including factor models, Gaussian mixture models, and contrastive learning.

Video

Chat is not available.