Timezone: »

A Theoretical Analysis of Contrastive Unsupervised Representation Learning
Nikunj Umesh Saunshi · Orestis Plevrakis · Sanjeev Arora · Mikhail Khodak · Hrishikesh Khandeparkar

Thu Jun 13 04:40 PM -- 05:00 PM (PDT) @ Room 102

Recent empirical works successfully use unlabeled data to learn feature representations that are broadly useful in downstream classification tasks. Several are reminiscent of the well-known word2vec embedding algorithm: leveraging availability of pairs of semantically similar" data points andnegative samples", the learner forces the inner product of representations of similar pairs with each other to be higher on average than with negative samples. The current paper uses the term {\em contrastive learning} for such algorithms and presents a theoretical framework for understanding it, by introducing {\em latent classes} and hypothesizing that semantically similar points are sampled from the same {\em latent class}. This conceptual framework allows us to show provable guarantees on the performance of the learnt representation on downstream classification tasks, whose classes are assumed to be random samples from the same set of latent classes. Our generalization bound also shows that learnt representations can reduce (labeled) sample complexity on downstream tasks. Controlled experiments are performed in NLP and image domains to support the theory.

Author Information

Nikunj Umesh Saunshi (Princeton University)
Orestis Plevrakis (Princeton University)
Sanjeev Arora ( Princeton University and Institute for Advanced Study)
Mikhail Khodak (CMU)
Hrishikesh Khandeparkar (Princeton University)

Related Events (a corresponding poster, oral, or spotlight)

More from the Same Authors