The pre-train representation learning paradigm is a recent popular approach to address distribution shift and dataset limitation. This approach first pre-trains a representation function using large unlabeled datasets by self-supervised learning (e.g., contrastive learning), and then learns a classifier on the representation using small labeled datasets for downstream target tasks. The representation should have two key properties: label efficiency (i.e., learning an accurate classifier with a small amount of labeled data) and universality (i.e., useful for a wide range of downstream tasks). In this paper, we focus on contrastive learning and systematically study a trade-off between label efficiency and universality both empirically and theoretically. We empirically show that the trade-off exists in different models and datasets. Theoretically, we propose a data model with hidden representation and provide analysis in a simplified setting with linear models. The analysis shows that compared with pre-training on the target data directly, pre-training on diverse tasks can lead to a larger sample complexity for learning the classifier and thus worse prediction performance.