Workshop: The First Workshop on Pre-training: Perspectives, Pitfalls, and Paths Forward

Boosting Monolingual Sentence Representation with Large-scale Parallel Translation Datasets

Jue Wang · Jue Wang · Haofan Wang · Haofan Wang · Xing Wu · Xing Wu · Chaochen Gao · Chaochen Gao · Debing Zhang


Although contrastive learning greatly improves sentence representation, its performance is still limited by the size of existing monolingual datasets. So can semantically highly correlated massively parallel translation pairs be used for pre-training of monolingual models? This paper proposes an exploration of this. We leverage parallel translated sentence pairs to learn single-sentence sentence embeddings and demonstrate superior performance in balancing alignment and consistency. We achieve new state-of-the-art performance on the mean score of Standard Semantic Text Similarity (STS), outperforming both SimCSE and Sentence-T5.

Chat is not available.