Separating representation from reconstruction enables scalable text encoders
Megi Dervishi ⋅ Mathurin VIDEAU ⋅ Yann LeCun
Abstract
While decoders have rapidly scaled, encoders have remained largely unchanged since BERT. We examine this disparity by revisiting evaluation through the lens of finetuning under frozen backbone and linear probing. As models scale, their representations become increasingly unexploitable by frozen probes, despite improved perplexity. This suggests a misalignment between direct token prediction and the learning of rich, versatile, easily extractable representations. Hence, we propose CrossBERT, a two-part architecture that separates the learning of high-quality encoded representations from the rigid grounding of token reconstruction. This design further enables high masking ratios ($\ge 50\%$) and gradient collection over all token via a Complementary Masking Strategy, respectively increasing throughput by $1.5$-$2$× and sample efficiency by 2×. Overall, CrossBERT demonstrates monotonic scaling and superior performance on MTEB(eng, v2) and frozen GLUE benchmarks.
Successful Page Load