Poster
in
Workshop: Differentiable Almost Everything: Differentiable Relaxations, Algorithms, Operators, and Simulators
Symbolic Autoencoding for Self-Supervised Sequence Learning
Mohammad Hossein Amani · Nicolas Baldwin · Amin Mansouri · Martin Josifoski · Maxime Peyrard · Robert West
Keywords: [ self-supervised learning ] [ straight-though gradient estimation ] [ discrete representation learning ] [ discrete auto-encoding ] [ Symbolic autoencoding ]
Abstract:
Traditional language models(LMs) excel at next-token prediction in text sequences but often struggle with transduction tasks involving distinct symbolic systems, particularly when parallel data is scarce or nonexistent. This issue is even more pronounced in domains dealing with complex, non-natural language sequences, such as audio signals, protein structures, or biological sequences, where the strengths of LMs in natural language do not directly translate.To address this challenge, we introduce symbolic autoencoding ($\Sigma$AE), a self-supervised framework designed to exploit the wealth of non-parallel data alongside limited parallel data. $\Sigma$AE integrates two generative models via a discrete bottleneck layer, optimizing the entire system end-to-end by minimizing unsupervised reconstruction loss for all data such that the sequence generated at the discrete bottleneck can be read out as the transduced input sequence, and separately optimizing the two models with supervised loss on the subset of labeled parallel data. To allow optimization of the models in the presence of discrete symbols, we use a family of straight-through gradient estimators. We demonstrate the effectiveness of $\Sigma$AE on four sequence-to-sequencetransduction tasks, showing that it significantly outperforms strong baselines in weakly supervised settings.
Chat is not available.