Skip to yearly menu bar Skip to main content


Session

Deep Sequence Models

Abstract:
Chat is not available.

Thu 13 June 11:00 - 11:20 PDT

Stochastic Beams and Where To Find Them: The Gumbel-Top-k Trick for Sampling Sequences Without Replacement

Wouter Kool · Herke van Hoof · Max Welling

The well-known Gumbel-Max trick for sampling from a categorical distribution can be extended to sample k elements without replacement. We show how to implicitly apply this `Gumbel-Top-k' trick on a factorized distribution over sequences, allowing to draw exact samples without replacement using a Stochastic Beam Search. Even for exponentially large domains, the number of model evaluations grows only linear in k and the maximum sampled sequence length. The algorithm creates a theoretical connection between sampling and beam search and can be used as a principled intermediate alternative. In a translation task, we show that the proposed method compares favourably against alternatives to obtain diverse yet good quality translations. We show that sequences sampled without replacement can be used to construct low-variance estimators for expected sentence-level BLEU score and model entropy.

Thu 13 June 11:20 - 11:25 PDT

Learning to Exploit Long-term Relational Dependencies in Knowledge Graphs

Lingbing Guo · Zequn Sun · Wei Hu

We study the problem of knowledge graph (KG) embedding. A widely-established assumption to this problem is that similar entities are likely to have similar relational roles. However, existing related methods derive KG embeddings mainly based on triple-level learning, which lack the capability of capturing long-term relational dependencies of entities. Moreover, triple-level learning is insufficient for the propagation of semantic information among entities, especially for the case of cross-KG embedding. In this paper, we propose recurrent skipping networks (RSNs), which employ a skipping mechanism to bridge the gaps between entities. RSNs integrate recurrent neural networks (RNNs) with residual learning to efficiently capture the long-term relational dependencies within and between KGs. We design an end-to-end framework to support RSNs on different tasks. Our experimental results showed that RSNs outperformed state-of-the-art embedding-based methods for entity alignment and achieved competitive performance for KG completion.

Thu 13 June 11:25 - 11:30 PDT

Meta-Learning Neural Bloom Filters

Jack Rae · Sergey Bartunov · Timothy Lillicrap

There has been a recent trend in training neural networks to replace data structures that have been crafted by hand, with an aim for faster execution, better accuracy, or greater compression. In this setting, a neural data structure is instantiated by training a network over many epochs of its inputs until convergence. In applications where inputs arrive at high throughput, or are ephemeral, training a network from scratch is not practical. This motivates the need for few-shot neural data structures. In this paper we explore the learning of approximate set membership over a set of data in one-shot via meta-learning. We propose a novel memory architecture, the Neural Bloom Filter, which is able to achieve significant compression gains over classical Bloom Filters and existing memory-augmented neural networks.

Thu 13 June 11:30 - 11:35 PDT

CoT: Cooperative Training for Generative Modeling of Discrete Data

Sidi Lu · Lantao Yu · Siyuan Feng · Yaoming Zhu · Weinan Zhang

To tackle the distribution shifting problem inherent in Maximum Likelihood Estimation, a.k.a. exposure bias, researchers mainly focused on introducing auxiliary adversarial training to penalize the unrealistic generated samples. To exploit the supervision signal from the discriminator, most previous models, typically language GANs, leverage REINFORCE to address the non-differentiable problem of discrete sequential data. In this paper, we propose a novel approach called Cooperative Training to improve the training of sequence generative models. Our algorithm transforms the minimax game of GANs into the form of a joint maximization problem and manages to explicitly estimate and optimize Jensen-Shannon divergence. In the experiments, compared to existing state-of-the-art methods, our model shows superior performance in both sample quality and diversity, as well as training stability. Unlike previous methods, our approach works without the necessity of pre-training via Maximum Likelihood Estimation, which is crucial to the success of previous methods.

Thu 13 June 11:35 - 11:40 PDT

Non-Monotonic Sequential Text Generation

Sean Welleck · Kiante Brantley · Hal Daumé III · Kyunghyun Cho

Standard sequential generation methods assume a pre-specified generation order, such as text generation methods which generate words from left to right. In this work, we propose a framework for training models of text generation that operate in non-monotonic orders; the model directly learns good orders, without any additional annotation. Our framework operates by generating a word at an arbitrary position, and then recursively generating words to its left and then words to its right, yielding a binary tree. Learning is framed as imitation learning, including a coaching method which moves from imitating an oracle to reinforcing the policy's own preferences. Experimental results demonstrate that using the proposed method, it is possible to learn policies which generate text without pre-specifying a generation order, while achieving competitive performance with conventional left-to-right generation.

Thu 13 June 11:40 - 12:00 PDT

Insertion Transformer: Flexible Sequence Generation via Insertion Operations

Mitchell Stern · William Chan · Jamie Kiros · Jakob Uszkoreit

We present the Insertion Transformer, an iterative, partially autoregressive model for sequence generation based on insertion operations. Unlike typical autoregressive models which rely on a fixed left-to-right ordering of the output, our approach accommodates arbitrary orderings by allowing for tokens to be inserted anywhere in the sequence during decoding. This flexibility confers a number of advantages: for instance, not only can our model be trained to follow specific orderings such as left-to-right generation or a binary tree traversal, but it can also be trained to maximize entropy over all valid insertions for robustness. In addition, our model seamlessly accommodates both fully autoregressive generation (one insertion at a time) and partially autoregressive generation (simultaneous insertions at multiple locations). We validate our approach by analyzing its performance on the WMT 2014 English-German machine translation task under various settings for training and decoding. We find that the Insertion Transformer outperforms many prior non-autoregressive approaches to translation at comparable or better levels of parallelism, and successfully recovers the performance of the original Transformer while requiring significantly fewer iterations during decoding.

Thu 13 June 12:00 - 12:05 PDT

Empirical Analysis of Beam Search Performance Degradation in Neural Sequence Models

Eldan Cohen · Christopher Beck

Beam search is the most popular inference algorithm for decoding neural sequence models. Unlike greedy search, beam search allows for non-greedy local decisions that can potentially lead to a sequence with a higher overall probability. However, previous work on a number of applications found that the quality of the highest probability hypothesis found by beam search degrades with large beam widths. We perform an empirical study of the behavior of beam search across three sequence synthesis tasks. We find that increasing the beam width leads to sequences that are disproportionately based on early very low probability tokens that are followed by a sequence of tokens with higher (conditional) probability. We show that, empirically, such sequences are more likely to have a lower evaluation score than lower probability sequences without this pattern. Using the notion of search discrepancies from heuristic search, we hypothesize that large discrepancies are the cause of the performance degradation. We show that this hypothesis generalizes the previous ones in machine translation and image captioning. To validate our hypothesis, we show that constraining beam search to avoid large discrepancies eliminates the performance degradation.

Thu 13 June 12:05 - 12:10 PDT

Trainable Decoding of Sets of Sequences for Neural Sequence Models

Ashwin Kalyan · Peter Anderson · Stefan Lee · Dhruv Batra

Many structured prediction tasks admit multiple correct outputs and so, it is often useful to decode a set of outputs that maximize some task-specific set-level metric. However, retooling standard sequence prediction procedures tailored towards predicting the single best output leads to the decoding of sets containing very similar sequences; failing to capture the variation in the output space. To address this, we propose $\nabla$BS, a trainable decoding procedure that outputs a set of sequences, highly valued according to the metric. Our method tightly integrates the training and decoding phases and further allows for the optimization of the task-specific metric addressing the shortcomings of standard sequence prediction. Further, we discuss the trade-offs of commonly used set-level metrics and motivate a new set-level metric that naturally evaluates the notion of ``capturing the variation in the output space''. Finally, we show results on the image captioning task and find that our model outperforms standard techniques and natural ablations.

Thu 13 June 12:10 - 12:15 PDT

Learning to Generalize from Sparse and Underspecified Rewards

Rishabh Agarwal · Chen Liang · Dale Schuurmans · Mohammad Norouzi

We consider the problem of learning from sparse and underspecified rewards. This task structure arises in interpretation problems where an agent receives a complex input, such as a natural language command, and needs to generate a complex response, such as an action sequence, but only receives binary success-failure feedback. Rewards of this kind are usually underspecified because they do not distinguish between purposeful and accidental success. To learn in these scenarios, effective exploration is critical to find successful trajectories, but generalization also depends on discounting spurious trajectories that achieve accidental success. We address exploration by using a mode covering direction of KL divergence to collect a diverse set of successful trajectories, followed by a mode seeking KL divergence to train a robust policy. We address reward underspecification by using Meta-Learning and Bayesian Optimization to construct an auxiliary reward function, which provides more accurate feedback for learning. The parameters of the auxiliary reward function are optimized with respect to the validation performance of the trained policy. Without using expert demonstrations, ground truth programs, our Meta Reward-Learning (MeRL) achieves state-of-the-art results on weakly-supervised semantic parsing, improving upon prior work by 1.3% and 2.6% on WikiTablesQuestions and WikiSQL.

Thu 13 June 12:15 - 12:20 PDT

Efficient Training of BERT by Progressively Stacking

Linyuan Gong · Di He · Zhuohan Li · Tao Qin · Liwei Wang · Tie-Yan Liu

Unsupervised pre-training is popularly used in natural language processing. By designing proper unsupervised prediction tasks, a deep neural network can be trained and shown to be effective in many downstream tasks. As the data is usually adequate, the model for pre-training is generally huge and contains millions of parameters. Therefore, the training efficiency becomes a critical issue even when using high-performance hardware. In this paper, we explore an efficient training method for the state-of-the-art bidirectional Transformer (BERT) model. By visualizing the self-attention distribution of different layers at different positions in a well-trained BERT model, we find that in most layers, the self-attention distribution will concentrate locally around its position and the start-of-sentence token. Motivating from this, we propose the stacking algorithm to transfer knowledge from a shallow model to a deep model; then we apply stacking progressively to accelerate BERT training. The experimental results showed that the models trained by our training strategy achieve similar performance to models trained from scratch, but our algorithm is much faster.