Timezone: »
As models increase in size and training budget, they not only systematically improve in upstream quality, but also exhibit novel emergent capabilities. This increase in scale raises proportionate difficulties for practitioners: foundation model training and inference lie at a unique interdisciplinary crossroad, combining open problems in algorithms, system design, and software engineering.
Machine learning practitioners are key stakeholders here: on the one hand, researchers may contribute algorithmic insights and novel methods to improving training and inference of large models; on the other hand, novel research findings may be best demonstrated at scale—which may require training models as efficiently as possible to make the best use of available resources.
The goal of this workshop is to bring together interdisciplinary experts working on the emerging research questions and challenges associated with foundation model training and inference. We welcome submissions around training and inference systems/algorithms for foundation models, focusing on scaling-up or on reducing compute, time, memory, bandwidth, and energy requirements. Notably, we encourage submissions concerning the entire spectrum of foundation models: from BERT-sized Transformers, to large models with 100B+ parameters. Topics include but are not limited to:
* Training and inference systems, either distributed at large scale or in resource-constrained scenarios;
* Algorithms for improved training and inference efficiency;
* Systems for foundation models, such as novel programming languages or compilers.
Sat 11:55 a.m. - 12:00 p.m.
|
🤗 Welcome and opening remarks
(
Opening
)
SlidesLive Video » |
🔗 |
Sat 12:00 p.m. - 12:01 p.m.
|
🔥 Session I: Large-Scale Distributed Pretraining
(
Invited Talks
)
|
🔗 |
Sat 12:01 p.m. - 12:20 p.m.
|
Using Megatron to Train Large Language Models (Deepak Narayanan, Microsoft Research)
(
Invited Talk
)
SlidesLive Video » |
🔗 |
Sat 12:20 p.m. - 12:40 p.m.
|
Distributed Systems for Decentralized AI (Ce Zhang, ETH/Together)
(
Invited Talk
)
SlidesLive Video » |
🔗 |
Sat 12:40 p.m. - 1:00 p.m.
|
Training Large Language Models on Cerebras Wafer-Scale Clusters AI (Natalia Vassilieva, Cerebras)
(
Invited Talk
)
SlidesLive Video » |
🔗 |
Sat 1:10 p.m. - 1:25 p.m.
|
☕️ Coffee break
|
🔗 |
Sat 1:25 p.m. - 1:40 p.m.
|
🎤 SRL: Scaling Distributed Reinforcement Learning to Over Ten Thousand Cores
(
Oral
)
link »
SlidesLive Video » The ever-growing complexity of reinforcement learning (RL) tasks demands a distributed system to train intelligent agents by efficiently producing and processing a massive amount of data. In this paper, we propose a more comprehensive computational abstraction for RL training tasks and introduce a general, scalable, and efficient RL system called Really Scalable RL (SRL), featuring a novel architecture that separates three major computation components in RL training. Our evaluation demonstrates that SRL outperforms a popular open-source RL system RLlib RLlib (Liang et al., 2017) in training throughput. Moreover, to assess the learning performance of SRL, we have conducted a benchmark on a large scale cluster with 32 Nvidia A100 GPUs, 64 Nvidia RTX 3090 GPUs and more than 10000 CPU cores, reproducing the results of industrial production system from OpenAI, Rapid (Berner et al., 2019) in the hide and-seek environment (Baker et al., 2019). The results show that SRL is capable of achieving up to 5 times training speedup compared to published results in Baker et al. (2019). |
Zhiyu Mei · Wei Fu · Guangju Wang · Huanchen Zhang · Yi Wu 🔗 |
Sat 1:40 p.m. - 1:55 p.m.
|
🎤 Fine-Tuning Language Models with Just Forward Passes
(
Oral
)
link »
SlidesLive Video » Fine-tuning language models (LMs) has yielded success on diverse downstream tasks, but as LMs grow in size, backpropagation requires a prohibitively large amount of memory. Zeroth-order (ZO) methods can in principle estimate gradients using only two forward passes but are theorized to be catastrophically slow for optimizing large models. In this work, we propose a memory-efficient zeroth-order optimizer (MeZO), adapting the classical ZO-SGD method to operate in-place, thereby fine-tuning LMs with the same memory footprint as inference. For example, with a single A100 80GB GPU, MeZO can train a 30-billion parameter model, whereas fine-tuning with backpropagation can train only a 2.7B LM with the same budget. We conduct comprehensive experiments across model types (masked and autoregressive LMs), model scales (up to 66B), and downstream tasks (classification, multiple-choice, and generation). Our results demonstrate that (1) MeZO significantly outperforms in-context learning and linear probing; (2) MeZO achieves comparable performance to fine-tuning with backpropagation across multiple tasks, with up to 12x memory reduction; (3) MeZO is compatible with both full-parameter and parameter-efficient tuning techniques such as LoRA and prefix tuning; (4) MeZO can effectively optimize non-differentiable objectives (e.g., maximizing accuracy or F1). We support our empirical findings with theoretical insights, highlighting how adequate pre-training and task prompts enable MeZOto fine-tune huge models, despite classical ZO analyses suggesting otherwise. |
Sadhika Malladi · Tianyu Gao · Eshaan Nichani · Alex Damian · Jason Lee · Danqi Chen · Sanjeev Arora 🔗 |
Sat 1:55 p.m. - 1:56 p.m.
|
🚀 Session II: Efficient Inference
(
Invited Talks
)
|
🔗 |
Sat 1:56 p.m. - 2:25 p.m.
|
The Case for 4-bit Inference (Tim Dettmers, University of Washington)
(
Invited Talk
)
SlidesLive Video » |
🔗 |
Sat 2:25 p.m. - 2:55 p.m.
|
Efficiently Scaling Transformer Inference (Aakanksha Chowdhery, Google Research)
(
Invited Talk
)
SlidesLive Video » |
🔗 |
Sat 2:55 p.m. - 3:10 p.m.
|
🎤 Memory-Efficient Selective Fine-Tuning
(
Oral
)
link »
SlidesLive Video » We propose an approach for reducing the memory required to fine-tune transformer-based models. During the backward pass, our approach only propagates the gradient through a small number of input positions, while freezing the others. Thus, we only save a subset of the intermediate activations during the forward pass, for which the computed gradient will not be zero. We show that our approach leads to performance on-par with full fine-tuning, while requiring only up to a third of the GPU memory. Our approach is specifically efficient in fine-tuning language models with a number of parameters lying around hundred of millions. It allows to fine-tune such models on consumer hardware, while maintaining a large batch size. |
Antoine Simoulin · Namyong Park · Xiaoyi Liu · Grey Yang 🔗 |
Sat 3:10 p.m. - 4:00 p.m.
|
🍱 Lunch break
|
🔗 |
Sat 4:00 p.m. - 5:15 p.m.
|
🧑🎓 Poster Session
(
Poster Session
)
See below for detailed of posters! |
🔗 |
Sat 5:15 p.m. - 6:15 p.m.
|
💬 Panel: Large Language Models Tooling Across Industry and Academia
(
Panel
)
SlidesLive Video » Anna Goldie (Anthropic), Rishi Bommasani (Stanford University), Susan Zhang (Meta), Emily Webber (AWS), James Bradbury (Google) |
🔗 |
Sat 6:15 p.m. - 6:30 p.m.
|
☕️ Coffee break
|
🔗 |
Sat 6:30 p.m. - 6:45 p.m.
|
🎤 Fast Causal Attention with Dynamic Sparsity
(
Oral
)
link »
SlidesLive Video »
Transformer-based language models have found many diverse applications requiring them to process sequences of increasing length. For these applications, the causal self-attention---which is the only component scaling quadratically w.r.t. the sequence length---becomes a central concern. While many works have proposed schemes to sparsify the attention patterns and reduce the computational overhead of self-attention, those are often limited by implementation concerns and end up imposing a simple and static structure over the attention matrix. Conversely, implementing more dynamic sparse attention often results in runtimes significantly slower than computing the full attention using the Flash implementation. We extend FlashAttention to accommodate a large class of attention sparsity patterns that, in particular, encompass key/query dropping and hashing-based attention. This leads to implementations with no computational complexity overhead and a multi-fold runtime speedup on top of FlashAttention. Even with relatively low degrees of sparsity, our method improves visibly upon FlashAttention as the sequence length increases. Without sacrificing perplexity, we increase the training speed of a transformer language model by $2.0\times$ for sequences of $8k$ tokens.
|
Daniele Paliotta · Matteo Pagliardini · Martin Jaggi · François Fleuret 🔗 |
Sat 6:45 p.m. - 6:46 p.m.
|
⚙️ Session III: Deep Optimisation
(
Invited Talks
)
|
🔗 |
Sat 6:46 p.m. - 7:15 p.m.
|
PyTorch 2.x: Faster, More Pythonic, and as Dynamic as Ever (Natalia Gimelshein, OpenAI)
(
Invited Talk
)
SlidesLive Video » |
🔗 |
Sat 7:15 p.m. - 7:45 p.m.
|
High-Performance Kernel Programming with Triton (Philippe Tillet, OpenAI)
(
Invited Talk
)
SlidesLive Video » |
🔗 |
Sat 7:45 p.m. - 8:00 p.m.
|
🏅 Best Paper Award
(
Awards
)
SlidesLive Video » |
🔗 |
Sat 9:00 p.m. - 12:00 a.m.
|
🎉 Post-Workshop Happy Hour (sponsored by Together) ( Party ) link » | 🔗 |
-
|
Mental Calibration: Discovering and Adjusting for Latent Factors Improves Zero-Shot Inference of CLIP
(
Poster
)
link »
The CLIP model demonstrates remarkable zero-shot inference capability that can be understood by humans through natural language.However, interpreting this zero-shot inference process and designing suitable methods, including crafting text description templates, remains an open problem.In this paper, we develop an understanding of the zero-shot inference process of CLIP by explicitly considering the latent factors in the data generation process along with their corresponding text descriptions.Building on this, we first find that conditioning on the correct latent factors improves inference, meaning that CLIP can adjust for them.Then, we find that CLIP can infer latent factors from images, meaning it can discover them.With these two findings, we propose an inference method that automatically discovers and adjusts for latent factors as long as we provide CLIP with a comprehensive set of potential latent factors.We empirically verify that this inference method improves both generalization and interpretability of the zero-shot inference of CLIP. |
Bang An · Sicheng Zhu · Michael-Andrei Panaitescu-Liess · Chaithanya Kumar Mummadi · Furong Huang 🔗 |
-
|
Predictive Pipelined Decoding: A Compute-Latency Trade-off for Exact LLM Decoding
(
Poster
)
link »
This paper presents “Predictive Pipelined Decoding (PPD),” a novel approach that speeds up greedy decoding in Large Language Models (LLMs) while maintaining the exact same output as the original decoding. Breaking from conventional strategies, PPD strategically employs additional compute resources to parallelize the initiation of subsequent token decoding during the ongoing verification of the current token decoding. This innovative method reduces decoding latency and reshapes the understanding of trade-offs in LLM decoding strategies. We have developed a theoretical framework that allows us to analyze the trade-off between computation and latency. Using this framework, we can estimate the potential reduction in latency associated with our proposed method, achieved through the assessment of the match rate, represented as $p_{correct}$. Our results demonstrate that the use of extra computational resources has the potential to significantly accelerate LLM greedy decoding.
|
Seongjun Yang · Gibbeum Lee · Jaewoong Cho · Dimitris Papailiopoulos · Kangwook Lee 🔗 |
-
|
Generating Efficient Kernels for Quantized Inference on Large Language Models
(
Poster
)
link »
We present ongoing work on a new automatic code generation approach for supporting quantized generative inference on LLMs such as LLaMA or OPT on off-the-shelf CPUs. Our approach is informed by the target architecture and a performance model, including both hardware characteristics and method-specific accuracy constraints. Results on CPU-based inference for LLaMA models show that our approach can lead to high performance and high accuracy, comparing favorably to the best existing open-source solution. |
Tommaso Pegolotti · Elias Frantar · Dan Alistarh · Markus Püschel 🔗 |
-
|
SpeedLimit: Neural Architecture Search for Quantized Transformer Models
(
Poster
)
link »
While prevailing research in the field of transformer models has primarily focused on enhancing performance metrics such as accuracyand perplexity, practical applications in industry often necessitate a rigorous consideration of inference latency constraints. Addressing this challenge, we introduce SpeedLimit, a novel Neural Architecture Search (NAS) technique that optimizes accuracy whilst adhering to an upper-bound latency constraint. Our method incorporates 8-bit integer quantization in the search process to outperform the current state-of-the-art technique. Our results underline the feasibility and efficacy of seeking an optimal balance between performance and latency, providing new avenues for deploying state-of-the-art transformer models in latency-sensitive environments. |
Luke Bailey · Yuji Chai · Yunho Jin · Glenn Ko · Matthew Karle 🔗 |
-
|
A Comprehensive Analysis of Adapter Efficiency
(
Poster
)
link »
Adapters have been positioned as a parameter-efficient fine-tuning (PEFT) approach. However, adapters have not been sufficiently analyzed to understand if PEFT translates to benefits in training/deployment efficiency and maintainability/extensibility. Through extensive experiments on many adapters, tasks, and languages in supervised and cross-lingual zero-shot settings, we clearly show that for Natural Language Understanding (NLU) tasks, the parameter efficiency in adapters does not translate to efficiency gains compared to full fine-tuning of models. More precisely, adapters are relatively expensive to train and have slightly higher deployment latency. Furthermore, the maintainability/extensibility benefits of adapters can be achieved with simpler approaches like multi-task training via full fine-tuning, which also provide relatively faster training times. We, therefore, recommend that for moderately sized models for NLU tasks, practitioners should rely on full fine-tuning or multi-task training rather than using adapters. |
Nandini Mundra · Sumanth Doddapaneni · Raj Dabre · Anoop Kunchukuttan · Ratish Puduppully · Mitesh Khapra 🔗 |
-
|
Less is More: Using Multiple LLMs for Applications with Lower Costs
(
Poster
)
link »
Large language models (LLMs) are increasingly used for querying purposes, but their associated costs vary significantly. This study investigates the pricing structures of popular LLM APIs, such as GPT-4, ChatGPT, and J1-Jumbo, revealing sub- stantial fee differences. To mitigate the expense of using LLMs on extensive queries and text, we propose three strategies: prompt adaptation, LLM approximation, and LLM cascade. We present FrugalGPT, an adaptable LLM cascade that in- telligently selects LLM combinations to reduce costs by up to 98% while matching or improving the accuracy of individual LLMs. This work es- tablishes a foundation for sustainable and efficient LLM utilization, offering valuable insights and practical techniques for users. |
Lingjiao Chen · Matei Zaharia · James Zou 🔗 |
-
|
Blockwise Parallel Transformer for Long Context Large Models
(
Poster
)
link »
Transformers have emerged as the cornerstone of state-of-the-art natural language processing models, showcasing exceptional performance across a wide range of AI applications. However, the memory demands posed by the self-attention mechanism and the large feedforward network in Transformers limit their ability to handle long sequences, thereby creating challenges for tasks involving multiple long sequences or long-term dependencies. We present a distinct approach, Blockwise Parallel Transformer (BPT), that leverages blockwise computation of self-attention and feedforward network fusion to minimize memory costs. By processing longer input sequences while maintaining memory efficiency, BPT enables training sequences up to 32 times longer than vanilla Transformers and 2 to 4 times longer than previous memory-efficient methods. Extensive experiments on language modeling and reinforcement learning tasks demonstrate the effectiveness of BPT in reducing memory requirements and improving performance. |
Hao Liu · Pieter Abbeel 🔗 |
-
|
SuperShaper: A Pre-Training Approach for Discovering Efficient Transformer Shapes
(
Poster
)
link »
Task-agnostic pre-training followed by task-specific fine-tuning is a default approach to train NLU models which need to be deployed on devices with varying resource and accuracy constraints. However, repeating pre-training and fine-tuning across tens of devices is prohibitively expensive. To address this, we propose SuperShaper, a task-agnostic approach wherein we pre-train a single model which subsumes a large number of Transformer models via linear bottleneck matrices around each Transformer layer which are sliced to generate differently shaped sub-networks. Despite its simplicity, SuperShaper radically simplifies NAS for language models and discovers networks, via evolutionary algorithm, that effectively trade-off accuracy and model size. Discovered networks are more accurate than a range of hand-crafted and automatically searched networks on GLUE benchmarks. Further, a critical advantage of shape as a design variable for NAS is that the networks found with these heuristics derived for good shapes, match and even improve on carefully searched networks across a range of parameter counts. |
Vinod Ganesan · Gowtham Ramesh · Pratyush Kumar · Raj Dabre 🔗 |
-
|
Continual Pre-Training of Large Language Models: How to re-warm your model?
(
Poster
)
link »
Large language models (LLMs) are routinely pre-trained on billions of tokens, only to restart the process over again once new data becomes available. Since the size of available datasets and models have drastically increased, retraining models from scratch has become increasingly costly. A much cheaper and more efficient solution is to enable the continual pre-training of these models, i.e. updating pre-trained models with new data instead of re-training them. However, the distribution shift induced by novel data typically results in degraded performance on past data. We take a step towards continual pre-training, we examine the effect of different warm-up strategies (e.g. varying the number of linear warm-up steps and the maximum learning rate) on upstream (Pile) and downstream (RedPajama) dataset performance. We conduct all experiments on the Pythia $410$M language model pre-trained on $300$B tokens from the Pile. Our results show that re-warming the learning rate leads to a decrease in performance based on a limited compute budget. Consequently, the best strategy based on stopping at $50$B tokens is to avoid re-warming the learning rate altogether, keeping it constant.
|
Kshitij Gupta · Benjamin Thérien · Adam Ibrahim · Mats Richter · Quentin Anthony · Eugene Belilovsky · Timothée Lesort · Irina Rish 🔗 |
-
|
Implementing block-sparse matrix multiplication kernels using Triton
(
Poster
)
link »
MegaBlocks is the state-of-the-art system for efficient training of MoE models based on block-sparse matrix multiplication kernels. The library is currently restricted to a specific block size in the sparse matrices, data type, and GPU architecture. This is due to the CUDA kernels used for the block-sparse matrix products in the MoE layers. These kernels have been hand-tuned and manually optimized to obtain the highest performance for a specific choice of parameters. In this work, we evaluate re-writing these kernels in Triton, a Python-embedded domain specific language (DSL) for high-performance kernels for GPUs. We show that it is possible to achieve same levels of performance as the hand-tuned CUDA kernels, while maintaining portability across GPU architectures and easily supporting different block sizes and data types without any code changes. We identify the challenges and advantages of using Triton in implementing these block-sparse matrix multiplication kernels. |
Priya Mishra · Trevor Gale · Matei Zaharia · Cliff Young · Deepak Narayanan 🔗 |
-
|
Looped Transformers are Better at Learning Learning Algorithms
(
Poster
)
link »
Transformers can “learn” to solve data-fitting problems generated by a variety of (latent) models, including linear models, sparse linear models, decision trees, and neural networks, as demonstrated by Garg et al. (2022). These tasks, which fall under well-defined function class learning problems, can be solved using iterative algorithms that involve repeatedly applying the same function to the input potentially an infinite number of times. In this work, we aim to train a transformer to emulate this iterative behavior by utilizing a looped transformer architecture (Giannou et al., 2023). Our experimental results reveal that the looped transformer performs equally well as the unlooped transformer in solving these numerical tasks, while also offering the advantage of having much fewer parameters |
Liu Yang · Kangwook Lee · Robert Nowak · Dimitris Papailiopoulos 🔗 |
-
|
Accelerating LLM Inference with Staged Speculative Decoding
(
Poster
)
link »
Recent advances with large language models (LLM) illustrate their diverse capabilities. We propose a novel algorithm, staged speculative decoding, to accelerate LLM inference in small-batch, on-device scenarios. We address the low arithmetic intensity of small-batch inference by improving upon previous work in speculative decoding. First, we restructure the speculative batch as a tree, which reduces generation costs and increases the expected tokens per batch. Second, we add a second stage of speculative decoding. Taken together, we reduce single-batch decoding latency by 3.16x with a 762M parameter GPT-2-L model while perfectly preserving output quality. |
Benjamin F Spector · Christopher Re 🔗 |
-
|
Test-Time Training for Speech
(
Poster
)
link »
In this paper, we study the application of Test-Time Training (TTT) as a solution to handling distribution shifts in speech applications. In particular, we introduce distribution-shifts to the test datasets of standard speech-classification tasks---for example, speaker-identification and emotion-detection---and explore how Test-Time Training (TTT) can help adjust to the distribution-shift. In our experiments that include distribution shifts due to background noise and natural variations in speech such as gender and age, we identify some key-challenges with TTT including sensitivity to optimization hyperparameters (e.g., number of optimization steps and subset of parameters chosen for TTT) and scalability (e.g., as each example gets its own set of parameters, TTT is not scalable). Finally, we propose using BitFit -- a parameter-efficient fine-tuning algorithm proposed for text applications that only considers the bias parameters for fine-tuning -- as a solution to the aforementioned challenges and demonstrate that it is consistently more stable than fine-tuning all the parameters of the model. |
Sri Harsha Dumpala · Chandramouli Shama Sastry · Sageev Oore 🔗 |
-
|
Towards Efficient World Models
(
Poster
)
link »
Scaling up deep Reinforcement Learning (RL) agents beyond traditional benchmarks, without abundant computational resources, presents a significant challenge. Following recent developments in generative modelling, model-based RL positions itself as a strong contender to bring autonomous agents to new heights. In fact, the recently introduced IRIS agent provides evidence that advances in sequence modelling can be leveraged to build powerful world models. In the present work, we propose delta-IRIS, a new agent with a world model architecture that is amenable to scaling up to visually complex environments with longer time horizons. In the Crafter benchmark, delta-IRIS solves 16 out of 21 tasks after 10M frames of training, matching the current best method, DreamerV3. To facilitate research on efficient world models, we release our code at X. |
Eloi Alonso · Vincent Micheli · François Fleuret 🔗 |
-
|
The Framework Tax: Disparities Between Inference Efficiency in Research and Deployment
(
Poster
)
link »
Increased focus on the efficiency of machine learning systems has led to rapid improvements in hardware accelerator performance and model efficiency. However, the resulting increases in computational throughput and reductions in floating point operations have not directly translated to improvements in wall-clock inference latency. We demonstrate that these discrepancies can be largely attributed to bottlenecks introduced by deep learning frameworks. We denote this phenomena as the framework tax, and observe that the disparity is growing as hardware speed increases over time. In this work, we examine this phenomena through a series of case studies analyzing the effects of model design decisions, framework paradigms, and hardware platforms on total model latency. Based on our findings, we provide actionable recommendations to researchers and practitioners aimed at narrowing the gap between efficient ML model research and practice. |
Jared Fernandez · Jacob Kahn · Clara Na · Yonatan Bisk · Emma Strubell 🔗 |
-
|
Towards Structured Sparsity in Transformers for Efficient Inference
(
Poster
)
link »
Transformer models have been critical in accelerating progress in numerous fields, yet scaling these models come at high computational costs. In this paper, we explore sparsity properties in transformers and manipulate existing sparsity in transformers to be more structured for efficient training and inference. In particular, we create sparse structures that have inter-layer similarity and are block sparse which have the potential to bypass a significant amount of model loading and computation. We present preliminary results and ideas using a small transformer which we hope to extend to more complex models. |
Harry Dong · Beidi Chen · Yuejie Chi 🔗 |
-
|
H2O: Heavy-Hitter Oracle for Efficient Generative Inference of Large Language Models
(
Poster
)
link »
Large Language Models (LLMs), despite their recent impressive accomplishments, are notably cost-prohibitive to deploy, particularly for applications involving long-content generation, such as dialogue systems and story writing. Often, a large amount of transient state information, referred to as the $\mathsf{KV}$ $\mathsf{cache}$, is stored in GPU memory in addition to model parameters, scaling linearly with the sequence length and batch size. In this paper, we introduce a novel approach for implementing the $\mathsf{KV}$ $\mathsf{cache}$ which significantly reduces its memory footprint. Our approach is based on the noteworthy observation that a small portion of tokens contributes most of the value when computing attention scores. We call these tokens Heavy Hitters ($\mathsf{H_2}$). Through a comprehensive investigation, we find that ($i$) the emergence of $\mathsf{H_2}$ is natural and strongly correlates with the frequent co-occurrence of tokens in the text, and ($ii$) removing them results in significant performance degradation. Based on these insights, we propose Heavy Hitter Oracle ($\mathsf{H_2O}$), a $\mathsf{KV}$ $\mathsf{cache}$ eviction policy that dynamically retains a balance of recent and $\mathsf{H_2}$ tokens. We formulate the $\mathsf{KV}$ $\mathsf{cache}$ eviction as a dynamic submodular problem and prove (under mild assumptions) a theoretical guarantee for our novel eviction algorithm which could help guide future work. We validate the accuracy of our algorithm with OPT, LLaMA, and GPT-NeoX across a wide range of tasks. Our implementation of $\mathsf{H_2O}$ with $20$\% heavy hitters improves the throughput over three leading inference systems DeepSpeed Zero-Inference, Hugging Face Accelerate, and FlexGen by up to $29\times$, $29\times$, and $3\times$ on OPT-6.7B and OPT-30B. With the same batch size, $\mathsf{H_2O}$ can reduce the latency by up to $1.9\times$.
|
Zhenyu Zhang · Ying Sheng · Tianyi Zhou · Tianlong Chen · Lianmin Zheng · Ruisi Cai · Zhao Song · Yuandong Tian · Christopher Re · Clark Barrett · Zhangyang “Atlas” Wang · Beidi Chen
|
-
|
Incrementally-Computable Neural Networks: Efficient Inference for Dynamic Inputs
(
Poster
)
link »
Deep learning often faces the challenge of efficiently processing dynamic inputs, such as sensor data or user inputs. For example, an AI writing assistant is required to update its suggestions in real time as a document is edited. Re-running the model each time is expensive, even with compression techniques like knowledge distillation, pruning, or quantization. Instead, we take an \emph{incremental computing} approach, looking to reuse calculations as the inputs change. However, the dense connectivity of conventional architectures poses a major obstacle to incremental computation, as even minor input changes cascade through the network and restrict information reuse. To address this, we use Vector Quantization to discretize intermediate values in the network, which filters out noisy and unnecessary modifications to hidden neurons, facilitating the reuse of their values. We apply this approach to the transformers architecture, creating an efficient incremental inference algorithm with complexity proportional to the fraction of the modified inputs. Our experiments with adapting the OPT-125M pre-trained language model demonstrate comparable accuracy on document classification, while achieving 6.8X better efficiency in processing sequences of atomic edits. |
Or Sharir · Anima Anandkumar 🔗 |
-
|
Compositional Interfaces for Compositional Generalization
(
Poster
)
link »
In this work, we study the effectiveness of a modular architecture for compositional generalization and transfer learning in the embodied agent setting. We develop an environment that allows us to independently vary perceptual modalities and action and task specifications, and use it to carefully analyze the agent’s performance in these compositions. We show that we can compose the agent’s perceptual suite, its task specifications, and its action spaces. Our experiments demonstrate zero-shot performance on held-out combinations of perception/instruction/action space and demonstration of fast adaptation (requiring fewer samples) to new perceptual or action spaces without the loss of performance. |
Jelena Luketina · Jack Lanchantin · Sainbayar Sukhbaatar · Arthur Szlam 🔗 |
-
|
ZipLM: Inference-Aware Structured Pruning of Language Models
(
Poster
)
link »
In this paper, we propose a novel structured compression approach for LLMs, called ZipLM, which achieves state-of-the-art accuracy-vs-speedup, while matching a set of desired target runtime speedups in any given inference environment. Specifically, given a model, a dataset, an inference environment, as well as a set of speedup targets, ZipLM iteratively identifies and removes components with the worst loss-runtime trade-off. Unlike prior methods that specialize in either the *post-training/one-shot* or the *gradual compression* setting, and only for specific families of models such as BERT (*encoder*) or GPT (*decoder*), ZipLM produces state-of-the-art compressed models across all these settings. Furthermore, ZipLM achieves superior results for a fraction of the computational cost relative to prior distillation and pruning techniques, making it a cost-effective approach for generating an entire family of smaller, faster, and highly accurate models, guaranteed to meet the desired inference specifications. In particular, ZipLM outperforms all prior BERT-base distillation and pruning techniques, such as CoFi, MiniLM, and TinyBERT. Of note is that on analyzed GLUE tasks, ZipLM compresses BERT-base up to 15x faster model while recovering $\geq 95$% accuracy. The resulting models have encoder size reduced from 85M to only 3M parameters, and on average $\leq 10$ attention heads compared to 144 heads in the uncompressed model. Moreover, ZipLM matches the performance of the heavily optimized MobileBERT model, obtained via extensive architecture search, by simply pruning the baseline BERT-large architecture. When compressing GPT2, ZipLM outperforms DistilGPT2 while being 60\% smaller and 30\% faster.
|
Eldar Kurtic · Elias Frantar · Dan Alistarh 🔗 |
-
|
Sophia: A Scalable Stochastic Second-order Optimizer for Language Model Pre-training
(
Poster
)
link »
Given the massive cost of language model pre-training, a non-trivial improvement of the optimization algorithm would lead to a material reduction on the time and cost of training. Adam and its variants have been state-of-the-art for years, and more sophisticated second-order (Hessian-based) optimizers often incur too much per-step overhead. In this paper, we propose Sophia, Second-order Clipped Stochastic Optimization, a simple scalable second-order optimizer that uses a light-weight estimate of the diagonal Hessian as the pre-conditioner. The update is the moving average of the gradients divided by the moving average of the estimated Hessian, followed by element-wise clipping. The clipping controls the worst-case update size and tames the negative impact of non-convexity and rapid change of Hessian along the trajectory. Sophia only estimates the diagonal Hessian every handful of iterations, which has negligible average per-step time and memory overhead. On language modeling with GPT-2 models of sizes ranging from 125M to 770M, Sophia achieves a 2x speed-up compared with Adam in the number of steps, total compute, and wall-clock time. |
Hong Liu · Zhiyuan Li · David Hall · Percy Liang · Tengyu Ma 🔗 |
-
|
RapidBERT: How to Train BERT with a Lunch Money Budget
(
Poster
)
link »
Although BERT-style encoder models are heavily used in NLP research, many researchers do not pretrain their own BERTs from scratch due to the high cost of training. In the past half-decade since BERT first rose to prominence, many advances have been made with other transformer architectures and training configurations that have yet to be systematically incorporated into BERT. Here, we introduce RapidBERT, a BERT-style encoder architecture and training recipe that is empirically optimized for fast pretraining. This efficient architecture incorporates FlashAttention, Attention with Linear Biases (ALiBi), Gated Linear Units (GLU), a module to dynamically remove padded tokens, and low precision LayerNorm into the classic transformer encoder block. The training recipe includes a 30\% masking ratio for the Masked Language Modeling (MLM) objective, bfloat16 precision, and vocabulary size optimized for GPU throughput, in addition to best-practices from RoBERTa and other encoder models. When pretrained from scratch on the C4 dataset, this base model achieves the downstream average GLUE score of 79.6 in 1.13 hours on 8 A100 80 GB GPUs at a cost of roughly $20. We plot extensive accuracy vs. pretraining speed Pareto curves and show that RapidBERT base and large are consistently Pareto optimal when compared to a competitive BERT base and large. This empirical speed up in pretraining enables researchers and engineers to pretrain custom BERT-style models at low cost instead of finetune on existing generic models. We open source our model weights, benchmarking data, and code. |
Alexander Trott · Jacob Portes · Sam Havens · DANIEL KING · Abhinav Venigalla · Moin Nadeem · Nikhil Sardana · Daya Khudia · Jonathan Frankle 🔗 |
-
|
UOTA: Unsupervised Open-Set Task Adaptation Using a Vision-Language Foundation Model
(
Poster
)
link »
Human-labeled data is essential for deep learning models, but annotation costs hinder their use in real-world applications. Recently, however, models such as CLIP have shown remarkable zero-shot capabilities through vision-language pre-training. Although fine-tuning with human-labeled data can further improve the performance of zero-shot models, it is often impractical in low-budget real-world scenarios. In this paper, we propose an alternative algorithm, dubbed Unsupervised Open-Set Task Adaptation (UOTA), which fully leverages the large amounts of open-set unlabeled data collected in the wild to improve pre-trained zero-shot models in real-world scenarios. |
Youngjo Min · Kwangrok Ryoo · Bumsoo Kim · Taesup Kim 🔗 |
-
|
Cramming: Training a Language Model on a single GPU in one day
(
Poster
)
link »
Recent trends in language modeling have focused on increasing performance through scaling, and have resulted in an environment where training language models is out of reach for most researchers and practitioners. While most in the community are asking how to push the limits of extreme computation, we ask the opposite question: How far can we get with a single GPU in just one day? We investigate the downstream performance achievable with a transformer-based language model trained completely from scratch with masked language modeling for a single day on a single consumer GPU.Aside from re-analyzing nearly all components of the pretraining pipeline for this scenario and providing a modified pipeline with performance close to BERT, we investigate why scaling down is hard, and which modifications actually improve performance in this scenario. We provide evidence that even in this constrained setting, performance closely follows scaling laws observed in large-compute settings. Through the lens of scaling laws, we categorize a range of recent improvements to training and architecture and discuss their merit and practical applicability (or lack thereof) for the limited compute setting. |
Jonas Geiping · Tom Goldstein 🔗 |
-
|
SpecTr: Fast Speculative Decoding via Optimal Transport
(
Poster
)
link »
Autoregressive sampling from large language models has shown to achieve state-of-the-art results in several natural language tasks.However, autoregressive sampling generates tokens one at a time making it slow, and even prohibitive in certain tasks. One way to speed up decoding is *speculative decoding*: use a smaller model to sample a *draft* (block or sequence of tokens), and then score all tokens in the draft by the desired large language model in parallel. The tokens in the draft are either accepted or rejected based on a statistical method to guarantee that the final output is a valid sample from the large model. In this work, we provide a principled understanding of speculative decoding through the lens of optimal transport (OT) with *membership cost*. This framework can be viewed as an extension of the well-known *maximal-coupling* problem. This new formulation enables us to generalize the sampling method to allow for a set of $k$ candidates at the token-level, leading to an improved optimal membership cost. The optimal solution can be computed via linear programming, whose best-known runtime is exponential in $k$. We then propose an approximate solution whose acceptance probability is $(1-1/e)$-optimal multiplicatively. Moreover, it can be computed in time almost linear with size of token vocabulary.Using this new OT algorithm, we develop a new autoregressive sampling algorithm called *SpecTr*, which creates multiple drafts of the next few tokens from the small language model, and score all of them in parallel by the large language model. We accept one or reject all of them based on their respective scores. We experimentally demonstrate that the proposed approach achieves a speedup of 3X, a further 1.36X speedup over speculative decoding on standard benchmarks.
|
Ziteng Sun · Ananda Suresh · Jae Ro · Ahmad Beirami · Himanshu Jain · Felix Xinnan Yu · Michael Riley · Sanjiv Kumar 🔗 |
-
|
Landmark Attention: Random-Access Infinite Context Length for Transformers
(
Poster
)
link »
While transformers have shown remarkable success in natural language processing, their attention mechanism's large memory requirements have limited their ability to handle longer contexts. Prior approaches, such as recurrent memory or retrieval-based augmentation, have either compromised the random-access flexibility of attention (i.e., the capability to select any token in the entire context) or relied on separate mechanisms for relevant context retrieval, which may not be compatible with the model's attention. In this paper, we present a novel approach that allows access to the complete context while retaining random-access flexibility, closely resembling running attention on the entire context. Our method uses a landmark token to represent each block of the input and trains the attention to use it for selecting relevant blocks, enabling retrieval of blocks directly through the attention mechanism instead of by relying on a separate mechanism. Our approach seamlessly integrates with specialized data structures and the system's memory hierarchy, enabling processing of arbitrarily long context lengths. To demonstrate the capabilities of our method, we show that fine-tuning LLaMA 7B with our method successfully extends its context length capacity beyond 32k tokens, allowing for inference at the context lengths of GPT-4. |
Amirkeivan Mohtashami · Martin Jaggi 🔗 |
-
|
Dissecting Efficient Architectures for Wake-Word Detection
(
Poster
)
link »
Wake-word detection models running on edge devices have stringent efficiency requirements.We observe that the over-the-air test accuracy of trained models on parallel devices (GPU/TPU) usually degrades when deployed on edge devices using a CPU for over-the-air, real-time evaluation.Further, the differing inference time when migrating between GPU and CPU varies across models.This drop is due to hardware latency and acoustic impulse response, while the non-uniform expansion of inference time results from varying exploitation of hardware acceleration by architectures.Although many neural architectures have been applied to wake-word detection tasks, such latency or accuracy drops have not been studied at granular, layer matrix multiplication levels.In this paper, we compare five Convolutional Neural Network (CNN) architectures and one pure Transformer architecture optimized for edge deployment, train them for wake-word detection on the Speech Commands dataset, and quantize two representative models. We seek to quantify their accuracy-efficiency tradeoffs to inform researchers and practitioners about the key components in models that influence this tradeoff. |
Cody Berger · Juncheng Li · Yiyuan Li · Aaron Berger · Dmitri Berger · Karthik Ganesan · Emma Strubell · Florian Metze 🔗 |
-
|
MRMP: Multi-Rate Magnitude Pruning of Graph Convolutional Networks
(
Poster
)
link »
In this paper, we devise a novel lightweight Graph Convolutional Network (GCN) design dubbed as Multi-Rate Magnitude Pruning (MRMP) that jointly trains network topology and weights. Our method is variational and proceeds by aligning the weight distribution of the learned networks with an a priori distribution. In the one hand, this allows implementing any fixed pruning rate, and also enhancing the generalization performances of the designed lightweight GCNs. In the other hand, MRMP achieves a joint training of multiple GCNs, on top of shared weights, in order to extrapolate accurate networks at any targeted pruning rate without retraining their weights. Extensive experiments conducted on the challenging task of skeleton-based recognition show a substantial gain of our lightweight GCNs particularly at very high pruning regimes. |
Hichem Sahbi 🔗 |
-
|
Progressive Knowledge Distillation: Balancing Inference Latency and Accuracy at Runtime
(
Poster
)
link »
We study the problem of progressive distillation: Given a large, pretrained teacher model $g$, we seek to decompose the model into smaller, low-inference cost student models $f_i$, such that progressively evaluating additional models in this ensemble results in strict improvements over previous predictions. For user-facing inference applications, this allows us to flexibly trade accuracy for inference latency at runtime. We develop a boosting based algorithm, B-DISTIL, for progressive distillation, and demonstrate its effectiveness on standard datasets.
|
Don Kurian Dennis · Abhishek Shetty · Anish Sevekari · Kazuhito Koishida · Virginia Smith 🔗 |
-
|
Language Models are Weak Learners
(
Poster
)
link »
A central notion in practical and theoretical machine learning is that of a weak learner, classifiers that achieve better-than-random performance (on any given distribution over data), even by a small margin. Such weak learners form the practical basis for canonical machine learning methods such as boosting. In this work, we illustrate that prompt-based large language models can operate effectively as said weak learners. Specifically, we illustrate the use of a large language model (LLM) as a weak learner in a boosting algorithm applied to tabular data. We show that by providing (properly sampled according to the distribution of interest) text descriptions of tabular data samples, LLMs can produce a summary of the samples that serves as a template for classification, and achieves the aim of acting as a weak learner on this task. We incorporate these models into a boosting approach, which in many settings can leverage the knowledge within the LLM to outperform traditional tree-based boosting. The model outperforms both few-shot learning and occasionally even more involved fine-tuning procedures, particularly for some tasks involving small numbers of data points. The results illustrate the potential for prompt-based LLMs to function not just as few-shot learners themselves, but as components of larger machine learning models. |
Hariharan Manikandan · Yiding Jiang · Zico Kolter 🔗 |
-
|
On Robustness-Accuracy Characterization of Large Language Models using Synthetic Datasets
(
Poster
)
link »
Despite the impressive capability of large language models (LLMs) in solving different downstream tasks, new concerns about proper performance evaluation have been raised, especially for test-data leakage caused by accidentally including them during pretraining, or by indirectly exposing them through API calls for evaluation. Motivated by these, in this paper, we propose a new evaluation workflow that generates steerable synthetic language datasets and proxy tasks for benchmarking the performance of pertained LLMs on sentence classification tasks. This approach allows for better characterization of the joint analysis on the robustness and accuracy of LLMs without risking sensitive information leakage. Verified on various pretrained LLMs, the proposed approach demonstrates promising high correlation with real downstream performance. |
Ching-Yun (Irene) Ko · Pin-Yu Chen · Payel Das · Yung-Sung Chuang · Luca Daniel 🔗 |
-
|
Training Diffusion Models with Reinforcement Learning
(
Poster
)
link »
Diffusion models are a class of flexible generative models trained with an approximation to the log-likelihood objective. However, most use cases of diffusion models are not concerned with likelihoods, but instead with downstream objectives such as human-perceived image quality or drug effectiveness. In this paper, we investigate reinforcement learning methods for directly optimizing diffusion models for such objectives. We describe how posing denoising as a multi-step decision-making problem enables a class of policy gradient algorithms, which we refer to as denoising diffusion policy optimization (DDPO), that are more effective than alternative reward-weighted likelihood approaches. Empirically, DDPO is able to adapt text-to-image diffusion models to objectives that are difficult to express via prompting, such as image compressibility, and those derived from human feedback, such as aesthetic quality. Finally, we show that DDPO can improve prompt-image alignment using feedback from a vision-language model without the need for additional data collection or human annotation. |
Kevin Black · Michael Janner · Yilun Du · Ilya Kostrikov · Sergey Levine 🔗 |
-
|
GPT-Zip: Deep Compression of Finetuned Large Language Models
(
Poster
)
link »
Storage is increasingly a practical bottleneck to scaling large language model (LLM) systems with personalization, co-location, and other use cases that require storing the pretrained base model plus multiple finetuned models. To this end, we propose GPT-Zip for post-finetuning compression. GPT-Zip uses quantization and sparsification to efficiently compress finetuned models by exploiting their closeness to the pretrained base model. Specifically, we demonstrate that the \emph{difference} between the finetuned models and the pretrained base model can efficiently be quantized into $2$ bits and pruned with $95 \%$ sparsity together -- providing up to $52$ times overall size reduction. Thus, GPT-Zip avoids the linear growth in memory costs required for naive storage. We show that this compression can be achieved without performance degradation, as measured by evaluations on several tasks from the Natural Instructions dataset. Surprisingly, GPT-Zip sometimes improves accuracy over uncompressed models. We demonstrate the efficacy of GPT-Zip on four finetuned OPT-1.3B models and show that GPT-Zip reduces the storage cost by $16$ times more than existing LLM compression techniques while attaining significantly better performance.
|
Berivan Isik · Hermann Kumbong · Wanyi Ning · Xiaozhe Yao · Sanmi Koyejo · Ce Zhang 🔗 |
-
|
Reverse Distillation: Training Billion Parameter Models For CTR Prediction
(
Poster
)
link »
Pre-training and fine-tuning large transformer models has shown promising results across various ML applications. Large model training brings with it a host of challenges including slow convergence, training instabilities, increased cost and resources for hyperparameter sweeps, among other difficulties. These challenges are further exacerbated when training large models on real-world internet-scale datasets as these are noisy. One such real-world internet-scale application is Click-Through Rate (CTR) prediction of product advertisements in e-commerce. In this work, we propose a method of training large models (upto 50 billion parameters) on the CTR Prediction task by making use of knowledge from smaller models for initialization through Reverse Distillation (RD). We show that our method improves over vanilla finetuning of large language models on a downstream CTR task at Amazon. We also study the effectiveness of this method at different model sizes and label noise levels in the training data. Using the proposed method we train and deploy a 50 billion parameter model which shows a lift of 6.52% in CTR during online A/B experiments. |
Aditya Anantharaman · Aashiq Muhamed · Hemant Pugaliya · Chong Wang · Sujan Perera · Zhen Ge · qingjun cui · Belinda Zeng · Trishul Chilimbi 🔗 |
-
|
A Simple and Effective Pruning Approach for Large Language Models
(
Poster
)
link »
As their size increases, Large Languages Models (LLMs) are natural candidates for network pruning. Existing methods require either retraining or solving a weight reconstruction problem, which may be computationally expensive for billion-scale LLMs. In this paper, we introduce a novel, simple yet effective pruning method, termed Wanda (Pruning by Weights and activations), to induce sparsity in pretrained LLMs. Motivated by the recent observation of emergent large magnitude features in LLMs, our approach prune weights with the smallest magnitudes multiplied by the corresponding input activations, on a per-output basis. Notably, Wanda requires no retraining or weight update, and the pruned LLM can be used as is. We conduct a thorough evaluation of our method on LLaMA, one of the best performing LLMs available. Wanda significantly outperforms the established baseline of magnitude pruning and competes favorably against recent methods involving intensive weight update. |
Mingjie Sun · Zhuang Liu · Anna Bair · Zico Kolter 🔗 |
-
|
Incremental Low-Rank Learning
(
Poster
)
link »
The theory of greedy low-rank learning (GLRL) aims to explain the impressive generalization capabilities of deep learning. It proves that stochastic gradient-based training implicitly regularizes neural networks towards low-rank solutions through a gradual increase of the rank during training. However, there is a gap between theory and practice since GLRL requires an infinitesimal initialization of the weights, which is not practical due to the fact that it is a saddle point. In this work, we remove the assumption of infinitesimal initialization by focusing on cumulative weight updates. We prove the cumulative weight updates follow an incremental low-rank trajectory for arbitrary orthogonal initialization of weights in a three-layer linear network. Empirically, we demonstrate that our theory holds on a broad range of neural networks (e.g., transformers) and standard training algorithms (e.g., SGD, Adam). However, existing training algorithms do not exploit the low-rank property to improve computational efficiency as the networks are not parameterized in low-rank. To remedy this, we design a new training algorithm Incremental Low-Rank Learning (InRank), which explicitly expresses cumulative weight updates as low-rank matrices while incrementally augmenting their ranks during training. We evaluate InRank on GPT-2, and our results indicate that InRank achieves comparable prediction performance as the full-rank counterpart while requiring at most 33% of the total ranks throughout training. We also propose an efficient version of InRank that achieves a reduction of 20% in total training time and 37% in memory usage when training GPT-medium on WikiText-103 from scratch. |
Jiawei Zhao · Yifei Zhang · Beidi Chen · Florian Schaefer · Anima Anandkumar 🔗 |
-
|
Deep Fusion: Efficient Network Training via Pre-trained Initializations
(
Poster
)
link »
In recent years, deep learning has made remarkable progress in a wide range of domains, with a particularly notable impact on natural language processing tasks. One of the challenges associated with training deep neural networks is the need for large amounts of computational resources and time. In this paper, we present Deep Fusion, an efficient approach to network training that leverages pretrained initializations of smaller networks. We show that Deep Fusion accelerates the training process, reduces computational requirements, and leads to improved generalization performance on a variety of NLP tasks and T5 model sizes. Our experiments demonstrate that Deep Fusion is a practical and effective approach to reduce the training time and resource consumption while maintaining, or even surpassing, the performance of traditional training methods. |
Hanna Mazzawi · Xavi Gonzalvo · Michael Wunder 🔗 |
-
|
ROSA: Random Orthogonal Subspace Adaptation
(
Poster
)
link »
Model training requires significantly more memory, compared with inference.Parameter efficient fine-tuning (PEFT) methods provide a means of adapting large models to downstream tasks using less memory. However, existing methods either introduce latency overhead at inference time or achieve subpar downstream performance compared with full fine-tuning.In this work we propose Random Orthogonal Subspace Adaptation (ROSA), a method that exceeds the performance of previous PEFT methods by a significant margin, while maintaining a zero latency overhead during inference time. In contrast to previous methods, ROSA is able to adapt subspaces of larger size, without consuming additional memory during runtime. As PEFT methods are especially useful in the natural language processing domain. We evaluate ROSA by finetuning GPT2 on various Natural Language Generation (NLG) tasks. We will to make our code publicly available upon acceptance. |
Marawan Gamal · Guillaume Rabusseau 🔗 |
-
|
Towards Fair Knowledge Distillation using Student Feedback
(
Poster
)
link »
With the advent of large-scale models and their success in diverse fields, Knowledge Distillation (KD) techniques are increasingly used to deploy them to edge devices with limited memory and computation constraints. However, most distillation works focus on improving the prediction performance of the student model with little to no work in studying the effect of distillation on key fairness properties, ensuring trustworthy distillation. In this work, we propose a fairness-driven distillation framework, BIRD (BIas-awaRe Distillation), which introduces a FAIRDISTILL operator to collect feedback from the student through a meta-learning-based approach and selectively distill teacher knowledge. We demonstrate that BIRD can be augmented with different KD methods to increase the performance of foundation models and convolutional neural networks. Extensive experiments across three fairness datasets show the efficacy of our framework over existing state-of-the-art KD methods, opening up new directions to develop trustworthy distillation techniques. |
Abhinav Java · Surgan Jandial · Chirag Agarwal 🔗 |
-
|
Audio-Journey: Efficient Visual+LLM-aided Audio Encodec Diffusion
(
Poster
)
link »
Despite recent progress, machine learning for the audio domain is limited by the availability of high-quality data.Visual information already presented in a video should complement the information in audio.In this paper, we leverage state-of-the-art (SOTA) Large Language Models (LLMs) to augment the existing weak labels of the audio dataset to enrich captions; we adopt SOTA video-captioning model to automatically generate video caption, and we again use LLMs to merge the audio-visual captions to form a rich dataset of large-scale.Using this dataset, we train a latent diffusion model on the Encodec embeddings.Furthermore, we leverage the trained diffusion model to generate even more audio data of the same format.In our experiment, we first verified that our Audio+Visual Caption is of high quality against baselines and ground truth (12.5\% gain in semantic score against baselines). Moreover, we demonstrate that we could train a classifier from scratch using the diffusion-generated data, or use diffusion to enhance classification models on the AudioSet test set, working in conjunction with mixup or other augmentation methods for impressive performance gains.Our approach exemplifies a promising method for augmenting low-resource audio datasets.The samples, models, and implementation will be at \url{https://audiojourney.github.io}. |
Juncheng Li · Jackson Michaels · Laura Yao · Lijun Yu · Zach Wood-Doughty · Florian Metze 🔗 |
-
|
Semi-supervised Tabular Classification via In-context Learning of Large Language Models
(
Poster
)
link »
Learning with limited labeled tabular samples is an important problem for industrial machine learning applications, as acquiring annotations for tabular data is often too costly.On the other hand, recent remarkable progress in natural language processing has evidenced that such an issue can be circumvented by using pre-trained large language models (LLMs).Motivated by this, we ask whether LLMs can help to handle the limited labeled data in the tabular domain as well.As a positive answer, we propose a novel semi-supervised tabular learning framework, coined Self-generated PROmpts from Unlabeled Tables (SPROUT), which utilizes unlabeled data in conjunction with LLMs.Our main idea is to exploit the in-context learning capabilities of LLMs to effectively extract transferable knowledge from unlabeled tabular samples.Specifically, SPROUT generates in-context prompts from unlabeled tables by identifying a column feature that exhibits a strong correlation with the actual target label, thereby creating examples that pertain to the true target tasks.In addition, we demonstrate how a language prior can facilitate knowledge transfer from heterogeneous data sources, enhancing performance of target datasets and mitigating the challenges posed by varying input formats.Experimental results show that SPROUT yields substantial performance improvements over previous methods across various tabular benchmarks. |
Jaehyun Nam · Woomin Song · Seong Hyeon Park · Jihoon Tack · Sukmin Yun · Jaehyung Kim · Jinwoo Shin 🔗 |
-
|
BK-SDM: Architecturally Compressed Stable Diffusion for Efficient Text-to-Image Generation
(
Poster
)
link »
Exceptional text-to-image (T2I) generation results of Stable Diffusion models (SDMs) come with substantial computational demands. To resolve this issue, recent research on efficient SDMs has prioritized enabling fewer sampling steps and utilizing network quantization. Orthogonal to these directions, this study highlights the power of classical architectural compression for general-purpose T2I synthesis by introducing block-removed knowledge-distilled SDMs (BK-SDMs). We eliminate several residual and attention blocks from the U-Net of SDMs, obtaining over a 30% reduction in the number of parameters, MACs per sampling step, and latency. We conduct distillation-based pretraining with only 0.22M LAION pairs (fewer than 0.1% of the full training pairs) on a single A100 GPU. Despite being trained with limited resources, our compact models can imitate the original SDM by benefiting from transferred knowledge and achieve competitive results against larger multi-billion parameter models on the zero-shot MS-COCO benchmark. Moreover, we show the applicability of our lightweight pretrained models in personalized generation with DreamBooth finetuning. |
Bo-Kyeong Kim · Hyoung-Kyu Song · Thibault Castells · Shinkook Choi 🔗 |
-
|
Transformers as Statisticians: Provable In-Context Learning with In-Context Algorithm Selection
(
Poster
)
link »
This work advances the understandings of the remarkable \emph{in-context learning} (ICL) abilities of transformers---the ability of performing new tasks when prompted with training and test examples, without any parameter update to the model. We begin by showing that transformers can implement a broad class of standard machine learning algorithms in context, such as least squares, ridge regression, Lasso, convex risk minimization for generalized linear models, and gradient descent on two-layer neural networks, with near-optimal predictive power on various in-context data distributions. Our transformer constructions admit mild bounds on the number of layers and heads, and can be learned with polynomially many pretraining sequences. Building on these ``base'' ICL algorithms, intriguingly, we show that transformers can implement more complex ICL procedures involving \emph{in-context algorithm selection}, akin to what a statistician can do in real life---A \emph{single} transformer can adaptively select different base ICL algorithms---or even perform qualitatively different tasks---on different input sequences, without any explicit prompting of the right algorithm or task. In theory, we construct two general mechanisms for algorithm selection with concrete examples: (1) Pre-ICL testing, where the transformer determines the right task for the given sequenceby examining certain summary statistics of the input sequence; (2) Post-ICL validation, where the transformer selects---among multiple base ICL algorithms---a near-optimal one for the given sequence using a train-validation split. Experimentally, we demonstrate the strong in-context algorithm selection capabilities of standard transformer architectures. |
Yu Bai · Fan Chen · Huan Wang · Caiming Xiong · Song Mei 🔗 |
-
|
Can Public Large Language Models Help Private Cross-device Federated Learning?
(
Poster
)
link »
We study (differentially) private federated learning (FL) of language models. The language models in cross-device FL are relatively small, which can be trained with meaningful formal user-level differential privacy (DP) guarantees when massive parallelism in training is enabled by the participation of a moderate size of users. Recently, public data has been used to improve privacy-utility trade-offs for both large and small language models. In this work, we provide a systematic study of using large-scale public data and LLMs to help differentially private training of on-device FL models, and further improve the privacy-utility tradeoff by techniques of distillation. Moreover, we propose a novel distribution matching algorithm with theoretical grounding to sample public data close to private data distribution, which significantly improves the sample efficiency of (pre)training on public data. The proposed method is efficient and effective for training private models by taking advantage of public data, especially for customized on-device architectures that do not have ready-to-use pre-trained models. |
Boxin Wang · Yibo J. Zhang · Yuan Cao · Bo Li · Hugh B McMahan · Sewoong Oh · Zheng Xu · Manzil Zaheer 🔗 |
-
|
Reasoning Ability Emerges in Large Language Models as Aggregation of Reasoning Paths
(
Poster
)
link »
This study focuses on the emergence of reasoning abilities in large language models (LLMs). While LLMs have shown remarkable capabilities in complex reasoning tasks, the exact origin of this ability and its relationship to pre-training and fine-tuning stages remain unclear. Previous research has explored in-context learning but has not fully addressed reasoning abilities such as logical reasoning or math deduction. The paper proposes investigating reasoning in LLMs through reasoning over knowledge graphs. The experiments demonstrate the importance of the pre-training sequence in enabling effective reasoning. The findings suggest that LLMs acquire reasoning abilities during pre-training rather than fine-tuning. Furthermore, training LLMs with next-token prediction enables them to aggregate relevant reasoning paths and derive new conclusions. The empirical results support the explanation of LLMs predicting unseen facts using a path ranking algorithm. |
Xinyi Wang · William Wang 🔗 |
-
|
Learned Thresholds Token Merging and Pruning for Vision Transformers
(
Poster
)
link »
Vision transformers have demonstrated remarkable success in a wide range of computer vision tasks over the last years, however, their high computational costs remains a significant barrier to their practical deployment.In particular, the complexity of transformer models is quadratic with respect to the number of input tokens. Therefore techniques that reduce the number of input tokens that need to be processed have been proposed. This paper introduces Learned Thresholds token Merging and Pruning (LTMP), a novel approach that leverages the strengths of both token merging and token pruning.LTMP uses learned threshold masking modules that dynamically determine which tokens to merge and which to prune.Our results demonstrate that LTMP achieves state-of-the-art accuracy on ImageNet across various reduction rates while requiring only a single fine-tuning epoch, which is an order of magnitude faster than previous methods. |
Maxim Bonnaerens · Joni Dambre 🔗 |
-
|
Three Towers: Flexible Contrastive Learning with Pretrained Image Models
(
Poster
)
link »
We introduce Three Towers (3T), a flexible method to improve the contrastive learning of vision-language models by incorporating pretrained image classifiers. While contrastive models are usually trained from scratch, LiT (Zhai et al., 2022) has recently shown performance gains from using pretrained classifier embeddings. However, LiT directly replaces the image tower with the frozen embeddings, excluding any potential benefits of contrastively training the image tower. With 3T, we propose a more flexible strategy that allows the image tower to benefit from both pretrained embeddings and contrastive training. To achieve this, we introduce a third tower that contains the frozen pretrained embeddings, and we encourage alignment between this third tower and the main image-text towers. Empirically, 3T consistently improves over LiT and the CLIP-style from-scratch baseline for retrieval tasks. For classification, 3T reliably improves over the from-scratch baseline, and while it underperforms relative to LiT for JFT-pretrained models, it outperforms LiT for ImageNet-21k and Places365 pretraining. |
Jannik Kossen · Mark Collier · Basil Mustafa · Xiao Wang · Xiaohua Zhai · Lucas Beyer · Andreas Steiner · Jesse Berent · Rodolphe Jenatton · Efi Kokiopoulou 🔗 |
-
|
ViT Graph Head Attention for Small Sized Datasets
(
Poster
)
link »
In this paper, we propose a new type of vision transformer (ViT) based on a graph head attention (GHA). The GHA creates the graph structure using an attention map generated from the input patches. Because the attention map represents the degree of concentration between image patches, it can be regarded as a type of relationship between patches, which can be converted into a graph structure. To maintain an MHA-like performance with fewer GHAs, we apply a graph attention network to the GHA to ensure attention diversity and emphasize the correlations between graph nodes. The proposed GHA maintains both the locality and globality of the input patches and guarantees diversity of attention. The proposed GHA-ViT commonly outperforms pure ViT-based models on small-sized and a medium-sized ImageNet-1K dataset through scratch training. A top-1 accuracy of 81.7\% was achieved in ImageNet-1K with GHA-B, which is a base model with approximately 29M parameters. |
HyeongJin Kim · GyungHyun Lee · Byoung Chul Ko 🔗 |
-
|
A Closer Look at In-Context Learning under Distribution Shifts
(
Poster
)
link »
In-context learning, a capability that enables a model to learn from input examples on the fly without necessitating weight updates, is a defining characteristic of large language models. In this work, we follow the setting proposed in Garg et al. to better understand the generality and limitations of in-context learning from the lens of the simple yet fundamental task of linear regression. The key question we aim to address is: Are transformers more adept than some natural and simpler architectures at performing in-context learning under varying distribution shifts? To compare transformers, we propose to use a simple architecture based on set-based Multi-Layer Perceptrons (MLPs). We find that both transformers and set-based MLPs exhibit in-context learning under in-distribution evaluations, but transformers more closely emulate the performance of ordinary least squares (OLS). Transformers also display better resilience to mild distribution shifts, where set-based MLPs falter. However, under severe distribution shifts, both models' in-context learning abilities diminish. |
Kartik Ahuja · David Lopez-Paz 🔗 |
-
|
Large Language Models Are Implicitly Topic Models: Explaining and Finding Good Demonstrations for In-Context Learning
(
Poster
)
link »
In recent years, pre-trained large language models (LLMs) have demonstrated remarkable efficiency in achieving an inference-time few-shot learning capability known as in-context learning. However, existing literature has highlighted the sensitivity of this capability to the selection of few-shot demonstrations. Current understandings of the underlying mechanisms by which this capability arises from regular language model pretraining objectives remain disconnected from the real-world LLMs. This study aims to examine the in-context learning phenomenon through a Bayesian lens, viewing real-world LLMs as implicit topic models. On this premise, we propose an algorithm to select optimal demonstrations from a set of annotated data with a small LLM, then directly generalize the selected demonstrations to larger LLMs. We demonstrate a significant 12.5\% improvement relative to the random selection baseline, averaged over eight GPT models on eight real-world text classification datasets. Our empirical findings support our hypothesis that LLMs implicitly infer a latent variable containing task information. |
Xinyi Wang · Wanrong Zhu · Michael Saxon · Mark Steyvers · William Wang 🔗 |
-
|
Constant Memory Attention Block
(
Poster
)
link »
Modern foundation model architectures rely on attention mechanisms to effectively capture context. However, these methods require linear or quadratic memory in terms of the number of inputs/datapoints, limiting their applicability in low-compute domains. In this work, we propose Constant Memory Attention Block (CMAB), a novel general-purpose attention block that computes its output in constant memory and performs updates in constant computation. Highlighting CMABs efficacy, we introduce methods for Neural Processes and Temporal Point Processes. Empirically, we show our proposed methods achieve results competitive with state-of-the-art while being significantly more memory efficient. |
Leo Feng · Frederick Tung · Hossein Hajimirsadeghi · Yoshua Bengio · Mohamed Osama Ahmed 🔗 |
-
|
Sequence Parallelism: Long Sequence Training from System Perspective
(
Poster
)
link »
Transformer achieves promising results on various tasks. However, self-attention suffers from quadratic memory requirements with respect to the sequence length. Existing work focuses on reducing time and space complexity from an algorithm perspective. In this work, we propose sequence parallelism, a memory-efficient parallelism to solve this issue from system perspective instead. With sequence parallelism, we no longer require a single device to hold the whole sequence. Besides, using efficient attention with linear complexity, our sequence parallelism enables us to train transformer with infinite long sequence. Experiments show that sequence parallelism performs well when scaling with batch size and sequence length. Compared with tensor parallelism, our approach achieved $13.7\times$ and $3.0\times$ maximum batch size and sequence length respectively when scaling up to 64 NVIDIA P100 GPUs. With efficient attention, sequence can handle sequence with over 114K tokens, which is over $27\times$ longer than existing efficient attention works holding the whole sequence on a single device.
|
Shenggui Li · Fuzhao Xue · Chaitanya Baranwal · Yongbin Li · Yang You 🔗 |
-
|
Scaling In-Context Demonstrations with Structured Attention
(
Poster
)
link »
The recent surge of large language models (LLMs) highlights their ability to perform in-context learning, i.e., “learning” to perform a task from a few demonstrations in the context without any parameter updates. However, their capabilities of in-context learning are limited by the model architecture: 1) the use of demonstrations is constrained by a maximum sentence length due to positional embeddings; 2) the quadratic complexity of attention hinders users from using more demonstrations efficiently; 3) LLMs are shown to be sensitive to the order of the demonstrations. In this work, we tackle these challenges by proposing a better architectural design for in-context learning. We propose SAICL (Structured Attention for In-Context Learning), which replaces the full-attention by a structured attention mechanism designed for in-context learning, and removes unnecessary dependencies between individual demonstrations, while making the model invariant to the permutation of demonstrations. We evaluate SAICL in a meta-training framework and show that SAICL achieves comparable or better performance than full attention while obtaining up to 3.4x inference speed-up. SAICL also consistently outperforms a strong Fusion-in-Decoder (FiD) baseline which processes each demonstration independently. Finally, thanks to its linear nature, we demonstrate that SAICL can easily scale to hundreds of demonstrations with continuous performance gains with scaling. |
Tianle Cai · Kaixuan Huang · Jason Lee · Mengdi Wang · Danqi Chen 🔗 |
Author Information
Julien Launay (HuggingFace)
Daniel Y Fu (Stanford University)
Tri Dao (Stanford)
Daniel Hesslow (Huggingface)
Beidi Chen (CMU / FAIR)
Azalia Mirhoseini
Percy Liang (Stanford University)
More from the Same Authors
-
2021 : ROPUST: Improving Robustness through Fine-tuning with Photonic Processors and Synthetic Gradients »
Alessandro Cappelli · Ruben Ohana · Julien Launay · Laurent Meunier · Iacopo Poli -
2022 : LinkBERT: Language Model Pretraining with Document Link Knowledge »
Michihiro Yasunaga · Jure Leskovec · Percy Liang -
2022 : Transform Once: Efficient Operator Learning in Frequency Domain »
Michael Poli · Stefano Massaroli · Federico Berto · Jinkyoo Park · Tri Dao · Christopher Re · Stefano Ermon -
2023 : Scan and Snap: Understanding Training Dynamics and Token Composition in 1-layer Transformer »
Yuandong Tian · Yiping Wang · Beidi Chen · Simon Du -
2023 : DoReMi: Optimizing Data Mixtures Speeds Up Language Model Pretraining »
Sang Michael Xie · Hieu Pham · Xuanyi Dong · Nan Du · Hanxiao Liu · Yifeng Lu · Percy Liang · Quoc Le · Tengyu Ma · Adams Wei Yu -
2023 : Retrieval-Augmented Multimodal Language Modeling »
Michihiro Yasunaga · Armen Aghajanyan · Weijia Shi · Rich James · Jure Leskovec · Percy Liang · Mike Lewis · Luke Zettlemoyer · Wen-tau Yih -
2023 : Lexinvariant Language Models »
Qian Huang · Eric Zelikman · Sarah Chen · Yuhuai Wu · Greg Valiant · Percy Liang -
2023 : PRODIGY: Enabling In-context Learning Over Graphs »
Qian Huang · Hongyu Ren · Peng Chen · Gregor Kržmanc · Daniel Zeng · Percy Liang · Jure Leskovec -
2023 : Towards Structured Sparsity in Transformers for Efficient Inference »
Harry Dong · Beidi Chen · Yuejie Chi -
2023 : H2O: Heavy-Hitter Oracle for Efficient Generative Inference of Large Language Models »
Zhenyu Zhang · Ying Sheng · Tianyi Zhou · Tianlong Chen · Lianmin Zheng · Ruisi Cai · Zhao Song · Yuandong Tian · Christopher Re · Clark Barrett · Zhangyang “Atlas” Wang · Beidi Chen -
2023 : Sophia: A Scalable Stochastic Second-order Optimizer for Language Model Pre-training »
Hong Liu · Zhiyuan Li · David Hall · Percy Liang · Tengyu Ma -
2023 : Incremental Low-Rank Learning »
Jiawei Zhao · Yifei Zhang · Beidi Chen · Florian Schaefer · Anima Anandkumar -
2023 Oral: Deja Vu: Contextual Sparsity for Efficient LLMs at Inference Time »
Zichang Liu · Jue Wang · Tri Dao · Tianyi Zhou · Binhang Yuan · Zhao Song · Anshumali Shrivastava · Ce Zhang · Yuandong Tian · Christopher Re · Beidi Chen -
2023 Oral: Hyena Hierarchy: Towards Larger Convolutional Language Models »
Michael Poli · Stefano Massaroli · Eric Nguyen · Daniel Y Fu · Tri Dao · Stephen Baccus · Yoshua Bengio · Stefano Ermon · Christopher Re -
2023 Poster: Simple Hardware-Efficient Long Convolutions for Sequence Modeling »
Daniel Y Fu · Elliot L Epstein · Eric Nguyen · Armin Thomas · Michael Zhang · Tri Dao · Atri Rudra · Christopher Re -
2023 Poster: Whose Opinions Do Language Models Reflect? »
Shibani Santurkar · Esin Durmus · Faisal Ladhak · Cinoo Lee · Percy Liang · Tatsunori Hashimoto -
2023 Poster: FlexGen: High-Throughput Generative Inference of Large Language Models with a Single GPU »
Ying Sheng · Lianmin Zheng · Binhang Yuan · Zhuohan Li · Max Ryabinin · Beidi Chen · Percy Liang · Christopher Re · Ion Stoica · Ce Zhang -
2023 Oral: FlexGen: High-Throughput Generative Inference of Large Language Models with a Single GPU »
Ying Sheng · Lianmin Zheng · Binhang Yuan · Zhuohan Li · Max Ryabinin · Beidi Chen · Percy Liang · Christopher Re · Ion Stoica · Ce Zhang -
2023 Oral: Whose Opinions Do Language Models Reflect? »
Shibani Santurkar · Esin Durmus · Faisal Ladhak · Cinoo Lee · Percy Liang · Tatsunori Hashimoto -
2023 Oral: Evaluating Self-Supervised Learning via Risk Decomposition »
Yann Dubois · Tatsunori Hashimoto · Percy Liang -
2023 Poster: Evaluating Self-Supervised Learning via Risk Decomposition »
Yann Dubois · Tatsunori Hashimoto · Percy Liang -
2023 Poster: Hyena Hierarchy: Towards Larger Convolutional Language Models »
Michael Poli · Stefano Massaroli · Eric Nguyen · Daniel Y Fu · Tri Dao · Stephen Baccus · Yoshua Bengio · Stefano Ermon · Christopher Re -
2023 Poster: CocktailSGD: Fine-tuning Foundation Models over 500Mbps Networks »
Jue Wang · Yucheng Lu · Binhang Yuan · Beidi Chen · Percy Liang · Chris De Sa · Christopher Re · Ce Zhang -
2023 Poster: Out-of-Domain Robustness via Targeted Augmentations »
Irena Gao · Shiori Sagawa · Pang Wei Koh · Tatsunori Hashimoto · Percy Liang -
2023 Poster: Deja Vu: Contextual Sparsity for Efficient LLMs at Inference Time »
Zichang Liu · Jue Wang · Tri Dao · Tianyi Zhou · Binhang Yuan · Zhao Song · Anshumali Shrivastava · Ce Zhang · Yuandong Tian · Christopher Re · Beidi Chen -
2023 Poster: One-sided Matrix Completion from Two Observations Per Row »
Steven Cao · Percy Liang · Greg Valiant -
2023 Poster: Retrieval-Augmented Multimodal Language Modeling »
Michihiro Yasunaga · Armen Aghajanyan · Weijia Shi · Richard James · Jure Leskovec · Percy Liang · Mike Lewis · Luke Zettlemoyer · Scott Yih -
2022 : FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness »
Tri Dao · Daniel Y Fu · Stefano Ermon · Atri Rudra · Christopher Re -
2022 : Discussion Panel »
Percy Liang · Léon Bottou · Jayashree Kalpathy-Cramer · Alex Smola -
2022 Workshop: The First Workshop on Pre-training: Perspectives, Pitfalls, and Paths Forward »
Huaxiu Yao · Hugo Larochelle · Percy Liang · Colin Raffel · Jian Tang · Ying WEI · Saining Xie · Eric Xing · Chelsea Finn -
2022 : RITA: a Study on Scaling Up Generative Protein Sequence Models »
Daniel Hesslow -
2022 Poster: Connect, Not Collapse: Explaining Contrastive Learning for Unsupervised Domain Adaptation »
Kendrick Shen · Robbie Jones · Ananya Kumar · Sang Michael Xie · Jeff Z. HaoChen · Tengyu Ma · Percy Liang -
2022 Oral: Connect, Not Collapse: Explaining Contrastive Learning for Unsupervised Domain Adaptation »
Kendrick Shen · Robbie Jones · Ananya Kumar · Sang Michael Xie · Jeff Z. HaoChen · Tengyu Ma · Percy Liang -
2022 Poster: Perfectly Balanced: Improving Transfer and Robustness of Supervised Contrastive Learning »
Mayee Chen · Daniel Y Fu · Avanika Narayan · Michael Zhang · Zhao Song · Kayvon Fatahalian · Christopher Re -
2022 Spotlight: Perfectly Balanced: Improving Transfer and Robustness of Supervised Contrastive Learning »
Mayee Chen · Daniel Y Fu · Avanika Narayan · Michael Zhang · Zhao Song · Kayvon Fatahalian · Christopher Re -
2022 Poster: ButterflyFlow: Building Invertible Layers with Butterfly Matrices »
Chenlin Meng · Linqi Zhou · Kristy Choi · Tri Dao · Stefano Ermon -
2022 Poster: Monarch: Expressive Structured Matrices for Efficient and Accurate Training »
Tri Dao · Beidi Chen · Nimit Sohoni · Arjun Desai · Michael Poli · Jessica Grogan · Alexander Liu · Aniruddh Rao · Atri Rudra · Christopher Re -
2022 Poster: What Language Model Architecture and Pretraining Objective Works Best for Zero-Shot Generalization? »
Thomas Wang · Adam Roberts · Daniel Hesslow · Teven Le Scao · Hyung Won Chung · Iz Beltagy · Julien Launay · Colin Raffel -
2022 Oral: Monarch: Expressive Structured Matrices for Efficient and Accurate Training »
Tri Dao · Beidi Chen · Nimit Sohoni · Arjun Desai · Michael Poli · Jessica Grogan · Alexander Liu · Aniruddh Rao · Atri Rudra · Christopher Re -
2022 Spotlight: What Language Model Architecture and Pretraining Objective Works Best for Zero-Shot Generalization? »
Thomas Wang · Adam Roberts · Daniel Hesslow · Teven Le Scao · Hyung Won Chung · Iz Beltagy · Julien Launay · Colin Raffel -
2022 Spotlight: ButterflyFlow: Building Invertible Layers with Butterfly Matrices »
Chenlin Meng · Linqi Zhou · Kristy Choi · Tri Dao · Stefano Ermon -
2021 Poster: WILDS: A Benchmark of in-the-Wild Distribution Shifts »
Pang Wei Koh · Shiori Sagawa · Henrik Marklund · Sang Michael Xie · Marvin Zhang · Akshay Balsubramani · Weihua Hu · Michihiro Yasunaga · Richard Lanas Phillips · Irena Gao · Tony Lee · Etienne David · Ian Stavness · Wei Guo · Berton Earnshaw · Imran Haque · Sara Beery · Jure Leskovec · Anshul Kundaje · Emma Pierson · Sergey Levine · Chelsea Finn · Percy Liang -
2021 Poster: Composed Fine-Tuning: Freezing Pre-Trained Denoising Autoencoders for Improved Generalization »
Sang Michael Xie · Tengyu Ma · Percy Liang -
2021 Oral: WILDS: A Benchmark of in-the-Wild Distribution Shifts »
Pang Wei Koh · Shiori Sagawa · Henrik Marklund · Sang Michael Xie · Marvin Zhang · Akshay Balsubramani · Weihua Hu · Michihiro Yasunaga · Richard Lanas Phillips · Irena Gao · Tony Lee · Etienne David · Ian Stavness · Wei Guo · Berton Earnshaw · Imran Haque · Sara Beery · Jure Leskovec · Anshul Kundaje · Emma Pierson · Sergey Levine · Chelsea Finn · Percy Liang -
2021 Oral: Composed Fine-Tuning: Freezing Pre-Trained Denoising Autoencoders for Improved Generalization »
Sang Michael Xie · Tengyu Ma · Percy Liang -
2021 Poster: Accuracy on the Line: on the Strong Correlation Between Out-of-Distribution and In-Distribution Generalization »
John Miller · Rohan Taori · Aditi Raghunathan · Shiori Sagawa · Pang Wei Koh · Vaishaal Shankar · Percy Liang · Yair Carmon · Ludwig Schmidt -
2021 Poster: Break-It-Fix-It: Unsupervised Learning for Program Repair »
Michihiro Yasunaga · Percy Liang -
2021 Oral: Break-It-Fix-It: Unsupervised Learning for Program Repair »
Michihiro Yasunaga · Percy Liang -
2021 Spotlight: Accuracy on the Line: on the Strong Correlation Between Out-of-Distribution and In-Distribution Generalization »
John Miller · Rohan Taori · Aditi Raghunathan · Shiori Sagawa · Pang Wei Koh · Vaishaal Shankar · Percy Liang · Yair Carmon · Ludwig Schmidt -
2021 Poster: Decoupling Exploration and Exploitation for Meta-Reinforcement Learning without Sacrifices »
Evan Liu · Aditi Raghunathan · Percy Liang · Chelsea Finn -
2021 Spotlight: Decoupling Exploration and Exploitation for Meta-Reinforcement Learning without Sacrifices »
Evan Liu · Aditi Raghunathan · Percy Liang · Chelsea Finn -
2021 Poster: Catformer: Designing Stable Transformers via Sensitivity Analysis »
Jared Quincy Davis · Albert Gu · Krzysztof Choromanski · Tri Dao · Christopher Re · Chelsea Finn · Percy Liang -
2021 Poster: Just Train Twice: Improving Group Robustness without Training Group Information »
Evan Liu · Behzad Haghgoo · Annie Chen · Aditi Raghunathan · Pang Wei Koh · Shiori Sagawa · Percy Liang · Chelsea Finn -
2021 Poster: A Tale of Two Efficient and Informative Negative Sampling Distributions »
Shabnam Daghaghi · Tharun Medini · Nicholas Meisburger · Beidi Chen · Mengnan Zhao · Anshumali Shrivastava -
2021 Spotlight: Catformer: Designing Stable Transformers via Sensitivity Analysis »
Jared Quincy Davis · Albert Gu · Krzysztof Choromanski · Tri Dao · Christopher Re · Chelsea Finn · Percy Liang -
2021 Oral: Just Train Twice: Improving Group Robustness without Training Group Information »
Evan Liu · Behzad Haghgoo · Annie Chen · Aditi Raghunathan · Pang Wei Koh · Shiori Sagawa · Percy Liang · Chelsea Finn -
2021 Oral: A Tale of Two Efficient and Informative Negative Sampling Distributions »
Shabnam Daghaghi · Tharun Medini · Nicholas Meisburger · Beidi Chen · Mengnan Zhao · Anshumali Shrivastava -
2020 : Keynote #3 Percy Liang »
Percy Liang -
2020 Poster: Concept Bottleneck Models »
Pang Wei Koh · Thao Nguyen · Yew Siang Tang · Stephen Mussmann · Emma Pierson · Been Kim · Percy Liang -
2020 Poster: Graph-based, Self-Supervised Program Repair from Diagnostic Feedback »
Michihiro Yasunaga · Percy Liang -
2020 Poster: Understanding Self-Training for Gradual Domain Adaptation »
Ananya Kumar · Tengyu Ma · Percy Liang -
2020 Poster: Fast and Three-rious: Speeding Up Weak Supervision with Triplet Methods »
Daniel Y Fu · Mayee Chen · Frederic Sala · Sarah Hooper · Kayvon Fatahalian · Christopher Re -
2020 Poster: Understanding and Mitigating the Tradeoff between Robustness and Accuracy »
Aditi Raghunathan · Sang Michael Xie · Fanny Yang · John Duchi · Percy Liang -
2020 Poster: An Investigation of Why Overparameterization Exacerbates Spurious Correlations »
Shiori Sagawa · aditi raghunathan · Pang Wei Koh · Percy Liang -
2020 Poster: Robustness to Spurious Correlations via Human Annotations »
Megha Srivastava · Tatsunori Hashimoto · Percy Liang -
2020 Poster: Feature Noise Induces Loss Discrepancy Across Groups »
Fereshte Khani · Percy Liang -
2020 Poster: Angular Visual Hardness »
Beidi Chen · Weiyang Liu · Zhiding Yu · Jan Kautz · Anshumali Shrivastava · Animesh Garg · Anima Anandkumar -
2019 Workshop: Workshop on the Security and Privacy of Machine Learning »
Nicolas Papernot · Florian Tramer · Bo Li · Dan Boneh · David Evans · Somesh Jha · Percy Liang · Patrick McDaniel · Jacob Steinhardt · Dawn Song -
2019 Poster: Learning Fast Algorithms for Linear Transforms Using Butterfly Factorizations »
Tri Dao · Albert Gu · Matthew Eichhorn · Atri Rudra · Christopher Re -
2019 Oral: Learning Fast Algorithms for Linear Transforms Using Butterfly Factorizations »
Tri Dao · Albert Gu · Matthew Eichhorn · Atri Rudra · Christopher Re -
2019 Poster: A Kernel Theory of Modern Data Augmentation »
Tri Dao · Albert Gu · Alexander J Ratner · Virginia Smith · Christopher De Sa · Christopher Re -
2019 Oral: A Kernel Theory of Modern Data Augmentation »
Tri Dao · Albert Gu · Alexander J Ratner · Virginia Smith · Christopher De Sa · Christopher Re -
2018 Poster: On the Relationship between Data Efficiency and Error for Uncertainty Sampling »
Stephen Mussmann · Percy Liang -
2018 Poster: Fairness Without Demographics in Repeated Loss Minimization »
Tatsunori Hashimoto · Megha Srivastava · Hongseok Namkoong · Percy Liang -
2018 Oral: Fairness Without Demographics in Repeated Loss Minimization »
Tatsunori Hashimoto · Megha Srivastava · Hongseok Namkoong · Percy Liang -
2018 Oral: On the Relationship between Data Efficiency and Error for Uncertainty Sampling »
Stephen Mussmann · Percy Liang -
2017 Poster: World of Bits: An Open-Domain Platform for Web-Based Agents »
Tim Shi · Andrej Karpathy · Jim Fan · Jonathan Hernandez · Percy Liang -
2017 Talk: World of Bits: An Open-Domain Platform for Web-Based Agents »
Tim Shi · Andrej Karpathy · Jim Fan · Jonathan Hernandez · Percy Liang -
2017 Poster: Developing Bug-Free Machine Learning Systems With Formal Mathematics »
Daniel Selsam · Percy Liang · David L Dill -
2017 Talk: Developing Bug-Free Machine Learning Systems With Formal Mathematics »
Daniel Selsam · Percy Liang · David L Dill -
2017 Poster: Convexified Convolutional Neural Networks »
Yuchen Zhang · Percy Liang · Martin Wainwright -
2017 Poster: Understanding Black-box Predictions via Influence Functions »
Pang Wei Koh · Percy Liang -
2017 Talk: Convexified Convolutional Neural Networks »
Yuchen Zhang · Percy Liang · Martin Wainwright -
2017 Talk: Understanding Black-box Predictions via Influence Functions »
Pang Wei Koh · Percy Liang