Workshop
ES-FoMo II: 2nd Workshop on Efficient Systems for Foundation Models
Julien Launay · Tri Dao · Daniel Y Fu · Max Ryabinin · Daniel Hesslow · Beidi Chen · Percy Liang
Lehar 2
Fri 26 Jul, midnight PDT
As models increase in size and training budget, they not only systematically improve in upstream quality, but also exhibit novel emergent capabilities, unlocking new AI applications. These new capabilities have led to a paradigm shift: large foundation models have become predominant in natural language processing and are growing increasingly common in computer vision, audio processing and even robotics. This increase in scale raises proportionate difficulties for practitioners: foundation model training and inference lie at a unique interdisciplinary crossroad, combining open problems in algorithms, system design, and software engineering.In response to these challenges, diverse research directions have spawned promising works: (1) training and inference either at large scale or in resource-constrained scenarios (e.g., with higher network latency and lower bandwidth, in a collaborative manner across a fleet of contributed devices, or with a single GPU); (2) large-scale distributed training approaches, such as 3D parallelism and sharding; and (3) deep system optimizations, with custom languages such as TVM and Triton. These novel interdisciplinary research directions directly shape and impact the trajectory of research across machine learning.Accordingly, these emerging lines of research are increasingly relevant to machine learning researchers. Indeed, researchers are key stakeholders: on the one hand, researchers may contribute algorithmic insights and novel methods to improving training and inference of large models (e.g., recent award-winning papers at ICML and NeurIPS); on the other hand, novel research findings may be best demonstrated at scale --- which may require training models as efficiently as possible to make the best use of available resources.The goal of this workshop is to bring together interdisciplinary experts working on the emerging research questions and challenges associated with foundation model training and inference.This would be the $$\textbf{second}$$ installment of the ES-FoMo workshop at ICML. This year, we are bringing further focus on three trends observed in 2023: (1) the emergence of novel architectures, popularized by Mixtral (mixture-of-experts) and Mamba (state-space models); (2) efficient open implementations, such as vLLM and $$\texttt{gpt-fast}$$; (3) open questions on novel hardware and data tooling. We look forward to continuing to grow this community at ICML 2024.
Chat is not available.
Timezone: America/Los_Angeles
Schedule
Fri 12:00 a.m. - 12:00 a.m.
|
Opening Remarks
(
Opening Remarks
)
>
|
🔗 |
Fri 12:01 a.m. - 12:30 a.m.
|
Efficient Quantization Methods and Marlin, a Fast 4-Bit Inference Kernel
(
Invited Talk
)
>
SlidesLive Video |
Elias Frantar 🔗 |
Fri 12:30 a.m. - 12:45 a.m.
|
Prompt-prompted Adaptive Structured Pruning for Efficient LLM Generation
(
Oral
)
>
link
SlidesLive Video |
Harry Dong · Beidi Chen · Yuejie Chi 🔗 |
Fri 12:45 a.m. - 1:00 a.m.
|
Lottery Ticket Adaptation: Mitigating Destructive Interference in LLMs
(
Oral
)
>
link
SlidesLive Video |
Ashwinee Panda · Berivan Isik · Xiangyu Qi · Sanmi Koyejo · Tsachy Weissman · Prateek Mittal 🔗 |
Fri 1:00 a.m. - 1:15 a.m.
|
Coffee Break
|
🔗 |
Fri 1:15 a.m. - 1:30 a.m.
|
Simple linear attention language models balance the recall-throughput tradeoff
(
Oral
)
>
link
SlidesLive Video |
Simran Arora · Sabri Eyuboglu · Michael Zhang · Aman Timalsina · Silas Alberti · Dylan Zinsley · James Zou · Atri Rudra · Christopher Re 🔗 |
Fri 1:30 a.m. - 1:45 a.m.
|
xLSTM: Extended Long Short-Term Memory
(
Oral
)
>
link
SlidesLive Video |
Maximilian Beck · Korbinian Pöppel · Markus Spanring · Andreas Auer · Oleksandra Prudnikova · Michael Kopp · Günter Klambauer · Johannes Brandstetter · Sepp Hochreiter 🔗 |
Fri 1:45 a.m. - 2:15 a.m.
|
A Deep Dive into State-Space Models
(
Invited Talk
)
>
SlidesLive Video |
Albert Gu 🔗 |
Fri 2:15 a.m. - 2:45 a.m.
|
Scaling Mixture-of-Experts: Lessons from DBRX
(
Invited Talk
)
>
SlidesLive Video |
Vitaliy Chiley 🔗 |
Fri 2:45 a.m. - 3:00 a.m.
|
Characterizing Prompt Compression Methods for Long Context Inference
(
Oral
)
>
link
SlidesLive Video |
Siddharth Jha · Lutfi Erdogan · Sehoon Kim · EECS Kurt Keutzer · Amir Gholaminejad 🔗 |
Fri 3:00 a.m. - 4:00 a.m.
|
Lunch Break
|
🔗 |
Fri 4:00 a.m. - 5:15 a.m.
|
Poster Session
|
🔗 |
Fri 5:15 a.m. - 5:30 a.m.
|
Awards
(
Awards
)
>
SlidesLive Video |
🔗 |
Fri 5:30 a.m. - 6:00 a.m.
|
Scaling Intelligence
(
Invited Talk
)
>
SlidesLive Video |
Azalia Mirhoseini 🔗 |
Fri 6:00 a.m. - 6:30 a.m.
|
Frontier Clusters for Frontier Models: Scaling to 100,000 GPUs and Beyond
(
Invited Talk
)
>
|
🔗 |
Fri 6:30 a.m. - 6:45 a.m.
|
Coffee Break
|
🔗 |
Fri 6:45 a.m. - 7:30 a.m.
|
Panel: Data and Architecture Trends Across Industry and Open Communities
(
Panel
)
>
SlidesLive Video |
🔗 |
Fri 7:30 a.m. - 7:59 a.m.
|
Open Tooling for Large Data Pipelines
(
Invited Talk
)
>
SlidesLive Video |
🔗 |
Fri 7:59 a.m. - 8:00 a.m.
|
Closing Remarks
(
Closing Remarks
)
>
|
🔗 |
-
|
Fast and Memory-Efficient Multi-Sequence Generation via Structured Masking ( Poster ) > link | Daniel Israel · Siyan Zhao · Guy Van den Broeck · Aditya Grover 🔗 |
-
|
Robust Federated Finetuning of Foundation Models via Alternating Minimization of LoRA ( Poster ) > link | Shuangyi Chen · Yue Ju · Hardik Dalal · Zhongwen Zhu · Ashish Khisti 🔗 |
-
|
Implicit Optimization Bias of Next-token Prediction in Linear Models ( Poster ) > link | Christos Thrampoulidis 🔗 |
-
|
Janus: An Efficient and Expressive Subquadratic Architecture for Modeling Biological Sequences ( Poster ) > link | Krithik Ramesh · Sameed Siddiqui · Michael Mitzenmacher · Pardis Sabeti 🔗 |
-
|
Training-Free Acceleration of ViTs with Delayed Spatial Merging ( Poster ) > link | Jung Hwan Heo · Seyedarmin Azizi · Arash Fayyazi · Massoud Pedram 🔗 |
-
|
MoRe Fine-Tuning with 10x Fewer Parameters ( Poster ) > link | Wenxuan Tan · Nicholas Roberts · Tzu-Heng Huang · Jitian Zhao · John Cooper · Samuel Guo · Chengyu Duan · Frederic Sala 🔗 |
-
|
HLSTransform: Energy-Efficient Llama 2 Inference on FPGAs Via High Level Synthesis ( Poster ) > link | Darren Key · Andy He · Mason Bulling · Andrew Chang · Skyler Shapiro · Everett Lee 🔗 |
-
|
Task Addition and Weight Disentanglement in Closed-Vocabulary Models ( Poster ) > link | Adam Hazimeh · Alessandro Favero · Pascal Frossard 🔗 |
-
|
DocParseNet: Advanced Semantic Segmentation and OCR Embeddings for Efficient Scanned Document Annotation ( Poster ) > link | Ahmad Mohammadshirazi · Ali Nosratifiroozsalari · Mengxi Zhou · Dheeraj Kulshrestha · Rajiv Ramnath 🔗 |
-
|
GPTVQ: The Blessing of Dimensionality for LLM Quantization ( Poster ) > link | Marinus van Baalen · Andrey Kuzmin · Markus Nagel · Peter Couperus · Artem Bolshakov · Cedric Bastoul · Eric Mahurin · Tijmen Blankevoort · Paul Whatmough 🔗 |
-
|
Scavenging Hyena: Distilling Transformers into Long Convolution Models ( Poster ) > link | Tokiniaina Ralambomihanta · Shahrad Mohammadzadeh · Mohammad Sami Nur Islam · Wassim Jabbour · Laurence Liang 🔗 |
-
|
NVDSL: Simplifying Tensor Cores with Python-Driven MLIR Metaprogramming ( Poster ) > link | Guray Ozen 🔗 |
-
|
Learned Best-Effort LLM Serving ( Poster ) > link | Siddharth Jha · Coleman Hooper · Xiaoxuan Liu · Sehoon Kim · EECS Kurt Keutzer 🔗 |
-
|
Revealing the Utilized Rank of Subspaces of Learning in Neural Networks ( Poster ) > link | Isha Garg · Christian Koguchi · Eshan Verma · Daniel Ulbricht 🔗 |
-
|
Scaling Laws and Compute-Optimal Training Beyond Fixed Training Durations ( Poster ) > link | Alexander Hägele · Elie Bakouch · Atli Kosson · Loubna Ben allal · Leandro Von Werra · Martin Jaggi 🔗 |
-
|
TinyAgent: Quantization-aware Model Compression and Adaptation for On-device LLM Agent Deployment ( Poster ) > link | Jason Kong · Lanxiang Hu · Flavio Ponzina · Tajana Rosing 🔗 |
-
|
AdaNF: Quantization Group Adaptive NormalFloat for Low Bit Fine-tuning of LLMs ( Poster ) > link | Yeojoon Youn · Sehoon Kim · Suhong Moon · Sang Keun Choe · Ce Zhang 🔗 |
-
|
Fast Adaptation and Robust Quantization of Multi-Modal Foundation Models from Associative Memory: A Case Study in SpeechLM ( Poster ) > link | Shang Wu · Yen-Ju Lu · Haozheng Luo · Jerry Yao-Chieh Hu · Jiayi Wang · Jing Liu · Najim Dehak · Jesus Villalba · Han Liu 🔗 |
-
|
Prepacking: A Simple Method for Fast Prefilling and Increased Throughput in Large Language Models ( Poster ) > link | Siyan Zhao · Daniel Israel · Guy Van den Broeck · Aditya Grover 🔗 |
-
|
Quantum-PEFT: Ultra parameter-efficient fine-tuning ( Poster ) > link | Toshiaki Koike-Akino · Francesco Tonin · Yongtao Wu · Leyla Candogan · Volkan Cevher 🔗 |
-
|
LAuReL: Learned Augmented Residual Layer ( Poster ) > link | Gaurav Menghani · Ravi Kumar · Sanjiv Kumar 🔗 |
-
|
Fast yet Safe: Early-Exiting with Risk Control ( Poster ) > link | Metod Jazbec · Alexander Timans · Tin Hadži Veljković · Johann Sakmann · Dan Zhang · Christian Andersson Naesseth · Eric Nalisnick 🔗 |
-
|
Mamba-PTQ: Outlier Channels in Recurrent Large Language Models ( Poster ) > link | Alessandro Pierro · Steven Abreu 🔗 |
-
|
Can Transformers Solve Least Squares to High Precision? ( Poster ) > link | Jerry Liu · Jessica Grogan · Owen Dugan · Simran Arora · Atri Rudra · Christopher Re 🔗 |
-
|
LazyLLM: Dynamic Token Pruning for Efficient Long Context LLM Inference ( Poster ) > link | Qichen Fu · Minsik Cho · Thomas Merth · Sachin Mehta · Mohammad Rastegari · Mahyar Najibi 🔗 |
-
|
Does your data spark joy? Performance gains from domain upsampling at the end of training ( Poster ) > link | Cody Blakeney · Mansheej Paul · Brett Larsen · Sean Owen · Jonathan Frankle 🔗 |
-
|
Pretrained Hybrids with MAD Skills ( Poster ) > link | Nicholas Roberts · Samuel Guo · Zhiqi Gao · Satya Sai Srinath Namburi GNVV · Sonia Cromp · Chengjun Wu · Chengyu Duan · Frederic Sala 🔗 |
-
|
Efficient LLM Pruning with Global Token-Dependency Awareness and Hardware-Adapted Inference ( Poster ) > link | Oshin Dutta · Ritvik Gupta · Sumeet Agarwal 🔗 |
-
|
Enhancing Stability for Large Models Training in Constrained Bandwidth Networks ( Poster ) > link | Yun Dai · Tejas Dharamsi · Pin-Lun Hsu · Tao Song · Hamed Firooz 🔗 |
-
|
Mobile and Edge Evaluation of Large Language Models ( Poster ) > link | Stefanos Laskaridis · Kleomenis Katevas · Lorenzo Minto · Hamed Haddadi 🔗 |
-
|
SpecDec++: Boosting Speculative Decoding via Adaptive Candidate Lengths ( Poster ) > link | Kaixuan Huang · Xudong Guo · Mengdi Wang 🔗 |
-
|
OutEffHop: A Principled Outlier-Efficient Attention Layer from Dense Associative Memory Models ( Poster ) > link | Haozheng Luo · Jerry Yao-Chieh Hu · Pei-Hsuan Chang · Hong-Yu Chen · Weijian Li · Wei-Po Wang · Han Liu 🔗 |
-
|
Hydragen: High-Throughput LLM Inference with Shared Prefixes ( Poster ) > link | Jordan Juravsky · Bradley Brown · Ryan Ehrlich · Daniel Y Fu · Christopher Re · Azalia Mirhoseini 🔗 |
-
|
Compress then Serve: Serving Thousands of LoRA Adapters with Little Overhead ( Poster ) > link | Rickard Gabrielsson · Jiacheng Zhu · Onkar Bhardwaj · Leshem Choshen · Kristjan Greenewald · Mikhail Yurochkin · Justin Solomon 🔗 |
-
|
In Defense of Structural Sparse Adapters for Concurrent LLM Serving ( Poster ) > link | Junda Su · Zirui Liu · Zeju Qiu · Weiyang Liu · Zhaozhuo Xu 🔗 |
-
|
OpenELM: An Efficient Language Model Family with Open Training and Inference Framework ( Poster ) > link |
11 presentersSachin Mehta · Mohammad Sekhavat · Qingqing Cao · Maxwell Horton · Yanzi Jin · Chenfan Sun · Seyed Iman Mirzadeh · Mahyar Najibi · Dmitry Belenko · Peter Zatloukal · Mohammad Rastegari |
-
|
ExpoMamba: Exploiting Frequency SSM Blocks for Efficient and Effective Image Enhancement ( Poster ) > link | Eashan Adhikarla · Kai Zhang · John Nicholson · Brian D Davison 🔗 |
-
|
Why Transformers Need Adam: A Hessian Perspective ( Poster ) > link | Yushun Zhang · Congliang Chen · Tian Ding · Ziniu Li · Ruoyu Sun · Zhi-Quan Luo 🔗 |
-
|
Low-rank Linearization of Large Language Models ( Poster ) > link | Michael Zhang · Aaryan Singhal · Benjamin F Spector · Simran Arora · Christopher Re 🔗 |
-
|
The Mamba in the Llama: Distilling and Accelerating Hybrid Models ( Poster ) > link | Junxiong Wang · Daniele Paliotta · Avner May · Alexander Rush · Tri Dao 🔗 |
-
|
SVFT: Parameter-Efficient Fine-Tuning with Singular Vectors ( Poster ) > link | Vijay Lingam · Atula Tejaswi · Aditya Vavre · Aneesh Shetty · Gautham Krishna Gudur · Joydeep Ghosh · Alexandros Dimakis · Eunsol Choi · Aleksandar Bojchevski · Sujay Sanghavi 🔗 |
-
|
Efficient multi-prompt evaluation of LLMs ( Poster ) > link | Felipe Maia Polo · Ronald Xu · Lucas Weber · MÍRIAN FRANCIELLE DA SILVA · Onkar Bhardwaj · Leshem Choshen · Allysson de Oliveira · Yuekai Sun · Mikhail Yurochkin 🔗 |
-
|
Adam-mini: Use Fewer Learning Rates To Gain More ( Poster ) > link | Yushun Zhang · Congliang Chen · Ziniu Li · Tian Ding · Chenwei Wu · Yinyu Ye · Zhi-Quan Luo · Ruoyu Sun 🔗 |
-
|
Efficient Training of Language Models with Compact and Consistent Next Token Distributions ( Poster ) > link | Ashutosh Sathe · Sunita Sarawagi 🔗 |
-
|
Revisiting Cascaded Ensembles for Efficient Inference ( Poster ) > link | Steven Kolawole · Don Kurian Dennis · Ameet Talwalkar · Virginia Smith 🔗 |
-
|
Just read twice: closing the recall gap for recurrent language models ( Poster ) > link | Simran Arora · Aman Timalsina · Aaryan Singhal · Sabri Eyuboglu · Xinyi Zhao · Ashish Rao · Atri Rudra · Christopher Re 🔗 |
-
|
Exploiting Activation Sparsity with Dense to Dynamic-k Mixture-of-Experts Conversion ( Poster ) > link | Filip Szatkowski · Bartosz Wójcik · Mikołaj Piórczyński · Simone Scardapane 🔗 |
-
|
Low Rank Quantization-Aware Training for LLMs ( Poster ) > link | Yelysei Bondarenko · Riccardo Del Chiaro · Markus Nagel 🔗 |
-
|
PQV-Mobile: A Combined Pruning and Quantization Toolkit to Optimize Vision Transformers for Mobile Applications ( Poster ) > link | Kshitij Bhardwaj 🔗 |
-
|
Optimistic Verifiable Training by Controlling Hardware Nondeterminism ( Poster ) > link | Megha Srivastava · Simran Arora · Dan Boneh 🔗 |
-
|
Block Verification Accelerates Speculative Decoding ( Poster ) > link | Ziteng Sun · Uri Mendlovic · Yaniv Leviathan · Asaf Aharoni · Ahmad Beirami · Jae Ro · Ananda Suresh 🔗 |
-
|
Exponential Quantum Communication Advantage in Distributed Inference and Learning ( Poster ) > link | Hagay Michaeli · Dar Gilboa · Daniel Soudry · Jarrod McClean 🔗 |
-
|
Hardware-Efficient Quantization for Green Custom Foundation Models ( Poster ) > link | Toshiaki Koike-Akino · Chang Meng · Volkan Cevher · Giovanni De Micheli 🔗 |
-
|
Unlocking the Global Synergies in Low-Rank Adapters ( Poster ) > link | Zixi Zhang · Cheng Zhang · Xitong Gao · Robert Mullins · George Constantinides · Yiren Zhao 🔗 |
-
|
Exploring and Improving Drafts in Blockwise Parallel Decoding ( Poster ) > link | Taehyeon Kim · Ananda Suresh · Kishore Papineni · Michael Riley · Sanjiv Kumar · Adrian Benton 🔗 |
-
|
Think Big, Generate Quick: LLM-to-SLM for Fast Autoregressive Decoding ( Poster ) > link | Benjamin Bergner · Andrii Skliar · Amelie Royer · Tijmen Blankevoort · Yuki Asano · Babak Ehteshami Bejnordi 🔗 |
-
|
BAM! Just Like That: Simple and Efficient Parameter Upcycling for Mixture of Experts ( Poster ) > link |
11 presentersQizhen Zhang · Nikolas Gritsch · Dwaraknath Gnaneshwar · Simon Guo · David Cairuz · Bharat Venkitesh · Jakob Foerster · Phil Blunsom · Sebastian Ruder · Ahmet Üstün · Acyr Locatelli |
-
|
GRASS: Compute Efficient Low-Memory LLM Training with Structured Sparse Gradients ( Poster ) > link | Aashiq Muhamed · Oscar Li · David Woodruff · Mona Diab · Virginia Smith 🔗 |
-
|
Towards Efficient Large-Scale Language-3D Representation Learning ( Poster ) > link | Shentong Mo · Xiaogang Xu · Tongzhou Wang · Antonio Torralba · Shuang Li 🔗 |
-
|
Outliers and Calibration Sets have Diminishing Effect on Quantization of Modern LLMs ( Poster ) > link | Davide Paglieri · Saurabh Dash · Tim Rocktäschel · Jack Parker-Holder 🔗 |
-
|
Train your cake and eat it too! Repurposing collaborative training to tailor LLMs to private data without sharing ( Poster ) > link | Boris Radovic · Mohammed Aljahdali · Marco Canini · Veljko Pejovic · Zuhair Khayyat 🔗 |
-
|
Projectable Models: One-Shot Generation of Small Specialized Transformers from Large Ones ( Poster ) > link | Andrey Zhmoginov · Jihwan Lee · Mark Sandler 🔗 |
-
|
Fewer Truncations Improve Language Modeling ( Poster ) > link | Hantian Ding · Zijian Wang · Giovanni Paolini · Varun Kumar · Anoop Deoras · Dan Roth · Stefano Soatto 🔗 |
-
|
Zeroth-Order Fine-Tuning of LLMs with Extreme Sparsity ( Poster ) > link |
12 presentersWentao Guo · Jikai Long · YIMENG ZENG · Zirui Liu · Xinyu Yang · Yide Ran · Jacob Gardner · Osbert Bastani · Chris De Sa · Xiaodong Yu · Beidi Chen · Zhaozhuo Xu |
-
|
MInference: Accelerating Pre-filling for Long-Context LLMs via Dynamic Sparse Attention ( Poster ) > link |
12 presentersHuiqiang Jiang · Yucheng Li · Chengruidong Zhang · Qianhui Wu · Xufang Luo · Surin Ahn · Zhenhua Han · Amir Abdi · Dongsheng Li · Chin-Yew Lin · Yuqing Yang · Lili Qiu |
-
|
Exploring Monotonicity in Early-Exiting Language Models ( Poster ) > link | Filipe Laitenberger · Max Belitsky · Denys Sheremet 🔗 |
-
|
OpenDiLoCo: An Open-Source Framework for Globally Distributed Low-Communication Training ( Poster ) > link | Sami Jaghouar · Johannes Hagemann 🔗 |
-
|
AdaInf: Adaptive Inference for Resource-Constrained Foundation Models ( Poster ) > link | Zhuoyan Xu · Khoi Nguyen · Preeti Mukherjee · Somali Chaterji · Yingyiu Liang · Yin Li 🔗 |
-
|
Performance Control in Early Exiting to Deploy Large Models at the Same Cost of Smaller Ones ( Poster ) > link | Mehrnaz Mofakhami · Reza Bayat · Ioannis Mitliagkas · Joao Monteiro · Valentina Zantedeschi 🔗 |
-
|
Seeded LoRA: Collaborative Fine-Tuning Through Seed Initialization of Adapters ( Poster ) > link | Alejandro Rodriguez Salamanca · Ahmet Üstün · Nicki Skafte Detlefsen · Tim Dettmers 🔗 |
-
|
Understanding and Minimising Outlier Features in Neural Network Training ( Poster ) > link | Bobby He · Lorenzo Noci · Daniele Paliotta · Imanol Schlag · Thomas Hofmann 🔗 |
-
|
Towards smaller language models via layer looping ( Poster ) > link | Sabri Eyuboglu · Dylan Zinsley · Jon Saad-Falcon · Simran Arora · Atri Rudra · James Zou · Christopher Re 🔗 |
-
|
CLAM: Unifying Finetuning, Quantization, and Pruning by Chaining LLM Adapter Modules ( Poster ) > link | Neelay Velingker · Jason Liu · Amish Sethi · William Dodds · Zhiqiu (Oscar) Xu · Saikat Dutta · Mayur Naik · Eric Wong 🔗 |
-
|
Optimised Grouped-Query Attention Mechanism for Transformers ( Poster ) > link | Yuang Chen · Cheng Zhang · Xitong Gao · Robert Mullins · George Constantinides · Yiren Zhao 🔗 |
-
|
CO2: Precise Attention Score Observation for improving KV Cache Replacement in Large Language Model ( Poster ) > link | Meguru Yamazaki · Shivaram Venkataraman 🔗 |