ES-FoMo II: 2nd Workshop on Efficient Systems for Foundation Models

Workshop

ES-FoMo II: 2nd Workshop on Efficient Systems for Foundation Models

Julien Launay · Tri Dao · Daniel Y Fu · Max Ryabinin · Daniel Hesslow · Beidi Chen · Percy Liang

Lehar 2

Fri 26 Jul, midnight PDT

[ Abstract ] Workshop Website

As models increase in size and training budget, they not only systematically improve in upstream quality, but also exhibit novel emergent capabilities, unlocking new AI applications. These new capabilities have led to a paradigm shift: large foundation models have become predominant in natural language processing and are growing increasingly common in computer vision, audio processing and even robotics. This increase in scale raises proportionate difficulties for practitioners: foundation model training and inference lie at a unique interdisciplinary crossroad, combining open problems in algorithms, system design, and software engineering.In response to these challenges, diverse research directions have spawned promising works: (1) training and inference either at large scale or in resource-constrained scenarios (e.g., with higher network latency and lower bandwidth, in a collaborative manner across a fleet of contributed devices, or with a single GPU); (2) large-scale distributed training approaches, such as 3D parallelism and sharding; and (3) deep system optimizations, with custom languages such as TVM and Triton. These novel interdisciplinary research directions directly shape and impact the trajectory of research across machine learning.Accordingly, these emerging lines of research are increasingly relevant to machine learning researchers. Indeed, researchers are key stakeholders: on the one hand, researchers may contribute algorithmic insights and novel methods to improving training and inference of large models (e.g., recent award-winning papers at ICML and NeurIPS); on the other hand, novel research findings may be best demonstrated at scale --- which may require training models as efficiently as possible to make the best use of available resources.The goal of this workshop is to bring together interdisciplinary experts working on the emerging research questions and challenges associated with foundation model training and inference.This would be the

\textbf{second}

installment of the ES-FoMo workshop at ICML. This year, we are bringing further focus on three trends observed in 2023: (1) the emergence of novel architectures, popularized by Mixtral (mixture-of-experts) and Mamba (state-space models); (2) efficient open implementations, such as vLLM and

\texttt{gpt-fast}

; (3) open questions on novel hardware and data tooling. We look forward to continuing to grow this community at ICML 2024.

Chat is not available.

Timezone: America/Los_Angeles

Schedule

Fri 12:00 a.m. - 12:00 a.m.	Opening Remarks ( Opening Remarks ) >	🔗
Fri 12:01 a.m. - 12:30 a.m.	Efficient Quantization Methods and Marlin, a Fast 4-Bit Inference Kernel ( Invited Talk ) > SlidesLive Video	Elias Frantar 🔗
Fri 12:30 a.m. - 12:45 a.m.	Prompt-prompted Adaptive Structured Pruning for Efficient LLM Generation ( Oral ) > link SlidesLive Video Link	Harry Dong · Beidi Chen · Yuejie Chi 🔗
Fri 12:45 a.m. - 1:00 a.m.	Lottery Ticket Adaptation: Mitigating Destructive Interference in LLMs ( Oral ) > link SlidesLive Video Link	Ashwinee Panda · Berivan Isik · Xiangyu Qi · Sanmi Koyejo · Tsachy Weissman · Prateek Mittal 🔗
Fri 1:00 a.m. - 1:15 a.m.	Coffee Break	🔗
Fri 1:15 a.m. - 1:30 a.m.	Simple linear attention language models balance the recall-throughput tradeoff ( Oral ) > link SlidesLive Video Link	Simran Arora · Sabri Eyuboglu · Michael Zhang · Aman Timalsina · Silas Alberti · Dylan Zinsley · James Zou · Atri Rudra · Christopher Re 🔗
Fri 1:30 a.m. - 1:45 a.m.	xLSTM: Extended Long Short-Term Memory ( Oral ) > link SlidesLive Video Link	Maximilian Beck · Korbinian Pöppel · Markus Spanring · Andreas Auer · Oleksandra Prudnikova · Michael Kopp · Günter Klambauer · Johannes Brandstetter · Sepp Hochreiter 🔗
Fri 1:45 a.m. - 2:15 a.m.	A Deep Dive into State-Space Models ( Invited Talk ) > SlidesLive Video	Albert Gu 🔗
Fri 2:15 a.m. - 2:45 a.m.	Scaling Mixture-of-Experts: Lessons from DBRX ( Invited Talk ) > SlidesLive Video	Vitaliy Chiley 🔗
Fri 2:45 a.m. - 3:00 a.m.	Characterizing Prompt Compression Methods for Long Context Inference ( Oral ) > link SlidesLive Video Link	Siddharth Jha · Lutfi Erdogan · Sehoon Kim · EECS Kurt Keutzer · Amir Gholaminejad 🔗
Fri 3:00 a.m. - 4:00 a.m.	Lunch Break	🔗
Fri 4:00 a.m. - 5:15 a.m.	Poster Session	🔗
Fri 5:15 a.m. - 5:30 a.m.	Awards ( Awards ) > SlidesLive Video	🔗
Fri 5:30 a.m. - 6:00 a.m.	Scaling Intelligence ( Invited Talk ) > SlidesLive Video	Azalia Mirhoseini 🔗
Fri 6:00 a.m. - 6:30 a.m.	Frontier Clusters for Frontier Models: Scaling to 100,000 GPUs and Beyond ( Invited Talk ) >	🔗
Fri 6:30 a.m. - 6:45 a.m.	Coffee Break	🔗
Fri 6:45 a.m. - 7:30 a.m.	Panel: Data and Architecture Trends Across Industry and Open Communities ( Panel ) > SlidesLive Video	🔗
Fri 7:30 a.m. - 7:59 a.m.	Open Tooling for Large Data Pipelines ( Invited Talk ) > SlidesLive Video	🔗
Fri 7:59 a.m. - 8:00 a.m.	Closing Remarks ( Closing Remarks ) >	🔗
-	Fast and Memory-Efficient Multi-Sequence Generation via Structured Masking ( Poster ) > link Link	Daniel Israel · Siyan Zhao · Guy Van den Broeck · Aditya Grover 🔗
-	Robust Federated Finetuning of Foundation Models via Alternating Minimization of LoRA ( Poster ) > link Link	Shuangyi Chen · Yue Ju · Hardik Dalal · Zhongwen Zhu · Ashish Khisti 🔗
-	Implicit Optimization Bias of Next-token Prediction in Linear Models ( Poster ) > link Link	Christos Thrampoulidis 🔗
-	Janus: An Efficient and Expressive Subquadratic Architecture for Modeling Biological Sequences ( Poster ) > link Link	Krithik Ramesh · Sameed Siddiqui · Michael Mitzenmacher · Pardis Sabeti 🔗
-	Training-Free Acceleration of ViTs with Delayed Spatial Merging ( Poster ) > link Link	Jung Hwan Heo · Seyedarmin Azizi · Arash Fayyazi · Massoud Pedram 🔗
-	MoRe Fine-Tuning with 10x Fewer Parameters ( Poster ) > link Link	Wenxuan Tan · Nicholas Roberts · Tzu-Heng Huang · Jitian Zhao · John Cooper · Samuel Guo · Chengyu Duan · Frederic Sala 🔗
-	HLSTransform: Energy-Efficient Llama 2 Inference on FPGAs Via High Level Synthesis ( Poster ) > link Link	Darren Key · Andy He · Mason Bulling · Andrew Chang · Skyler Shapiro · Everett Lee 🔗
-	Task Addition and Weight Disentanglement in Closed-Vocabulary Models ( Poster ) > link Link	Adam Hazimeh · Alessandro Favero · Pascal Frossard 🔗
-	DocParseNet: Advanced Semantic Segmentation and OCR Embeddings for Efficient Scanned Document Annotation ( Poster ) > link Link	Ahmad Mohammadshirazi · Ali Nosratifiroozsalari · Mengxi Zhou · Dheeraj Kulshrestha · Rajiv Ramnath 🔗
-	GPTVQ: The Blessing of Dimensionality for LLM Quantization ( Poster ) > link Link	Marinus van Baalen · Andrey Kuzmin · Markus Nagel · Peter Couperus · Artem Bolshakov · Cedric Bastoul · Eric Mahurin · Tijmen Blankevoort · Paul Whatmough 🔗
-	Scavenging Hyena: Distilling Transformers into Long Convolution Models ( Poster ) > link Link	Tokiniaina Ralambomihanta · Shahrad Mohammadzadeh · Mohammad Sami Nur Islam · Wassim Jabbour · Laurence Liang 🔗
-	NVDSL: Simplifying Tensor Cores with Python-Driven MLIR Metaprogramming ( Poster ) > link Link	Guray Ozen 🔗
-	Learned Best-Effort LLM Serving ( Poster ) > link Link	Siddharth Jha · Coleman Hooper · Xiaoxuan Liu · Sehoon Kim · EECS Kurt Keutzer 🔗
-	Revealing the Utilized Rank of Subspaces of Learning in Neural Networks ( Poster ) > link Link	Isha Garg · Christian Koguchi · Eshan Verma · Daniel Ulbricht 🔗
-	Scaling Laws and Compute-Optimal Training Beyond Fixed Training Durations ( Poster ) > link Link	Alexander Hägele · Elie Bakouch · Atli Kosson · Loubna Ben allal · Leandro Von Werra · Martin Jaggi 🔗
-	TinyAgent: Quantization-aware Model Compression and Adaptation for On-device LLM Agent Deployment ( Poster ) > link Link	Jason Kong · Lanxiang Hu · Flavio Ponzina · Tajana Rosing 🔗
-	AdaNF: Quantization Group Adaptive NormalFloat for Low Bit Fine-tuning of LLMs ( Poster ) > link Link	Yeojoon Youn · Sehoon Kim · Suhong Moon · Sang Keun Choe · Ce Zhang 🔗
-	Fast Adaptation and Robust Quantization of Multi-Modal Foundation Models from Associative Memory: A Case Study in SpeechLM ( Poster ) > link Link	Shang Wu · Yen-Ju Lu · Haozheng Luo · Jerry Yao-Chieh Hu · Jiayi Wang · Jing Liu · Najim Dehak · Jesus Villalba · Han Liu 🔗
-	Prepacking: A Simple Method for Fast Prefilling and Increased Throughput in Large Language Models ( Poster ) > link Link	Siyan Zhao · Daniel Israel · Guy Van den Broeck · Aditya Grover 🔗
-	Quantum-PEFT: Ultra parameter-efficient fine-tuning ( Poster ) > link Link	Toshiaki Koike-Akino · Francesco Tonin · Yongtao Wu · Leyla Candogan · Volkan Cevher 🔗
-	LAuReL: Learned Augmented Residual Layer ( Poster ) > link Link	Gaurav Menghani · Ravi Kumar · Sanjiv Kumar 🔗
-	Fast yet Safe: Early-Exiting with Risk Control ( Poster ) > link Link	Metod Jazbec · Alexander Timans · Tin Hadži Veljković · Johann Sakmann · Dan Zhang · Christian Andersson Naesseth · Eric Nalisnick 🔗
-	Mamba-PTQ: Outlier Channels in Recurrent Large Language Models ( Poster ) > link Link	Alessandro Pierro · Steven Abreu 🔗
-	Can Transformers Solve Least Squares to High Precision? ( Poster ) > link Link	Jerry Liu · Jessica Grogan · Owen Dugan · Simran Arora · Atri Rudra · Christopher Re 🔗
-	LazyLLM: Dynamic Token Pruning for Efficient Long Context LLM Inference ( Poster ) > link Link	Qichen Fu · Minsik Cho · Thomas Merth · Sachin Mehta · Mohammad Rastegari · Mahyar Najibi 🔗
-	Does your data spark joy? Performance gains from domain upsampling at the end of training ( Poster ) > link Link	Cody Blakeney · Mansheej Paul · Brett Larsen · Sean Owen · Jonathan Frankle 🔗
-	Pretrained Hybrids with MAD Skills ( Poster ) > link Link	Nicholas Roberts · Samuel Guo · Zhiqi Gao · Satya Sai Srinath Namburi GNVV · Sonia Cromp · Chengjun Wu · Chengyu Duan · Frederic Sala 🔗
-	Efficient LLM Pruning with Global Token-Dependency Awareness and Hardware-Adapted Inference ( Poster ) > link Link	Oshin Dutta · Ritvik Gupta · Sumeet Agarwal 🔗
-	Enhancing Stability for Large Models Training in Constrained Bandwidth Networks ( Poster ) > link Link	Yun Dai · Tejas Dharamsi · Pin-Lun Hsu · Tao Song · Hamed Firooz 🔗
-	Mobile and Edge Evaluation of Large Language Models ( Poster ) > link Link	Stefanos Laskaridis · Kleomenis Katevas · Lorenzo Minto · Hamed Haddadi 🔗
-	SpecDec++: Boosting Speculative Decoding via Adaptive Candidate Lengths ( Poster ) > link Link	Kaixuan Huang · Xudong Guo · Mengdi Wang 🔗
-	OutEffHop: A Principled Outlier-Efficient Attention Layer from Dense Associative Memory Models ( Poster ) > link Link	Haozheng Luo · Jerry Yao-Chieh Hu · Pei-Hsuan Chang · Hong-Yu Chen · Weijian Li · Wei-Po Wang · Han Liu 🔗
-	Hydragen: High-Throughput LLM Inference with Shared Prefixes ( Poster ) > link Link	Jordan Juravsky · Bradley Brown · Ryan Ehrlich · Daniel Y Fu · Christopher Re · Azalia Mirhoseini 🔗
-	Compress then Serve: Serving Thousands of LoRA Adapters with Little Overhead ( Poster ) > link Link	Rickard Gabrielsson · Jiacheng Zhu · Onkar Bhardwaj · Leshem Choshen · Kristjan Greenewald · Mikhail Yurochkin · Justin Solomon 🔗
-	In Defense of Structural Sparse Adapters for Concurrent LLM Serving ( Poster ) > link Link	Junda Su · Zirui Liu · Zeju Qiu · Weiyang Liu · Zhaozhuo Xu 🔗
-	OpenELM: An Efficient Language Model Family with Open Training and Inference Framework ( Poster ) > link Link	11 presenters Sachin Mehta · Mohammad Sekhavat · Qingqing Cao · Maxwell Horton · Yanzi Jin · Chenfan Sun · Seyed Iman Mirzadeh · Mahyar Najibi · Dmitry Belenko · Peter Zatloukal · Mohammad Rastegari 🔗
-	ExpoMamba: Exploiting Frequency SSM Blocks for Efficient and Effective Image Enhancement ( Poster ) > link Link	Eashan Adhikarla · Kai Zhang · John Nicholson · Brian D Davison 🔗
-	Why Transformers Need Adam: A Hessian Perspective ( Poster ) > link Link	Yushun Zhang · Congliang Chen · Tian Ding · Ziniu Li · Ruoyu Sun · Zhi-Quan Luo 🔗
-	Low-rank Linearization of Large Language Models ( Poster ) > link Link	Michael Zhang · Aaryan Singhal · Benjamin F Spector · Simran Arora · Christopher Re 🔗
-	The Mamba in the Llama: Distilling and Accelerating Hybrid Models ( Poster ) > link Link	Junxiong Wang · Daniele Paliotta · Avner May · Alexander Rush · Tri Dao 🔗
-	SVFT: Parameter-Efficient Fine-Tuning with Singular Vectors ( Poster ) > link Link	Vijay Lingam · Atula Tejaswi · Aditya Vavre · Aneesh Shetty · Gautham Krishna Gudur · Joydeep Ghosh · Alexandros Dimakis · Eunsol Choi · Aleksandar Bojchevski · Sujay Sanghavi 🔗
-	Efficient multi-prompt evaluation of LLMs ( Poster ) > link Link	Felipe Maia Polo · Ronald Xu · Lucas Weber · MÍRIAN FRANCIELLE DA SILVA · Onkar Bhardwaj · Leshem Choshen · Allysson de Oliveira · Yuekai Sun · Mikhail Yurochkin 🔗
-	Adam-mini: Use Fewer Learning Rates To Gain More ( Poster ) > link Link	Yushun Zhang · Congliang Chen · Ziniu Li · Tian Ding · Chenwei Wu · Yinyu Ye · Zhi-Quan Luo · Ruoyu Sun 🔗
-	Efficient Training of Language Models with Compact and Consistent Next Token Distributions ( Poster ) > link Link	Ashutosh Sathe · Sunita Sarawagi 🔗
-	Revisiting Cascaded Ensembles for Efficient Inference ( Poster ) > link Link	Steven Kolawole · Don Kurian Dennis · Ameet Talwalkar · Virginia Smith 🔗
-	Just read twice: closing the recall gap for recurrent language models ( Poster ) > link Link	Simran Arora · Aman Timalsina · Aaryan Singhal · Sabri Eyuboglu · Xinyi Zhao · Ashish Rao · Atri Rudra · Christopher Re 🔗
-	Exploiting Activation Sparsity with Dense to Dynamic-k Mixture-of-Experts Conversion ( Poster ) > link Link	Filip Szatkowski · Bartosz Wójcik · Mikołaj Piórczyński · Simone Scardapane 🔗
-	Low Rank Quantization-Aware Training for LLMs ( Poster ) > link Link	Yelysei Bondarenko · Riccardo Del Chiaro · Markus Nagel 🔗
-	PQV-Mobile: A Combined Pruning and Quantization Toolkit to Optimize Vision Transformers for Mobile Applications ( Poster ) > link Link	Kshitij Bhardwaj 🔗
-	Optimistic Verifiable Training by Controlling Hardware Nondeterminism ( Poster ) > link Link	Megha Srivastava · Simran Arora · Dan Boneh 🔗
-	Block Verification Accelerates Speculative Decoding ( Poster ) > link Link	Ziteng Sun · Uri Mendlovic · Yaniv Leviathan · Asaf Aharoni · Ahmad Beirami · Jae Ro · Ananda Suresh 🔗
-	Exponential Quantum Communication Advantage in Distributed Inference and Learning ( Poster ) > link Link	Hagay Michaeli · Dar Gilboa · Daniel Soudry · Jarrod McClean 🔗
-	Hardware-Efficient Quantization for Green Custom Foundation Models ( Poster ) > link Link	Toshiaki Koike-Akino · Chang Meng · Volkan Cevher · Giovanni De Micheli 🔗
-	Unlocking the Global Synergies in Low-Rank Adapters ( Poster ) > link Link	Zixi Zhang · Cheng Zhang · Xitong Gao · Robert Mullins · George Constantinides · Yiren Zhao 🔗
-	Exploring and Improving Drafts in Blockwise Parallel Decoding ( Poster ) > link Link	Taehyeon Kim · Ananda Suresh · Kishore Papineni · Michael Riley · Sanjiv Kumar · Adrian Benton 🔗
-	Think Big, Generate Quick: LLM-to-SLM for Fast Autoregressive Decoding ( Poster ) > link Link	Benjamin Bergner · Andrii Skliar · Amelie Royer · Tijmen Blankevoort · Yuki Asano · Babak Ehteshami Bejnordi 🔗
-	BAM! Just Like That: Simple and Efficient Parameter Upcycling for Mixture of Experts ( Poster ) > link Link	11 presenters Qizhen Zhang · Nikolas Gritsch · Dwaraknath Gnaneshwar · Simon Guo · David Cairuz · Bharat Venkitesh · Jakob Foerster · Phil Blunsom · Sebastian Ruder · Ahmet Üstün · Acyr Locatelli 🔗
-	GRASS: Compute Efficient Low-Memory LLM Training with Structured Sparse Gradients ( Poster ) > link Link	Aashiq Muhamed · Oscar Li · David Woodruff · Mona Diab · Virginia Smith 🔗
-	Towards Efficient Large-Scale Language-3D Representation Learning ( Poster ) > link Link	Shentong Mo · Xiaogang Xu · Tongzhou Wang · Antonio Torralba · Shuang Li 🔗
-	Outliers and Calibration Sets have Diminishing Effect on Quantization of Modern LLMs ( Poster ) > link Link	Davide Paglieri · Saurabh Dash · Tim Rocktäschel · Jack Parker-Holder 🔗
-	Train your cake and eat it too! Repurposing collaborative training to tailor LLMs to private data without sharing ( Poster ) > link Link	Boris Radovic · Mohammed Aljahdali · Marco Canini · Veljko Pejovic · Zuhair Khayyat 🔗
-	Projectable Models: One-Shot Generation of Small Specialized Transformers from Large Ones ( Poster ) > link Link	Andrey Zhmoginov · Jihwan Lee · Mark Sandler 🔗
-	Fewer Truncations Improve Language Modeling ( Poster ) > link Link	Hantian Ding · Zijian Wang · Giovanni Paolini · Varun Kumar · Anoop Deoras · Dan Roth · Stefano Soatto 🔗
-	Zeroth-Order Fine-Tuning of LLMs with Extreme Sparsity ( Poster ) > link Link	12 presenters Wentao Guo · Jikai Long · YIMENG ZENG · Zirui Liu · Xinyu Yang · Yide Ran · Jacob Gardner · Osbert Bastani · Chris De Sa · Xiaodong Yu · Beidi Chen · Zhaozhuo Xu 🔗
-	MInference: Accelerating Pre-filling for Long-Context LLMs via Dynamic Sparse Attention ( Poster ) > link Link	12 presenters Huiqiang Jiang · Yucheng Li · Chengruidong Zhang · Qianhui Wu · Xufang Luo · Surin Ahn · Zhenhua Han · Amir Abdi · Dongsheng Li · Chin-Yew Lin · Yuqing Yang · Lili Qiu 🔗
-	Exploring Monotonicity in Early-Exiting Language Models ( Poster ) > link Link	Filipe Laitenberger · Max Belitsky · Denys Sheremet 🔗
-	OpenDiLoCo: An Open-Source Framework for Globally Distributed Low-Communication Training ( Poster ) > link Link	Sami Jaghouar · Johannes Hagemann 🔗
-	AdaInf: Adaptive Inference for Resource-Constrained Foundation Models ( Poster ) > link Link	Zhuoyan Xu · Khoi Nguyen · Preeti Mukherjee · Somali Chaterji · Yingyiu Liang · Yin Li 🔗
-	Performance Control in Early Exiting to Deploy Large Models at the Same Cost of Smaller Ones ( Poster ) > link Link	Mehrnaz Mofakhami · Reza Bayat · Ioannis Mitliagkas · Joao Monteiro · Valentina Zantedeschi 🔗
-	Seeded LoRA: Collaborative Fine-Tuning Through Seed Initialization of Adapters ( Poster ) > link Link	Alejandro Rodriguez Salamanca · Ahmet Üstün · Nicki Skafte Detlefsen · Tim Dettmers 🔗
-	Understanding and Minimising Outlier Features in Neural Network Training ( Poster ) > link Link	Bobby He · Lorenzo Noci · Daniele Paliotta · Imanol Schlag · Thomas Hofmann 🔗
-	Towards smaller language models via layer looping ( Poster ) > link Link	Sabri Eyuboglu · Dylan Zinsley · Jon Saad-Falcon · Simran Arora · Atri Rudra · James Zou · Christopher Re 🔗
-	CLAM: Unifying Finetuning, Quantization, and Pruning by Chaining LLM Adapter Modules ( Poster ) > link Link	Neelay Velingker · Jason Liu · Amish Sethi · William Dodds · Zhiqiu (Oscar) Xu · Saikat Dutta · Mayur Naik · Eric Wong 🔗
-	Optimised Grouped-Query Attention Mechanism for Transformers ( Poster ) > link Link	Yuang Chen · Cheng Zhang · Xitong Gao · Robert Mullins · George Constantinides · Yiren Zhao 🔗
-	CO2: Precise Attention Score Observation for improving KV Cache Replacement in Large Language Model ( Poster ) > link Link	Meguru Yamazaki · Shivaram Venkataraman 🔗