Spotlight Poster
AdaSplash: Adaptive Sparse Flash Attention
Nuno Gonçalves · Marcos V. Treviso · Andre Martins
East Exhibition Hall A-B #E-3305
                        
                        
                            [ 
                            Abstract
                         ]
                        
                        
                            [ 
                            Lay Summary
                         ]
                        
                        
                        
                        
                        
                            
                                
                            
                            
                                
                            
                        
                        
                            
                                
                                     
                                
                                Oral
                                presentation:
                                
                                    Oral 2D Efficient ML
                                
Tue 15 Jul 3:30 p.m. PDT — 4:30 p.m. PDT
                    
                    
                    
                                
                
                [
                 OpenReview] 
                
            
                            
                            Tue 15 Jul 4:30 p.m. PDT 
                                    — 7 p.m. PDT  
                                    
                                
                            
                            
                            
                                Tue 15 Jul 3:30 p.m. PDT — 4:30 p.m. PDT
                        Abstract:
                        
                            The computational cost of softmax-based attention in transformers limits their applicability to long-context tasks. Adaptive sparsity, of which $\alpha$-entmax attention is an example, offers a flexible data-dependent alternative, but existing implementations are inefficient and do not leverage the sparsity to obtain runtime and memory gains. In this work, we propose AdaSplash, which combines the efficiency of GPU-optimized algorithms with the sparsity benefits of $\alpha$-entmax. We first introduce a hybrid Halley-bisection algorithm, resulting in a 7-fold reduction in the number of iterations needed to compute the $\alpha$-entmax transformation. Then, we implement custom Triton kernels to efficiently handle adaptive sparsity. Experiments with RoBERTa and ModernBERT for text classification and single-vector retrieval, along with GPT-2 for language modeling, show that our method achieves substantial improvements in runtime and memory efficiency compared to existing $\alpha$-entmax implementations. It approaches---and in some cases surpasses---the efficiency of highly optimized softmax implementations like FlashAttention-2, enabling long-context training while maintaining strong task performance.
                        
                    
                    
                
                            Lay Summary:
                            Transformers, the backbone of modern language models, use softmax attention to decide how much focus each token (e.g. word) gives to others. While effective, softmax always assigns some importance to all tokens, even irrelevant ones, making it harder for models to focus sharply on important tokens. A promising alternative is adaptively sparse $\alpha$-entmax attention, which learns to ignore irrelevant words by assigning them exactly zero weight, allowing models to focus more selectively. However, prior implementations of $\alpha$-entmax have been slow and memory-intensive for practical use. To solve this, we introduce AdaSplash, a fast and GPU-friendly implementation of $\alpha$-entmax attention. With this, we significantly cut down both computation and required memory usage compared to previous methods, and as a result, AdaSplash closes the longstanding gap between the theoretical appeal of $\alpha$-entmax attention and its practical usability in large-scale, long-context applications. All code is open-source to support broader adoption and further research.
                        
                        
                    Chat is not available.