Timezone: »
Deep and wide neural networks successfully fit very complex functions today, but dense models are starting to be prohibitively expensive. To mitigate this, one promising research direction is networks that activate a sparse subgraph of the network. The subgraph is chosen by a data-dependent routing function, enforcing a fixed mapping of inputs to subnetworks (e.g., the Mixture of Experts (MoE) paradigm). However, there is little theoretical grounding for these sparsely activated models. As our first contribution, we present a formal model of such sparse networks that captures salient aspects of popular MoE architectures. Then, we show how to construct sparse networks that provably match the approximation power and total size of dense networks on Lipschitz functions. The sparse networks use exponentially fewer inference operations than dense networks, leading to a faster forward pass. This offers a theoretical insight into why sparse networks work well in practice. Finally, we present empirical findings that support our theory; compared to dense networks, sparse networks give a favorable trade-off between number of active units and approximation quality.
Author Information
Cenk Baykal (Google Research)
Nishanth Dikkala (Google Research)
Rina Panigrahy (Google)
Cyrus Rashtchian (Google Research)
Xin Wang (Google)
More from the Same Authors
-
2022 : For Manifold Learning, Deep Neural Networks can be Locality Sensitive Hash Functions »
Nishanth Dikkala · Gal Kaplun · Rina Panigrahy -
2022 : Provable Hierarchical Lifelong Learning with a Sketch-based Modular Architecture »
ZIHAO DENG · Zee Fryer · Brendan Juba · Rina Panigrahy · Xin Wang -
2022 Poster: Do More Negative Samples Necessarily Hurt In Contrastive Learning? »
Pranjal Awasthi · Nishanth Dikkala · Pritish Kamath -
2022 Oral: Do More Negative Samples Necessarily Hurt In Contrastive Learning? »
Pranjal Awasthi · Nishanth Dikkala · Pritish Kamath -
2021 Poster: Statistical Estimation from Dependent Data »
Vardis Kandiros · Yuval Dagan · Nishanth Dikkala · Surbhi Goel · Constantinos Daskalakis -
2021 Spotlight: Statistical Estimation from Dependent Data »
Vardis Kandiros · Yuval Dagan · Nishanth Dikkala · Surbhi Goel · Constantinos Daskalakis