Workshop: Information-Theoretic Methods for Rigorous, Responsible, and Reliable Machine Learning (ITR3)
Soft BIBD and Product Gradient Codes: Coding Theoretic Constructions to Mitigate Stragglers in Distributed Learning
Animesh Sakorikar · Lele Wang
Conventional approaches to mitigate stragglers in distributed synchronous gradient descent include predicting and avoiding stragglers or replicating tasks across workers. However, predicting stragglers require substantial data about the computing nodes in the distributed system, and coding theoretic approaches provide more efficient use of redundant computations to mitigate stragglers. In this paper, we consider a coding theoretic framework, called gradient coding, to mitigate stragglers. Recently, Kadhe et al. proposed a gradient code based on a combinatorial design, called balanced incomplete block design (BIBD), which is shown to outperform many existing gradient codes, but exist for a very limited set of parameters. In this paper, we aim to overcome such limitations and construct gradient codes which exist for a wide range of parameters while retaining the superior performance of BIBD gradient codes.Two such constructions are proposed, one based on a probabilistic construction that relax the stringent BIBD gradient code constraints, and the other based on taking the Kronecker product of existing gradient codes.Theoretical error bounds for worst-case adversarial straggling scenarios are derived. Simulations show that the proposed constructions outperform existing gradient codes with similar redundancy per data piece.