Track: Parallel and Distributed Learning 2

Thu 12 July 2:00 - 2:20 PDT

Exploring Hidden Dimensions in Accelerating Convolutional Neural Networks

Zhihao Jia · Sina Lin · Charles Qi · Alex Aiken

The past few years have witnessed growth in the computational requirements for training deep convolutional neural networks. Current approaches parallelize training onto multiple devices by applying a single parallelization strategy (e.g., data or model parallelism) to all layers in a network. Although easy to reason about, these approaches result in suboptimal runtime performance in large-scale distributed training, since different layers in a network may prefer different parallelization strategies. In this paper, we propose layer-wise parallelism that allows each layer in a network to use an individual parallelization strategy. We jointly optimize how each layer is parallelized by solving a graph search problem. Our evaluation shows that layer-wise parallelism outperforms state-of-the-art approaches by increasing training throughput, reducing communication costs, achieving better scalability to multiple GPUs, while maintaining original network accuracy.

Thu 12 July 2:20 - 2:40 PDT

Error Compensated Quantized SGD and its Applications to Large-scale Distributed Optimization

Jiaxiang Wu · Weidong Huang · Junzhou Huang · Tong Zhang

Large-scale distributed optimization is of great importance in various applications. For data-parallel based distributed learning, the inter-node gradient communication often becomes the performance bottleneck. In this paper, we propose the error compensated quantized stochastic gradient descent algorithm to improve the training efficiency. Local gradients are quantized to reduce the communication overhead, and accumulated quantization error is utilized to speed up the convergence. Furthermore, we present theoretical analysis on the convergence behaviour, and demonstrate its advantage over competitors. Extensive experiments indicate that our algorithm can compress gradients by a factor of up to two magnitudes without performance degradation.

Thu 12 July 2:40 - 2:50 PDT

DICOD: Distributed Convolutional Coordinate Descent for Convolutional Sparse Coding

Thomas Moreau · Laurent Oudre · Nicolas Vayatis

In this paper, we introduce DICOD, a convolutional sparse coding algorithm which builds shift invariant representations for long signals. This algorithm is designed to run in a distributed setting, with local message passing, making it communication efficient. It is based on coordinate descent and uses locally greedy updates which accelerate the resolution compared to greedy coordinate selection. We prove the convergence of this algorithm and highlight its computational speed-up which is super-linear in the number of cores used. We also provide empirical evidence for the acceleration properties of our algorithm compared to state-of-the-art methods.

Thu 12 July 2:50 - 3:00 PDT

Distributed Asynchronous Optimization with Unbounded Delays: How Slow Can You Go?

Zhengyuan Zhou · Panayotis Mertikopoulos · Nicholas Bambos · Peter Glynn · Yinyu Ye · Li-Jia Li · Li Fei-Fei

One of the most widely used optimization methodsfor large-scale machine learning problemsis distributed asynchronous stochastic gradientdescent (DASGD). However, a key issue thatarises here is that of delayed gradients: when a‚Äúworker‚Äù node asynchronously contributes a gradientupdate to the ‚Äúmaster‚Äù, the global modelparameter may have changed, rendering this informationstale. In massively parallel computinggrids, these delays can quickly add up if the computationalthroughput of a node is saturated, sothe convergence of DASGD is uncertain underthese conditions. Nevertheless, by using a judiciouslychosen quasilinear step-size sequence, weshow that it is possible to amortize these delaysand achieve global convergence with probability1, even when the delays grow at a polynomialrate. In this way, our results help reaffirm thesuccessful application of DASGD to large-scaleoptimization problems.

Main Navigation

Session

Parallel and Distributed Learning 2

Exploring Hidden Dimensions in Accelerating Convolutional Neural Networks

Error Compensated Quantized SGD and its Applications to Large-scale Distributed Optimization

DICOD: Distributed Convolutional Coordinate Descent for Convolutional Sparse Coding

Distributed Asynchronous Optimization with Unbounded Delays: How Slow Can You Go?