Track: Deep Learning (Neural Network Architectures) 13

Fri 13 July 8:00 - 8:20 PDT

WSNet: Compact and Efficient Networks Through Weight Sampling

Xiaojie Jin · Yingzhen Yang · Ning Xu · Jianchao Yang · Nebojsa Jojic · Jiashi Feng · Shuicheng Yan

We present a new approach and a novel architecture, termed WSNet, for learning compact and efficient deep neural networks. Existing approaches conventionally learn full model parameters independently and then compress them via ad hoc processing such as model pruning or filter factorization. Alternatively, WSNet proposes learning model parameters by sampling from a compact set of learnable parameters, which naturally enforces parameter sharing throughout the learning process. We demonstrate that such a novel weight sampling approach (and induced WSNet) promotes both weights and computation sharing favorably. By employing this method, we can more efficiently learn much smaller networks with competitive performance compared to baseline networks with equal numbers of convolution filters. Specifically, we consider learning compact and efficient 1D convolutional neural networks for audio classification. Extensive experiments on multiple audio classification datasets verify the effectiveness of WSNet. Combined with weight quantization, the resulted models are up to 180x smaller and theoretically up to 16x faster than the well-established baselines, without noticeable performance drop.

Fri 13 July 8:20 - 8:40 PDT

StrassenNets: Deep Learning with a Multiplication Budget

Michael Tschannen · Aran Khanna · Animashree Anandkumar

A large fraction of the arithmetic operations required to evaluate deep neural networks (DNNs) consists of matrix multiplications, in both convolution and fully connected layers. We perform end-to-end learning of low-cost approximations of matrix multiplications in DNN layers by casting matrix multiplications as 2-layer sum-product networks (SPNs) (arithmetic circuits) and learning their (ternary) edge weights from data. The SPNs disentangle multiplication and addition operations and enable us to impose a budget on the number of multiplication operations. Combining our method with knowledge distillation and applying it to image classification DNNs (trained on ImageNet) and language modeling DNNs (using LSTMs), we obtain a first-of-a-kind reduction in number of multiplications (over 99.5%) while maintaining the predictive performance of the full-precision models. Finally, we demonstrate that the proposed framework is able to rediscover Strassen's matrix multiplication algorithm, learning to multiply $2 \times 2$ matrices using only 7 multiplications instead of 8.

Fri 13 July 8:40 - 8:50 PDT

Deep k-Means: Re-Training and Parameter Sharing with Harder Cluster Assignments for Compressing Deep Convolutions

Junru Wu · Yue Wang · Zhenyu Wu · Zhangyang Wang · Ashok Veeraraghavan · Yingyan Lin

Many existing compression approaches have been focused and evaluated on convolutional neural networks (CNNs) where fully-connected layers contain the most parameters (e.g., LeNet and AlexNet). However, the current trend of pushing CNNs deeper with convolutions has created a pressing demand to achieve higher compression gains on CNNs where convolutions dominate the parameter amount (e.g., GoogLeNet, ResNet and Wide ResNet). Further, convolutional layers always account for most energy consumption in run time. To this end, this paper investigates the relatively less-explored direction of compressing convolutional layers in deep CNNs. We introduce a novel spectrally relaxed k -means regularization, that tends to approximately make hard assignments of convolutional layer weights to K learned cluster centers during re-training. Compression is then achieved through weight-sharing, by only recording K cluster centers and weight assignment indexes. Our proposed pipeline, termed Deep k -Means, has well-aligned goals between re-training and compression stages. We further propose an improved set of metrics to estimate energy consumption of CNN hardware implementations, whose estimation results are verified to be consistent with previously proposed energy estimation tool extrapolated from actual hardware measurements. We have evaluated Deep k -Means in compressing several CNN models in terms of both compression ratio and energy consumption reduction, observing promising results without incurring accuracy loss.

Fri 13 July 8:50 - 9:00 PDT

Born Again Neural Networks

Tommaso Furlanello · Zachary Lipton · Michael Tschannen · Laurent Itti · Anima Anandkumar

Knowledge Distillation (KD) consists of transferring ``knowledge'' from one machine learning model (the teacher) to another (the student). Commonly, the teacher is a high-capacity model with formidable performance, while the student is more compact. By transferring knowledge, one hopes to benefit from the student's compactness, without sacrificing too much performance. We study KD from a new perspective: rather than compressing models, we train students parameterized identically to their teachers. Surprisingly, these Born-Again Networks (BANs), outperform their teachers significantly, both on computer vision and language modeling tasks. Our experiments with BANs based on DenseNets demonstrate state-of-the-art performance on the CIFAR-10 (3.5%) and CIFAR-100 (15.5%) datasets, by validation error. Additional experiments explore two distillation objectives: (i) Confidence-Weighted by Teacher Max (CWTM) and (ii) Dark Knowledge with Permuted Predictions (DKPP). Both methods elucidate the essential components of KD, demonstrating the effect of the teacher outputs on both predicted and non-predicted classes.

Main Navigation

Session

Deep Learning (Neural Network Architectures) 13

WSNet: Compact and Efficient Networks Through Weight Sampling

StrassenNets: Deep Learning with a Multiplication Budget

Deep k-Means: Re-Training and Parameter Sharing with Harder Cluster Assignments for Compressing Deep Convolutions

Born Again Neural Networks