Paper ID: 1215 Title: Training Neural Networks Without Gradients: A Scalable ADMM Approach Review #1 ===== Summary of the paper (Summarize the main claims/contributions of the paper.): In this paper the authors propose and benchmark an alternative optimisation method for neural networks. The method alternates between optimizing the (latent) unit activations and the model parameters. Clarity - Justification: The paper is well written and well structured; the introductionary sections provide the motivation why alternatives to gradient based methods might be beneficial for scalability in distributed, highly parallel implementations and connects the proposed method to previously published approaches for optimization without (global) gradients. Significance - Justification: Robustly working alternatives to gradient-based methods could have a huge impact on the machine learning community: E.g. for increased scalability, but potentially also for reinforcement learning or for biologically more plausible neural network models where gradients are not readily available. Detailed comments. (Explain the basis for your ratings while providing constructive feedback.): The results seem encouraging -- but the authors unfortunately only evaluate shallow neural network models (1 and 2 hidden layers). It would be very interesting to see whether this methods can also optimize deeper architectures. I’m also wondering how to interpret the (rather) long plateau in figure 1 b) where ADMM does not seem to make any progress. Would this plateau extend with more hidden layers? ===== Review #2 ===== Summary of the paper (Summarize the main claims/contributions of the paper.): This paper proposes an ADMM-style parameter update scheme for estimating parameters for (deep) neural networks. The key strategy of the proposed method is first it decompose the optimization problem into many small parts of sub-problems. Then, it sequentially and iteratively solves the sub-problems, whose (global) optimal solution can be obtained by the closed form solution. In addition, this paper also proposes a modification to fit to the situation for applying ADMM framework to neural networks. It only incorporates a small number of dual variables (multipliers), which corresponds to the primal variables appeared in the objective function. In other words, the method ignores to introduce the dual variables that do not appear in the objective function. Experiments showed that the proposed method outperformed the conventional optimization algorithms often used for the estimation of neural networks, such as SGD. Clarity - Justification: According from the experiments, I wonder if the comparison with the standard gradient-based method, such as SGD, CG, and LBFGS is really in the fair condition. The hardware and implementation seems largely different. It is hard for me to judge that the proposed method is truly effective for the purpose of the parameter estimation of (deep) neural networks. Moreover, it seems hard to reproduce the results of the experiments shown in this paper. Significance - Justification: This paper tackles an important issue; a faster and effectively estimation for the (especially large) parameters of neural networks, which often exceeds ten millions. This paper proposes an interesting (and reasonable) decomposition method. The proposed method is approach seems to be a novel, and possibly be able to find a better solution than the conventional standard gradient-based methods, such as SGD. Detailed comments. (Explain the basis for your ratings while providing constructive feedback.): There are several points that are unclear for me. I would like to clarify these points to provide a better review of this paper. I am ready to increase my rating if I received clear and understandable answers of my question and comments listed below. Q1: In Eq. (5), I think the second term, which was added from Eq. (4), should be $< z_L – W_L a_{L-1} , \lambda > $ instead of $ < z_L, \lambda > $. If I am something misunderstanding, please explain the explicit derivation, or intuitive explanation why it is so. Q2: In algorithm 1, the line $z_L \leftarrow \argmin_z \ell(z,y) + \beta_L \| z –W_L a_{l-1} \|^2$ should be $z_L \leftarrow \argmin_z \ell(z,y) + \beta_L \| z –W_L a_{L-1} + \frac{\lambda}{\beta_L} \|^2$. (if I am right for Q1) Otherwise, $\lambda$ never appears in any of right-hand sides. This means that there is no effect to update $lambda$ shown in $\lambda \leftarrow \lambda + \beta_L (z_L – W_L a_{L-1}$. Please clearly explain this point, and revise the algorithm if it has errors. Q3: I think Algorithm 1 looks slightly strange. For the update of $a_l$, it uses $W_{l+1}$, which is going to update next step. Therefore, $a_l$ are updated by using $W_{l+1}$, which were estimated in the former iteration. Does the above interpretation correct or not? Q4: I do not exactly understand the validity of the derivation of Eq. (8). I understand that that the right-hand side of Eq. (8) matches the sub-gradient of objective function in Eq. (3). However, I do not exactly follow why this value can be used for updating multiplier. Discussions in 4.1 and 4.2 for explaining the derivation of Eq. (8) do not make sense for me. Please explicitly and clearly show the validity (not the interpretation). -- Experiments seem very weak for the evaluation of the proposed method. The current state-of-the-art neural systems for speech recognition, image recognition, and natural language processing like machine translation often require a few days or more than week for estimating the parameters. In contrast, the experiments shown in this paper only took a few seconds. Please also show the learning curves of objective functions against the elapsed time to confirm that the proposed method successfully find the better solution in terms of the optimization view. I am not fully convinced with the statement in line from 434 to 439, “In fact, when ADMM is applied to (3) in a ... the method is highly unstable because of the de-stabilizing effect of ...”. I think that it is important to reduce the number of parameters, which are essentially unnecessary for the optimization. However, there is no theoretical or empirical evidence for the unstableness when we incorporate dual variables for every constraint. I guess authors claim can be correct. However, the authors should show the evidence. ===== Review #3 ===== Summary of the paper (Summarize the main claims/contributions of the paper.): The paper proposes a new optimization method for (deep) neural networks that can be easily implemented on clusters of computers (on CPUs). The NN architectures targeted here are classical networks which are sequences of fully connected layers with non-linear activations functions. The underlying idea is to transform the loss minimization problem by a constrained problem, the activation and output of each layer being replaced by parameters to learn under constraints. By using Lagrange multipliers, it is shown that the final optimization problem can be solved globally in closed-form (alternating optimization), the different minimizers being computed on separated cores. Experiments are made on two datasets (classification) and aims at showing the effectiveness of the parallelization schema. The performance is compared to classical gradient-based algorithms using GPU (or not) Clarity - Justification: The paper is well written and easy to understand. My only concern is about the presentation of the model : he equation 5 is hard to follow since not all the Lagrange multipliers are written – the explanation of this equation only appears in Section 4. The authors could better explain why not all the constraints are in the Equation 5 and add a reference to section 4. The results can be reproduced – but the MPI implementation of the algorithm is not easy so open-source code would be nice. Significance - Justification: The idea of decomposing a multi-layer architecture into a set of optimization problems under constraints is interesting, and close variants have been already proposed in the context of energy-based models (see work by Yann Lecun) and also for recurrent network architectures. What is interesting here is to include Lagrangian multipliers in order to obtain an exact enforcement of the constraints and to obtain the equivalence between the original MLP problem and the one proposed here. AS far as I know, it is a new thing. I find the idea interesting. The fact that each term of the equation can be solved using a closed form is clearly interesting, but the advantage of this w.r.t gradient approaches could be reinforced. If one uses a gradient-descent approach for each sub-problem (using GPU), what will the speedup be ? And using a cluster or GPUs, each subproblem being solved on different machines ? The experimental part is interesting but not fully -- the paper only proposes a set of experiments on two datasets. Moreover, since we are in a world where we can have one computer with a GPU, a cluster of CPU and also (smaller) clusters of GPUs, it would be interesting to compare the interest of the approach in term of architecture cost (instead of comparing in term of number of cores and time). Perhaps that a simple synchronous gradient-based descent method on a small cluster of GPUs will be faster than this method with 2 thousand CPU cores… At last, the way this method can be used for more popular architectures (like CNN for example) could be more discussed as it does not seem so simple... But it is proposed here as a research perspective which could greatly improve the quality of this work. Detailed comments. (Explain the basis for your ratings while providing constructive feedback.): Pros: * Interesting/orignal idea and optimization algorithm * Interesting links with Bregman iteration and multipliers method. Cons: * The initial idea (minimization problem with constraint) is not so new * Lack of experimental results (in term of dataset and architectures tested) * Interpretation not so clear. I still don't know if this algorithm is efficient and allows me to obtain the same efficiency at a lower architecture price. =====