Paper ID: 1361
Title: Noisy Activation Functions

Review #1
=====
Summary of the paper (Summarize the main claims/contributions of the paper.): The paper proposes noisy activation function to improve the performance of saturating activation functions such as sigmoid and tanh that are widely used for recurrent neural networks. The basic idea is to add noise to the saturating regions of activation functions with increasing magnitude as input to activation function gets farther from the saturating threshold, so that the network can escape more easily from suboptimal solution. Extensive experimental evaluation on several benchmark tasks shows effectiveness of the proposed method.

Clarity - Justification: I think the paper is okay to read, but requires quite a bit of revision on minor points.  - In Equation (8), don't we need absolute value to denote difference?  Typo line 32: by by -> by Definition 2.3.: right hard saturate is missing line 184: we will refer use -> we will use In the caption of Table 1: 2.5% -> 1.5%? line 727: trainign -> training 

Significance - Justification: Simple modification of existing activation functions for recurrent neural network showed non-trivial improvement on many different tasks. I think the proposed method could be widely effective to the community.

Detailed comments. (Explain the basis for your ratings while providing constructive feedback.): The paper introduces simple modification of saturating activation function with a good insight. Although the final form of activation function looks a bit random, the effectiveness of the proposed method has been demonstrated through extensive experimental evaluation.  - In Figure 9, why are there big jumps for noisy NTM controller?  - It would be great if the authors could provide analysis on hyperparameters, such as c or alpha, w.r.t. the performance. - Equation (9) seems quite random. What other alternatives have been tried? Is there any better explanation why the formulation in Equation (9) works better? 

=====

Review #2
=====
Summary of the paper (Summarize the main claims/contributions of the paper.): This paper proposes modified nonlinearities for neural network hidden units. The authors seek to combine the benefits of hard-saturating activation functions with non-zero activation function derivatives across the full function domain. The proposed approach adds noise to the activation function only in regions where hard saturation would occur. The proposed approach is evaluated on a few different tasks with language modeling perplexity and image captioning being the most related to state of the art application tasks. On almost all tasks the proposed approach very slightly edges out standard activation functions, but the improvements aren’t large enough to make up for the rather ad-hoc approach to adding noise proposed. 

Clarity - Justification: The writing is fairly clear overall. The discussion and equations introducing the approach could be improved and may have some notation bugs (see below). Experiments and results are sufficiently clear that I think others could reproduce the results. 

Significance - Justification: Several previous works have investigated adding noise to activation functions and/or gradients during learning. This paper offers another way to do that with an interesting tweak, but there’s no fundamental contribution. As such, the value of this work really relies on how well the method works in practice. By this standard the contribution of the method is fairly weak -- it shows some very small improvements on a range of tasks but it’s nowhere close to the gains observed in RelU vs sigmoidal units. I feel that all of this taken together makes the paper better suited for a workshop rather than main ICML.

Detailed comments. (Explain the basis for your ratings while providing constructive feedback.): Eqn 9 comes with basically no motivation. Scaling noise as a function of delta seems reasonable but there’s no reason why this should take the form given other than through empirically picking it, which is bad considering the experiments don’t report results with any alternative choices.  Line 350: Notation error? Phi(x) is the activation function but sigmoid(x) is the derivative of the gaussian. It looks like this is some mix of the activation function and its derivative

=====

Review #3
=====
Summary of the paper (Summarize the main claims/contributions of the paper.): Sigmoidal activation functions for neural networks are replaced by hard saturating piecewise linear functions. Since these new functions have a zero derivative when saturated, noise is added when the function is in its saturated region. The authors propose to construct the noise in such a way that it depends on the input to the function and increases in variance as the function moves deeper into its saturating regions. Thus the gradient will be nonzero even when the activation function is saturated. Moreover, the noise is biased such that the gradients will force the activations towards the linear region of the activation function.   

Clarity - Justification: The manuscript would benefit from a thorough editing to improve readability and clarity. There are a number of places that are somewhat difficult to understand and could be worded better.

Significance - Justification: These activation functions are intended to be used in place of the standard sigmoid gating functions in neural network models, such long-short term memory networks and neural Turing machine models. The hard saturating nature of the proposed functions allow effective gating behavior, while the noise allows the training procedure to explore and recover from premature convergence. Similarities to simulated annealing are discussed.  Experimental results are encouraging and show improved performance over existing state-of-the-art results on several data sets/tasks. The neural Turing machine task (corresponding to the identification of the number of unique elements in a sequence) illustrates the model with the new functions converging rapidly with near perfect performance, while the reference model fails to converge and maintains high error.  The authors that results were obtained by taking the existing model to be compared against and replacing the soft sigmoids with their new functions. In other words, they performed additional hyperparameter optimization of the typical neural network hyperparameters to take advantage of the proposed method. This implies that even better results may be possible.  

Detailed comments. (Explain the basis for your ratings while providing constructive feedback.): This is an interesting paper that explores an idea that is simple and straightforward to implement. It is somewhat surprising that it hasn't been tried before. It would be a very well-received paper as it addresses one of the key limitations of neural network models that rely on sigmoidal functions as gating units, and the promising results suggest that adoption of the technique will allow these models to be successfully applied to more challenging tasks.  The only downsides to the approach are additional complexity over using a simple sigmoid function and the addition of several hyperparameters. I believe that these costs are well justified by the performance of the approach. There was no discussion regarding how to choose these parameters or how they impact the performance on the data sets used, and contained minimal mentioning of the values used in experiments. It would be valuable if the authors clarified these points.

=====