We would like to thank all reviewers for their valuable feedback. We have improved the submission and our changes are:$- We have fixed several typos and grammatical mistakes. Some of the discussions were changed to be more in-depth. - We added experiments about injecting noise to the input of the activation function. - We fixed notation in some of the equations and provided an equivalent, but simpler formulation of noisy activation function. - More thorough empirical investigation of different types of noisy activation functions. You can access the revised paper through the link below: https://filetea.me/t1s2uX7ILzyTIGVWIxzTugQ0A Reviewer 1&5: >...the addition of several hyperparameters... We have investigated the sensitivity to c and alpha hyperparameters on PTB and MNIST datasets with GRU and an MLP. The activation function is robust to the choice of c, as long as it is not very large or small. For example, any choice of c between 0.5 and 0.25, the model converges to a BPC between 1.4 and 1.42 on test set for PTB character level language modelling with GRU. This is mainly due to the fact that standard deviation (sigma) is being learned with parameter p. If we remove the p from the definition of sigma, the model becomes more sensitive to the different choices of c. We tried different alpha values and for most of the tasks a value between 1.05 and 1.1 seems to perform the best. We will add these experiments to the supplementary material. Reviewer 5: >... absolute value to denote difference? In our notation delta is a vector that indicates the magnitude and direction that u(x) should be moved to reach h(x), that is why we didn't use absolute value. We could have treated it as a metric by using absolute value and modify the rest of the equations for that. >Typos Thanks for pointing out these mistakes, we addressed them in our revised version of the paper. > In Figure 9, why are there big jumps for noisy NTM … We have realized that these jumps were mainly due to the large learning rate and the lack of gradient clipping. NTMs in general can have a very unstable behavior during the training and these two factors alleviated it. If we use a smaller learning rate the jumps disappear, but learning becomes slower. Reviewer 5&7 (on the justification of Equation 9): > Equation (9) … what other alternatives have been tried? … any better explanation for it … We have tried several alternatives and some of them are: 1. c|delta| 2. csoftplus(|delta|) 3. csoftplus(p|delta|) 4. csigmoid(|delta|) 5. csigmoid(p delta^2) 6. csigmoid(p |delta|/|xi|) 7. c(sigmoid(p delta^2) - 0.5)^2 8. c(sigmoid(p delta) - 0.5)^2 1, 2 and 3 were causing numerical stability issues, and disrupt the learning dynamics with recurrent models. 6 was only performing well with feedforward networks. 7 and 8 perform equivalently well on all the tasks that we have experimented with. Thus we switched to 8 due to its simplicity. Experimentally 4 was performing better than 2 and 3 in both feedforward and recurrent models that we have tried. Learning p helped the model to adapt to the magnitude of the noise. Subtraction of 0.5 from sigmoid makes the activation function easier to optimize since the activation becomes centered at 0 and a square is in order to keep the sigma to be positive. Reviewer 7 > but the improvements aren’t large enough … We observe 8 points of improvement in perplexity on PTB and about 2 points of improvement in METEOR on Flickr8k. We obtained our results with just drop-in replacement of activation functions and compared against highly tuned systems. The improvements we observed are consistent across different tasks and architectures. > … there’s no fundamental contribution … contribution of the method is fairly weak -- it shows some very small improvements … We claim that our main contributions in this paper are: - To the best of our knowledge our paper is the first one to apply piecewise linear activation functions to the gates of neural networks. Previously there have been some related works on using ReLU activations with vanilla RNNs (Le 2015 and Hannun 2014-DeepSpeech1). But our paper proposes to replace sigmoid and tanh activations in the gated recurrent units, i.e. LSTMs and GRUs. We observe that for speech recognition with deep neural networks on TIMIT dataset, our noisy-tanh units performs significantly (2.0% absolute increase) better than ReLU. - We investigate the effect of injecting different types of noise to the activation function, which adds to a growing of literature on noise injection in deep learning models. Furthermore we propose a novel and efficient way of learning the standard deviation of that noise. Please note that it is crucial for experiments with annealing. - We show that annealing the noise in the activation function seems to have an profound continuation effect.