Paper ID: 1293
Title: Energetic Natural Gradient Descent

Review #1
=====
Summary of the paper (Summarize the main claims/contributions of the paper.): This article describes a generalization of the natural gradient descent algorithm that takes into account an explicit similarity measure---the energy distance---between probability distributions. A significant part of the article is devoted to the motivation and illustration of the generalization (most of the technical arguments being deferred to supplementary material). It is proved that the new gradient descent is a proper (Theorem 1) strict generalization of the natural gradient based on the Fisher information matrix (Theorem 2) that does not break the nice property that the direction is independent of the parameterization (Theorem 3). The approach is illustrated in a Reinforcement Learning setting where the distance is chosen so as to measure the difference of values of different actions in the same state, a choice that is empirically proved to be interesting (the update of the new method leading to a better improvement).

Clarity - Justification: In my opinion, the article is very well-written. It is not the first time that I read about natural gradient but this paper made me understand a lot what I had not understood before (thanks for this!). The choice to spend most of the article to motivate and illustrate makes it really nice to read. I have not checked the details of the supplementary, but I believe they are a natural extension of the results on the natural gradient (if this is true, it might be a good idea to stress it).

Significance - Justification: The contribution of this work is a new way to do gradient descent, which is at the heart of many machine learning problems. Getting a strict generalization of a state-of-the-art technique while keeping its main property makes a very nice point. The application on Reinforcement Learning is clean and convincing. At the best of my knowledge and expertise (which are not really high on optimization), I feel that this it may be a very good paper, with lots of potential applications to most of the fields of machine learning.

Detailed comments. (Explain the basis for your ratings while providing constructive feedback.): I don't have many remarks, except a few minor suggestions or typos: - The illustrative example of Section 5 and Figure 2 is done in a very specific situation (H+D=I) that may bias a little bit the claim done in the paper. It might be interesting to provide an illustration that does not exactly fit this perfect situation. - line 551: paramtrization -> parametrization - line 597: We then showed -> We then discussed - lines 613-615: The end of the sentence looks a bit strange (and is therefore explained by Footnote 3). Making the argument positive (instead of negative) would probably help. 

=====

Review #2
=====
Summary of the paper (Summarize the main claims/contributions of the paper.):  This paper presents a new gradient-based optimization algorithm for parametric probabilistic models (PPM) which generalizes natural gradient descent by taking into account structure of the set of objects (the "sample space") which the PPM  provides a distribution over. For instance, in reinforcement learning, the PPM is a distribution over actions and the action set may have a sensible metric which could be incorporated in to an algorithm (some actions may have the same effect, so should be regarded as being more similar, for instance).  This is achieved by replacing the (approximation to) the KL-divergence between distributions as the metric on the parameter space in the natural gradient algorithm, with an energy distance, which takes into account some underlying metric on the sample space. The metric tensor for this energy distance is derived and some theoretical properties are derived: the resulting algorithm is covariant, gives and ascent direction and ordinary natural gradients is a special case.

Clarity - Justification: The paper was fairly well written with good coverage of the background methods.

Significance - Justification: The idea is potentially useful, but is a modification of existing methods.

Detailed comments. (Explain the basis for your ratings while providing constructive feedback.): I find the idea potentially promising, and I buy that there are possible benefits. The KL divergence is not the only "distance" between distributions and it makes sense to consider other choices. In this case the considered energy metric is interesting since it takes into account an underlying metric on the sample space, unlike natural gradients. The metric tensor has a simple form that resembles (and is a generalization of) the Fisher matrix, i.e. the metric tensor of natural gradients.  Unfortunately I find the motivating example and experiments weak. The illustrative example of Section 5 is really cheating since the stated aim of the method is to take into account intuitive or known structure in the sample space $\Omega$, but in this example the metric on the sample space is chosen to be that defined by the Hessian of the objective. i.e. the chosen distance takes the *objective* into account, and the method becomes a sort of hybrid between Newton's method and natural gradient descent. So of course it does well on this particular quadratic objective. Is the point here that the Hessian of the objective (w.r.t. \Omega) should be estimated and used? And if we are going to have to start estimating Hessians why not just do Newton's method (i.e. estimate the Hessian w.r.t. \theta)? Does your hybrid (with the Hessian) have some advantage over a straightforward Newton's method?  I'm confused by the suggestion in Section 6 (line 570) to use 1 - H as the distance matrix. I don't buy that this is going to be generally useful at all except for your specially constructed case (e.g. just rescaling H messes it up) where it coincides with (1/2)*(e_i-e_j)^T H (e_i-e_j), which seems more generally sensible.  In the RL experiment the distance on actions takes into account the effect on the objective via the Q-values. This is interesting but the method would really benefit from some argument for why this is good thing to do. It should be stated that the Q-values are not known in practice and so were presumably estimated. Is the method sensitive to the quality of Q-estimation? We should also see learning curves for the whole optimization process for all three algorithms. The line search to compute the optimal step is not a typical thing to do in RL or stochastic environments generally, so isn't giving a picture of how the algorithm performs in a real practical setting.  minor comments (no need to respond)  There's a typo on line 230 and 143 where you state that the Euclidean distance is \sqrt(\theta_1^T \theta_2).  summary  A promising idea, some nice fundamentals are developed, but more needs to be done to turn these observations into a convincing practical algorithm.  

=====

Review #3
=====
Summary of the paper (Summarize the main claims/contributions of the paper.): The authors present an extension of the natural gradient idea, with a different distance metric than the KL divergence. This distance metric depends on a general distance function in the observation space, and can therefore accommodate prior knowledge about the distance between specific outcomes into the gradient direction. It is shown that the natural gradient with this distance metric is a generalization of the natural gradient, and is guaranteed to be a non-descent direction. 

Clarity - Justification: The paper is very clear, and I enjoyed reading it. The daily exercises example is particularly illuminating.  My only problem is with the application (Sec. 8) - it isn't clear what the algorithm looks like - see my following comments.

Significance - Justification: The idea of incorporating prior knowledge about the observation space (or any other prior knowledge, for that matter) into the gradient direction is potentially very strong, and to my knowledge novel. In this sense, I think the paper pushes a very important idea.  The paper claims an improvement over the natural gradient. The results in the paper, however, do not contain enough evidence that the proposed approach is indeed as useful as claimed.  This is since the theoretical results (descent direction + covariant gradient) hold also for the natural gradient, while the empirical results are not extensive enough to be convincing.

Detailed comments. (Explain the basis for your ratings while providing constructive feedback.): I really like the idea of using knowledge about the problem structure in the gradient metric. This idea is excellent, and very clearly described. In this sense, the paper provides an important contribution.  While it seems intuitive that incorporating such knowledge would lead to better algorithms, the paper is lacking in evidence that the proposed approach (i.e., the EIM) is indeed an improvement over the natural gradient. This is mainly because the experimental evaluation is not clear, and not extensive / convincing enough.  I ask the authors to clarify the following issues in their rebuttal: 1. The experimental evaluation is very strange, compared to standard RL evaluations. Why not show the learning curve (perhaps with some hyperparameter search to account for the effect of step size on different gradient directions)? I would expect to see faster / more robust convergence with the energetic gradient.  2. It's not clear to me how the algorithm for updating the parameters in Sec. 8 looks like. How do you compute the update direction from samples?  3. Could the authors provide an intuition for the distance on the Q function metric in Sec. 8? Intuitively, I would think it makes more sense to make the update *proportional* to the Q difference, not *inversely proportional* to it!  I have a feeling that what's really going on here is some form of variance reduction on the gradient estimate... 4. Does this idea (distance on the Q function metric) work also in other domains than mountain car?  On the theoretical side: 5. Is the finite sample space assumption in Theorems 1 and 2 really necessary, or is it just for simplifying the proofs?   To summarize, a more extensive empirical evaluation in this case would be very appropriate. Without it, I would suggest to tone down the claims of improvement over the natural gradient.  Minor comments: 306: Provide a reference or explanation of Pinsker's inequality and the result that follows. 453: The equation after (3) is a Taylor approx. of (3). 570: Would 1/H also work here? 624: Consider clarifying the importance of this corollary, especially in the Omega < 4 case.

=====