Paper ID: 854
Title: Softened Approximate Policy Iteration for Markov Games

Review #1
=====
Summary of the paper (Summarize the main claims/contributions of the paper.): This paper considers a variant of Newton's method as a way of softening the oscillations that can occur in approximate policy iteration algorithms. It is billed as a Markov Game paper, though that may be underselling it, because the results are interesting for the special of MDPs too.

Clarity - Justification: The paper is nicely written, overall. It seems to assume a fair amount of familiarity with LSPI. That was not a problem for this reviewer, but I wonder if it would be for other readers.

Significance - Justification: Oscillation in approximate policy iteration algorithms is an annoying problem and this approach may be helpful. The presentation is rather centered around zero sum Markov games and LSPI-like algorithms, though the application to MDPs could be emphasized more to bring in a wider audience and the application to other forms of approximate policy iteration should also be considered.

Detailed comments. (Explain the basis for your ratings while providing constructive feedback.): paper rises -> paper raises  I might be nice to connect this with some of the projected Bellman residual minimization work coming out of Alberta.  Otherwise, I don't have much to add that was already said above.

=====

Review #2
=====
Summary of the paper (Summarize the main claims/contributions of the paper.): The paper discusses policy iteration methods for two-player zero-sum  Markov games. The novelty of the work is in relating the application of  Newton's method to minimizing the optimal Bellman residual, to well-known existing methods such as LSPI. The authors then show that by changing from Newton's method to the quasi-Newton method in the minimization step, the stability and performance of the resulting policy iterators is improved.  

Clarity - Justification: The contributions are quite clearly described in general, in particular I enjoyed the detailed introduction and background. Some aspects of the main theoretical contribution are however very dense, and could be improved by providing short descriptions of concepts that are now assumed to be known to the reader, in particular "Armijo's rule" and "Wolf conditions". Please provide at least a reference or a short intuition behind these concepts.  One of the main points made is the fact that the gradient of the objective function is not always well defined (in some sense) for MDPs, which violates the assumption on which previous proofs of convergence rely. Some intuition behind this point would help the reader to appreciate the problem. Now, it is not immediately clear to me why / when this assumption is violated, and how serious this is in practice.  I also got a bit confused by notation in Section 3, regarding the Q functions. It seems that a and b are used for actions taken in the  previous state s, and successor state s', respectively (when looking at the definition of the state-action kernel), but it is not clear to me why the Q function Q(s,a,b) depends on both. Some clarification would help here. 

Significance - Justification: The theoretical contribution seems significant and novel. However the empirical evaluation is less convincing, mostly because the results are not described in depth. From the graphs, it is not clear to me when the quasi-Newton methods perform better in practice. In particular, the differences in Figure 3 seem hardly significant.  It would be interesting to discuss specific scenarios in which quasi-Newton methods outperform Newton methods. What makes the latter oscillate and the former not? What is the main difference in the scenarios of Figures 2 and 3 that makes the differences between both sets of methods disappear?

Detailed comments. (Explain the basis for your ratings while providing constructive feedback.): Although I am not able to fully apprehend (and therefore judge) the correctness of the mathematical proofs, the theoretical part of the work  seems thorough. The contributions are nicely embedded in, and related to, historical related work. I also like the concise and clear introduction of the relevant techniques in the background section. Some parts of the main theoretical contribution are quite dense and could be clarified here and there, see detailed comments below.  In my opinion, the weakest part of the work are the empirical results. Although the chosen testbed lends itself nicely to extensive  experiments, I miss quite a lot of depth in the discussion of the results (see below for comments). This is a pity, as it distracts  somewhat from the main contribution, and does not help to convince me of the significance of the work.  Overall, I believe the contribution to be interesting and definitely relevant to ICML, although I am less convinced by the empirical part.  other remarks: - l.165: "average reward" should be "expected reward" - l.618: "not to large" => "not too large" - l.651: please explain Q*, is this the optimal (theoretical) Q  function? Please explain the performance metric (in one sentence), why  is lower better in this case (regarding the figures in Section 6)? - l.715: why suddenly use the term "softened LSPI", instead of  quasi-Newton? This distracts somewhat. Similar for the figures. - l.734: I would suggest to stick to the term "learning rate" instead of  "step-size", since the figure captions also refer to steps but with a  different meaning. - In Section 6.1, there is a large degree of repetition in paragraphs 3  and 4, this could be more to the point. - The last two paragraphs of Section 6.2 are very brief. In fact,  currently Section 6.2 does not contribute much, given the space it takes  in the paper. Either explain the results presented in the figures in  detail, or perhaps leave out some of the figures to allow space to  discuss the remainder more in depth.

=====

Review #3
=====
Summary of the paper (Summarize the main claims/contributions of the paper.): The paper introduces Policy Iteration (PI)-like algorithm with function approximation for zero-sum Markov Games (MG). Similar to Markov Decision Processes (MDP), one may define Bellman operators, including Bellman optimality operator, for MG. As a result, we can define Optimal Bellman Residual (OBR), which is similar to the usual Bellman Residual with the difference that it is defined for the optimality operator as opposed to Bellman operator for a fixed policy.  Given a linear FA, one may define the l2-norm of OBR as an objective to be  minimized. Likewise, one may define the Projected OBR (POBR), which is the projection of the OBR onto the span of the features of the function approximator, as well as the l2 norm of it as the objective to be minimized.  The paper derives the Newton optimization procedures for these two objectives.  The result is a PI-like algorithms similar to Pollatschek and Avi-Itzhak (1969)'s. Pollatschek et al.'s approach, however, is not convergent. An attempt to fix it, by Filar and Tolwinski (1991), suggests using quasi-Newton method instead of the Newton update. So this paper also suggests using quasi-Newton for the case of function approximation. The problem is that, as pointed out in the paper, the solution of Filar et al. was not correct as it had a smoothness assumption that is not met. The same holds here.  The paper also argues that the Newton method for POBR is similar to, but not exactly the same as, to LSPI for zero-sum MGs. They are the same for MDPs. The application of the Newton method to OBR would be Bellman Residual Minimization PI.  These algorithms are derived assuming that the transition probability matrix P is known. For the case that we have data, the paper follows the standard procedure of estimating necessary quantities using samples. As expected, for OBR we get a biased estimate.  There are some simple experiments showing that using quasi-Newton-based methods, in both MDP and MG, leads to more stable behavior.

Clarity - Justification: The paper is mostly clear. Some paragraphs can be improved, e.g., LL384-484.

Significance - Justification: I think this is a good paper. It makes interesting connections, which may attract the attention of the community to designing new algorithms for Markov Games.

Detailed comments. (Explain the basis for your ratings while providing constructive feedback.): Deriving PI-like algorithms for zero-sum MG with function approximator is important. The approach of the paper, which starts from two commonly-used objectives for MDPs, and re-derive it for MG is also reasonable thing to do.  As already mentioned, since the original algorithm for non-FA case is not guaranteed to be convergent, the same also holds for the results here. So it seems that we have some new algorithms for MG that can work with FA, but are not guaranteed to convergence. This may not be a big deal at this stage, and the paper is clear and upfront about it. Overall, I think this is a good paper. It doesn't provide a clean answer, but it makes interesting connections, which may attract the attention of the community to designing algorithms for MG.  I have a few questions and concerns:  - It seems that a key step of all the derivations is the result by Correa and Seeger (1985): when there exists a unique pair of strategies whose Bellman operator matches the Bellman optimality operator, the gradient of the Bellman operator is the same as the gradient of the linearized operator (discussion in LL355-369). Also it appears that this is the key difference with the MG version of LSPI. My question is whether the existence of such a pair is always guaranteed? If so, please discuss this in more detail in the paper.  - Focusing on the empirical version of these algorithms, when we use a batch of data, we see that OBR is biased (similar to Bellman Residual Minimization). POBR doesn't seem to have that problem. On the other hand, the minimizer of POBR is not necessarily a good solution (for MDPs, the minimizer has the objective value of zero, but the quality of the solution depends on how well the function space can approximate the true value function). How much bias is too much? Can we quantify when either of them is better than the other?  - Can Modified Bellman error minimization procedure be used to fixed the bias of the OBR?  (Antos, Szepesvaari, and Munos, "Learning near-optimal policies with Bellman-residual minimization based fitted policy iteration and a single sample path," MLJ, 2008).  - Typo: At several places, \hat{T} is used instead of \tilde{T}, e.g., L359, L364.  - L420: Are you referring to the minimum of the Bellman residual or projected Bellman residual?

=====