First we would like to thank the reviewers for their time and helpful reviews. We commit to update the paper accordingly to our answers in the final version of the paper if accepted.$
To reviewer 1:

We fully agree that, since the community interested in MDPs is wider than the one interested in MGs, this paper would reach a wider audience if we had restricted ourselves to MDPs. However, since all the contributions apply to the special case of MDPs as a direct implication from the one on MGs, we decided to present our work from the more generic perspective of zero-sum two-player MGs. Maybe can we be more specific in the introduction about the links with MDPs?

Due to the lack of space and for the sake of clarity, we did only focus on the linear case. Yet, this paper can indeed be a good starting point to revisit other policy iteration methods.

Concerning the link with Alberta's paper. We did cite Maei's paper. But you are right, and we will emphasize the connection in the revised version if accepted.

To reviewer 2:

Existence of the derivative: the existence of a pair of strategies is always guaranteed. Its uniqueness however is not. When reduced to an MDP, the problem can also occur if two actions achieve the maximum of the optimal Bellman operator (i.e. the argmax is a set with multiple elements). Correra's result provides a way to remove such ambiguities by specifying the directional derivative of the Bellman operator even when the pair of minmax strategies is not unique. 

Bias issues: since we are dealing with finite state spaces, all the estimators of the Bellman residual norm are biased. However, they are all consistent estimators. On Line 600 we have an unbiased estimator of $\Psi_\omega \omega - \upsilon$ but the estimator of its $L_2$ norm is biased and consistent. The choice of the Bellman residual (OBR or POBR) remains, to our knowledge, an open problem. One can say that, when the dynamics is deterministic, the OBR is unbiased and thus should be preferred.

Modified Bellman errors: To our knowledge the work of Antos considers the linear Bellman residual where we consider the optimal Bellman residual. We don't know to which extend their proof can be adapted to the optimal Bellman residual. Yet it would be a good start to address the question.

To reviewer 3:

Concerning the clarity of the paper:

-Gradient of the objective function: the smoothness of the objective function is required for most optimization methods to be guaranteed to converge in theory. In practice, gradient methods work well even for non-smooth functions.

-$Q$-function: Actions $a$ and $b$ are used in state $s$ and actions $a'$ and $b'$ are used in state $s'$. In MGs there are two players and they take actions simultaneously. Thus, the $Q$-function depends on the pair of actions of both players $Q(s,a,b)$.


Concerning detailed comments:

-Use of quasi-Newton's method: to our knowledge, Quasi-Newton's methods are always preferable to Newton's method from an optimization point of view. What we can say from a practical point of view is that quasi-Newton's methods do not under-perform Newton's method. They also prevent oscillating behavior. 

-The definition of $Q^*$: $Q^*$ is the fixed point of the $\mathcal{T}^*$ operator defined in section 3. It is the minmax value of the game. It will be emphasized in the final version.

-The definition of the error: the error is the norm of the difference of $Q^*$ and $Q_\mu$ (where $Q_\mu$ is defined as the value of the strategy $\mu$ considering the opponent is optimal), normalized by the norm of $Q^*$.

-The experiment section: Indeed paragraphs 3 and 4 contain repetitions. They will be merged in the final version if the paper is accepted. We believe section 6.2 is necessary since it reports results on MGs (instead of MDPs in section 6.1). We will drop figure 3 and explain in more details the setting and results.