Paper ID: 1307 Title: Near Optimal Behavior via Approximate State Abstraction Review #1 ===== Summary of the paper (Summarize the main claims/contributions of the paper.): The authors present sufficient conditions for achieving bounded error when performing approximate state abstraction in MDPs. They present theoretical bounds on performance loss resulting from the approximation as well as an empirical investigation in four problems. This work is most closely related to earlier work on exact state abstraction by Li et al. (2006) where the authors introduced a unifying theoretical framework for state abstraction in MDPs and defined five classes of state-aggregation functions. The current work generalizes two of these five cases to the approximate case. Clarity - Justification: The paper is clear, well organized, and easy to follow. The contributions are clearly stated. Significance - Justification: State approximation is an important topic in learning and planning in MDPs. This work may be useful in developing algorithms for learning approximate abstraction functions online, which would be an important development. Detailed comments. (Explain the basis for your ratings while providing constructive feedback.): This is a well-written paper that presents new results on an important topic. While the results are not immediately useful in practice, they may be useful for developing practical algorithms in the future. I have only a few comments. -- The empirical results can be expanded to include an analysis of the population of abstract MDPs found in a particular domain as epsilon increases. -- It would be useful to see the theoretical bounds plotted in Figure 2. -- Thanks to the authors for providing a code base for replicating their experiments and visualizing the results! -- I did not see gamma specified in the empirical domains. -- Line 750: Figure 6 --> Figure 1 -- Please increase the text size in axis labels and legends in Figure 2. -- Why would T be negative in lines 361-370? ===== Review #2 ===== Summary of the paper (Summarize the main claims/contributions of the paper.): The authors investigate state abstractions in MDPs. Previous work has proposed state-aggregation functions that retain optimal performance, but these are typically computationally intractable to obtain. Instead, the authors suggest using approximate state-aggregation functions that are computationally affordable yet near-optimal. The main result is the conclusion that four simple state-aggregation functions, parameterized by an error $\epsilon$, achieve a suboptimality gap that is polynomial in the reciprocal of $\epsilon$. The theoretical results are then validated on toy problems. Although the authors do not explicitly mention it, the empirical results display properties that resemble rate-distortion trade-offs. Clarity - Justification: The paper was very easy to review due to its clarity and simple proofs. The relation to the existing literature was adequate, although there are obvious connections to rate-distortion theory that were not explored at all. Significance - Justification: Scaling existing RL algorithms to larger settings is an important challenge. In particular, understanding state abstractions and their trade-offs is an integral aspect of this process. The authors have provided theoretical guarantees for four simple state aggregation functions. Detailed comments. (Explain the basis for your ratings while providing constructive feedback.): The paper is nearly flawless. The only minor comments I have are: -Page 6, definition 13. Could you replace $\wedge$ with AND? -Page 6, proof sketch of lemma 3. $\delta_1$ and $\delta_2$ are not defined. ===== Review #3 ===== Summary of the paper (Summarize the main claims/contributions of the paper.): The paper provides novel theoretical results on four different classes of approximate state abstraction methods for Markov Decision Processes. This paper addresses a very important problem, that of providing alternative models that are both faithful to an original MDP formulation and substantially reduced in dimension. The authors provide a careful analysis of four classes of abstraction methods that have been introduced in the past. In particular, the analysis is based on providing new upper bounds on the precision of optimal value function computation under corresponding alternative models. The authors also provide an extensive empirical illustration of the approximate models that they analyse. Although the results are significant in the field of value function approximation and the presentation is relatively clear, the overall presentation of the results lacks conciseness for half of the results, and it lacks clarity for the other half. The paper needs a serious rewrite for publication. Also, the paper ignores related work on bisimulation metrics, which is quite relevant and which could easily be used to provide tighter bounds for one of the abstraction classes studied in this paper. Clarity - Justification: The paper is mostly clearly written, with the exception of a few small typos, albeit the proofs of the main theoretical results would benefit from a more concise presentation. See below specific feedback: L132: you use \mathcal{T}(s' | s,a) for transition probabilities, but later you use \mathcal{T}(s,a,s'). Be consistent. L159: you should reference Puteman's "Markov Decision Processes: Discrete and Stochastic Dynamic Programming" (1994) for more background. It is a much more relevant resource. L179: Please refer to work by Norm Ferns (UAI 2004, 2006) for more work on approximate state abstraction L245: Definitions 4,5,6 can easily all go together. Makes the presentation much more clear. Also, you should add here the definition of the Q-values based on this model (which you use on Line422, but never define explicitly) L297: the function f should map to \mathbb{R}, not \mathcal{R}. L298: \epsilon should be part of the definition, i.e. \phi_{f,\epsilon} - the abstraction is dependent both on \epsilon and on the function f that is used for evaluation L316: The Theorem is confusing...it is weird to call them "classes" since you never really explain what a class is, what is difference between these classes. Instead you should refer back to the Li et al. 2006 paper, who use the notion of "finer relation" (Definition 2 on page 3 of the Li et al. paper). L316: make it clear that you will prove this theorem in the following section, using the 4 lemmas. L322: Why don't you use V_G instead of V_G^{\pi_G^*}? You did define it in Definition 10. IF you don't do that, what is the point of Definition 10? Same applies for the remainder of the paper. L331: Definition 12 is useless. You did mention this already in Def. 9. No need to repeat yourself with Def. 9 and 12. L354-L372: The whole paragraph is not true and unnecessary. If the model changes after T steps, then the model is no longer Markovian, so you are not defining "a temporally homogenous MDP", as you call it. Also, what is the point of the whole paragraph, when you can simply define Q_T(s,a) as a function that in the limit becomes Q_G(s,a) (as T \to \infty). I really don't see the point of this paragraph. L383: Is Q_T a function over \mathcal{S}_G \times \mathcal{A} or \mathcal{S}_A \times \mathcal{A} ? You define Q_T on line 377 for s \in \mathcal{S}_G, but then you define \sigma_{T-1} to depend on Q_{T-1}(s_A',a'). This makes no sense, unless you explain what is Q_T for elements in \mathcal{S}_A \times A. Please clarify. L414: just as above, I am confused again on what is the induction hypothesis. You want to prove something about Q_T over \mathcal{S}_G, but then you use the i.h. on elements of \mathcal{S}_A. Fix this in the proof. L462-478: Equations 17-20 could be written in a compact way. The proof takes too much space for a derivation that is quite clear. L483: Much as you did with Q_T, try to keep the analysis over Markovian models. Don't define a policy that depends on time. Instead, define V_G^{\pi_{A,t}} as a function that converges in the limit to what is needed, then use induction to bound this limit. L549: Use some marker to make it clear when your proofs are done. L592: What are RMAX and QMAX. "We defined RMAX =1". I have a hard time finding both the place where you define these things and where you set RMAX to 1. L621: This assumption is unnecessary. It follows easily from Eq. 27, by the triangle inequality. The difference in the normalizing terms is upper bounded by |\mathcal{A}| \times \epsilon. L629: k=|\mathcal{A}|. L636: the Sketch for the proof of Lemma 3 is a misleading. I would be curious to see a proof based on this sketch. First, e^x can be approximated with 1+x + \delta, but we are trying to prove a Lemma that holds for all \epsilon, which means that at the difference between 1+x and e^x could be quite significant for some errors \epsilon which are small. Also, just because there is a k\epsilon error in the normalizing terms, it does not mean that one can simply replace the two the way it is done in the sketch. Please simplify the proofs in Lemmas 1 and 2, and use the space to provide a proper proof for this Lemma. Significance - Justification: The contributions of the paper relies heavily on the theoretical bounds on the approximation error induced by different abstraction methods. In particular, four different classes of approximation are analysed. If all Lemmas would be carefully proved, the contributions would be substantial enough for a publication. As is, the paper contains a convoluted proof for one model, a second bound that is not as tight as state-of-art theoretical results on model similarity, and two other bounds which are provided with sketched proofs. See details below: Lemma 1: See the section on Clarity on how the proof could be shortened. Lemma 2: Ferns, N., Panangaden, P., and Precup, D. "Metrics for Finite Markov Decision Processes". In UAI 2004, have already proven bounds for this model. Although they don't give the exact upper bound, it can be easily derived from their results that Lemma 2 is actually \frac{2 \epsilon + 2 \gamma |S_G| \epsilon}{(1-\gamma)^2}, which a factor of $1/(1-\gamma)$ tighter. Lemma 3,4: Sketches are not convincing, as already explained in the Clarity section. Detailed comments. (Explain the basis for your ratings while providing constructive feedback.): See Sections on Clarity and Significance. =====