We sincerely thank the reviewers for their detailed feedback and would like to address some of their concerns here.$
To Reviewer 3, we were unaware of the connection to rate-distortion theory, but appreciate the suggestion; this is an interesting connection indeed. We plan on exploring parallels between state abstraction and information theoretic analysis of compression in future work.

To Reviewer 4, expanding empirical results to explore the relationship between populations of abstract MDPs for particular domains and epsilon is a great suggestion; we will investigate this thread in future work as well. We appreciate the additional commentary on typos and formatting, and will handle these as suggested.

Due to space constraints, the remainder of the rebuttal will address the feedback of Reviewer 1.

To Reviewer 1: We strongly agree that there is a close connection to work in bisimulation (indeed, Dean et. al. 1996 that we cite on L166 is concerned with bisimulation). We also agree that work of Ferns et al. (UAI 2004 and 2006, “FCPP”) is especially relevant---we have added these papers as references and included a discussion of the similarities and differences to our work.

Note that the overarching agenda and primary contributions of our paper differ significantly from FCPP. Our primary aim is to provide a general theory of abstraction that bounds the suboptimality of applying the optimal policy of an abstract MDP to its ground MDP. FCPP seeks to bound the difference in value between ground states and their corresponding abstract states for a particular abstraction.

In more detail, the main value function approximation result of the 2004 paper (mentioned immediately after Theorem 5.2) is closely related to our Claim 1 of Lemma 1, and our Lemma 2. Like Claim 1 of Lemma 1, this result of FCPP provides a bound on the difference in the value of a ground state and its corresponding abstract state. Like Lemma 2, the FCPP result is concerned with model similarity. However, Claim 1 of our Lemma 1 does not concern a model similarity abstraction. Moreover, our Lemma 2 bounds the sub-optimality of the optimal abstract policy applied to the ground MDP, not the similarity of the value of a state in the ground MDP to the value of its corresponding abstract state. 

It is not apparent to us how to leverage the bound from the FCPP to shave off this factor of 1/(1-gamma) as Reviewer 1 suggests. From our perspective, removing an additional 1/(1-gamma) would be a new result that is not yet in the literature. It is possible to convert a value approximation like that in FCPP to a policy approximation like what we present (using Theorem 1 of [Singh and Yee 1994]). However, doing so comes with a corresponding increase of 1/(1-gamma) in the bounds.

We also thank Reviewer 1 for detailed feedback on the proof of Lemma 1. We plan to amend our notation to correct the type confusion of Q_T. However, we feel that the narrative afforded by the use of non-Markovian models in our proof provides a conceptual framework and clarity that would be absent if functions that in the limit converge on the desired functions were simply sprung on the reader without explanation. Consequently, we find the argument we used to be more conceptually palatable. We acknowledge that referring to a model that is time dependent and therefore non-Markovian as a “temporally heterogenous MDP” may be confusing and have clarified this phrasing.

Reviewer 1 is incorrect to think that the normalizing constant for our approximate Boltzmann is bounded by virtue of the triangle inequality. Consider when there is a single action with Q value of 1 in one state and Q value of 1000 in a second state. Both of the induced Boltzmann distributions are identical (each takes the sole action with probability 1) and so the states could be collapsed with epsilon=0 yet the normalizing constants are clearly not within |A|epsilon=1*0 of one another.

In regard to the brevity of proofs of Lemmas 3 and 4: due to space constraints, we chose to only include sketches. Apart from the steps we note in our proof sketches, the proofs consist entirely of algebraic manipulations, so we chose to omit the full proofs in favor of providing more detailed expositions of the proofs of Lemmas 1 and 2. Reviewer 1 is right to note that error of the approximation 1+x=e^x may grow large for large x---the algebra of the proof is such that when a large x is used and the approximation becomes weaker the bound still holds but becomes less tight. In any case, we appreciate Reviewer 1’s feedback on our proof sketches of Lemma 3 and 4, and added details to make the proof strategies more clear.

Thanks again to all of the reviewers for their thoughtful and detailed feedback!