Aggregate evaluations of deep learning models on popular benchmarks have incentivized the creation of bigger models that are more accurate on iid data. As the research community is realizing that these models do not generalize out of distribution, the trend has shifted to evaluations on adversarially constructed, unnatural datasets. However, both these extremities have limitations when it comes to meeting the goals of evaluation. In this talk, I propose that the goal of evaluation is to inform next action to a user in the form of 1) further analysis or 2) model patching. Thinking of evaluation as an iterative process dovetails with these goals. Our work on Robustness Gym (RG) proposes an iterative process of evaluation and explains how that enables a user to iterate on their model development process. I will give two concrete examples in NLP demonstrating how RG supports the aforementioned evaluation goals. Towards the end of the talk, I will discuss some caveats associated with evaluating pre-trained language models (PLMs) and in particular focus on the problem of input contamination, giving examples from our work on SummVis. Using these examples from RG and SummVis, I hope to draw attention to the limitations of current evaluations and the need for a more thorough process that helps us gain a better understanding of our deep learning models.