Much recent work on large-scale visual recogni- tion aims to scale up learning to massive, noisily- annotated datasets. We address the problem of scaling-up the evaluation of such models to large- scale datasets with noisy labels. Current protocols for doing so require a human user to either vet (re-annotate) a small fraction of the testset and ignore the rest, or else correct errors in annotation as they are found through manual inspection of results. In this work, we re-formulate the problem as one of active testing, and examine strategies for efficiently querying a user so as to obtain an accurate performance estimate with minimal vet- ting. We demonstrate the effectiveness of our proposed active testing framework on estimating two performance metrics, Precision@K and mean Average Precisions, for two popular Computer Vi- sion tasks, multilabel classification and instance segmentation, respectively. We further show that our approach is able to siginificantly save human annotation effort and more robust than alterna- tive evaluation protocols.