Much recent work on large-scale visual recogni-tion aims to scale up learning to massive, noisily-annotated datasets. We address the problem ofscaling-up the evaluation of such models to large-scale datasets with noisy labels. Current protocolsfor doing so require a human user to either vet(re-annotate) a small fraction of the testset andignore the rest, or else correct errors in annotationas they are found through manual inspection ofresults. In this work, we re-formulate the problemas one of active testing, and examine strategiesfor efficiently querying a user so as to obtain anaccurate performance estimate with minimal vet-ting. We demonstrate the effectiveness of ourproposed active testing framework on estimatingtwo performance metrics, Precision@K and meanAverage Precisions, for two popular Computer Vi-sion tasks, multilabel classification and instancesegmentation, respectively. We further show thatour approach is able to siginificantly save humanannotation effort and more robust than alterna-tive evaluation protocols.