Perplexity Cannot Always Tell Right from Wrong
Abstract
Perplexity---a function measuring a model's overall level of "surprise" when encountering a particular output---has gained significant traction in recent years, both as a loss function and as a simple-to-compute metric of model quality. Prior studies have pointed out several limitations of perplexity, often from an empirical manner. Here we leverage recent results on Transformer continuity to show in a rigorous manner how perplexity may be an unsuitable metric for model selection. Specifically, we prove that, if there is any sequence that a compact decoder-only Transformer model predicts accurately and confidently---a necessary pre-requisite for strong generalisation---it must imply existence of another sequence with very low perplexity, but not predicted correctly by that same model. Further, by analytically studying iso-perplexity plots, we find that perplexity will not always select for the more accurate model---rather, any increase in model confidence must be accompanied by a commensurate rise in accuracy for the new model to be selected.