Skip to yearly menu bar Skip to main content


Tutorial

Calibration and Bias in Algorithms, Data, and Models: a tutorial on metrics and plots for measuring calibration, bias, fairness, reliability, and robustness

Mark Tygert


Abstract:

There are many different notions of bias and fairness. When comparing subpopulations, an especially important dichotomy is between (1) equal or equitable average outcomes and (2) equal or equitable treatment. In the particular context considered here, "equal treatment" and "equal opportunity" are not too different. However, comparing the average outcome of one subpopulation to another is different and sometimes less desirable than comparing the outcomes of pairs of individuals (one individual from each subpopulation) for which the individuals in each pair are similar. The latter requires comparing outcomes via "conditioning on" or "controlling for" confounding covariates.

Conditioning on or controlling for variates helps compare only those who are comparable. That often means matching up people by their age or income, for example, and then looking at differences in results between people with similar ages or similar incomes. Yet that raises the question: how many people with exactly the same age or exactly the same income are in the data? If there are too few, they will be unrepresentative. When there are too few, the randomness in the results fails to average away. This would seem to call for matching up people whose ages or incomes are only close, but not exactly the same. How close is "close"? Does it matter?

Choosing how close is "close" turns out to make all the difference. In many cases, the data can be made to support any arbitrary conclusion simply by manipulating how close is considered "close." Conventionally, adjusting data for covariates such as age and income often ends up fudging the numbers, spinning facts or figures. Even the well-intentioned are susceptible to confirmation bias, cherry-picking, or otherwise making the data merely confirm expectations.

This tutorial shows how to avoid setting how close is "close." Without any parameter to tune, the tutorial's graphical methods and scalar summary statistics cannot mislead, not even in principle. These methods are thus well-suited for assessing bias, fairness, reliability, the calibration of predicted probabilities, and other treatment effects. The analysis applies to observational studies as well as to randomized controlled trials, including A/B tests. The most common use case is for analyzing the predictions or other outputs of machine-learned models.

Live content is unavailable. Log in and register to view live content