ICML Poster The Disparate Benefits of Deep Ensembles

Poster

The Disparate Benefits of Deep Ensembles

Kajetan Schweighofer · Adrián Arnaiz-Rodríguez · Sepp Hochreiter · Nuria Oliver

East Exhibition Hall A-B #E-1001

[ Abstract ] [ Lay Summary ]

[ Poster] [ OpenReview]

Tue 15 Jul 11 a.m. PDT — 1:30 p.m. PDT

Abstract:

Ensembles of Deep Neural Networks, Deep Ensembles, are widely used as a simple way to boost predictive performance. However, their impact on algorithmic fairness is not well understood yet. Algorithmic fairness examines how a model's performance varies across socially relevant groups defined by protected attributes such as age, gender, or race. In this work, we explore the interplay between the performance gains from Deep Ensembles and fairness. Our analysis reveals that they unevenly favor different groups, a phenomenon that we term the disparate benefits effect. We empirically investigate this effect using popular facial analysis and medical imaging datasets with protected group attributes and find that it affects multiple established group fairness metrics, including statistical parity and equal opportunity. Furthermore, we identify that the per-group differences in predictive diversity of ensemble members can explain this effect. Finally, we demonstrate that the classical Hardt post-processing method is particularly effective at mitigating the disparate benefits effect of Deep Ensembles by leveraging their better-calibrated predictive distributions.

Lay Summary:

Deep Ensembles (multiple deep learning models combined) are widely used to improve accuracy. However, in this paper we show that combining models (ensembling) leads to different groups, e.g. race or gender, being treated more unfairly. This is, because ensembling helps some groups more than others, a phenomenon we call the disparate benefits effect of Deep Ensembles. We investigate this effect for multiple datasets with face and medical image data, demonstrating increased unfairness under standard fairness metrics. Furthermore, we investigate the reason for this effect and find that different levels of disagreement between individual models may explain it. Finally, we investigate ways to mitigate this issue, finding that a known method from the literature is particularly effective in this setting. This allows to use the improved accuracy of Deep Ensembles without a decrease in fairness.

Chat is not available.