Workshop Poster
in
Workshop: ICML 2021 Workshop on Computational Biology
Data Inequality, Machine Learning and Health Disparity
Yan Gao
Over 80% of clinical genetics and omics data were collected from individuals of European ancestry (EA), which comprise approximately 16% of the world’s population. This severe data disadvantage for the non-EA populations is set to generate new health disparities as machine learning powered biomedical research and health care become increasingly common. The new health disparity arising from data inequality can potentially impact all data-disadvantaged ethnic groups in all diseases where data inequality exists. Thus, its negative impact is not limited to the diseases for which significant racial/ethnic disparities have already been evident. In a recent work, we showed that the current prevalent scheme for machine learning with multiethnic data, the mixture learning scheme, and its main alternative, the independent learning scheme, are prone to generating machine learning models with relatively low performance for data-disadvantaged ethnic groups due to inadequate training data and data distribution discrepancies among ethnic groups. We found that transfer learning can provide improved machine learning models for data-disadvantaged ethnic groups by leveraging knowledge learned from other groups having more abundant data. These results indicate that transfer learning can provide an effective approach to reduce health care disparities arising from data inequality among ethnic groups.