Poster
in
Workshop: DMLR Workshop: Data-centric Machine Learning Research
Detecting Dataset Drift and Non-IID Sampling via k-Nearest Neighbors
Jesse Cummings · Jonas Mueller · ElĂas Snorrason
We present a straightforward statistical test todetect certain violations of the assumption thatthe data are Independent and Identically Dis-tributed (IID). The specific form of violation con-sidered is common across real-world applications:whether the examples are ordered in the datasetsuch that almost adjacent examples tend to havemore similar feature values (e.g. due to distri-butional drift, or attractive interactions betweendatapoints). Based on a k-Nearest Neighbors es-timate, our approach can be used to audit anymultivariate numeric data as well as other datatypes (image, text, audio, etc.) that can be numeri-cally represented, perhaps via model embeddings.Compared with existing methods to detect drift orauto-correlation, our approach is both applicableto more types of data and also able to detect awider variety of IID violations in practice.