Timezone: »
This is the third edition of highly successful workshops focused on data-centric AI, following the success of the Data-Centric AI workshop at NeurIPS 2021 and DataPerf workshop at ICML 2022. Data, and operations over data (e.g., cleaning, debugging, curation) have been continually fueling the success of machine learning for decades. While historically the ML community has focused primarily on model development, recently the importance of data quality has attracted intensive interest from the community, including the creation of the NeurIPS dataset and benchmark track, several data-centric AI benchmarks (e.g., DataPerf), and the flourishing of data consortiums such as LAION, the community’s attention has been directed to the quality of data used for ML training and evaluation. The goal of this workshop is to facilitate these important topics in what we call Data-centric Machine Learning Research, which includes not only datasets and benchmarks, but tooling and governance, as well as fundamental research on topics such as data quality and data acquisition for dataset creation and optimization.
Sat 12:00 p.m. - 12:05 p.m.
|
Introduction and Opening
(
Opening Remarks
)
SlidesLive Video » |
Praveen Paritosh 🔗 |
Sat 12:05 p.m. - 12:40 p.m.
|
Keynote 1: Andrew Ng (Landing AI)
(
Keynote
)
SlidesLive Video » |
Andrew Ng 🔗 |
Sat 12:40 p.m. - 1:10 p.m.
|
Data-centric Ecosystem: Croissant and Dataperf - Peter Mattson (Google & MLCommons)
(
Talk
)
SlidesLive Video » |
Peter Mattson · Praveen Paritosh 🔗 |
Sat 1:10 p.m. - 1:25 p.m.
|
Coffee break / networking break ( Break ) link » | 🔗 |
Sat 1:25 p.m. - 2:00 p.m.
|
Keynote 2: Mihaela van der Schaar (University of Cambridge) - Reality-Centric AI
(
Keynote
)
SlidesLive Video » |
Mihaela van der Schaar 🔗 |
Sat 2:00 p.m. - 2:30 p.m.
|
Invited Talk 2: Olga Russakovsky (Princeton University)
(
Talk
)
SlidesLive Video » |
Olga Russakovsky · Vikram V Ramaswamy 🔗 |
Sat 2:30 p.m. - 3:00 p.m.
|
Invited Talk 3: Masashi Sugiyama (RIKEN & UTokyo) - Data distribution shift
(
Talk
)
SlidesLive Video » |
Masashi Sugiyama 🔗 |
Sat 3:00 p.m. - 4:00 p.m.
|
Lunch Break / networking break link » | 🔗 |
Sat 4:00 p.m. - 4:35 p.m.
|
Keynote 3: Isabelle Guyon (Google Brain) - Towards Data-Centric AutoML
(
Keynote
)
link »
SlidesLive Video » |
Isabelle Guyon 🔗 |
Sat 4:35 p.m. - 5:05 p.m.
|
Invited Talk 1: Dina Machuve (DevData Analytics) - Data for Agriculture
(
Talk
)
SlidesLive Video » |
Dina Machuve 🔗 |
Sat 5:05 p.m. - 5:20 p.m.
|
Announcement and open discussion on DMLR (Selected members of DMLR Advisory Board)
(
Discussion Panel
)
SlidesLive Video » |
Ce Zhang 🔗 |
Sat 5:20 p.m. - 6:15 p.m.
|
Panel Discussion
(
Discussion Panel
)
SlidesLive Video » |
Megan Ansdell · Nathan Lambert · Ludwig Schmidt · Praveen Paritosh · Sang Michael Xie 🔗 |
Sat 6:15 p.m. - 6:30 p.m.
|
Coffee break / networking break ( Break ) link » | 🔗 |
Sat 6:30 p.m. - 7:30 p.m.
|
Poster Session 1
(
Poster Session - In Person
)
|
🔗 |
Sat 7:30 p.m. - 8:00 p.m.
|
Poster Session 2 (Virtual) ( Poster Session - Virtual ) link » | 🔗 |
-
|
Training on Thin Air: Improve Image Classification with Generated Data
(
Poster
)
Acquiring high-quality data for training discriminative models is a crucial yet challenging aspect of building effective predictive systems. In this paper, we present Diffusion Inversion, a simple yet effective method that leverages the pre-trained generative model, Stable Diffusion, to generate diverse, high-quality training data for image classification. Our approach captures the original data distribution and ensures data coverage by inverting images to the latent space of Stable Diffusion, and generates diverse novel training images by conditioning the generative model on noisy versions of these vectors. We identify three key components that allow our generated images to successfully supplant the original dataset, leading to a 2-3x enhancement in sample complexity and a 6.5x decrease in sampling time. Furthermore, our approach consistently outperforms generic prompt-based steering methods and KNN retrieval baseline across a wide range of datasets, exhibiting especially remarkable results in specialized fields like medical imaging. Furthermore, we demonstrate the compatibility of our approach with widely-used data augmentation techniques, as well as the reliability of the generated data in supporting various neural architectures and enhancing few-shot learning performance. |
Yongchao Zhou · Hshmat Sahak · Jimmy Ba 🔗 |
-
|
DMOps: Data Management Operations and Recipes
(
Poster
)
Data-centric AI has shed light on the significance of data within the machine learning (ML) pipeline. Recognizing its significance, academia, industry, and government departments have suggested various NLP data research initiatives. While the ability to utilize existing data is essential, the ability to build a dataset has become more critical than ever, especially in the industry. In consideration of this trend, we propose a "Data Management Operations and Recipes" to guide the industry in optimizing the building of datasets for NLP products. This paper presents the concept of DMOps which is derived from real-world experiences with NLP data management and aims to streamline data operations by offering a baseline. |
Eujeong Choi · Chanjun Park 🔗 |
-
|
Transcending Traditional Boundaries: Leveraging Inter-Annotator Agreement (IAA) for Enhancing Data Management Operations (DMOps)
(
Poster
)
This paper presents a novel approach of leveraging Inter-Annotator Agreement (IAA), traditionally used for assessing labeling consistency, to optimize Data Management Operations (DMOps). We advocate for the use of IAA in predicting the labeling quality of individual annotators, leading to cost and time efficiency in data production. Additionally, our work highlights the potential of IAA in forecasting document difficulty, thereby boosting the data construction process's overall efficiency. This research underscores IAA's broader application potential in data-driven research optimization and holds significant implications for large-scale data projects prioritizing efficiency, cost reduction, and high-quality data. |
Damrin Kim · NamHyeok Kim · Chanjun Park · Harksoo Kim 🔗 |
-
|
To Aggregate or Not? Learning with Separate Noisy Labels
(
Poster
)
The rawly collected training data often comes with separate noisy labels collected from multiple imperfect annotators (e.g., via crowdsourcing). A typical way of using these separate labels is to first aggregate them into one and apply standard training methods. The literature has also studied extensively on effective aggregation approaches. This paper revisits this choice and aims to provide an answer to the question of whether one should aggregate separate noisy labels into single ones or use them separately as given. We theoretically analyze the performance of both approaches under the empirical risk minimization framework for a number of popular loss functions, including the ones designed specifically for the problem of learning with noisy labels. Our theorems conclude that label separation is preferred over label aggregation when the noise rates are high, or the number of labelers/annotations is insufficient. Extensive empirical results validate our conclusions. |
Jiaheng Wei · Zhaowei Zhu · Tianyi Luo · Ehsan Amid · Abhishek Kumar · Yang Liu 🔗 |
-
|
On the Trade-off of Intra-/Inter-class Diversity for Supervised Pre-training
(
Poster
)
Pre-training datasets are critical for building state-of-the-art machine learning models, motivating rigorous study on their impact on downstream tasks. In this work, we study the impact of the trade-off between the intra-class diversity (the number of samples per class) and the inter-class diversity (the number of classes) of a supervised pre-training dataset. Empirically, we found that with the size of the pre-training dataset fixed, the best downstream performance comes with a balance on the intra-/inter-class diversity. To understand the underlying mechanism, we show theoretically that the downstream performance depends monotonically on both types of diversity. Notably, our theory reveals that the optimal class-to-sample ratio (#classes / #samples per class) is invariant to the size of the pre-training dataset, which motivates an application of predicting the optimal number of pre-training classes. We demonstrate the effectiveness of this application by an improvement of around 2 points on the downstream tasks when using ImageNet as the pre-training dataset. |
Jieyu Zhang · Bohan Wang · zhengyu hu · Pang We Koh · Alex Ratner 🔗 |
-
|
Inter-Annotator Agreement in the Wild: Uncovering Its Emerging Roles and Considerations in Real-World Scenarios
(
Poster
)
Inter-Annotator Agreement (IAA) is commonly used as a measure of label consistency in natural language processing tasks. However, in real-world scenarios, IAA has various roles and implications beyond its traditional usage. In this paper, we not only consider IAA as a measure of consistency but also as a versatile tool that can be effectively utilized in practical applications. Moreover, we discuss various considerations and potential concerns when applying IAA and suggest strategies for effectively navigating these challenges. |
NamHyeok Kim · Chanjun Park 🔗 |
-
|
Algorithm Selection for Deep Active Learning with Imbalanced Datasets
(
Poster
)
Label efficiency has become an increasingly important objective in deep learning applications. Active learning aims to reduce the number of labeled examples needed to train deep networks, but the empirical performance of active learningalgorithms can vary dramatically across datasets and applications. It is difficult to know in advance which active learning strategy will perform well or best in a given application. To address this, we propose the first adaptive algorithm selection strategy for deep active learning. For any unlabeled dataset, our (meta) algorithm TAILOR (Thompson ActIve Learning algORithm selection) iteratively and adaptively chooses among a set of candidate active learning algorithms. TAILOR uses novel reward functions aimed at gathering class-balanced examples. Extensiveexperiments in multi-class and multi-label applications demonstrate TAILOR ’s effectiveness in achieving accuracy comparable or better than that of the best of the candidate algorithms. |
Jifan Zhang · Shuai Shao · Saurabh Verma · Robert Nowak 🔗 |
-
|
How to Improve Imitation Learning Performance with Sub-optimal Supplementary Data?
(
Poster
)
Imitation learning (IL) is a machine learning technique that involves learning from examples provided by an expert. IL algorithms can solve the sequential decision-making tasks but their performance usually suffer when the amount of expert data is limited. To address this challenge, a new data-centric framework called (offline) IL with supplementary data has emerged, which \emph{additionally} utilizes an imperfect dataset inexpensively collected from sub-optimal policies. However, the supplementary data may contain out-of-expert-distribution samples, making it tricky to utilize the supplementary data to improve performance. In this paper, we focus on a classic offline IL algorithm called behavioral cloning (BC) and its variants, studying the imitation gap bounds in the context of IL with supplementary data. Our theoretical results show that a naive method, which applies BC on the union of expert and supplementary data, has a non-vanishing imitation error. As a result, its performance may be worse than BC which relies solely on the expert data. To address this issue, we propose an importance-sampling-based approach for selecting in-expert-distribution samples from the supplementary dataset. The proposed method theoretically eliminates the gap of the naive method. Empirical studies demonstrate that our method can perform better than prior state-of-the-art methods on tasks including locomotion control, Atari games, and object recognition. |
Ziniu Li · Tian Xu · Zeyu Qin · Yang Yu · Zhiquan Luo 🔗 |
-
|
DoReMi: Optimizing Data Mixtures Speeds Up Language Model Pretraining
(
Poster
)
The mixture proportions of pretraining data domains (e.g., Wikipedia, books, web text) greatly affect language model (LM) performance. In this paper, we propose Domain Reweighting with Minimax Optimization (DoReMi), which first trains a small proxy model using group distributionally robust optimization (Group DRO) over domains to produce domain weights (mixture proportions) without knowledge of downstream tasks. We then resample a dataset with these domain weights and train a larger, full-sized model. In our experiments, we use DoReMi on a 280M-parameter proxy model to find domain weights for training an 8B-parameter model (30x larger) more efficiently. On The Pile, DoReMi improves perplexity across all domains, even when it downweights a domain. DoReMi improves average few-shot downstream accuracy by 6.5% points over a baseline model trained using The Pile's default domain weights and reaches the baseline accuracy with 2.6x fewer training steps. On the GLaM dataset, DoReMi, which has no knowledge of downstream tasks, even matches the performance of using domain weights tuned on downstream tasks. |
Sang Michael Xie · Hieu Pham · Xuanyi Dong · Nan Du · Hanxiao Liu · Yifeng Lu · Percy Liang · Quoc Le · Tengyu Ma · Adams Wei Yu 🔗 |
-
|
How to Cope with Gradual Data Drift?
(
Poster
)
Real-world deployment of machine learning models is challenging because data evolves over time. While no model can work when data evolves in an arbitrary fashion, if there is some pattern to these changes, we might be able to design methods to address it. This paper addresses situations when data evolves gradually. We introduce a time-varying propensity score that can detect gradual shifts in the distribution of data which allows us to selectively sample past data to update the model---not just similar data from the past like that of a standard propensity score but also data that evolved in a similar fashion in the past. The time-varying propensity score is quite general: we demonstrate different ways of implementing it and evaluate it on a variety of problems ranging from supervised learning (e.g., image classification problems) where data undergoes a sequence of gradual shifts, to reinforcement learning tasks (e.g., robotic manipulation and continuous control) where data shifts as the policy or the task changes. |
Rasool Fakoor · Jonas Mueller · Zachary Lipton · Pratik Chaudhari · Alex Smola 🔗 |
-
|
Synthetic Alone: Exploring the Dark Side of Synthetic Data for Grammatical Error Correction
(
Poster
)
Data-centric AI approach aims to enhance the model performance without modifying the model and has been shown to impact model performance positively. While recent attention has been given to data-centric AI based on synthetic data, due to its potential for performance improvement, data-centric AI has long been exclusively validated using real-world data and publicly available benchmark datasets. In respect of this, data-centric AI still highly depends on real-world data, and the verification of models using synthetic data has not yet been thoroughly carried out. Given the challenges above, we ask the question: `Does data quality control (noise injection and balanced data), a data-centric AI methodology acclaimed to have a positive impact, exhibit the same positive impact in models trained solely with synthetic data?'' To address this question, we conducted comparative analyses between models trained on synthetic and real-world data based on grammatical error correction (GEC) task. Our experimental results reveal that the data quality control method has a positive impact on models trained with real-world data, as previously reported in existing studies, while a negative impact is observed in models trained solely on synthetic data. |
Chanjun Park · Seonmin Koo · Seolhwa Lee · Jaehyung Seo · Sugyeong Eo · Hyeonseok Moon · HEUISEOK LIM 🔗 |
-
|
Programmable Synthetic Tabular Data Generation
(
Poster
)
Large amounts of tabular data remain underutilized due to privacy, data quality, and data sharing limitations. While training a generative model producing synthetic data resembling the original distribution addresses some of these issues, most applications require additional constraints from the generated data. Existing synthetic data approaches are limited as they typically only handle specific constraints, e.g., differential privacy (DP) or increased fairness, and lack an accessible interface for declaring general specifications. In this work, we introduce ProgSyn, the first programmable synthetic tabular data generation algorithm that allows for comprehensive customization over the generated data. To ensure high data quality while adhering to custom specifications, ProgSyn pre-trains a generative model on the original dataset and fine-tunes it on a differentiable loss automatically derived from the provided specifications. These can be programmatically declared using statistical and logical expressions, supporting a wide range of requirements (e.g., DP or fairness, among others). We conduct an extensive experimental evaluation of ProgSyn on a number of constraints, achieving a new state-of-the-art on some, while remaining general. For instance, at the same fairness level we achieve 2.3% higher downstream accuracy than the state-of-the-art in fair synthetic data generation on the Adult dataset. Overall, ProgSyn provides a versatile and accessible framework for generating constrained synthetic tabular data, allowing for specifications that generalize beyond the capabilities of prior work. |
Mark Vero · Mislav Balunovic · Martin Vechev 🔗 |
-
|
Unitail: A Benchmark for Detecting, Reading, and Matching in Retail Scene
(
Poster
)
In order to fully utilize computer vision technology in retail stores, we present the United Retail Datasets (Unitail), an extensive large-scale benchmark of basic visual tasks on products that challenge algorithms for detecting, reading, and matching. The Unitail includes 1.8M quadrilateral-shaped instances annotated to improve product detection and offers a gallery-style OCR dataset comprising 1454 product categories, 30k text regions, and 21k transcriptions to enable reliable text recognition of products and encourage advanced product matching. In addition to evaluating the datasets using different state-of-the-art methods, we have developed a customized product detection model and a straightforward OCR-based matching solution, both of which demonstrate their effectiveness. |
Fangyi Chen · Han Zhang · Hao Chen · Kai Hu · Jiachen Dou · zaiwang li · Chenchen Zhu · Marios Savvides 🔗 |
-
|
Understanding Unfairness via Training Concept Influence
(
Poster
)
Knowing the causes of a model's unfairness helps practitioners better understand their data and algorithms. This is an important yet relatively unexplored task. We look into this problem through the lens of the training data - one of the major sources of unfairness. We ask the following questions: how would a model's fairness performance change if, in its training data, some samples (1) were collected from a different (e.g. demographic) group, (2) were labeled differently, or (3) some features were changed? In other words, we quantify the fairness influence of training samples by counterfactually intervening and changing samples based on predefined concepts, i.e. data attributes such as features (X), labels (Y), or sensitive attributes (A). To calculate a training sample's influence on the model's unfairness w.r.t a concept, we first generate counterfactual samples based on the concept, i.e. the counterfactual versions of the sample if the concept were changed. We then calculate the resulting impact on the unfairness, via influence function, if the counterfactual samples were used in training. Our framework not only helps practitioners understand the observed unfairness and repair their training data, but also leads to many other applications, e.g. detecting mislabeling, fixing imbalanced representations, and detecting fairness-targeted poisoning attacks. |
Yuanshun Yao · Yang Liu 🔗 |
-
|
Promises and Pitfalls of Threshold-based Auto-labeling
(
Poster
)
Creating large-scale high-quality labeled datasets is a major bottleneck in supervised machine learning workflows. Threshold-based auto-labeling (TBAL), where validation data obtained from humans is used to find a confidence threshold above which the data is machine-labeled, reduces reliance on manual annotation. TBAL is emerging as a widely-used solution in practice. Given the long shelf-life and diverse usage of the resulting datasets, understanding when the data obtained by such auto-labeling systems can be relied on is crucial. This is the first work to analyze TBAL systems and derive sample complexity bounds on the amount of human-labeled validation data required for guaranteeing the quality of machine-labeled data. Our results provide two crucial insights. First, reasonable chunks of unlabeled data can be automatically and accurately labeled by seemingly bad models. Second, a hidden downside of TBAL systems is potentially prohibitive validation data usage. Together, these insights describe the promise and pitfalls of using such systems. We validate our theoretical guarantees with extensive experiments on synthetic and real datasets. |
Harit Vishwakarma · Heguang Lin · Frederic Sala · Ramya Korlakai Vinayak 🔗 |
-
|
Detecting Dataset Drift and Non-IID Sampling via k-Nearest Neighbors
(
Poster
)
We present a straightforward statistical test todetect certain violations of the assumption thatthe data are Independent and Identically Dis-tributed (IID). The specific form of violation con-sidered is common across real-world applications:whether the examples are ordered in the datasetsuch that almost adjacent examples tend to havemore similar feature values (e.g. due to distri-butional drift, or attractive interactions betweendatapoints). Based on a k-Nearest Neighbors es-timate, our approach can be used to audit anymultivariate numeric data as well as other datatypes (image, text, audio, etc.) that can be numeri-cally represented, perhaps via model embeddings.Compared with existing methods to detect drift orauto-correlation, our approach is both applicableto more types of data and also able to detect awider variety of IID violations in practice. |
Jesse Cummings · Jonas Mueller · Elías Snorrason 🔗 |
-
|
Data-Driven Approach for Formality-Sensitive Machine Translation: Language-Specific Handling and Synthetic Data Generation
(
Poster
)
In this paper, we introduce a data-driven approach for Formality-Sensitive Machine Translation (FSMT) that caters to the unique linguistic properties of four target languages. Our methodology centers on two core strategies: 1) language-specific data handling, and 2) synthetic data generation using large-scale language models and empirical prompt engineering. This approach demonstrates a considerable improvement over the baseline, highlighting the effectiveness of data-centric techniques. Our prompt engineering strategy further improves performance by producing superior synthetic translation examples. |
Seungjun Lee · Hyeonseok Moon · Chanjun Park · HEUISEOK LIM 🔗 |
-
|
Prioritized Trajectory Replay: A Replay Memory for Data-driven Reinforcement Learning
(
Poster
)
In recent years, data-driven reinforcement learning (RL), also known as offline RL, have gained significant attention. However, the role of data sampling techniques in offline RL has been overlooked despite its potential to enhance online RL performance. Recent research suggests applying sampling techniques directly to state-transitions does not consistently improve performance in offline RL. Therefore, in this study, we propose a memory technique, (Prioritized) Trajectory Replay (TR/PTR), which extends the sampling perspective to trajectories for more comprehensive information extraction from limited data. TR enhances learning efficiency by backward sampling of trajectories that optimizes the use of subsequent state information. Building on TR, we build the weighted critic target to avoid sampling unseen actions in offline training, and Prioritized Trajectory Replay (PTR) that enables more efficient trajectory sampling, prioritized by various trajectory priority metrics. We demonstrate the benefits of integrating TR and PTR with existing offline RL algorithms on D4RL. In summary, our research emphasizes the significance of trajectory-based data sampling techniques in enhancing the efficiency and performance of offline RL algorithms. |
Jinyi Liu · Yi Ma · Jianye Hao · Yujing Hu · Yan Zheng · Tangjie Lv · Changjie Fan 🔗 |
-
|
CD-GraB: Coordinating Distributed Example Orders for Provably Accelerated Training
(
Poster
)
Recent research on online Gradient Balancing (GraB) reveals that there exist permutation-based data example orders that are guaranteed to outperform random reshuffling (RR). Whereas RR arbitrarily permutes training data examples, GraB leverages information in stale example gradients from prior epochs to order examples for the next epoch --- achieving a provably faster convergence rate than RR. However, GraB is limited by design: While it demonstrates an impressive ability to scale-up training on \emph{centralized} data, it does not naturally extend to modern \emph{distributed} ML workloads. We therefore propose \emph{Coordinated Distributed GraB} (CD-GraB), which uses insights from prior work on kernel thinning to translate the benefits of provably faster permutation-based example ordering to distributed settings. With negligible overhead, CD-GraB exhibits a linear speedup in convergence rate over centralized GraB and outperforms baselines empirically, including distributed RR, on a variety of benchmark tasks. |
A. Feder Cooper · Wentao Guo · Duc Khiem Pham · Tiancheng Yuan · Charlie Ruan · Yucheng Lu · Chris De Sa 🔗 |
-
|
Data-Centric Defense: Shaping Loss Landscape with Augmentations to Counter Model Inversion
(
Poster
)
Machine Learning models have shown susceptibility to various privacy attacks such as model inversion. Current defense techniques are mostly \emph{model-centric}, which are computationally expensive and often result in a significant privacy-utility tradeoff. This paper proposes a novel \emph{data-centric} approach to mitigate model inversion attacks which offers the unique advantage of enabling each individual user to control their data's privacy risk. We introduce several privacy-focused data augmentations which make it challenging for attackers to generate private target samples. We provide theoretical analysis and evaluate our approach against state-of-the-art model inversion attacks. Specifically, in standard face recognition benchmarks, we reduce face reconstruction success rates to $\leq1\%$, while maintaining high utility with only a 2\% classification accuracy drop, significantly surpassing state-of-the-art model-centric defenses. This is the first study to propose a data-centric approach for mitigating model inversion attacks, showing promising potential for decentralized privacy protection.
|
Si Chen · Feiyang Kang · Nikhil Abhyankar · Ming Jin · Ruoxi Jia 🔗 |
-
|
Probing Heterogeneous Pretraining Datasets with Small Curated Datasets
(
Poster
)
Language models rely on increasingly large web-scraped datasets for pretraining. The size of these datasets prevents manual curation, and existing automated quality filters are heuristic and limited. Characterizing these datasets is an open problem. We present preliminary work on documenting and visualizing pretraining datasets by mapping their similarity to downstream benchmark datasets, which are often hand-curated and more focused in style and content. We show this method finely characterizes popular pretraining datasets, supplementing existing characterizations that can be used for quality filtering. |
Gregory Yauney · Emily Reif · David Mimno 🔗 |
-
|
Dataset Interfaces: Diagnosing Model Failures Using Controllable Counterfactual Generation
(
Poster
)
Distribution shift is a major source of failure for machine learning models. However, evaluating model reliability under distribution shift can be challenging, especially since it may be difficult to acquire counterfactual examples that exhibit a specified shift. In this work, we introduce the notion of a dataset interface: a framework that, given an input dataset and a user-specified shift, returns instances from that input distribution that exhibit the desired shift. We study a number of natural implementations for such an interface, and find that they often introduce confounding shifts that complicate model evaluation. Motivated by this, we propose a new implementation that leverages Textual Inversion to tailor generation to the input distribution. We then demonstrate how applying this dataset interface to the ImageNet dataset enables studying model behavior across a diverse array of distribution shifts, including variations in background, lighting, and attributes of the objects. |
Joshua Vendrow · Saachi Jain · Logan Engstrom · Aleksander Madry 🔗 |
-
|
EPIC: Graph Augmentation with Edit Path Interpolation via Learnable Cost
(
Poster
)
Graph-based models have become increasingly important in various domains, but the limited size and diversity of existing graph datasets often limit their performance. To address this issue, we propose EPIC (Edit Path Interpolation via learnable Cost), a novel interpolation-based method for augmenting graph datasets. Our approach leverages graph edit distance to generate new graphs that are similar to the original ones but exhibit some variation in their structures. To achieve this, we learn the graph edit distance through a comparison of labeled graphs and utilize this knowledge to create graph edit paths between pairs of original graphs. With randomly sampled graphs from a graph edit path, we enrich the training set to enhance the generalization capability of classification models. We demonstrate the effectiveness of our approach on several benchmark datasets and show that it outperforms existing augmentation methods in graph classification tasks. |
Jaeseung Heo · Seungbeom Lee · Sungsoo Ahn · Dongwoo Kim 🔗 |
-
|
Toward Practical Automatic Speech Recognition and Post-Processing: a Call for Explainable Error Benchmark Guideline
(
Poster
)
Automatic speech recognition (ASR) outcomes serve as input for downstream tasks, substantially impacting the satisfaction level of end-users. Hence, the diagnosis and enhancement of the vulnerabilities present in the ASR model bear significant importance. However, traditional evaluation methodologies of ASR systems generate a singular, composite quantitative metric, which fails to provide comprehensive insight into specific vulnerabilities. This lack of detail extends to the post-processing stage, resulting in further obfuscation of potential weaknesses. Despite an ASR model's ability to recognize utterances accurately, subpar readability can negatively affect user satisfaction, giving rise to a trade-off between recognition accuracy and user-friendliness. To effectively address this, it is imperative to consider both the speech-level, crucial for recognition accuracy, and the text-level, critical for user-friendliness. Consequently, we propose the development of an Error Explainable Benchmark (EEB) dataset. This dataset, while considering both speech- and text-level, enables a granular understanding of the model's shortcomings. Our proposition provides a structured pathway for a more `real-world-centric' evaluation, a marked shift away from abstracted, traditional methods, allowing for the detection and rectification of nuanced system weaknesses, ultimately aiming for an improved user experience. |
Seonmin Koo · Chanjun Park · Jinsung Kim · Jaehyung Seo · Sugyeong Eo · Hyeonseok Moon · HEUISEOK LIM 🔗 |
-
|
Contrastive clustering of tabular data
(
Poster
)
Contrastive self-supervised learning has significantly improved the performance of deep learning methods, such as representation learning and clustering. However, due to their dependence on data augmentation, these methods are mostly utilized in computer vision. In this paper, we investigate the adaptation of the recent contrastive clustering approach in the case of tabular data. Our experiments show that it outperforms typical clustering methods applicable to tabular data in most cases. Our findings affirm the potential adaptability of successful contrastive clustering techniques from other fields, such as image processing, to the realm of tabular data. |
Piotr Przemielewski · Witold Wydmański · Marek Śmieja 🔗 |
-
|
Investigating minimizing the training set fill distance in machine learning regression
(
Poster
)
Many machine learning regression methods leverage large datasets for training predictive models. However, using large datasets may not be feasible due to computational limitations or high labelling costs. Therefore, sampling small training sets from large pools of unlabelled data points is essential to maximize model performance while maintaining computational efficiency. In this work, we study a sampling approach aimed to minimize the fill distance of the selected set. We derive an upper bound for the maximum expected prediction error that linearly depends on the training set fill distance, conditional to the knowledge of data features. For empirical validation, we perform experiments using two regression models on two datasets. We empirically show that selecting a training set by aiming to minimize the fill distance, thereby minimizing the bound, significantly reduces the maximum prediction error of various regression models, outperforming existing sampling approaches by a large margin. |
Paolo Climaco · Jochen Garcke 🔗 |
-
|
Repeated Random Sampling for Minimizing the Time-to-Accuracy of Learning
(
Poster
)
Methods for carefully selecting or generating a small set of training data to learn from, i.e., data pruning, coreset selection, and data distillation, have been shown to be effective in reducing the ever-increasing cost of training neural networks. Behind this success are rigorously designed strategies for identifying informative training examples out of large datasets. However, these strategies come with additional computational costs associated with subset selection or data distillation before training begins, and furthermore, many are shown to under-perform random sampling in high data compression regimes. As such, many data pruning, coreset selection, or distillation methods may not reduce 'time-to-accuracy', which has become a critical efficiency measure of training deep neural networks over large datasets. In this work, we revisit a powerful yet overlooked random sampling strategy to address these challenges and introduce an approach called Repeated Sampling of Random Subsets (RSRS or RS2), where we randomly sample the subset of training data for each epoch of model training. We test RS2 against thirty state-of-the-art data pruning and data distillation methods across four datasets including ImageNet. Our results demonstrate that RS2 significantly reduces time-to-accuracy compared to existing techniques. For example, when training on ImageNet in the high-compression regime (less than 10% of the data each epoch), RS2 yields accuracy improvements up to 29% compared to competing pruning methods while offering a runtime reduction of 7x. Beyond the above meta-study, we provide a convergence analysis for RS2 and discuss its generalization capability. The primary goal of our work is to establish RS2 as a competitive baseline for future data selection or distillation techniques aimed at efficient training. |
Patrik Okanovic · Roger Waleffe · Vasileios Mageirakos · Konstantinos Nikolakakis · Amin Karbasi · Dionysios Kalogerias · Nezihe Merve Gürel · Theodoros Rekatsinas 🔗 |
-
|
Evaluating the Capabilities of Multi-modal Reasoning Models with Synthetic Task Data
(
Poster
)
The impressive advances and applications of large language and joint language-and-visual understanding models has led to an increased need for methods of probing their potential reasoning capabilities. However, the difficulty of gather naturally-occurring data for complex multi-modal reasoning tasks bottlenecks the evaluation of AI methods on tasks which are not already covered by an academic dataset. In this work, we leverage recent advances in high resolution text-to-image generation to develop a framework for generating evaluation data for multi-modal reasoning tasks. We apply this framework to generate context-dependent anomaly data, creating a synthetic dataset on a challenging task which is not well covered by existing datasets. We benchmark the performance of a state-of-the-art visual question answering (VQA) model against data generated with this method, and demonstrate that while the task is tractable, the model performs significantly worse on the context-dependent anomaly detection task than on standard VQA tasks. |
Nathan Vaska · Victoria Helus 🔗 |
-
|
Addressing Discrepancies in Semantic and Visual Alignment in Neural Networks
(
Poster
)
For the task of image classification, neural networks primarily rely on visual patterns. In robust networks, we would expect for visually similar classes to be represented similarly. We consider the problem of when semantically similar classes are visually dissimilar, and when visual similarity is present among non-similar classes. We propose a data augmentation technique with the goal of better aligning semantically similar classes with arbitrary (non-visual) semantic relationships. We leverage recent work in diffusion-based semantic mixing to generate semantic hybrids of two classes, and these hybrids are added to the training set as augmented data. We evaluate whether the method increases semantic alignment by evaluating model performance on adversarially perturbed data, with the idea that it should be easier for an adversary to switch one class to a similarly represented class. Results demonstrate that there is an increase in alignment of semantically similar classes when using our proposed data augmentation method. |
Natalie Abreu · Nathan Vaska · Victoria Helus 🔗 |
-
|
Fair Machine Unlearning: Data Removal while Mitigating Disparities
(
Poster
)
As public consciousness regarding the collection and use of personal information by corporations grows, it is of increasing importance that consumers be active participants in the curation of corporate datasets. In light of this, data governance frameworks such as the General Data Protection Regulation (GDPR) have outlined the right to be forgotten as a key principle allowing individuals to request that their personal data be deleted from the databases and models used by organizations. To achieve forgetting in practice, several machine unlearning methods have been proposed to address the computational inefficiencies of retraining a model from scratch with each unlearning request. While efficient online alternatives to retraining, it is unclear how these methods impact other properties critical to real-world applications, such as fairness. In this work, we propose the first fair machine unlearning method that can provably and efficiently unlearn data instances while preserving group fairness. We derive theoretical results which demonstrate that our method can provably unlearn data instances while maintaining fairness objectives. Extensive experimentation with real-world datasets highlight the efficacy of our method at unlearning data instances while preserving fairness. |
Alex Oesterling · Jiaqi Ma · Flavio Calmon · Hima Lakkaraju 🔗 |
-
|
TMARS: Improving Visual Representations by Circumventing Text Feature Learning
(
Poster
)
Large web-sourced multimodal datasets have powered a slew of new methods for learning general-purpose visual representations, advancing the state of the art in computer vision and revolutionizing zero- and few-shot recognition. One crucial decision facing practitioners is how, if at all, to curate these ever-larger datasets. For example, the creators of the LAION-5B dataset chose to retain only image-caption pairs whose CLIP similarity score exceeded a designated threshold. In this paper, we propose a new state-of-the-art data filtering approach motivated by our observation that nearly 40% of LAION's images contain text that overlaps significantly with the caption. Intuitively, such data could be wasteful as it incentivizes models to perform optical character recognition rather than learning visual features. However, naively removing all such data could also be wasteful, as it throws away images that contain visual features (in addition to overlapping text). Our simple and scalable approach, T-MARS (Text Masking and Re-Scoring), filters out only those pairs where the text dominates the remaining visual features---by first masking out the text and then filtering out those with a low CLIP similarity score of the masked image. Experimentally, T-MARS outperforms the top-ranked method on the ``medium scale'' of DataComp (a data filtering benchmark) by a margin of 6.5% on ImageNet and 4.7% on VTAB. Additionally, our systematic evaluation on various data pool sizes from 2M to 64M shows that the accuracy gains enjoyed by T-MARS linearly increase as data and compute are scaled exponentially. |
Pratyush Maini · Sachin Goyal · Zachary Lipton · Zico Kolter · Aditi Raghunathan 🔗 |
-
|
Do Machine Learning Models Learn Statistical Rules Inferred from Data?
(
Poster
)
Machine learning models can make basic errors that are easily hidden within vast amounts of data. Such errors often run counter to rules based on human intuition. However, rules based on human knowledge are challenging to scale or even to formalize. We thereby seek to infer statistical rules from the data, and quantify the extent to which a model has learned them. We propose a framework SQRL that integrates logic-based methods with statistical inference to derive these rules from a model’s training data without supervision. We further show how to adapt models at test-time to reduce rule violations and produce more coherent predictions. In an object detection task, SQRL generates 252 rules without human supervision, which uncovers up to 8.1k violations of those rules by state-of-the-art object detection models. Test-time adaptation reduces these violations by up to 31.4% without impacting overall model accuracy. |
Aaditya Naik · Yinjun Wu · Mayur Naik · Eric Wong 🔗 |
-
|
Predicting Article Time Periods with Text2Time: A Transformer-based Approach
(
Poster
)
The prediction of the publication period of textualdocuments, such as news articles, represents a significant andrelatively understudied problem within the realm of naturallanguage processing. Determining the year in which a news articlewas published holds relevance in various domains, includinghistorical research, sentiment analysis, and media monitoring.In this research, our focus is on investigating the prediction ofpublication periods specifically for news articles, leveraging theirtextual content. To tackle this challenge, we curated an extensivelabeled dataset consisting of over 350,000 news articles publishedby The New York Times over a span of six decades. This datasetforms the foundation of our investigation. Our approach involvesutilizing a pretrained BERT model that has been fine-tunedfor the task of text classification, specifically tailored for timeperiod prediction. The performance of our model surpasses ourinitial expectations, demonstrating impressive results in accurately classifying news articles into their respective publicationdecades. Through rigorous evaluation, our model outperforms thebaseline model for this relatively unexplored task of predictingtime periods based on textual content. This research sheds lighton the potential for effectively predicting the publication periodsof news articles and presents promising outcomes achieved byleveraging a pretrained BERT model fine-tuned for time periodclassification. The results obtained contribute to the advancementof this underexplored task, demonstrating the viability andaccuracy of time prediction from textual data. |
KARTHICK GUNASEKARAN 🔗 |
-
|
Knowledge Graph-Augmented Korean Generative Commonsense Reasoning
(
Poster
)
Generative commonsense reasoning refers to the task of generating acceptable and logical assumptions about everyday situations based on commonsense understanding. By utilizing an existing dataset such as Korean CommonGen, language generation models can learn commonsense reasoning specific to the Korean language. However, language models often fail to consider the relationships between concepts and the deep knowledge inherent to concepts. To address these limitations, we propose a method to utilize the Korean knowledge graph data for text generation. Our experimental result shows that the proposed method can enhance the efficiency of Korean commonsense inference, thereby underlining the significance of employing supplementary data. |
Dahyun Jung · Jaehyung Seo · Jaewook Lee · Chanjun Park · HEUISEOK LIM 🔗 |
-
|
Accelerating Batch Active Learning Using Continual Learning Techniques
(
Poster
)
A major problem with Active Learning (AL) is high training costs since models are typically retrained from scratch after every query round. We start by demonstrating that standard AL on neural networks with warm starting fails, both to accelerate training and to avoid catastrophic forgetting when using fine-tuning over AL query rounds. We then develop a new class of techniques, circumventing this problem, by biasing further training towards previously labeled sets, thereby complementing existing work on AL acceleration. We accomplish this by employing existing, and developing novel, replay-based Continual Learning (CL) algorithms that are effective at quickly learning the new without forgetting the old, especially when data comes from an evolving distribution. We call this paradigm "Continual Active Learning" (CAL). We show CAL achieves significant speedups using a plethora of replay schemes that use model distillation and that select diverse/uncertain points from the history.We conduct experiments across many diverse data domains, including natural language, vision, medical imaging, and computational biology, each with very different neural architectures (transformers/CNNs/MLPs) and dataset sizes. CAL consistently provides a 3x reduction in training time, while retaining performance and out-of-distribution robustness, showing its wide applicability. |
Gantavya Bhatt · Arnav M Das · · Rui Yang · Vianne Gao · Jeff Bilmes 🔗 |
-
|
RewriteLM: An Instruction-Tuned Large Language Model for Text Rewriting
(
Poster
)
Large Language Models (LLMs) have demonstrated impressive zero-shot capabilities in long-form text generation tasks expressed through natural language instructions. However, user expectations for long-form text rewriting is high, and unintended rewrites (''hallucinations'') produced by the model can negatively impact its overall performance. Existing evaluation benchmarks primarily focus on limited rewriting styles and sentence-level rewriting rather than long-form open-ended rewriting.We introduce OpenRewriteEval, a novel benchmark that covers a wide variety of rewriting types expressed through natural language instructions. It is specifically designed to facilitate the evaluation of open-ended rewriting of long-form texts. In addition, we propose a strong baseline model, RewriteLM, an instruction-tuned large language model for long-form text rewriting. We develop new strategies that facilitate the generation of diverse instructions and preference data with minimal human intervention. We conduct empirical experiments and demonstrate that our model outperforms the current state-of-the-art LLMs in text rewriting. Specifically, it excels in preserving the essential content and meaning of the source text, minimizing the generation of ''hallucinated'' content, while showcasing the ability to generate rewrites with diverse wording and structures. |
Liangchen Luo · Lei Shu · Jayakumar Hoskere · Yun Zhu · Canoee Liu · Simon Tong · Jindong Chen · Lei Meng 🔗 |
-
|
Data Similarity is Not Enough to Explain Language Model Performance
(
Poster
)
Large language models achieve high few-shot performance on many but not all downstream tasks. The interaction between pretraining and downstream data is commonly assumed to influence this variance: a task with data that is more similar to a model's pretraining dataset is assumed to be easier for that model. We test whether general textual similarity measures (embedding-, token- and model-based) correlate with large language model performance through a large-scale comparison of the Pile and C4 pretraining datasets with downstream benchmarks. Surprisingly, we find no correlation between performance and similarity across various models and dataset similarities. |
Gregory Yauney · Emily Reif · David Mimno 🔗 |
-
|
Enhancing Time Series Forecasting Models under Concept Drift by Data-centric Online Ensembling
(
Poster
)
Online updating of time series forecasting models aims to address the concept drifting problem by efficiently updating forecasting models based on streaming data. Many algorithms are proposed recently, with some exploiting cross-variable dependency while others assume independence among variables. Given every data assumption has its own pros and cons in online time series modeling, we propose \textbf{D}ata-centric \textbf{On}line \textbf{e}nsembling \textbf{Net}work (\abbr), which allows for the linear combination of the two models with dynamically adjusted weights based on the data bias. Empirical results show that \abbr reduces online forecasting error by more than $\mathbf{50\%}$ compared to the State-Of-The-Art (SOTA) method.
|
Yi-Fan Zhang · Qingsong Wen · Xue Wang · Weiqi Chen · Liang Sun · Zhang Zhang · Liang Wang · Rong Jin · Tieniu Tan 🔗 |
-
|
A Privacy-Friendly Approach to Data Valuation
(
Poster
)
Data valuation aims to quantify the usefulness of individual data sources in training machine learning (ML) models, and is a critical aspect of data-centric ML research. However, data valuation faces significant yet frequently overlooked privacy challenges despite its importance. This paper studies these privacy challenges with a focus on KNN-Shapley, one of the most practical data valuation methods nowadays. We first emphasize the inherent privacy risks of KNN-Shapley, and demonstrate the significant technical difficulties in adapting KNN-Shapley to accommodate differential privacy (DP). To overcome these challenges, we introduce \emph{TKNN-Shapley}, a refined variant of KNN-Shapley that is privacy-friendly, allowing for straightforward modifications to incorporate DP guarantee (\emph{DP-}TKNN-Shapley). We show that DP-TKNN-Shapley has several advantages and offers a superior privacy-utility tradeoff compared to naively privatized KNN-Shapley in discerning data quality. Moreover, even non-private TKNN-Shapley achieves comparable performance as KNN-Shapley. Overall, our findings suggest that TKNN-Shapley is a promising alternative to KNN-Shapley, particularly for real-world applications involving sensitive data. Full version of the paper is attached in Appendix. |
Jiachen Wang · Yuqing Zhu · Yu-Xiang Wang · Ruoxi Jia · Prateek Mittal 🔗 |
-
|
Performance Scaling via Optimal Transport: Enabling Data Selection from Partially Revealed Sources
(
Poster
)
Traditionally, data selection has been studied in settings where all samples from prospective sources are fully revealed to a machine learning developer. However, in practical data exchange scenarios, data providers often reveal only a limited subset of samples before an acquisition decision is made. Recently, there have been efforts to fit scaling laws that predict model performance at any size and data source composition using the limited available samples. However, these scaling functions are black-box, computationally expensive to fit, highly susceptible to overfitting, or/and difficult to optimize for data selection. This paper proposes a framework called "projektor", which predicts model performance and supports data selection decisions based on partial samples of prospective data sources. Our approach distinguishes itself from existing work by introducing a novel two-stage performance inference process. In the first stage, we leverage the Optimal Transport distance to predict the model's performance for any data mixture ratio within the range of disclosed data sizes. In the second stage, we extrapolate the performance to larger undisclosed data sizes based on a novel parameter-free mapping technique inspired by neural scaling laws. We further derive an efficient gradient-based method to select data sources based on the projected model performance. Evaluation over a diverse range of applications demonstrates that projektor significantly improves existing performance scaling approaches in terms of both the accuracy of performance inference and the computation costs associated with constructing the performance predictor. Also, projektor outperforms by a wide margin in data selection effectiveness compared to a range of other off-the-shelf solutions. |
Feiyang Kang · Hoang Anh Just · Anit Kumar Sahu · Ruoxi Jia 🔗 |
-
|
Improve Model Inference Cost with Image Gridding
(
Poster
)
The success of AI has spurred the rise of Machine Learning as a Service (MLaaS), where companies develop, maintain, and serve general-purpose models such as object detectors and image classifiers for users that pay a fixed rate per inference. As more organizations rely on AI, the MLaaS market is set to expand, necessitating cost optimization for these services. We explore how a simple yet effective method of increasing model efficiency, aggregating multiple images into a grid before inference, can significantly reduce the required number of inferences for processing a batch of images with varying drops in accuracy. Experiments on open-source and commercial models show that image gridding reduces inferences by 50%, while maintaining low impact on mean average precision (mAP) over the Pascal VOC object detection task. |
Shreyas Krishnaswamy · Lisa Dunlap · Lingjiao Chen · Matei Zaharia · James Zou · Joseph Gonzalez 🔗 |
-
|
THOS: A Benchmark Dataset for Targeted Hate and Offensive Speech
(
Poster
)
Detecting harmful content on social media, such as Twitter, is made difficult by the fact that the seemingly simple yes/no classification conceals a significant amount of complexity.Unfortunately, while several datasets have been collected for training classifiers in hate and offensive speech, there is a scarcity of datasets labeled with a finer granularity of target classes and specific targets. In this paper, we introduce THOS, a dataset of 8.3k tweets manually labeled with fine-grained annotations about the target of the message. We demonstrate that this dataset makes it feasible to train classifiers, based on Large Language Models, to perform classification at this level of granularity. |
Saad Almohaimeed · Saleh Almohaimeed · Saleh Almohaimeed · Ashfaq Ali Shafin · Bogdan Carbunar · Ladislau Boloni 🔗 |
-
|
On Robustness-Accuracy Characterization of Large Language Models using Synthetic Datasets
(
Poster
)
Despite the impressive capability of large language models (LLMs) in solving different downstream tasks, new concerns about proper performance evaluation have been raised, especially for test-data leakage caused by accidentally including them during pretraining, or by indirectly exposing them through API calls for evaluation. Motivated by these, in this paper, we propose a new evaluation workflow that generates steerable synthetic language datasets and proxy tasks for benchmarking the performance of pertained LLMs on sentence classification tasks. This approach allows for better characterization of the joint analysis on the robustness and accuracy of LLMs without risking sensitive information leakage. Verified on various pretrained LLMs, the proposed approach demonstrates promising high correlation with real downstream performance. |
Ching-Yun (Irene) Ko · Pin-Yu Chen · Payel Das · Yung-Sung Chuang · Luca Daniel 🔗 |
-
|
Partial Label Learning meets Active Learning: Enhancing Annotation Efficiency through Binary Questioning
(
Poster
)
Supervised learning is an effective approach to machine learning, but it can be expensive to acquire labeled data. Active learning (AL) and partial label learning (PLL) are two techniques that can be used to reduce the annotation costs of supervised learning. AL is a strategy for reducing the annotation budget by selecting and labeling the most informative samples, while PLL is a weakly supervised learning approach to learn from partially annotated data by identifying the true hidden label. In this paper, we propose a novel approach that combines AL and PLL techniques to improve annotation efficiency. Our method leverages AL to select informative binary questions and PLL to identify the true label from the set of possible answers. We conduct extensive experiments on various benchmark datasets and show that our method achieves state-of-the-art (SoTA) performance with significantly reduced annotation costs. Our findings suggest that our method is a promising solution for cost-effective annotation in real-world applications. |
Shivangana Rawat · Chaitanya Devaguptapu · Vineeth Balasubramanian 🔗 |
-
|
Towards an Efficient Algorithm for Time Series Forecasting with Anomalies
(
Poster
)
Most of time series forecasting techniques assume that the training data is clean without anomalies. This assumption is unrealistic since the collected time series data can be contaminated in practice. The forecasting model will be inferior if it is directly trained by time series with anomalies. In this paper, we aim to develop methods to automatically learn a robust forecasting model from a data-centric perspective. Specifically, we first statistically define three types of anomalies in time series data, then theoretically and experimentally analyze the \emph{loss robustness} and \emph{sample robustness} when these anomalies exist. Based on our analyses, we propose a simple and efficient algorithm to learn a robust forecasting model which outperforms all existing approaches. |
Hao Cheng · Qingsong Wen · Yang Liu · Liang Sun 🔗 |
-
|
Towards Declarative Systems for Data-Centric Machine Learning
(
Poster
)
We argue for a declarative approach to simplify the application of data-centric ML in real-world scenarios, and present our prototypical system AutoDC, which takes a first step in this direction. |
Stefan Grafberger · Bojan Karlaš · Paul Groth · Sebastian Schelter 🔗 |
-
|
Data Banzhaf: A Robust Data Valuation Framework for Machine Learning
(
Poster
)
Data valuation, a growing field dedicated to measuring the usefulness of individual data sources in training machine learning (ML) models, plays a critical role in data-centric ML research; it has wide-ranging applications from improving data quality to incentivizing data sharing. This paper studies the robustness of data valuation techniques to noisy model performance scores. Particularly, we find that the inherent randomness of the widely used stochastic gradient descent can cause existing data value notions (e.g., the Shapley value and the Leave-one-out error) to produce inconsistent data value rankings across different runs. To address this challenge, we introduce the concept of safety margin, which measures the robustness of a data value notion. We show that \emph{the Banzhaf value}, a famous value notion that originated from cooperative game theory literature, achieves the largest safety margin among a large class of value notions. Our evaluation demonstrates that the Banzhaf value outperforms the existing semivalue-based data value notions on several ML tasks such as learning with weighted samples and noisy label detection. Overall, our study suggests that when the underlying ML algorithm is stochastic, the Banzhaf value is a promising alternative to the famous Shapley value given its computational advantage and ability to robustly differentiate data quality. |
Jiachen Wang · Ruoxi Jia 🔗 |
-
|
No Imputation without Representation
(
Poster
)
By filling in missing values in datasets, imputation allows these datasets to be used with algorithms that cannot handle missing values by themselves. However, missing values may in principle contribute useful information that is lost through imputation. The missing-indicator approach can be used in combination with imputation to instead represent this information as a part of the dataset. There are several theoretical considerations why missing-indicators may or may not be beneficial, but there has not been any large-scale practical experiment on real-life datasets to test this question for machine learning predictions. We perform this experiment for three imputation strategies and a range of different classification algorithms, on the basis of twenty real-life datasets. In a follow-up experiment, we determine attribute-specific missingness thresholds for each classifier above which missing-indicators are more likely than not to increase classification performance. And in a second follow-up experiment, we evaluate numerical imputation of one-hot encoded categorical attributes. We reach the following conclusions. Firstly, missing-indicators generally increase classification performance. Secondly, with missing-indicators, nearest neighbour and iterative imputation do not lead to better performance than simple mean/mode imputation. Thirdly, for decision trees, pruning is necessary to prevent overfitting. Fourthly, the thresholds above which missing-indicators are more likely than not to improve performance are lower for categorical attributes than for numerical attributes. Lastly, mean imputation of numerical attributes preserves some of the information from missing values. Consequently, when not using missing-indicators it can be advantageous to apply mean imputation to one-hot encoded categorical attributes instead of mode imputation. |
Oliver Lenz · Daniel Peralta · 🔗 |
-
|
L3Cube-MahaSent-MD: A Multi-domain Marathi Sentiment Analysis Dataset and Transformer Models
(
Poster
)
The exploration of sentiment analysis in low-resource languages, such as Marathi, has been limited due to the availability of suitable datasets. In this work, we present L3Cube-MahaSent-MD, a multi-domain Marathi sentiment analysis dataset, with four different domains - movie reviews, general tweets, TV show subtitles, and political tweets. The dataset consists of around 60,000 manually tagged samples covering 3 distinct sentiments - positive, negative, and neutral. We create a sub-dataset for each domain comprising 15k samples. The MahaSent-MD is the first comprehensive multi-domain sentiment analysis dataset within the Indic sentiment landscape. We fine-tune different monolingual and multilingual BERT models on these datasets and report the best accuracy with the MahaBERT model. We also present an extensive in-domain and cross-domain analysis thus highlighting the need for low-resource multi-domain datasets. |
Aabha Pingle · Aditya Vyawahare · Isha Joshi · Rahul Tangsali · Raviraj Joshi 🔗 |
-
|
Point Cloud Classification with ModelNet40: What is left?
(
Poster
)
State-of-the-art 3D classification models are showing saturating performance on the popular ModelNet40 benchmark. We investigate possible causes for the remaining mistakes and find various data-related issues. In summary, our goal is 1) to give suggestions for future dataset creation in 3D deep learning and 2) to provide ground-truth information on mistakes for evaluation of (future) automated data cleaning methods. |
Jarne Van den Herrewegen · Tom Tourwé · Francis Wyffels 🔗 |
-
|
Does Progress On Object Recognition Benchmarks Improve Real-World Generalization?
(
Poster
)
Recent advances in foundation models, trained on orders of magnitude more data, have begun to saturate standard generalization benchmarks, which tend to focus on predefined or synthetic alterations of images. Despite this progress, even today’s best models are brittle in practice. Consequently, we propose studying generalization across geography as a more realistic measure of progress using two datasets of objects from households across the globe. We conduct an extensive empirical evaluation of nearly 100 vision models, including the most recent foundation models. We examine both the rate of progress and disparities in performance not revealed by average accuracy. We first identify a progress gap between standard benchmarks and real-world, geographical shifts: progress on ImageNet results in up to 2.5x more progress on standard generalization benchmarks than real-world distribution shifts. Second, we study model generalization across geographies by measuring the disparities in performance across regions, a more fine-grained measure of real world generalization. We observe all models have large geographic disparities, even foundation CLIP models, with differences of 7% - 20% in accuracy between regions. Counter to modern intuition, we discover progress on standard benchmarks fails to improve geographic disparities and in many cases exacerbates them: geographic disparities between the least performant models and today's best models have more than tripled. Our results suggest scaling alone is insufficient for consistent robustness to real-world distribution shifts. We highlight the need for more representative benchmarking and more precise measures of generalization progress. |
Megan Richards · Diane Bouchacourt · Mark Ibrahim · Polina Kirichenko 🔗 |
-
|
In or Out? Fixing ImageNet Out-of-Distribution Detection Evaluation
(
Poster
)
Out-of-distribution (OOD) detection is the problem of identifying inputs which are unrelated to the in-distribution task. The OOD detection performance when the in-distribution (ID) is ImageNet-1K is commonly being tested on a small range of test OOD datasets. We find that most of the currently used test OOD datasets, including datasets from the open set recognition (OSR) literature, have severe issues: In some cases more than 50 % of the dataset contains objects belonging to one of the ID classes. These erroneous samples heavily distort the evaluation of OOD detectors.As a solution, we introduce with NINCO a novel test OOD dataset, each sample checked to be ID free, which with its fine-grained range of OOD classes allows for a detailed analysis of an OOD detector's strengths and failure modes, particularly when paired with a number of synthetic “OOD unit-tests”. We provide detailed evaluations across a large set of architectures and OOD detection methods on NINCO and the unit-tests, revealing new insights about model weaknesses and the effects of pretraining on OOD detection performance.We provide code and data at https://github.com/NINCO-Dataset/NINCO. |
Julian Bitterwolf · Maximilian Müller · Matthias Hein 🔗 |
-
|
Localized Data Work as a Precondition for Data-Centric ML: A Case Study of Full Lifecycle Crop Disease Identification in Ghana
(
Poster
)
The Ghana Cashew Disease Identification with Artificial Intelligence (CADI AI) project demonstrates the importance of sound data work as a precondition for the delivery of useful, localized data-centric solutions for public good tasks such as agricultural productivity and food security. Drone-collected data and machine learning are utilized to determine crop stressors. Data, model and the final app are developed jointly and made available to local farmers via a desktop application. |
Darlington Akogo · Issah Samori · Cyril Akafia · Harriet Fiagbor · Andrews Kangah · Donald Donald · Kwabena Fuachie · Luis Oala 🔗 |
-
|
On the Reproducibility of Data Valuation under Learning Stochasticity
(
Poster
)
Data valuation, which quantifies how individual data points contribute to machine learning (ML) model training, is an important question in data-centric ML research and has empowered a broad variety of applications. Popular data value notions such as the Shapley value are computed based on model performance scores trained on different data subsets. Recent studies, however, reveal that stochasticity in neural network training algorithms can adversely affect the consistency of data value rankings. Yet, how to effectively mitigate the impact of the actual perturbation arising from model training, remains an open question.This work introduces TinyMV, a new data value notion that is developed for improved reproducibility against stochasticity stemming from stochastic gradient descent (SGD) or its variants. TinyMV is inspired by a surprising yet consistent pattern of learning stochasticity from SGD: the signal-to-noise ratio (SNR) of a model’s performance change caused by the addition of a training point is maximized on very small datasets (e.g., <=15 data points for CIFAR10). Our experiments demonstrate that TinyMV exhibits state-of-the-art reproducibility and surpasses existing data valuation techniques across a broad range of applications. |
Jiachen Wang · Feiyang Kang · Chiyuan Zhang · Ruoxi Jia · Prateek Mittal 🔗 |
-
|
On the Usefulness of Synthetic Tabular Data Generation
(
Poster
)
Despite recent advances in synthetic data generation, the scientific community still lacks a unified consensus on its usefulness. It is commonly believed that synthetic data can be used for both data exchange and boosting machine learning (ML) training. Privacy-preserving synthetic data generation can accelerate data exchange for downstream tasks, but there is not enough evidence to show how or why synthetic data can boost ML training. In this study, we benchmarked ML performance using synthetic tabular data for three use cases: data augmentation, class balancing, and data summarization. We observed marginal improvements for the balancing use case on some datasets. However, we conclude that there is not enough evidence to claim that synthetic tabular data is useful for ML training. |
Dionysis Manousakas · Sergul Aydore 🔗 |
-
|
Bayesian Optimisation Against Climate Change: Applications and Benchmarks
(
Poster
)
Bayesian optimisation is a powerful method for optimising black-box functions, popular in settings where the true function is expensive to evaluate and no gradient information is available. Bayesian optimisation can improve responses to many optimisation problems within climate change for which simulator models are unavailable or expensive to sample from. While there have been several feasibility demonstrations of Bayesian optimisation in climate-related applications, there has been no unifying review of applications and benchmarks. We provide such a review here, to encourage the use of Bayesian optimisation in important and well-suited application domains. We identify four main application domains: material discovery, wind farm layout, optimal renewable control and environmental monitoring. Our contributions are: a) identifying a representative range of benchmarks, providing example code where necessary; b) introducing a new benchmark, LAQN-BO; and c) promoting a wider use of climate change applications among Bayesian optimisation practitioners. |
Sigrid Passano Hellan · Chris Lucas · Nigel Goddard 🔗 |
-
|
Suboptimal Data Can Bottleneck Scaling
(
Poster
)
Deep learning has been shown to reliably improve in performance on supervised learning tasks when scaling up data, compute, and parameters. In this work, we argue that properly understanding the impact of scale requires a nuanced understanding of dataset composition. Towards this end, we design experiments in the domain of offline reinforcement learning to disentangle the effects of data quantity and quality. Our results comprehensively confirm that performance is bottlenecked by the quality of the data, even in the limit of parameters, compute, and dataset size. Furthermore, we show that the performance of offline reinforcement learning algorithms obeys reliable scaling laws in these settings, allowing performance-at-scale to be extrapolated from a smaller set of experiments. |
Jacob Buckman · Kshitij Gupta · Ethan Caballero · Rishabh Agarwal · Marc Bellemare 🔗 |
-
|
Speech Wikimedia: A 77 Language Multilingual Speech Dataset
(
Poster
)
The Speech Wikimedia Dataset is a publicly available compilation of audio with transcriptions extracted from Wikimedia Commons. It includes 1780 hours (195 GB) of CC-BY-SA licensed transcribed speech from a diverse set of scenarios and speakers, in 77 different languages. Each audio file has one or more transcriptions in different languages, making this dataset suitable for training speech recognition, speech translation, and machine translation models. |
Rafael Mosquera Gómez · Julian Eusse · Juan Ciro · Daniel Galvez · Ryan Hileman · Kurt Bollacker · David Kanter 🔗 |
-
|
Data-Efficient Contrastive Self-supervised Learning: Most Beneficial Examples for Supervised Learning Contribute the Least
(
Poster
)
Self-supervised learning (SSL) learns high-quality representations from large pools of unlabeled training data. As datasets grow larger, it becomes crucial to identify the examples that contribute the most to learning such representations. This enables efficient SSL by reducing the volume of data required. Nevertheless, quantifying the value of examples for SSL has remained an open question. In this work, we address this problem for the first time, by proving that examples that contribute the most to contrastive SSL are those that have the most similar augmentations to other examples, in expectation. We provide rigorous guarantees for the generalization performance of contrastive learning on such subsets. Through extensive experiments, we show that we can safely exclude 20% of examples from CIFAR100 and 40% from STL10 and TinyImageNet, without affecting downstream task performance. Additionally, we show that subsets selected by our method outperform random subsets by over 3% across these datasets. Interestingly, we also discover the subsets that contribute the most to contrastive learning are those that contribute the least to supervised learning. |
Siddharth Joshi · Baharan Mirzasoleiman 🔗 |
-
|
Active learning for time instant classification
(
Poster
)
Active learning is a common strategy for reducing the dependency of model training on large labeled datasets by selecting only the most useful data for labeling. In this work, we consider the problem of actively selecting labels for time instant classification using neural network classifiers. We propose a novel method that selects samples based on a combination of factors that includes uncertainty, diversity, and data density. The performance of the proposed method is demonstrated on synthetic and robot activity datasets. |
Nauman Ahad · Namrata Nadagouda · Eva Dyer · Mark Davenport 🔗 |
-
|
Prediction without Preclusion Recourse Verification with Reachable Sets
(
Poster
)
Machine learning models are often used to decidewho will receive a loan, a job interview, or a pub-lic service. Standard techniques to build thesemodels use features that characterize people butoverlook their actionability. In domains like lend-ing and hiring, models can assign predictions thatare fixed – meaning that consumers who are de-nied loans and interviews are permanently lockedout from access to credit and employment. In thiswork, we introduce a formal testing procedure toflag models that assign these “predictions with-out recourse," called recourse verification. Wedevelop machinery to reliably test the feasibilityof recourse for any model given user-specified ac-tionability constraints. We take a data centric ap-proach to demonstrate how these tools can ensurerecourse and adversarial robustness in real-worlddatasets and use them to study the infeasibilityof recourse in real-world lending datasets. Ourresults highlight how models can inadvertently as-sign fixed predictions that permanently bar accessand the need to design algorithms that account foractionability when developing models and provid-ing recourse. |
Avni Kothari · Berk Ustun · Lily Weng · Bogdan Kulynych 🔗 |
-
|
Skill-it! A Data-Driven Skills Framework for Understanding and Training Language Models
(
Poster
)
The quality of training data impacts the performance of pre-trained large language models (LMs). Given a fixed budget of tokens, it is unclear what data to best select for the model’s performance across tasks. To study this, we develop a new framework based on a simple hypothesis: similar to how humans acquire interdependent skills in a deliberate order, there exists a natural order in how the LM best learns a set of skills from its training data. If such order exists, it can be exploited for improved understanding of LMs and data-efficient training. Using this intuition, our framework formalizes the notion of a skill and of an ordered set of skills in terms of their associated data. We demonstrate that these ordered skill sets exist on synthetic and real data, and their existence enables skills to be learned with less data given that we train on their prerequisite skills. Using our proposed framework, we introduce an online data sampling algorithm, Skill-It, over mixtures of skills for learning skills more quickly for both continuous pre-training and fine-tuning regimes, where we aim to learn multiple skills in the former and an individual skill in the latter. On the LEGO synthetic in the continual pre-training setting, Skill-It obtains 36.5 points higher accuracy than random sampling. On the Natural Instructions dataset in the fine-tuning setting, Skill-It reduces the validation loss on the target skill by 13.6% versus training on the skill itself. We apply our skills framework on the recent RedPajama dataset to continually pre-train a 3B-parameter LM, achieving higher accuracy on the LM Evaluation Harness with 1B tokens than uniform sampling over data sources with 3B tokens. |
Mayee Chen · Nicholas Roberts · Kush Bhatia · Jue Wang · Ce Zhang · Frederic Sala · Christopher Ré 🔗 |
-
|
Birds of an Odd Feather: Guaranteed Out-of-Distribution (OOD) Novel Category Detection
(
Poster
)
In this work, we solve the problem of novel category detection under distribution shift. This problem is critical to ensuring the safety and efficacy of machine learning models, particularly in domains such as healthcare where timely detection of novel categories of patients is crucial.To address this problem, we propose a method based on constrained learning. Our approach is guaranteed to detect a novel category under a relatively weak assumption, namely that rare events in past data have bounded frequency under the shifted distribution. Prior works on the problem do not provide such guarantees, as they either attend to very specific types of distribution shift or make stringent assumptions that limit their guarantees.We demonstrate favorable performance of our method on challenging novel category detection problems over real world datasets. |
Yoav Wald · Suchi Saria 🔗 |
-
|
Mobile Internet Quality Estimation using Self-Tuning Kernel Regression
(
Poster
)
Modeling and estimation for spatial data are ubiquitous in real life, frequently appearing in weather forecasting, pollution detection, and agriculture. Spatial data analysis often involves processing datasets of enormous scale. In this work, we focus on large-scale internet-quality open datasets from Ookla. We look into estimating mobile (cellular) internet quality at the scale of a state in the United States. In particular, we aim to conduct estimation based on highly {\it imbalanced} data: Most of the samples are concentrated in limited areas, while very few are available in the rest, posing significant challenges to modeling efforts. We propose a new adaptive kernel regression approach that employs self-tuning kernels to alleviate the adverse effects of data imbalance in this problem. Through comparative experimentation on two distinct mobile network measurement datasets, we demonstrate that the proposed self-tuning kernel regression method produces more accurate predictions, with the potential to be applied in other applications. |
Hanyang Jiang · Yao Xie · Ellen Zegura · Elizabeth Belding · Shaowu Yuchi 🔗 |
-
|
Estimating label quality and errors in semantic segmentation data via any model
(
Poster
)
The labor-intensive annotation process of semantic segmentation datasets is often prone to errors, since humans struggle to label every pixel correctly. We study algorithms to automatically detect such annotation errors, in particular methods to score label quality, such that the images with the lowest scores are least likely to be correctly labeled. This helps prioritize what data to review in order to ensure a high-quality training/evaluation dataset, which is critical in sensitive applications such as medical imaging and autonomous vehicles. Widely applicable, our label quality scores rely on probabilistic predictions from a trained segmentation model -- any model architecture and training procedure can be utilized. Here we study 7 different label quality scoring methods used in conjunction with either a DeepLabV3+ or FPN segmentation model to detect annotation errors in a version of the SYNTHIA dataset. Precision-recall evaluations reveal a score -- the soft-minimum of the model-estimated likelihoods of each pixel's annotated class -- that is particularly effective to identify images that are mislabeled, across multiple types of annotation error. |
Vedang Lad · Jonas Mueller 🔗 |
-
|
STG-MTL: Scalable Task Grouping for Multi-Task Learning Using Data Maps
(
Poster
)
Multi-Task Learning (MTL) is a powerful technique that has gained popularity due to its performance improvement over traditional Single-Task Learning (STL). However, MTL is often challenging because there is an exponential number of possible task groupings, which can make it difficult to choose the best one, and some groupings might produce performance degradation due to negative interference between tasks. Furthermore, existing solutions are severely suffering from scalability issues, limiting any practical application. In our paper, we propose a new data-driven method that addresses these challenges and provides a scalable and modular solution for classification task grouping based on hand-crafted features, specifically Data Maps, which capture the training behavior for each classification task during the MTL training. We experiment with the method demonstrating its effectiveness, even on an unprecedented number of tasks (up to 100). |
Ammar Sherif · Abubakar Abid · Mustafa Elattar · Mohamed ElHelw 🔗 |
-
|
Detecting Errors in Numerical Data via any Regression Model
(
Poster
)
Noise plagues many numerical datasets, where the recorded values in the data may fail to match the true underlying values due to reasons including: erroneous sensors, data entry/processing mistakes, or imperfect human estimates. Here we consider estimating which data values are incorrect along a numerical column. We present a model-agnostic approach that can utilize any regressor (i.e. statistical or machine learning model) which was fit to predict values in this column based on the other variables in the dataset. By accounting for various uncertainties, our approach distinguishes between genuine anomalies and natural data fluctuations, conditioned on the available information in the dataset. We establish theoretical guarantees for our method and show that other approaches like conformal inference struggle to detect errors. We also contribute a new error detection benchmark involving 5 regression datasets with real-world numerical errors (for which the true values are also known). In this benchmark and additional simulation studies, our method identifies incorrect values with better precision/recall than other approaches. |
Hang Zhou · Jonas Mueller · Mayank Kumar · Jane-Ling Wang · Jing Lei 🔗 |
-
|
ObjectLab: Automated Diagnosis of Mislabeled Images in Object Detection Data
(
Poster
)
Despite powering sensitive systems like autonomous vehicles, object detection remains fairly brittle in part due to annotation errors that plague most real-world training datasets. We propose ObjectLab, a straightforward algorithm to detect diverse errors in object detection labels, including: overlooked bounding boxes, badly located boxes, and incorrect class label assignments. ObjectLab utilizes any trained object detection model to score the label quality of each image, such that mislabeled images can be prioritized for label review/correction. Properly handling the erroneous data enables training a better version of the same object detection model, without any change in existing modeling code. Benchmarks on SYNTHIA and naturally-occurring annotation errors in COCO reveal that across different object detection models/datasets, ObjectLab consistently detects error with much better precision/recall compared to other label quality scores. |
Ulyana Tkachenko · Aditya Thyagarajan · Jonas Mueller 🔗 |
-
|
Characterizing Risk Regimes for Safe Deployment of Deep Regression Models
(
Poster
)
To ensure the safe deployment of AI models, it is crucial to identify potential failure modes to prevent costly errors. While failure detection in classification problems has received significant attention, characterizing failure or risk in regression is more complex and less explored. In this paper, we propose a new framework to characterize risk regimes in regression models. Our framework leverages the principle of anchoring to estimate both uncertainties and non-conformity scores, that can be used to jointly categorize samples into distinct risk regimes, thus enabling a fine-grained analysis of model failure. Additionally, we introduce a suite of metrics for evaluating such failure detectors in regression settings. Our results on synthetic and real-world benchmarks demonstrate the effectiveness of our framework over existing methods that rely solely on predictive uncertainties or feature inconsistency to assess risk. |
Jayaraman J. Thiagarajan · Vivek Narayanaswamy · Puja Trivedi · Rushil Anirudh 🔗 |
-
|
Offline Reinforcement Learning with Imbalanced Datasets
(
Poster
)
The prevalent use of benchmarks in current offline reinforcement learning (RL) research has led to a neglect of the imbalance of real-world dataset distributions in the development of models. The real-world offline RL dataset is often imbalanced over the state space due to the challenge of exploration or safety considerations. In this paper, we specify properties of imbalanced datasets in offline RL, where the state coverage follows a power law distribution characterized by skewed policies. Theoretically and empirically, we show that typically offline RL methods based on distributional constraints, such as conservative Q-learning (CQL), are ineffective in extracting policies under the imbalanced dataset. Inspired by natural intelligence, we propose a novel offline RL method that utilizes the augmentation of CQL with a retrieval process to recall past related experiences, effectively alleviating the challenges posed by imbalanced datasets. We evaluate our method on several tasks in the context of imbalanced datasets with varying levels of imbalance, utilizing the variant of D4RL. Empirical results demonstrate the superiority of our method over other baselines. |
Li Jiang · Sijie Cheng · Jielin Qiu · Victor Chan · Ding Zhao 🔗 |
-
|
Beyond Scale: the Diversity Coefficient as a Data Quality Metric Demonstrates LLMs are Pre-trained on Formally Diverse Data
(
Poster
)
Current trends to pre-train capable Large Language Models (LLMs) mostly focus on scaling of model and dataset size.However, the quality of pre-training data is an important factor for training powerful LLMs, yet it is a nebulous concept that has not been fully characterized.Therefore, we use the recently proposed Task2Vec diversity coefficient to ground and understand formal aspects of data quality, to go beyond scale alone.Specifically, we measure the diversity coefficient of publicly available pre-training datasets to demonstrate that their formal diversity is high when compared to theoretical lower and upper bounds.In addition, to build confidence in the diversity coefficient, we conduct interpretability experiments and find that the coefficient aligns with intuitive properties of diversity,e.g., it increases as the number of latent concepts increases. We conclude the diversity coefficient is reliable, show it's high for publicly available LLM datasets, and conjecture it can be used to build useful diverse datasets for LLMs. |
Alycia Lee · Brando Miranda · Brando Miranda · Sanmi Koyejo 🔗 |
-
|
Large Language Model as Attributed Training Data Generator: A Tale of Diversity and Bias
(
Poster
)
Large language models (LLMs) have been recently leveraged as training data generators for text classification. While previous research has explored different approaches to training models using generated data, there is a tendency to rely on simple class-conditional prompts, which may limit the diversity of the generated data and inherit systematic biases of LLM. Thus, we investigate training data generation with diversely attributed prompts (e.g., specifying the length and style), which have the potential to yield diverse and attributed generated data. Our investigation focuses on datasets with high cardinality and diverse domains, wherein we demonstrate that attributed prompts outperform simple class-conditional prompts in terms of the resulting model's performance. Additionally, we present a comprehensive empirical study on data generation encompassing vital aspects like bias, diversity, and efficiency. Importantly, our findings highlight two key observations: firstly, synthetic datasets generated by simple prompts exhibit significant biases, such as regional bias; secondly, attribute diversity plays a pivotal role in enhancing model performance. |
Yue Yu · Yuchen Zhuang · Jieyu Zhang · Yu Meng · Alex Ratner · Ranjay Krishna · Jiaming Shen · Chao Zhang 🔗 |
-
|
Is Pre-training Truly Better Than Meta-Learning?
(
Poster
)
In the context of few-shot learning, it is currently believed that a fixed pre-trained (PT) model, along with fine-tuning the final layer during evaluation, outperforms standard meta-learning algorithms. We re-evaluate these claims under an in-depth empirical examination of an extensive set of formally diverse datasets and compare PT to Model Agnostic Meta-Learning (MAML). Unlike previous work, we emphasize a fair comparison by using: the same architecture, the same optimizer, and all models trained to convergence.Crucially, we use a more rigorous statistical tool -- the effect size (Cohen's d) -- to determine the practical significance of the difference between a model trained with PT vs. a MAML.We then use a previously proposed metric -- the diversity coefficient -- to compute the average formal diversity of a dataset.Using this analysis, we demonstrate the following:1. when the formal diversity of a data set is low, PT beats MAML on average and 2. when the formal diversity is high, MAML beats PT on average. The caveat is that the magnitude of the average difference between a PT vs. MAML using the effect size is low (according to classical statistical thresholds) -- less than 0.2. Nevertheless, this observation is contrary to the currently held belief that a pre-trained model is always better than a meta-learning model.Our extensive experiments consider 21 few-shot learning benchmarks, including the large-scale few-shot learning dataset Meta-Data set. We also show no significant difference between a MAML model vs. a PT model with GPT-2 on Openwebtext. We, therefore, conclude that a pre-trained model does not always beat a meta-learned model and that the formal diversity of a dataset is a driving factor. |
Brando Miranda · Patrick Yu · Saumya Goyal · Yu-Xiong Wang · Sanmi Koyejo 🔗 |
-
|
Characterizing the Impacts of Semi-supervised Learning for Weak Supervision
(
Poster
)
Labeling training data is a critical and expensive step in producing high accuracy ML models, whether training from scratch or fine-tuning. To make labeling more efficient, two major approaches are programmatic weak supervision (WS) and semi-supervised learning (SSL). More recent works have either explicitly or implicitly used techniques at their intersection, but in various complex and ad hoc ways. In this work, we define a simple, modular design space to study the use of SSL techniques for WS more systematically. Surprisingly, we find that fairly simple methods from our design space match the performance of more complex state-of-the-art methods, averaging a 3 p.p. increase in accuracy/F1-score across 8 standard WS benchmarks. Further, we provide practical guidance on when different components are worth their added complexity and training costs. Contrary to current understanding, we find using SSL is not necessary to obtain the best performance on most WS benchmarks but is more effective when: (1) end models are smaller, and (2) WS provides labels for only a small portion of training examples. |
Jeffrey Li · Jieyu Zhang · Ludwig Schmidt · Alex Ratner 🔗 |
-
|
A Skew-Sensitive Evaluation Framework for Imbalanced Data Classification
(
Poster
)
Class distribution skews in imbalanced datasets may lead to models with prediction bias towards majority classes, making fair assessment of classifiers a challenging task. Metrics such as Balanced Accuracy are commonly used to evaluate a classifier’s prediction performance under such scenarios. However, these metrics fall short when classes vary in importance. In this paper, we propose a simple and general-purpose evaluation framework for imbalanced data classification that is sensitive to arbitrary skews in class cardinalities and importances. Experiments with several state-of-the-art classifiers tested on real-world datasets from three different domains show the effectiveness of our framework – not only in evaluating and ranking classifiers, but also training them. |
Min Du · Nesime Tatbul · Brian Rivers · Akhilesh Kumar Gupta · Lucas Hu · Wei Wang · Ryan Marcus · Shengtian Zhou · Insup Lee · Justin Gottschlich 🔗 |
-
|
Learning pipeline-invariant representation for robust brain phenotype prediction
(
Poster
)
Deep learning has been widely applied in neuroimaging, including predicting brain-phenotype relationships from magnetic resonance imaging (MRI) volumes. MRI data usually requires extensive preprocessing prior to modeling but variation introduced by different MRI preprocessing pipelines may lead to different scientific findings, even when using identical data. Meanwhile, the machine learning community has emphasized the importance of shifting from model-centric to data-centric approaches considering the essential role of data quality in deep learning applications. Motivated by the recent data-centric perspective, we first evaluate how preprocessing pipeline selection can affect the downstream performance of a supervised learning model. We next propose two pipeline-invariant representation learning methodologies, MPSL and PXL, to improve robustness in classification performance and to capture similar neural network representations. Using a wide range of sample sizes from the UK Biobank dataset, we demonstrate that two models present common advantages, in particular that MPSL and PXL can be used to improve within-sample prediction performance and out-of-sample generalization. Both PXL and MPSL can learn more similar between-pipeline representations. These results suggest that our proposed models can be applied to mitigate pipeline-related biases, and to improve prediction robustness in brain-phenotype modeling. |
Xinhui Li · Alex Fedorov · Mrinal Mathur · Anees Abrol · Gregory Kiar · Sergey Plis · Vince Calhoun 🔗 |
-
|
Improving multimodal datasets with image captioning
(
Poster
)
Massive web datasets play a key role in the success of large vision-language models such as CLIP and Flamingo. However, the raw data is noisy, and existing methods to reduce noise often come at the expense of data diversity. Our work focuses on caption quality as one major source of noise, and studies the effectiveness of synthetic captions in increasing the utility of web-scraped datapoints with poorly aligned captions. Through exploring different mixing strategies for raw and synthetic captions, we achieve state-of-the-art performance at the small and medium scales of the DataComp benchmark (Gadre et al., 2023), improving ImageNet accuracy by 2% and average accuracy (over 38 tasks) by 4% compared to the previous best baseline, given a candidate pool of 128M image-text pairs. The best-performing approach is also 2x better at Flickr and MS-COCO retrieval. We then analyze what makes synthetic captions so effective, and explore the impact of image captioning model and sampling temperature on the resulting training set. Overall our findings demonstrate the potential of leveraging image-captioning models as a way to improve multimodal datasets, as (i) we show that progress in image captioning models can translate to better captions and boost accuracy, and (ii) this unlocks a plethora of web images without accompanying captions that can now be used for training. |
Thao Nguyen · · Gabriel Ilharco · Sewoong Oh · Ludwig Schmidt 🔗 |
-
|
Adaptive Aggregated Drift Detector
(
Poster
)
There needs to be an adaptive approach that com-bines both performance and distribution basedconcept drift detectors in order to harness the ben-efits of unlabeled data and the ability to detectvarying types of drifts. This paper proposes Adap-tive Aggregated Drift Detector (A2D2), whichconsists of a suite of performance and data distri-bution based detectors that can adaptively selectdetectors based on rankings of least cost. Thenotable contribution is that it enables an ecosytemto not only adaptively combat drift, but to alsoexpand the information learned across a suite ofdetectors |
Beverly Quon · Jean-Luc Gaudiot 🔗 |
-
|
On Estimating the Epistemic Uncertainty of Graph Neural Networks using Stochastic Centering
(
Poster
)
Graph neural networks (GNNs) are known to have limited expressivity (poor size generalization; over-smoothing; over-squashing). However, at test time, GNNs may encounter distributions where such factors are present. For example, test datasets may have larger sizes than those used for training. In such settings, to ensure safe deployment, it is necessary that GNNs provide accurate confidence indicators that can then be utilized in a variety of downstream safety tasks (generalization gap prediction; calibration; OOD detection). Here, we assess the ability of several baseline uncertainty estimators (Monte Carlo Dropout, Deep Ensembles, Temperature Scaling) in producing well-calibrated confidence estimates under covariate and concept shifts, and study the impact of architecture and model size on the quality of these estimates. Moreover, we adapt a recently proposed stochastic centering framework to graph datasets/GNNs, identifying several graph-specific challenges in the process. Overall, our work not only rigorously studies UQ under challenging graph distribution shifts, but also provides multiple insights into designing effective UQ estimators on graphs that are effective on a variety of safety-critical tasks. |
Puja Trivedi · Mark Heimann · Rushil Anirudh · Danai Koutra · Jayaraman J. Thiagarajan 🔗 |
-
|
Identifying Implicit Social Biases in Vision-Language Models
(
Poster
)
Vision-language models like CLIP are widely used for multimodal retrieval tasks. However, they can learn historical biases from their train- ing data, resulting in the perpetuation of stereo- types and potential harm. In this study, we an- alyze the social biases present in CLIP, particu- larly in the interaction between image and text. We introduce a taxonomy of social biases called So-B-IT, consisting of 374 words categorized into ten types of bias. These biases can have nega- tive societal effects when associated with specific demographic groups. Using this taxonomy, we investigate the images retrieved by CLIP from a facial image dataset using each word as a prompt. We observe that CLIP often exhibits undesirable associations between harmful words and partic- ular demographic groups. Furthermore, we ex- plore the source of these biases by demonstrating their presence in a large image-text dataset used to train CLIP models. Our findings emphasize the significance of evaluating and mitigating bias in vision-language models, underscoring the neces- sity for transparent and fair curation of extensive pre-training datasets. |
Kimia Hamidieh · Haoran Zhang · Thomas Hartvigsen · Marzyeh Ghassemi 🔗 |
-
|
SemDeDup: Data-efficient learning at web-scale through semantic deduplication
(
Poster
)
Progress in machine learning has been driven in large part by massive increases in data. However, large web-scale datasets such as LAION are largely uncurated beyond searches for exact duplicates, potentially leaving much redundancy. Here, we introduce SemDeDup, a method which leverages embeddings from pre-trained models to identify and remove “semantic duplicates”: data pairs which are semantically similar, but not exactly identical. Removing semantic duplicates preserves performance and speeds up learning. Analyzing a subset of LAION, we show that SemDeDup can remove 50% of the data with minimal performance loss, effectively halving training time. Moreover, performance increases out of distribution. Also, analyzing language models trained on C4, a partially curated dataset, we showthat SemDeDup improves over prior approaches while providing efficiency gains. SemDeDup provides an example of how simple ways of leveraging quality embeddings can be used to make models learn faster with less data. |
Amro Abbas · Daniel Simig · Surya Ganguli · Ari Morcos · Kushal Tirumala 🔗 |
-
|
LabelBench: A Comprehensive Framework for Benchmarking Label-Efficient Learning
(
Poster
)
Labeled data are critical to modern machine learning applications, but obtaining labels can be expensive. To mitigate this cost, machine learning methods, such as transfer learning, semi-supervised learning and active learning, aim to be label-efficient: achieving high predictive performance from relatively few labeled examples. While obtaining the best label-efficiency in practice often requires combinations of these techniques, existing benchmark and evaluation frameworks do not capture a concerted combination of all such techniques. This paper addresses this deficiency by introducing LabelBench, a new computationally-efficient framework for joint evaluation of multiple label-efficient learning techniques. As an application of LabelBench, we introduce a novel benchmark of state-of-the-art active learning methods in combination with semi-supervised learning for fine-tuning pretrained vision transformers. Our benchmark demonstrates better label-efficiencies than previously reported in active learning. LabelBench's modular codebase will be open-sourced for the broader community to contribute label-efficient learning methods and benchmarks. |
Jifan Zhang · Yifang Chen · Gregory Canal · Stephen Mussmann · Yinglun Zhu · Simon Du · Kevin Jamieson · Robert Nowak 🔗 |
-
|
Internet Explorer: Targeted Representation Learning on the Open Web
(
Poster
)
Modern vision models typically rely on fine-tuning general-purpose models pre-trained on large, static datasets. These general-purpose models only capture the knowledge within their pre-training datasets, which are tiny, out-of-date snapshots of the Internet--where billions of images are uploaded each day. Rather than hoping our static datasets transfer to our desired tasks after large-scale pre-training, we propose dynamically utilizing the Internet to quickly train a small-scale model that excels at the task at hand. Our approach, called Internet Explorer, explores the web in a self-supervised manner to progressively find relevant examples that improve performance on a desired target dataset. It cycles between searching for images on the Internet with text queries, self-supervised training on downloaded images, determining which images were useful, and prioritizing what to search for next. We evaluate Internet Explorer across several datasets and show that it outperforms or matches CLIP oracle performance by using just a single GPU desktop to actively query the Internet for 30-40 hours. |
Alexander Li · Ellis Brown · Alexei Efros · Deepak Pathak 🔗 |
-
|
Graphtester: Exploring Theoretical Boundaries of GNNs on Graph Datasets
(
Poster
)
Graph Neural Networks (GNNs) have emerged as a powerful tool for learning from graph-structured data. However, even state-of-the-art architectures have limitations on what structures they can distinguish, imposing theoretical limits on what the networks can achieve on different datasets. In this paper, we provide a new tool called Graphtester for comprehensive analysis of the theoretical capabilities of GNNs for various datasets, tasks, and scores. We use Graphtester to analyze over 40 different graph datasets, determining upper bounds on the performance of various GNNs based on the number of layers. Further, we show that the tool can also be used for Graph Transformers using positional node encodings, thereby expanding its scope. Finally, we demonstrate that features generated by Graphtester can be used for practical applications such as Graph Transformers, and provide a synthetic dataset to benchmark node and edge features, such as positional encodings. The package is freely available at the following URL: https://anonymous.4open.science/r/graphtester |
M. Eren Akbiyik · Florian Grötschla · Beni Egressy · Roger Wattenhofer 🔗 |
-
|
Early Experiments in Scalable Dataset Selection for Self-Supervised Learning in Geospatial Imagery Models
(
Poster
)
Dataset selection plays a crucial role in large-scale self-supervised geospatial imagery models, particularly with regard to the impact of dataset diversity on model efficacy. This study investigates the effectiveness of diverse geospatial imagery datasets in enhancing downstream task performance of a self-supervised mdoel trained on such data. To address this, we propose a scalable online clustering method for dataset selection that is designed to maximize diversity. Through a series of experiments on BigEarthNet, we demonstrate both the efficacy of our approach for increasing downstream task performance and its ability to significantly enhance dataset diversity. The results reveal substantial improvements in both supervised and self-supervised training performance. Specifically, our findings demonstrate up to ~5\% increase in accuracy for supervised tasks and a notable ~6\% improvement on downstream tasks following self-supervised learning, surpassing the capabilities of traditional dataset selection methods used in geospatial domain. These early results highlight the practical value of our approach in constructing robust self-supervised datasets from extensive archives of geospatial imagery, thereby unlocking new possibilities for advanced geospatial analysis and applications. |
Muhammed Razzak · Anthony Ortiz · Caleb Robinson 🔗 |
-
|
Uncovering Neural Scaling Law in Molecular Representation Learning
(
Poster
)
Molecular Representation Learning (MRL) has demonstrated great potential in a variety of tasks such as virtual screening for drug and materials discovery. Despite the widespread interests in advancing model-centric techniques, how the quantity and quality of molecular data affect the learned representation remains an open question in this field. In light of this, we investigate the neural scaling behaviors of MRL from a data-centric perspective across various dimensions, including (1) data modality, (2) data distribution, (3) pre-training intervention, and (4) model capacity. Our empirical studies confirm that the performance of MRL exhibits a power-law relationship with data quantity across aforementioned four dimensions. Moreover, our fine-grained analysis uncovers valuable factors that can be used to improve the learning efficiency. To seek the possibility to beat the scaling law, we adapt seven popular data pruning strategies to molecular data and benchmark their performances. Drawing from our experimental findings, we underscore the importance of data-centric MRL and discuss their potential for future research. |
Dingshuo Chen · Yanqiao Zhu · Jieyu Zhang · Yuanqi Du · Zhixun Li · Qiang Liu · Shu Wu · Liang Wang 🔗 |
-
|
MultiLegalPile: A 689GB Multilingual Legal Corpus
(
Poster
)
Large, high-quality datasets are crucial for training \acp{LLM}. However, so far, there are few datasets available for specialized critical domains such as law and the available ones are often only for the English language. We curate and release \textsc{MultiLegalPile}, a 689GB corpus in 24 languages from 17 jurisdictions. The \textsc{MultiLegalPile} corpus, which includes diverse legal data sources with varying licenses, allows for pretraining NLP models under fair use, with more permissive licenses for the Eurlex Resources and Legal mC4 subsets. We pretrain two RoBERTa models and one Longformer multilingually, and 24 monolingual models on each of the language-specific subsets and evaluate them on LEXTREME. Additionally, we evaluate the English and multilingual models on LexGLUE. Our multilingual models set a new SotA on LEXTREME and our English models on LexGLUE. We release the dataset, the trained models, and all of the code under the most open possible licenses. |
Joel Niklaus · Veton Matoshi · Matthias Stürmer · Ilias Chalkidis · Daniel Ho 🔗 |
-
|
On Memorization and Privacy risks of Sharpness Aware Minimization
(
Poster
)
In many recent works, there is an increased focus on designing algorithms that seek wider optima for neural network loss optimization as there is empirical evidence that it leads to better generalization performance in many datasets. In this work, we dissect these performance gains through the lens of data memorization in overparameterized models. We define a new metric that helps us identify which data points specifically do algorithms seeking wider optima do better when compared to vanilla SGD. This insight helps us unearth data privacy risks associated with such algorithms, which we verify through exhaustive empirical evaluations. Finally, we propose mitigation strategies to achieve a more desirable accuracy vs privacy trade-off. The proposed metric and the insights are also applicable more generally when analyzing performance and risks of a novel optimization algorithm. |
Young In Kim · Pratiksha Agrawal · Johannes Royset · RAJIV KHANNA 🔗 |
-
|
Evaluating the Evaluators: Are Current Few-Shot Learning Benchmarks Fit for Purpose?
(
Poster
)
Numerous benchmarks for Few-Shot Learning have been proposed in the last decade. However all of these benchmarks focus on performance averaged over many tasks, and the question of how to reliably evaluate and tune models trained for individual tasks in this regime has not been addressed. This paper presents the first investigation into task-level evaluation---a fundamental step when deploying a model. We measure the accuracy of performance estimators in the few-shot setting, consider strategies for model selection, and examine the reasons for the failure of evaluators usually thought of as being robust. We conclude that cross-validation with a low number of folds is the best choice for directly estimating the performance of a model, whereas using bootstrapping or cross validation with a large number of folds is better for model selection purposes. Overall, we find that existing benchmarks for few-shot learning are not designed in such a way that one can get a reliable picture of how effectively methods can be used on individual tasks. |
Luísa Shimabucoro · Timothy Hospedales · Henry Gouk 🔗 |
-
|
Can Expert Demonstration Guarantee Offline Performance in Sparse Reward Environment?
(
Poster
)
The reinforcement learning paradigm has shifted from online to offline with the insight of supervised learning. Interestingly, we have empirically figured out that the expert demonstration dataset underperforms in the sparse reward environment. We conjecture that this result originates from the given dataset’s properties: reward ratio and trajectory diversity. Those properties can be associated with reward experience and trajectory stitching ability, which are significant factors in the sparse reward problem. This study investigates the aforementioned properties to deeper comprehend the dataset’s influence on offline performance in the sparse reward environment. Experiment results demonstrate that the offline RL performance is proportional to the product of reward ratio and trajectory diversity. Moreover, we have identified these two properties are in a trade-off. |
Jeyeon Eo · Dongsu Lee · Minhae Kwon 🔗 |
-
|
The Matrix Reloaded: A Counterfactual Perspective on Bias in Machine Learning
(
Poster
)
This paper introduces a novel data-centric framework for bias analysis in machine learning, leveraging the power of counterfactual reasoning. We propose a Counterfactual Confusion Matrix, from which we derive a suite of metrics that provide a comprehensive view of a model's behaviour under counterfactual conditions. These metrics offer unique insights into the model's resilience and susceptibility to changes in sensitive attributes such as sex or race. We demonstrate their utility and complementarity with standard fairness metrics through experiments on synthetic data and known real-world datasets. Our results show that our metrics can reveal subtle biases that traditional bias evaluation strategies may overlook, providing a more nuanced understanding of potential model bias. |
Andre Carreiro · Mariana Pinto · Pedro Madeira · Alberto Lopez · Hugo Gamboa 🔗 |
-
|
D4: Document Deduplication and Diversification
(
Poster
)
Over recent years, practitioners have poured an increasing amount of compute and data into training large language models (LLMs), usually by doing one-pass learning on randomly selected tokens from large-scale web corpora. While training on ever-larger portions of web scrapes leads to consistent performance improvement, there has been little work exploring the effect of data selection on pre-training and downstream performance outside of simple de-duplication methods such as MinHash. Here, we show that careful data selection (on top of de-duplicated data) via pre-trained model embeddings can speed up training (20% efficiency gains) and improve downstream accuracy in LLMs (up to 2%). Furthermore, we show that repeating data intelligently selected by D4 consistently outperforms baseline training (while repeating random data performs worse than baseline training). This calls into question common practitioner intuition that randomly selecting new data is optimal for LLM pre-training. We hope our results motivate the community to rethink current standards in data selection for LLM pre-training. |
Kushal Tirumala · Daniel Simig · Armen Aghajanyan · Ari Morcos 🔗 |
-
|
On Data Quality and Speed of Training: Bad Data Slows Training
(
Poster
)
While the role of model architecture and hardware system on training speed is well-understood and appreciated, the role of data quality and quantity is often overlooked. In this paper, we quantify data quality across four dimensions, namely label correctness, information density, input coverage, and input space resolution, and conduct a data-driven analysis to understand the impact of data quality on training time as well as generalization error. We show that poor data quality can slow down the training process by one to two orders of magnitude on an ablation study conducted over various domains (vision and text), datasets (Kaggle's Cat vs Dog, CIFAR10, CIFAR100, ImageNet, WikiText, and GLUE) and model architectures (MobileNet_v2, ResNet18, VGG19, ViT, BERT, and OPT). |
Newsha Ardalani · Mostafa Elhoushi · Carole-Jean Wu 🔗 |
-
|
Decoupled Graph Label Denoising for Robust Semi-Supervised Node Classification
(
Poster
)
Graph neural networks (GNNs) based on message passing have achieved remarkable performance in (semi-supervised) node classification. However, most existing works assume that node labels are noise-free, while the learning errors on mislabeled nodes can be easily propagated to unlabeled nodes along the graph structure. In this paper, we perform a preliminary study showing that message passing can potentially hurt the performance of GNN-based node classification with the existence of label noise. To address this issue, we propose to decouple the processes of message passing and node classification. Specifically, we first train a message-passing GNN in a self-supervised manner to learn informative node representations. Next, we propose a novel topology-aware noise transition matrix estimation algorithm to learn a robust node classifier without using GNNs. We conduct extensive experiments on real-world datasets for semi-supervised node classification with different levels of class-dependent and instance-dependent label noise and show new state-of-the-art performance. |
Kaize Ding · Yancheng Wang · Huan Liu 🔗 |
-
|
Ensemble Fractional Imputation for Incomplete Categorical Data with a Graphical Model
(
Poster
)
Missing data is common in practice, and standard statistical inference can be biased when missingness is related to the outcome of interest. We present a frequentist approach using a graphical model and fractional imputation, which can handle missing data for multivariate categorical variables under missing at random assumption. To avoid the problem due to the curse of dimensionality in multivariate data, we adopt the idea of a random forest to fit multiple reduced models and then combine multiple models using model weights. The model weights are computed from the novel method, double projection, where the observed likelihood is projected to the class of a graphical mixture model. The performance of the proposed method is investigated through an extensive simulation study. |
Yonghyun Kwon · Jae-kwang Kim 🔗 |
-
|
Put on your detective hat: What's wrong in this video?
(
Poster
)
Following step-by-step procedures is an essential component of various activities carried out by individuals in their everyday lives. These procedures serve as a guiding framework that helps achieve goals efficiently, whether assembling furniture or preparing a recipe. However, the complexity and duration of procedural activities inherently increase the likelihood of making errors. Understanding such procedural activities from a sequence of frames is a challenging task that demands an accurate interpretation of visual information and an ability to reason about the structure of the activity. To this end, we collected a new ego-centric 4D dataset comprising 380 recordings (90 hrs) of people performing recipes in kitchen environments. This dataset consists of two distinct activity types: one in which participants adhere to the provided recipe instructions and another where they deviate and induce errors. We provide 5K step annotations and 10K fine-grained action annotations for 20\% of the collected data and benchmark it on two tasks: error detection and procedure learning. |
Rohith Peddi · Shivvrat Arya · Bharath Challa · Likhitha Pallapothula · Akshay Vyas · Qifan Zhang · Jikai Wang · Vasundhara Komaragiri · Eric Ragan · Nicholas Ruozzi · Yu Xiang · Vibhav Gogate
|
-
|
Data-OOB: Out-of-bag Estimate as a Simple and Efficient Data Value
(
Poster
)
Data valuation is a powerful framework for providing statistical insights into which data are beneficial or detrimental to model training. Many Shapley-based data valuation methods have shown promising results in various downstream tasks, however, they are well known to be computationally challenging as it requires training a large number of models. To address this issue, we propose Data-OOB, a new data valuation method for a bagging model that utilizes the out-of-bag estimate. The proposed method is computationally efficient. Specifically, Data-OOB takes less than $2.25$ hours on a single CPU processor when there are $10^6$ samples to evaluate and the input dimension is $100$. We demonstrate that the proposed method significantly outperforms existing state-of-the-art data valuation methods in identifying mislabeled data, highlighting the potential for applying data values in real-world applications.
|
Yongchan Kwon · James Zou 🔗 |
-
|
Regularizing Neural Networks with Meta-Learning Generative Models
(
Poster
)
This paper investigates benchmarking and improving generative data augmentation. Generative data augmentation leverages the synthetic samples produced by generative models as an additional dataset for classification with small dataset settings. However, through benchmarking on multiple datasets, we found that generative data augmentation fails to boost performance when training dataset is small even if using state-of-the-art generative models. This is because the synthetic samples do not perfectly represent class categories in real data and uniform sampling does not necessarily provide useful samples for tasks. In this paper, we present a novel strategy for generative data augmentation called meta generative regularization (MGR). To avoid the degradation of generative data augmentation, MGR utilizes synthetic samples in the regularization term for feature extractors instead of in the loss function, e.g., cross-entropy. These synthetic samples are dynamically determined to minimize the validation losses through meta-learning. We observed that MGR can avoid the performance degradation of na\"ive generative data augmentation and boost the baselines. Experiments on six datasets showed that MGR is effective particularly when datasets are smaller and stably outperforms baselines. |
Shin'ya Yamaguchi · Daiki Chijiwa · Sekitoshi Kanai · Atsutoshi Kumagai · Hisashi Kashima 🔗 |
-
|
Taming Small-sample Bias in Low-budget Active Learning
(
Poster
)
Active learning (AL) aims to minimize the annotation cost by only querying a few informative examples for each model training stage. However, training a model on a few queried examples suffers from the small-sample bias. In this paper, we address this small-sample bias issue in low-budget AL by exploring a regularizer called Firth bias reduction, which can provably reduce the bias during the model training process but might hinder learning if its coefficient is not adaptive to the learning progress. Instead of tuning the coefficient for each query round, which is sensitive and time-consuming, we propose the curriculum Firth bias reduction (CHAIN) that can automatically adjust the coefficient to be adaptive to the training process. Under both deep learning and linear model settings, experiments on three benchmark datasets with several widely used query strategies and hyperparameter searching methods show that CHAIN can be used to build more efficient AL and can substantially improve the progress made by each active learning query. |
Linxin Song · Jieyu Zhang · Xiaotian Lu · Tianyi Zhou 🔗 |
-
|
PhysicsCAP: Natural Scene Understanding By Semantic Segmentation, CLIP And Physical Models Through Refined and Enriched Captions
(
Poster
)
Vision-Language models, i.e., image-text pairs of CLIP, have boosted image-based Deep Learning (DL). Unseen images by transferring semantic knowledge from seen classes, can be dealt with with the help of language models pre-trained only with texts. Two-dimensional spatial relationships and a higher semantic level have been performed. Moreover, Visual-Question-Answer (VQA) tools and open-vocabulary semantic segmentation provide us with more detailed scene descriptions, i.e., qualitative texts, in captions. However, the capability of VLMs presents still far from that of human perception. Captions from state-of-the-art (SOTA) VLMs have not contained physical scales from images. Prepositions in captions play a role in knowing the relative positions of objects, i.e., "left" and "on". However, in addition to such two-dimensional clues, three-dimensional clues of "far" may be more helpful. Therefore, physical scales are needed for better natural scene understanding. For example, visibility affects traffic flow and control on city roads, highways, and runways. Visibility distance or level is an important measure for predicting the risk on the road. However, only a few papers have tackled such nighttime vision with visibility estimation. This paper proposes PhysicsCAP for refined and enriched qualitative and quantitative captions to make them closer to what human recognizes by combining multiple DLs and VLMs. In particular, captions with physical scales and objects’ surface properties are integrated by water level, counting, depth map, visibility distance, and road conditions. Fine-tuned VLM models are also used. An iteratively refined caption model with a new physics-based contrastive loss function is used. Experimental results using images with adversarial weather conditions, i.e., rain, snow, fog, landslide, flooding, and traffic events, i.e., accidents, outperform SOTA DLs and VLMs. A higher semantic level in captions for real-world scene descriptions is shown. |
Hidetomo Sakaino 🔗 |
-
|
Training with Low-Label-Quality Data: Rank Pruning and Multi-Review
(
Poster
)
Inaccurate labels in training data is a common problem in machine learning. Algorithms have been proposed to prune samples with label noise (i.e., samples are far from the decision boundary but still the label is inaccurate); training models on such samples could cause poor model performance. However, in many real applications, there exist samples around the decision boundary that are inherently difficult to label, leading to label error. Such samples are important for model training because of their high learning value. Existing pruning algorithm do not differentiate between samples with label noise and label error, therefore prunes both kinds of samples. This paper improves an existing pruning algorithm in two ways: it (a) prunes noisy samples and high-confidence samples (with less learning value), and (b) preserves the samples (potentially) with label error that have a high learning value and gets accurate labels for them (using multiple reviews). Our evaluation using publicly available and Meta internal de-identified and aggregated data sets shows that the combination of these ideas improve the baseline pruning algorithm. |
Yue Xing · Ashutosh Pandey · David Yan · Fei Wu · Michael Fronda · Pamela Bhattacharya 🔗 |
-
|
DataCI: A Platform for Data-Centric AI on Streaming Data
(
Poster
)
We introduce DataCI, a comprehensive open-source platform designed specifically for data-centric AI in dynamic streaming data settings. DataCI provides 1) an infrastructure with rich APIs for seamless streaming dataset management, data-centric pipeline development and evaluation on streaming scenarios, 2) an carefully designed versioning control function to track the pipeline lineage, and 3) an intuitive graphical interface for a better interactive user experience. Preliminary studies and demonstrations attest to the easy-to-use and effectiveness of DataCI, highlighting its potential to revolutionize the practice of data-centric AI in streaming data contexts. |
Huaizheng Zhang · Liao Chang · Yuanming Li 🔗 |
-
|
Participatory Personalization in Classification
(
Poster
)
Machine learning models are often personalized based on information that is protected, sensitive, self-reported, or costly to acquire. These models use information about people, but do not facilitate nor inform their \emph{consent}. Individuals cannot opt out of reporting information that a model needs to personalize their predictions nor tell if they would benefit from personalization in the first place. We introduce a new family of prediction models, called participatory systems, that let individuals opt into personalization at prediction time. We present a model-agnostic algorithm to learn participatory systems for supervised learning tasks where models are personalized with categorical group attributes. We conduct a comprehensive empirical study of participatory systems in clinical prediction tasks, comparing them to common approaches for personalization and imputation. Experimental results demonstrate that participatory systems can facilitate and inform consent in a way that improves performance and privacy across all groups who report personal data. |
Hailey Joren · Chirag Nagpal · Katherine Heller · Berk Ustun 🔗 |
-
|
Making Scalable Meta Learning Practical
(
Poster
)
Despite its flexibility to learn diverse inductive biases in machine learning programs, meta learning (i.e.,\ learning to learn) has long been recognized to suffer from poor scalability due to its tremendous compute/memory costs, training instability, and a lack of efficient distributed training support. In this work, we focus on making scalable meta learning practical by introducing SAMA, which combines advances in both implicit differentiation algorithms and systems. Specifically, SAMA is designed to support arbitrary optimizers in the base level of meta learning programs, while reducing computational burden by avoiding explicit computation of second-order gradient information, and exploiting efficient distributed training techniques implemented for first-order gradients. Evaluated on multiple large-scale meta learning benchmarks, SAMA showcases up to 1.7/4.8x increase in throughput and 2.0/3.8x decrease in memory consumption respectively on single-/multi-GPU setups compared to other baseline meta learning algorithms. Furthermore, we show that SAMA-based data optimization leads to consistent improvements in text classification accuracy with BERT and RoBERTa large language models, and achieves state-of-the-art results in both small- and large-scale data pruning on image classification tasks, demonstrating the practical applicability of scalable meta learning across language and vision domains. |
Sang Keun Choe · Sanket Vaibhav Mehta · Hwijeen Ahn · Willie Neiswanger · Pengtao Xie · Emma Strubell · Eric Xing 🔗 |
-
|
Learning Better with Less: Effective Augmentation for Sample-Efficient Visual Reinforcement Learning
(
Poster
)
Data augmentation (DA) is a crucial technique for enhancing the sample efficiency of visual reinforcement learning (RL) algorithms. Notably, employing simple observation transformations alone can yield outstanding performance without extra auxiliary representation tasks or pre-trained encoders. However, it remains unclear which attributes of DA account for its effectiveness in achieving sample-efficient visual RL. To investigate this issue and further explore the potential of DA, this work conducts comprehensive experiments to assess the impact of DA's attributes on its efficacy and provides the following insights and improvements: (1) For individual DA operations, we reveal that both ample spatial diversity and slight hardness are indispensable. Building on this finding, we introduce Random PadResize (Rand PR), a new DA operation that offers abundant spatial diversity with minimal hardness. (2) For multi-type DA fusion schemes, the increased DA hardness and unstable data distribution result in the current fusion schemes being unable to achieve higher sample efficiency than their corresponding individual operations. Taking the non-stationary nature of RL into account, we propose a RL-tailored multi-type DA fusion scheme called Cycling Augmentation (CycAug), which performs periodic cycles of different DA operations to increase type diversity while maintaining data distribution consistency. Extensive evaluations on the DeepMind Control suite and CARLA driving simulator demonstrate that our methods achieve superior sample efficiency compared with the prior state-of-the-art methods. |
Guozheng Ma · · Haoyu Wang · Lu Li · Zilin Wang · Zhen Wang · Li Shen · Xueqian Wang · Dacheng Tao 🔗 |
-
|
Data Integration for Driver Telematics with Selection Biases
(
Poster
)
While driver telematics has gained attention for risk classification in auto insurance, scarcity of observations with telematics features has been problematic, which could be owing to either privacy concern or adverse selection compared to the data points with traditional features. To handle this issue, we explore multiple data integration approaches and assess their performance both in inference and prediction in a case study. It is shown that one of the approaches, propensity score approach, can efficiently integrate the so-called traditional data and telematics data pre-preprocessed in a tabular format and also cope with possible adverse selection issues on the availability of telematics data compared to other existing approaches. We expect that this research can encourage further discussions and interests on telematics data handling by the ML community. |
Hashan Peiris · Himchan Jeong · Jae-kwang Kim 🔗 |
-
|
Self-supervised Autoencoder for Correlation-Preserving in Tabular GANs
(
Poster
)
Preserving relationships and interactions between columns(or variables) is crucial for any synthetic tabular data generation approach. Despite their performances, the existing generative adversarial network (GAN)-based methods don't shed much importance on this aspect. In this work, we propose VSA+GAN, a framework that augments with the existing GANs to capture and learn inter-variable interactions with a self-supervised autoencoder trained on a novel pretext task. We show that the method is versatile and is applicable to any variation of Tabular Generative Adversarial Network implementations, and empirically show that our framework significantly improves their performance in terms of data similarity, pair-wise correlation and machine-learning utility metrics. |
Siddarth Ramesh · Surgan Jandial · Gauri Gupta · Piyush Gupta · Balaji Krishnamurthy 🔗 |
-
|
Why Do Self-Supervised Models Transfer? On Data Augmentation and Feature Properties
(
Poster
)
Self-supervised learning is a powerful paradigm for representation learning on unlabelled images. A wealth of effective new methods based on instance matching rely on data-augmentation to drive learning, and these have reached a rough agreement on an augmentation scheme that optimises popular recognition benchmarks. However, there is strong reason to suspect that different tasks in computer vision require features to encode different (in)variances, and therefore likely require different augmentation strategies. In this paper, we measure the invariances learned by contrastive methods and confirm that they do learn invariance to the augmentations used and further show that this invariance largely transfers to related real-world changes in pose and lighting. We show that learned invariances strongly affect downstream task performance and confirm that different downstream tasks benefit from polar opposite (in)variances, leading to performance loss when the standard augmentation strategy is used. Finally, we demonstrate that a simple fusion of representations with complementary invariances ensures wide transferability to all the diverse downstream tasks considered. |
Linus Ericsson · Henry Gouk · Timothy Hospedales 🔗 |
-
|
Principlism Guided Responsible Data Curation
(
Poster
)
Human-centric computer vision (HCCV) data curation practices often neglect privacy and bias concerns, leading to dataset retractions and unfair models. Further, HCCV datasets constructed through nonconsensual web scraping lack the necessary metadata for comprehensive fairness and robustness evaluations. Current remedies address issues post hoc, lack persuasive justification for adoption, or fail to provide proper contextualization for appropriate application. Our research focuses on proactive, domain-specific recommendations for curating HCCV datasets, addressing privacy and bias. We adopt an ante hoc reflective perspective and draw from current practices and guidelines, guided by the ethical framework of principlism. |
Jerone Andrews · Dora Zhao · William Thong · Apostolos Modas · Orestis Papakyriakopoulos · Alice Xiang 🔗 |
Author Information
Ce Zhang (ETH Zurich)
Praveen Paritosh (Google)
Newsha Ardalani (Meta AI Research (FAIR))
Nezihe Merve Gürel (TU Delft)
William Gaviria Rojas (Coactive AI)
Yang Liu (UC Santa Cruz/ByteDance Research)
Rotem Dror (University of Pennsylvania)
Manil Maskey (NASA)
Lilith Bat-Leah (dPrism Advisors)
Lilith Bat-Leah is Vice President, Data Services at dPrism, responsible for consulting on use cases for data analytics, data science, and machine learning. Lilith has over 11 years of experience managing, delivering, and consulting on identification, preservation, collection, processing, review, annotation, analysis, and legal production of data. She also has experience in research and development of machine learning software for eDiscovery. She speaks and writes about various topics in eDiscovery, such as evaluation of machine learning systems, ESI protocols, and discovery of databases. Lilith holds a BSGS in Organization Behavior from Northwestern University, where she graduated magna cum laude. She is a current member of MLCommons/DataPerf/DynaBench and formerly served as Co-Trustee of the EDRM Analytics and Machine Learning project, as a member of the EDRM Global Advisory Council, as Vice President of the Chicago ACEDS chapter, and as President of the New York Metro ACEDS Chapter.
Tzu-Sheng Kuo (CMU)
Luis Oala (Dotphoton)
Max Bartolo (Cohere, UCL)
Ludwig Schmidt (University of Washington)
Alicia Parrish (Google)
Daniel Kondermann (Quality Match GmbH)
Najoung Kim (Boston University)
More from the Same Authors
-
2020 : Contributed Talk: Incentives for Federated Learning: a Hypothesis Elicitation Approach »
Yang Liu · Jiaheng Wei -
2020 : Contributed Talk: Linear Models are Robust Optimal Under Strategic Behavior »
Wei Tang · Chien-Ju Ho · Yang Liu -
2021 : Linear Classifiers that Encourage Constructive Adaptation »
Yatong Chen · Jialu Wang · Yang Liu -
2021 : When Optimizing f-divergence is Robust with Label Noise »
Jiaheng Wei · Yang Liu -
2022 : Adaptive Data Debiasing Through Bounded Exploration »
Yifan Yang · Yang Liu · Parinaz Naghizadeh -
2023 : Data Models for Dataset Drift Controls in Machine Learning With Optical Images »
Luis Oala · Marco Aversa · Gabriel Nobis · Kurt Willis · Yoan Neuenschwander · Michèle Buck · Christian Matek · Jerome Extermann · Enrico Pomarico · Wojciech Samek · Roderick Murray-Smith · Christoph Clausen · Bruno Sanguinetti -
2023 : To Aggregate or Not? Learning with Separate Noisy Labels »
Jiaheng Wei · Zhaowei Zhu · Tianyi Luo · Ehsan Amid · Abhishek Kumar · Yang Liu -
2023 : Understanding Unfairness via Training Concept Influence »
Yuanshun Yao · Yang Liu -
2023 : Repeated Random Sampling for Minimizing the Time-to-Accuracy of Learning »
Patrik Okanovic · Roger Waleffe · Vasileios Mageirakos · Konstantinos Nikolakakis · Amin Karbasi · Dionysios Kalogerias · Nezihe Merve Gürel · Theodoros Rekatsinas -
2023 : Towards an Efficient Algorithm for Time Series Forecasting with Anomalies »
Hao Cheng · Qingsong Wen · Yang Liu · Liang Sun -
2023 : Localized Data Work as a Precondition for Data-Centric ML: A Case Study of Full Lifecycle Crop Disease Identification in Ghana »
Darlington Akogo · Issah Samori · Cyril Akafia · Harriet Fiagbor · Andrews Kangah · Donald Donald · Kwabena Fuachie · Luis Oala -
2023 : Skill-it! A Data-Driven Skills Framework for Understanding and Training Language Models »
Mayee Chen · Nicholas Roberts · Kush Bhatia · Jue Wang · Ce Zhang · Frederic Sala · Christopher Ré -
2023 : On Data Quality and Speed of Training: Bad Data Slows Training »
Newsha Ardalani · Mostafa Elhoushi · Carole-Jean Wu -
2023 : GPT-Zip: Deep Compression of Finetuned Large Language Models »
Berivan Isik · Hermann Kumbong · Wanyi Ning · Xiaozhe Yao · Sanmi Koyejo · Ce Zhang -
2023 : Data Models for Dataset Drift Controls in Machine Learning With Optical Images »
Luis Oala · Marco Aversa · Gabriel Nobis · Kurt Willis · Yoan Neuenschwander · Michèle Buck · Christian Matek · Jerome Extermann · Enrico Pomarico · Wojciech Samek · Roderick Murray-Smith · Christoph Clausen · Bruno Sanguinetti -
2023 : Panel Discussion »
Megan Ansdell · Nathan Lambert · Ludwig Schmidt · Praveen Paritosh · Sang Michael Xie -
2023 : Announcement and open discussion on DMLR (Selected members of DMLR Advisory Board) »
Ce Zhang -
2023 : Data-centric Ecosystem: Croissant and Dataperf - Peter Mattson (Google & MLCommons) »
Peter Mattson · Praveen Paritosh -
2023 : Introduction and Opening »
Praveen Paritosh -
2023 Workshop: Knowledge and Logical Reasoning in the Era of Data-driven Learning »
Nezihe Merve Gürel · Bo Li · Theodoros Rekatsinas · Beliz Gunel · Alberto Sngiovanni Vincentelli · Paroma Varma -
2023 : Opening Remarks »
Nezihe Merve Gürel -
2023 Social: AI Data Underground DMLR Social - Discussing Data-centric Machine Learning Research »
Luis Oala -
2023 Oral: Deja Vu: Contextual Sparsity for Efficient LLMs at Inference Time »
Zichang Liu · Jue Wang · Tri Dao · Tianyi Zhou · Binhang Yuan · Zhao Song · Anshumali Shrivastava · Ce Zhang · Yuandong Tian · Christopher Re · Beidi Chen -
2023 Poster: Identifiability of Label Noise Transition Matrix »
Yang Liu · Hao Cheng · Kun Zhang -
2023 Poster: FlexGen: High-Throughput Generative Inference of Large Language Models with a Single GPU »
Ying Sheng · Lianmin Zheng · Binhang Yuan · Zhuohan Li · Max Ryabinin · Beidi Chen · Percy Liang · Christopher Re · Ion Stoica · Ce Zhang -
2023 Poster: Weak Proxies are Sufficient and Preferable for Fairness with Missing Sensitive Attributes »
Zhaowei Zhu · Yuanshun Yao · Jiankai Sun · Hang Li · Yang Liu -
2023 Oral: FlexGen: High-Throughput Generative Inference of Large Language Models with a Single GPU »
Ying Sheng · Lianmin Zheng · Binhang Yuan · Zhuohan Li · Max Ryabinin · Beidi Chen · Percy Liang · Christopher Re · Ion Stoica · Ce Zhang -
2023 Poster: CocktailSGD: Fine-tuning Foundation Models over 500Mbps Networks »
Jue Wang · Yucheng Lu · Binhang Yuan · Beidi Chen · Percy Liang · Chris De Sa · Christopher Re · Ce Zhang -
2023 Poster: Model Transferability with Responsive Decision Subjects »
Yatong Chen · Zeyu Tang · Kun Zhang · Yang Liu -
2023 Poster: Deja Vu: Contextual Sparsity for Efficient LLMs at Inference Time »
Zichang Liu · Jue Wang · Tri Dao · Tianyi Zhou · Binhang Yuan · Zhao Song · Anshumali Shrivastava · Ce Zhang · Yuandong Tian · Christopher Re · Beidi Chen -
2023 Poster: FedHPO-Bench: A Benchmark Suite for Federated Hyperparameter Optimization »
Zhen WANG · Weirui Kuang · Ce Zhang · Bolin Ding · Yaliang Li -
2022 : Data Valuation »
Newsha Ardalani -
2022 : Model Transferability With Responsive Decision Subjects »
Yang Liu · Yatong Chen · Zeyu Tang · Kun Zhang -
2022 Workshop: DataPerf: Benchmarking Data for Data-Centric AI »
Lora Aroyo · Newsha Ardalani · Colby Banbury · Gregory Diamos · William Gaviria Rojas · Tzu-Sheng Kuo · Mark Mazumder · Peter Mattson · Praveen Paritosh -
2022 Poster: Estimating Instance-dependent Bayes-label Transition Matrix using a Deep Neural Network »
Shuo Yang · Erkun Yang · Bo Han · Yang Liu · Min Xu · Gang Niu · Tongliang Liu -
2022 Poster: Detecting Corrupted Labels Without Training a Model to Predict »
Zhaowei Zhu · Zihao Dong · Yang Liu -
2022 Poster: Understanding Instance-Level Impact of Fairness Constraints »
Jialu Wang · Xin Eric Wang · Yang Liu -
2022 Spotlight: Understanding Instance-Level Impact of Fairness Constraints »
Jialu Wang · Xin Eric Wang · Yang Liu -
2022 Spotlight: Estimating Instance-dependent Bayes-label Transition Matrix using a Deep Neural Network »
Shuo Yang · Erkun Yang · Bo Han · Yang Liu · Min Xu · Gang Niu · Tongliang Liu -
2022 Poster: Metric-Fair Classifier Derandomization »
Jimmy Wu · Yatong Chen · Yang Liu -
2022 Poster: Beyond Images: Label Noise Transition Matrix Estimation for Tasks with Lower-Quality Features »
Zhaowei Zhu · Jialu Wang · Yang Liu -
2022 Spotlight: Detecting Corrupted Labels Without Training a Model to Predict »
Zhaowei Zhu · Zihao Dong · Yang Liu -
2022 Spotlight: Metric-Fair Classifier Derandomization »
Jimmy Wu · Yatong Chen · Yang Liu -
2022 Spotlight: Beyond Images: Label Noise Transition Matrix Estimation for Tasks with Lower-Quality Features »
Zhaowei Zhu · Jialu Wang · Yang Liu -
2022 Poster: To Smooth or Not? When Label Smoothing Meets Noisy Labels »
Jiaheng Wei · Hangyu Liu · Tongliang Liu · Gang Niu · Masashi Sugiyama · Yang Liu -
2022 Poster: Certifying Out-of-Domain Generalization for Blackbox Functions »
Maurice Weber · Linyi Li · Boxin Wang · Zhikuan Zhao · Bo Li · Ce Zhang -
2022 Oral: To Smooth or Not? When Label Smoothing Meets Noisy Labels »
Jiaheng Wei · Hangyu Liu · Tongliang Liu · Gang Niu · Masashi Sugiyama · Yang Liu -
2022 Spotlight: Certifying Out-of-Domain Generalization for Blackbox Functions »
Maurice Weber · Linyi Li · Boxin Wang · Zhikuan Zhao · Bo Li · Ce Zhang -
2021 Poster: Knowledge Enhanced Machine Learning Pipeline against Diverse Adversarial Attacks »
Nezihe Merve Gürel · Xiangyu Qi · Luka Rimanic · Ce Zhang · Bo Li -
2021 Spotlight: Knowledge Enhanced Machine Learning Pipeline against Diverse Adversarial Attacks »
Nezihe Merve Gürel · Xiangyu Qi · Luka Rimanic · Ce Zhang · Bo Li -
2021 Poster: Clusterability as an Alternative to Anchor Points When Learning with Noisy Labels »
Zhaowei Zhu · Yiwen Song · Yang Liu -
2021 Spotlight: Clusterability as an Alternative to Anchor Points When Learning with Noisy Labels »
Zhaowei Zhu · Yiwen Song · Yang Liu -
2021 Poster: Understanding Instance-Level Label Noise: Disparate Impacts and Treatments »
Yang Liu -
2021 Oral: Understanding Instance-Level Label Noise: Disparate Impacts and Treatments »
Yang Liu -
2021 Poster: 1-bit Adam: Communication Efficient Large-Scale Training with Adam's Convergence Speed »
Hanlin Tang · Shaoduo Gan · Ammar Ahmad Awan · Samyam Rajbhandari · Conglong Li · Xiangru Lian · Ji Liu · Ce Zhang · Yuxiong He -
2021 Spotlight: 1-bit Adam: Communication Efficient Large-Scale Training with Adam's Convergence Speed »
Hanlin Tang · Shaoduo Gan · Ammar Ahmad Awan · Samyam Rajbhandari · Conglong Li · Xiangru Lian · Ji Liu · Ce Zhang · Yuxiong He -
2021 Poster: Evolving Attention with Residual Convolutions »
Yujing Wang · Yaming Yang · Jiangang Bai · Mingliang Zhang · Jing Bai · JING YU · Ce Zhang · Gao Huang · Yunhai Tong -
2021 Spotlight: Evolving Attention with Residual Convolutions »
Yujing Wang · Yaming Yang · Jiangang Bai · Mingliang Zhang · Jing Bai · JING YU · Ce Zhang · Gao Huang · Yunhai Tong -
2020 Workshop: Incentives in Machine Learning »
Boi Faltings · Yang Liu · David Parkes · Goran Radanovic · Dawn Song -
2020 : Spotlight Talk 5: Detecting Failure Modes in Image Reconstructions with Interval Neural Network Uncertainty »
Luis Oala -
2020 Poster: Don't Waste Your Bits! Squeeze Activations and Gradients for Deep Neural Networks via TinyScript »
Fangcheng Fu · Yuzheng Hu · Yihan He · Jiawei Jiang · Yingxia Shao · Ce Zhang · Bin Cui -
2020 Poster: Peer Loss Functions: Learning from Noisy Labels without Knowing Noise Rates »
Yang Liu · Hongyi Guo -
2019 : Networking Lunch (provided) + Poster Session »
Abraham Stanway · Alex Robson · Aneesh Rangnekar · Ashesh Chattopadhyay · Ashley Pilipiszyn · Benjamin LeRoy · Bolong Cheng · Ce Zhang · Chaopeng Shen · Christian Schroeder · Christian Clough · Clement DUHART · Clement Fung · Cozmin Ududec · Dali Wang · David Dao · di wu · Dimitrios Giannakis · Dino Sejdinovic · Doina Precup · Duncan Watson-Parris · Gege Wen · George Chen · Gopal Erinjippurath · Haifeng Li · Han Zou · Herke van Hoof · Hillary A Scannell · Hiroshi Mamitsuka · Hongbao Zhang · Jaegul Choo · James Wang · James Requeima · Jessica Hwang · Jinfan Xu · Johan Mathe · Jonathan Binas · Joonseok Lee · Kalai Ramea · Kate Duffy · Kevin McCloskey · Kris Sankaran · Lester Mackey · Letif Mones · Loubna Benabbou · Lynn Kaack · Matthew Hoffman · Mayur Mudigonda · Mehrdad Mahdavi · Michael McCourt · Mingchao Jiang · Mohammad Mahdi Kamani · Neel Guha · Niccolo Dalmasso · Nick Pawlowski · Nikola Milojevic-Dupont · Paulo Orenstein · Pedram Hassanzadeh · Pekka Marttinen · Ramesh Nair · Sadegh Farhang · Samuel Kaski · Sandeep Manjanna · Sasha Luccioni · Shuby Deshpande · Soo Kim · Soukayna Mouatadid · Sunghyun Park · Tao Lin · Telmo Felgueira · Thomas Hornigold · Tianle Yuan · Tom Beucler · Tracy Cui · Volodymyr Kuleshov · Wei Yu · yang song · Ydo Wexler · Yoshua Bengio · Zhecheng Wang · Zhuangfang Yi · Zouheir Malki -
2019 Poster: Fairness without Harm: Decoupled Classifiers with Preference Guarantees »
Berk Ustun · Yang Liu · David Parkes -
2019 Poster: Distributed Learning over Unreliable Networks »
Chen Yu · Hanlin Tang · Cedric Renggli · Simon Kassing · Ankit Singla · Dan Alistarh · Ce Zhang · Ji Liu -
2019 Oral: Fairness without Harm: Decoupled Classifiers with Preference Guarantees »
Berk Ustun · Yang Liu · David Parkes -
2019 Oral: Distributed Learning over Unreliable Networks »
Chen Yu · Hanlin Tang · Cedric Renggli · Simon Kassing · Ankit Singla · Dan Alistarh · Ce Zhang · Ji Liu -
2019 Poster: Exploring the Landscape of Spatial Robustness »
Logan Engstrom · Brandon Tran · Dimitris Tsipras · Ludwig Schmidt · Aleksander Madry -
2019 Oral: Exploring the Landscape of Spatial Robustness »
Logan Engstrom · Brandon Tran · Dimitris Tsipras · Ludwig Schmidt · Aleksander Madry -
2019 Poster: DL2: Training and Querying Neural Networks with Logic »
Marc Fischer · Mislav Balunovic · Dana Drachsler-Cohen · Timon Gehr · Ce Zhang · Martin Vechev -
2019 Oral: DL2: Training and Querying Neural Networks with Logic »
Marc Fischer · Mislav Balunovic · Dana Drachsler-Cohen · Timon Gehr · Ce Zhang · Martin Vechev -
2018 Poster: On the Limitations of First-Order Approximation in GAN Dynamics »
Jerry Li · Aleksander Madry · John Peebles · Ludwig Schmidt -
2018 Oral: On the Limitations of First-Order Approximation in GAN Dynamics »
Jerry Li · Aleksander Madry · John Peebles · Ludwig Schmidt -
2018 Poster: A Classification-Based Study of Covariate Shift in GAN Distributions »
Shibani Santurkar · Ludwig Schmidt · Aleksander Madry -
2018 Poster: Asynchronous Decentralized Parallel Stochastic Gradient Descent »
Xiangru Lian · Wei Zhang · Ce Zhang · Ji Liu -
2018 Poster: $D^2$: Decentralized Training over Decentralized Data »
Hanlin Tang · Xiangru Lian · Ming Yan · Ce Zhang · Ji Liu -
2018 Oral: A Classification-Based Study of Covariate Shift in GAN Distributions »
Shibani Santurkar · Ludwig Schmidt · Aleksander Madry -
2018 Oral: $D^2$: Decentralized Training over Decentralized Data »
Hanlin Tang · Xiangru Lian · Ming Yan · Ce Zhang · Ji Liu -
2018 Oral: Asynchronous Decentralized Parallel Stochastic Gradient Descent »
Xiangru Lian · Wei Zhang · Ce Zhang · Ji Liu -
2017 Poster: ZipML: Training Linear Models with End-to-End Low Precision, and a Little Bit of Deep Learning »
Hantian Zhang · Jerry Li · Kaan Kara · Dan Alistarh · Ji Liu · Ce Zhang -
2017 Talk: ZipML: Training Linear Models with End-to-End Low Precision, and a Little Bit of Deep Learning »
Hantian Zhang · Jerry Li · Kaan Kara · Dan Alistarh · Ji Liu · Ce Zhang