Workshop
Subset Selection in Machine Learning: From Theory to Applications
Rishabh Iyer · Abir De · Ganesh Ramakrishnan · Jeff Bilmes
A growing number of machine learning problems involve finding subsets of data points. Examples range from selecting subset of labeled or unlabeled data points, to subsets of features or model parameters, to selecting subsets of pixels, keypoints, sentences etc. in image segmentation, correspondence and summarization problems. The workshop would encompass a wide variety of topics ranging from theoretical aspects of subset selection e.g. coresets, submodularity, determinantal point processes, to several practical applications, {\em e.g.}, time and energy efficient learning, learning under resource constraints, active learning, human assisted learning, feature selection, model compression, feature induction, {\em etc.}
We believe that this workshop is very timely since, a) subset selection is naturally emerging and has often been considered in isolation in many of the above applications, and b) by connecting researchers working on both the theoretical and application domains above, we can foster a much needed discussion on reusing a several technical innovations across these subareas and applications. Furthermore, we would also like to connect researchers working on the theoretical foundations of subset selection (in areas such as coresets and submodularity) with researchers working in applications (such as feature selection, active learning, data efficient learning, model compression, and human assisted machine learning).
Schedule
Sat 6:15 a.m.  6:30 a.m.

Introduction by the Organizers
(
Live Intro
)
SlidesLive Video In this introduction, we will cover:

Abir De · Rishabh Iyer · Ganesh Ramakrishnan · Jeff Bilmes 🔗 
Sat 6:30 a.m.  7:00 a.m.

Introduction to Coresets and Open Problems
(
Invited Talk
)
SlidesLive Video When we want to discover patterns in data, we usually use the best available algorithm/software or try to improve it. In recent years we have started exploring a different approach: instead of improving the algorithm, reduce the input data and run the existing algorithm on the reduced data to obtain the desired output much faster. A coreset for a given problem is a semantic compression of its input, in the sense that a solution for the problem with the (small) coreset as input yields an approximate solution to the problem with the original (Big) data. Using existing techniques it can be computed on a streaming input that may be distributed among several machines on device, and in parallel (e.g. clouds or GPUs). 
Dan Feldman 🔗 
Sat 7:00 a.m.  7:25 a.m.

Differentiable learning Under Algorithmic Triage
(
Invited Talk
)
SlidesLive Video In recent years, there have a raising interest on learning under algorithmic triage, a new learning paradigm which seeks the development of machine learning models that operate under different automation levelsmodels that take decisions for a given fraction of instances and leave the remaining ones to human experts. However, the interplay between the prediction accuracy of the models and the human experts under algorithmic triage is not well understood. In this talk, we will start by formally characterizing under which circumstances a predictive model may benefit from algorithmic triage. In doing so, we will also demonstrate that models trained for full automation may be suboptimal under triage. Then, given any model and desired level of triage, we will show that the optimal triage policy is a deterministic threshold rule in which triage decisions are derived deterministically by thresholding the difference between the model and human errors on a perinstance level. Building upon these results, we will introduce a practical gradientbased algorithm that is guaranteed to find a sequence of triage policies and predictive models of increasing performance. Finally, we will use real data from two important applications, content moderation and scientific discovery, to illustrate our theoretical results and show that the models and triage policies provided by our gradientbased algorithm outperform those provided by several competitive baselines. 
Manuel GomezRodriguez 🔗 
Sat 7:25 a.m.  7:30 a.m.

Differentiable learning Under Algorithmic Triage Q&A
(
Live Q&A
)

🔗 
Sat 7:30 a.m.  7:55 a.m.

Data Summarization via Bilevel Coresets
(
Invited Talk
)
SlidesLive Video Coresets are small data summaries that are sufficient for model training. They can be maintained online, enabling efficient handling of large data streams under resource constraints. However, existing constructions are limited to simple models such as kmeans and logistic regression. In this work, we propose a novel coreset construction via cardinalityconstrained bilevel optimization. We show how our framework can efficiently generate coresets for deep neural networks, and demonstrate its empirical benefits in continual and streaming deep learning, as well as active semisupervised learning. Joint work with Zalán Borsos, Mojmír Mutny and Marco Tagliasacchi 
Andreas Krause 🔗 
Sat 7:55 a.m.  8:00 a.m.

Data Summarization via Bilevel Coresets: Live Q&A
(
Live Q&A
)

🔗 
Sat 8:00 a.m.  8:25 a.m.

Learning Constraints from Examples
(
Invited Talk
)
SlidesLive Video While constraints are ubiquitous in artificial intelligence and constraints are also commonly used in machine learning and data mining, the problem of learning constraints from examples has received less attention. I will discuss the problem of constraint learning in general, and present some recent contributions in the learning of various types of constraints and models for combinatorial optimisation. This includes the learning of formulae in Excel, learning SMT formulae, MAXSAT and Linear Programs. 
Luc De Raedt 🔗 
Sat 8:25 a.m.  8:30 a.m.

Learning Constraints from Examples Live Q&A
(
Live Q&A
)

🔗 
Sat 8:30 a.m.  8:51 a.m.

Greedy and Its Friends
(
Invited Talk
)
SlidesLive Video In this talk, I will introduce 3 close friends of the greedy algorithm who can maximize a general submodular function (monotone or not) subject to very general constraints. They come in different flavors and guarantees. 
Amin Karbasi 🔗 
Sat 8:51 a.m.  9:00 a.m.

Greedy and Its Friends Live Q&A
(
Live Q&A
)

🔗 
Sat 9:00 a.m.  9:30 a.m.

Poster Session 1
(
Poster Session
)
We will be using gather town for the poster sessions. The links are below. Room 1: [ protected link dropped ] Room 2: [ protected link dropped ] 
🔗 
Sat 9:30 a.m.  10:30 a.m.

Panel Discussion on Subset Selection for ML Problems in the Real World (Speakers, Organizers, and a few more invited panelists)
(
Panel Discussion
)
SlidesLive Video Panelists: Andreas Krause Dan Feldman Amin Karbasi Luc De Raedt Manuel Gomez Rodriguez Rajiv Khanna Baharan Mirzasoleiman Cody Coleman Matthai Phillipose Gaurav Aggarwal 
🔗 
Sat 10:30 a.m.  10:44 a.m.

Benchmarks and Toolkits for Data Subset Selection in ML through DECILE: Part I
(
Invited Talk
)
link
SlidesLive Video In this talk (Part I and II), we will cover the different functionalities of DECILE (www.decile.org) which include modules like a) SUBMODLIB, b) CORDS, c) TRUST, d) DISTIL, e) SPEAR. SUBMODLIB is a library for submodular optimization which implements a number of submodular optimization algorithms and functions (including the submodular mutual information and conditional gain functions). CORDS is a library for data subset selection and coresets for computeefficient training of deep models. TRUST is targeted subset selection for personalization and model remediation. DISTIL is an active learning toolkit for deep models and SPEAR is a library for weak supervision via labeling functions. We will also focus on the different SOTA algorithms implemented and the benchmarks enabled through these toolkits. In Part I, we will cover submodlib (a toolkit for submodular optimization), and CORDS (a toolkit for data subset selection and coresets for efficient training of deep models). CORDS: https://github.com/decileteam/cords/ DISTIL: https://github.com/decileteam/distil/ TRUST: https://github.com/decileteam/trust/ SubmodLib: https://github.com/decileteam/submodlib/ SPEAR: https://github.com/decileteam/spear/ 
Rishabh Iyer 🔗 
Sat 10:44 a.m.  10:58 a.m.

Benchmarks and Toolkits for Data Subset Selection in ML through DECILE: Part II
(
Invited Talk
)
link
SlidesLive Video In this talk (Part I and II), we will cover the different functionalities of DECILE (www.decile.org) which include modules like a) SUBMODLIB, b) CORDS, c) TRUST, d) DISTIL, e) SPEAR. SUBMODLIB is a library for submodular optimization which implements a number of submodular optimization algorithms and functions (including the submodular mutual information and conditional gain functions). CORDS is a library for data subset selection and coresets for computeefficient training of deep models. TRUST is targeted subset selection for personalization and model remediation. DISTIL is an active learning toolkit for deep models and SPEAR is a library for weak supervision via labeling functions. We will also focus on the different SOTA algorithms implemented and the benchmarks enabled through these toolkits. In Part II, we will cover TRUST, DISTIL, and SPEAR. CORDS: https://github.com/decileteam/cords/ DISTIL: https://github.com/decileteam/distil/ TRUST: https://github.com/decileteam/trust/ SubmodLib: https://github.com/decileteam/submodlib/ SPEAR: https://github.com/decileteam/spear/ 
Ganesh Ramakrishnan 🔗 
Sat 10:58 a.m.  11:00 a.m.

Benchmarks and Toolkits for Data Subset Selection in ML through DECILE: Live Q&A
(
Live Q&A
)

🔗 
Sat 11:00 a.m.  11:20 a.m.

More Information, Less Data
(
Invited Talk
)
link
SlidesLive Video In this talk, we introduce several features of Summary Analytics, a new company that makes summarizationlike abilities accessible to the masses, and scalable to petabytesize datasets. We demonstrate wide applicability to different data modalities via a number of case studies. We demonstrate scalability by showing that it is possible to summarize 100 million records, each with 1700 features, down to 1000 records in 60 seconds using one CPU thread, and 14 seconds using four CPU threads, on a 2019era laptop. 
Jeff Bilmes 🔗 
Sat 11:20 a.m.  11:30 a.m.

More Information, Less Data: Q&A Session
(
Live Q&A
)

🔗 
Sat 11:30 a.m.  11:50 a.m.

Theory of feature selection
(
Invited Talk
)
SlidesLive Video In this talk, I will present novel theoretical results to bridge the gap in theory and practice for interpretable dimensionality reduction aka feature selection. Specifically, I will show that feature selection satisfies a weaker form of submodularity. Because of this connection, for any function, one can provide constantfactor approximation guarantees that are solely dependent on the condition number of the function. Moreover, I will discuss that the "cost of interpretability" accrued because of selecting features as opposed to principal components is not as high as was previously thought to be. 
Rajiv Khanna 🔗 
Sat 11:50 a.m.  12:00 p.m.

Theory of feature selection Live Q&A
(
Live Q&A
)

🔗 
Sat 12:00 p.m.  12:04 p.m.

Online and Non Parametric Coresets for Bregman Divergence
(
Spotlight
)
SlidesLive Video
We present algorithms that create coresets for clustering problems according to a wide subset of Bregman divergences. Our coresets have a small additive error, similar in magnitude to the lightweight coresets~\cite{bachem2018scalable}, and take an update time $O(d)$. We first give an algorithm that returns {\em online} coresets of size $\tilde{O}((dk\log k)/\epsilon^2\mu^2)$ for $k$clustering according to any $\mu$similar Bregman divergence. We also show the existence of {\em nonparametric} coresets, i.e. the coreset size is independent of the number of clusters $k$, as well the dimension of points $d$, for the same subclass of Bregman divergences.

Supratim Shit · Rachit Chhaya · Anirban Dasgupta · Jayesh Choudhari 🔗 
Sat 12:04 p.m.  12:09 p.m.

Unconstrained Submodular Maximization with Modular Costs: Tight Approximation and Application to Profit Maximization
(
Spotlight
)
SlidesLive Video Given a set V, the problem of unconstrained submodular maximization with modular costs (USMMC) asks for a subset S \subseteq V that maximizes f(S)  c(S), where f is a nonnegative, monotone, and submodular function that gauges the utility of S, and c is a nonnegative and modular function that measures the cost of S. This problem finds applications in numerous practical scenarios, such as profit maximization in viral marketing on social networks. This paper presents ROIGreedy, a polynomial time algorithm for USMMC that returns a solution S satisfying f(S)  c(S) >= f(S)  c(S)  c(S)\ln(f(S)/c(S)), where S is the optimal solution to USMMC. To our knowledge, ROIGreedy is the first algorithm that provides such a strong approximation guarantee. In addition, We show that this worstcase guarantee is tight, in the sense that no polynomial time algorithm can ensure f(S)  c(S) >= (1+\epsilon)(f(S)  c(S)  c(S) \ln(f(S)/c(S*)), for any \epsilon > 0. Further, we devise a nontrivial extension of ROIGreedy to solve the profit maximization problem, where the precise value of f(S) for any set S is unknown and can only be approximated via sampling. Extensive experiments on benchmark datasets demonstrate that ROIGreedy significantly outperforms competing methods in terms of the tradeoff between efficiency and solution quality. 
Tianyuan Jin · Yu Yang · Renchi Yang · Jieming Shi · Keke Huang · Xiaokui Xiao 🔗 
Sat 12:09 p.m.  12:14 p.m.

SVPCF: Selection via Proxy for Collaborative Filtering Data
(
Spotlight
)
SlidesLive Video We study the practical consequences of dataset sampling strategies on the performance of recommendation algorithms. Recommender systems are generally trained and evaluated on samples of larger datasets. Samples are often taken in a naive or adhoc fashion: e.g. by sampling a dataset randomly or by selecting users or items with many interactions. As we demonstrate, commonlyused data sampling schemes can have significant consequences on algorithm performancemasking performance deficiencies in algorithms or altering the relative performance of algorithms, as compared to models trained on the complete dataset. Following this observation, this paper makes the following main contributions: (1) characterizing the effect of sampling on algorithm performance, in terms of algorithm and dataset characteristics (e.g. sparsity characteristics, sequential dynamics, etc); and (2) designing SVPCF, which is a dataspecific sampling strategy, that aims to preserve the relative performance of models after sampling, and is especially suited to longtail interaction data. Detailed experiments show that SVPCF is more accurate than commonly used sampling schemes in retaining the relative ranking of different recommendation algorithms. 
Noveen Sachdeva · Julian McAuley · CaroleJean Wu 🔗 
Sat 12:14 p.m.  12:19 p.m.

Bayesian decision analysis for collecting nearlyoptimal subsets
(
Spotlight
)
SlidesLive Video Subset selection is valuable for interpretable learning, scientific discovery, and data compression. However, classical subset selection is often eschewed due to instability and computational bottlenecks. We address these challenges using Bayesian decision analysis. For any Bayesian predictive model, we elicit an acceptable family of subsets that provide nearlyoptimal linear prediction. The Bayesian model regularizes the linear coefficients and propagates uncertainty quantification for key predictive comparisons. The acceptable family spawns new variable importance metrics based on whether a variable appear in all, some, or no acceptable subsets. The proposed approach exhibits excellent prediction and stability for both simulated data (including p = 400 > n) and a large education dataset with highly correlated covariates. 
Daniel Kowal 🔗 
Sat 12:19 p.m.  12:23 p.m.

Fast Estimation Method for the Stability of Ensemble Feature Selectors
(
Spotlight
)
SlidesLive Video It is preferred that feature selectors be \textit{stable} for better interpretabity and robust prediction. Ensembling is known to be effective for improving the stability of feature selectors. Since ensembling is timeconsuming, it is desirable to reduce the computational cost to estimate the stability of the ensemble feature selectors. % for the data we use. We propose a simulator of a feature selector, and apply it to a fast estimation of the stability of ensemble feature selectors. To the best of our knowledge, this is the first study that estimates the stability of ensemble feature selectors and reduces the computation time theoretically and empirically. 
Rina Onda · Kenta Oono 🔗 
Sat 12:23 p.m.  12:27 p.m.

Selective Focusing Learning in Conditional GANs
(
Spotlight
)
SlidesLive Video Conditional generative adversarial networks (cGANs) have demonstrated remarkable success due to their classwise controllability and superior quality for complex generation tasks. Typical cGANs solve the joint distribution matching problem by decomposing two easier subproblems: marginal matching and conditional matching. From our toy experiments, we found that it is the best to apply only conditional matching to certain samples due to the contentaware optimization of the discriminator. This paper proposes a simple (a few lines of code) but effective training methodology, selective focusing learning, which enforces the discriminator and generator to learn easy samples of each class rapidly while maintaining diversity. Our key idea is to selectively apply conditional and joint matching for the data in each minibatch. We conducted experiments on recent cGAN variants in ImageNet (64x64 and 128x128), CIFAR10, and CIFAR100 datasets, and improved the performance significantly (up to 35.18% in terms of FID) without sacrificing diversity. 
Kyeongbo Kong · Kyunghun Kim · Woojin Song · SukJu Kang 🔗 
Sat 12:27 p.m.  12:32 p.m.

Kernel Thinning
(
Spotlight
)
SlidesLive Video
We introduce kernel thinning, a new procedure for compressing a distribution $\mathbb{P}$ more effectively than i.i.d. sampling or standard thinning.
Given a suitable reproducing kernel $\mathbf{k}$ and $\mathcal{O}(n^2)$ time, kernel thinning compresses an $n$point approximation to $\mathbb{P}$ into a $\sqrt{n}$point approximation with comparable worstcase integration error across the associated reproducing kernel Hilbert space.
With high probability, the maximum discrepancy in integration error is $\mathcal{O}_d(n^{\frac{1}{2}}\sqrt{\log n})$ for compactly supported $\mathbb{P}$ and $\mathcal{O}_d(n^{\frac{1}{2}} \sqrt{(\log n)^{d+1}\log\log n})$ for subexponential $\mathbb{P}$ on $\mathbb{R}^d$.
In contrast, an equalsized i.i.d. sample from $\mathbb{P}$ suffers $\Omega(n^{\frac14})$ integration error.
Our subexponential guarantees resemble
the classical quasiMonte Carlo error rates for uniform $\mathbb{P}$ on $[0,1]^d$ but apply to general distributions on $\mathbb{R}^d$
and a wide range of common kernels.
We use our results to derive explicit nonasymptotic maximum mean discrepancy bounds for Gaussian, Mat\'ern, and Bspline kernels
and present two vignettes illustrating the practical benefits of kernel thinning over i.i.d. sampling and standard Markov chain Monte Carlo thinning.

Raaz Dwivedi · Lester Mackey 🔗 
Sat 12:32 p.m.  12:37 p.m.

Multiplecriteria Based Active Learning with Fixedsize Determinantal Point Processes
(
Spotlight
)
SlidesLive Video Active learning (AL) aims to achieve greater accuracy with less training data by selecting the most useful data samples from which it learns. Singlecriterion (i.e., informativeness and representativeness) based methods are simple and efficient; however, they lack adaptability to different realworld scenarios. In this paper, we introduce a multiplecriteria based AL algorithm, which incorporates three complementary criteria, i.e., informativeness, representativeness and diversity, to make appropriate selections in the AL rounds under different data types. We consider the selection process as a Determinantal Point Process, which good balance among these criteria. We refine the query selection strategy by both selecting the hardest unlabeled data sample and biasing towards the classifiers that are more suitable for the current data distribution. In addition, we also consider the dependencies and relationships between these data points in data selection by means of centroidbased clustering approaches. 
Xueying ZHAN · Qing Li · Antoni Chan 🔗 
Sat 12:37 p.m.  12:42 p.m.

Coresets for Classification – Simplified and Strengthened
(
Spotlight
)
SlidesLive Video
We give relative error coresets for training linear classifiers with a broad class of loss functions, including the logistic loss and hinge loss. Our construction achieves $(1\pm \epsilon)$ relative error with $\tilde O(d \cdot \mu_y(X)^2/\epsilon^2)$ points, where $\mu_y(X)$ is a natural complexity measure of the data matrix $X \in \R^{n \times d}$ and label vector $y \in \{1,1\}^n$, introduced in \cite{munteanu2018coresets}. Our result is based on subsampling data points with probabilities proportional to their \emph{$\ell_1$ Lewis weights}. It significantly improves on existing theoretical bounds and performs well in practice, outperforming uniform subsampling along with other importance sampling methods. Our sampling distribution does not depend on the labels, so can be used for active learning. It also does not depend on the specific loss function, so a single coreset can be used in multiple training scenarios.

Anup Rao · Tung Mai · Cameron Musco 🔗 
Sat 12:42 p.m.  12:47 p.m.

Using Machine Learning to Recognise Statistical Dependence
(
Spotlight
)
SlidesLive Video Recognising statistical dependence of variables is a fundamentally prominent task in modelling data relatedness, having impact on multiple data disciplines. However, it is a peculiarly challenging task often requiring manual scrutiny of data in order to uncover the correct relatedness function. Inspired by this human ability, we cast the named task as machine learning problem. Solving such problem results in models encoding abstract meaning of statistical dependence. In this paper, we report overwhelmingly and significantly superior performance to known models, especially when few data points are available and/or high noise is present. Such novel view of the named task is transformative for numerous data disciplines such as unsupervised learning, feature selection and causal inference. 
Ubai Sandouk 🔗 
Sat 12:47 p.m.  12:52 p.m.

Sparsifying Transformer Models with Trainable Representation Pooling
(
Spotlight
)
SlidesLive Video
We propose a novel method to sparsify attention in the Transformer model by learning to select the mostinformative token representations during the training process, thus focusing on the taskspecific parts of an input.
A reduction of quadratic time and memory complexity to sublinear was achieved due to a robust trainable top$k$ operator.
Our experiments on a challenging long document summarization task show that even our simple baseline performs comparably to the current SOTA, and with the trainable selection we can retain its top quality while being $1.8\times$ faster during training, $4.5\times$ faster during inference, and up to $16\times$ more computationally efficient in the decoder.
The method can be effortlessly applied to many models used in NLP and CV.

Michał Pietruszka · Łukasz Borchmann · Łukasz Garncarek 🔗 
Sat 12:52 p.m.  12:57 p.m.

Continual Learning via FunctionSpace Variational Inference: A Unifying View
(
Spotlight
)
SlidesLive Video Continual learning is the process of developing new abilities while retaining existing ones. Sequential Bayesian inference is a natural framework for this, but applying it successfully to deep neural networks remains a challenge. We propose continual functionspace variational inference (CFSVI), in which the variational distribution over functions induced by stochastic model parameters is encouraged to match the variational distribution over functions induced by stochastic parameters inferred on previous tasks. Unlike approaches that explicitly penalize changes in the model parameters, functionspace regularization allows parameters to vary widely during training, resulting in greater flexibility to fit new data. CFSVI improves on existing approaches to functionspace regularization by performing inference entirely in function space and without relying on carefully selected coreset points. We show that CFSVI outperforms alternative methods based on parameterspace and functionspace regularization on a range of tasks. 
Tim G. J. Rudner · Freddie Bickford Smith · Qixuan Feng · YeeWhye Teh · Yarin Gal 🔗 
Sat 12:57 p.m.  1:02 p.m.

Active Learning under Pool Set Distribution Shift and Noisy Data
(
Spotlight
)
SlidesLive Video Bayesian Active Learning has focused on BALD, which reduces model parameter uncertainty. However, we show that BALD gets stuck on outofdistribution or junk data that is not relevant for the task. We examine a novel Expected Predictive Information Gain (EPIG) to deal with distribution shifts of the pool set. EPIG reduces the uncertainty of predictions on an unlabelled evaluation set sampled from the test data distribution whose distribution might be different to the pool set distribution. Based on this, our new EPIGBALD acquisition function for Bayesian Neural Networks selects samples to improve the performance on the test data distribution instead of selecting samples that reduce model uncertainty everywhere, including for outofdistribution regions with low density in the test data distribution. Our method outperforms stateoftheart Bayesian active learning methods on highdimensional datasets and avoids outofdistribution junk data in cases where current stateoftheart methods fail. 
Andreas Kirsch · Tom Rainforth · Yarin Gal 🔗 
Sat 1:02 p.m.  1:07 p.m.

Sparse Bayesian Learning via Stepwise Regression
(
Spotlight
)
SlidesLive Video Sparse Bayesian Learning (SBL) is a powerful paradigm for attaining sparsity in probabilistic models. Herein, we propose a coordinate ascent algorithm for SBL termed Relevance Matching Pursuit (RMP) and show that, as its noise variance parameter goes to zero, RMP exhibits a surprising connection to Stepwise Regression. Further, we derive novel guarantees for Stepwise Regression algorithms, which also shed light on RMP. Our guarantees for Forward Regression improve on deterministic and probabilistic results for Orthogonal Matching Pursuit with noise. Our analysis of Backward Regression on determined systems culminates in a bound on the residual of the optimal solution to the {\it subset selection} problem that, if satisfied, guarantees the {\it optimality} of the result. To our knowledge, this bound is the first that can be computed in polynomial time and depends chiefly on the smallest singular value of the matrix. 
Sebastian Ament · Carla Gomes 🔗 
Sat 1:07 p.m.  1:10 p.m.

Mitigating Memorization in Sample Selection for Learning with Noisy Labels
(
Spotlight
)
SlidesLive Video Because deep learning is vulnerable to noisy labels, sample selection techniques, which train networks with only clean labeled data, have attracted a great attention. However, if the labels are dominantly corrupted by few classes, these noisy samples are called dominantnoisylabeled samples, the network also learns dominantnoisylabeled samples rapidly via contentaware optimization. In this study, we propose a compelling criteria to penalize dominantnoisylabeled samples intensively through classwise penalty labels. By averaging prediction confidences for the each observed label, we obtain suitable penalty labels that have high values if the labels are largely corrupted by some classes. Experiments were performed using benchmarks (CIFAR10, CIFAR100, TinyImageNet) and realworld datasets (ANIMAL10N, Clothing1M) to evaluate the proposed criteria in various scenarios with different noise rates. Using the proposed sample selection, the learning process of the network becomes significantly robust to noisy labels compared to existing methods in several noise types. 
Kyeongbo Kong · Junggi Lee · Youngchul Kwak · YoungRae Cho · SeongEun Kim · Woojin Song 🔗 
Sat 1:10 p.m.  1:59 p.m.

Poster Session 2
(
Poster Session
)
We will be using gather town for the poster sessions. The links are below. Room 1: [ protected link dropped ] Room 2: [ protected link dropped ] 
🔗 
Sat 1:59 p.m.  2:00 p.m.

Introduction to Invited Talk
(
Live Intro
)

🔗 
Sat 2:00 p.m.  2:29 p.m.

Dataefficient and Robust Learning from Massive Datasets
(
Invited Talk
)
SlidesLive Video Large datasets have been crucial to the success of modern machine learning models. However, training on massive data has two major limitations. First, it is contingent on exceptionally large and expensive computational resources, and incurs a substantial cost due to the significant energy consumption. Second, in many realworld applications such as medical diagnosis, selfdriving cars, and fraud detection, big data contains highly imbalanced classes and noisy labels. In such cases, training on the entire data does not result in a highquality model. In this talk, I will argue that we can address the above limitations by developing techniques that can identify and extract the representative subsets for learning from massive datasets. Training on representative subsets not only reduces the substantial costs of learning from big data, but also improves their accuracy and robustness against noisy labels. I will discuss how we can develop theoretically rigorous techniques that provide strong guarantees for the quality of the extracted subsets, as well as the learned models’ quality and robustness against noisy labels. I will also show the effectiveness of such methods in practice for dataefficient and robust learning. 
Baharan Mirzasoleiman 🔗 
Sat 2:29 p.m.  2:30 p.m.

Dataefficient and Robust Learning from Massive Datasets Live Q&A
(
Live Q&A
)

🔗 
Sat 2:30 p.m.  2:50 p.m.

Computationally Efficient Data Selection for Deep Learning
(
Invited Talk
)
SlidesLive Video Data selection methods, such as active learning and coreset selection, improve the data efficiency of machine learning by identifying the most informative data points to label or train on. Across the data selection literature, there are many ways to identify these training examples. However, classical data selection methods are prohibitively expensive to apply in deep learning because of the larger datasets and models. To make these methods tractable, we propose (1) “selection via proxy” (SVP) to avoid expensive training and reduce the computation per example and (2) “similarity search for efficient active learning and search” (SEALS) to reduce the number of examples processed. Both methods lead to order of magnitude performance improvements, making techniques like active learning on billions of unlabeled images practical for the first time. 
Cody Coleman 🔗 
Sat 2:50 p.m.  3:00 p.m.

Computationally Efficient Data Selection for Deep Learning Live Q&A
(
Live Q&A
)

🔗 
Sat 3:00 p.m.  3:05 p.m.

HighDimensional Variable Selection and NonLinear Interaction Discovery in Linear Time
(
Spotlight
)
SlidesLive Video
Many scientific problems require identifying a small set of covariates that are associated with a target response and estimating their effects. Often, these effects are nonlinear and include interactions, so linear and additive methods can lead to poor estimation and variable selection. The Bayesian framework makes it straightforward to simultaneously express sparsity, nonlinearity, and interactions in a hierarchical model. But, as for the few other methods that handle this trifecta, inference is computationally intractable  with runtime at least quadratic in the number of covariates, and often worse. In the present work, we solve this computational bottleneck. We first show that suitable Bayesian models can be represented as Gaussian processes (GPs). We then demonstrate how a kernel trick can reduce computation with these GPs to $O$(\# covariates) time for both variable selection and estimation. Our resulting fit corresponds to a sparse orthogonal decomposition of the regression function in a Hilbert space (i.e., a functional ANOVA decomposition), where interaction effects represent all variation that cannot be explained by lowerorder effects. On a variety of synthetic and real datasets, our approach outperforms existing methods used for large, highdimensional datasets while remaining competitive (or being orders of magnitude faster) in runtime.

Raj Agrawal · Tamara Broderick 🔗 
Sat 3:05 p.m.  3:10 p.m.

Errordriven FixedBudget ASR Personalization for Accented Speakers
(
Spotlight
)
SlidesLive Video We consider the task of personalizing ASR models while being constrained by a fixed budget on recording speaker specific utterances. Given a speaker and an ASR model, we propose a method of identifying sentences for which the speaker's utterances are likely to be harder for the given ASR model to recognize. We assume a tiny amount of speakerspecific data to learn phonemelevel error models which help us select such sentences. We show that speaker's utterances on the sentences selected using our error model indeed have larger error rates when compared to speaker's utterances on randomly selected sentences. We find that finetuning the ASR model on the sentence utterances selected with the help of error models yield higher WER improvements in comparison to finetuning on an equal number of randomly selected sentence utterances. Thus, our method provides an efficient way of collecting speaker utterances under budget constraints for personalizing ASR models. 
Abhijeet Awasthi · Sunita Sarawagi · Preethi Jyothi 🔗 
Sat 3:10 p.m.  3:15 p.m.

MISNN: Multiple Imputation via Semiparametric Neural Networks
(
Spotlight
)
SlidesLive Video
Multiple imputation (MI) has been widely applied to missing value problems in biomedical, social and econometric research, in order to avoid improper inference in the downstream data analysis. In the presence of highdimensional data, imputation models that include feature selection, especially $\ell_1$ regularized regression (such as Lasso, adaptive Lasso and Elastic Net), are common choices to prevent the model from underdetermination. However, conducting MI with feature selection is difficult: existing methods are often computationally inefficient and poor in performance. We propose MISNN, a novel and efficient algorithm that incorporates feature selection for MI. Leveraging the approximation power of neural networks, MISNN is a general and flexible framework, compatible to any feature selection method, any neural network architecture, high/lowdimensional data and general missing patterns. Through empirical experiments, MISNN has demonstrated great advantages over stateoftheart imputation methods (e.g. Bayesian Lasso and matrix completion), in terms of imputation accuracy, statistical consistency and computation speed.

Zhiqi Bu · Zongyu Dai · Yiliang Zhang · Qi Long 🔗 
Sat 3:15 p.m.  3:20 p.m.

Towards Active Air Quality Station Deployment
(
Spotlight
)
SlidesLive Video Air pollution is a global problem and has a severe impact on human health. Finegrained air quality (AQ) monitoring is important in mitigating air pollution. However, existing AQ station deployments are sparse due to installation and operational costs. In this work, we propose an Active Learningbased method for air quality station deployment. We use Gaussian processes and several classical machine learning algorithms (Random Forest, KNeighbors and Support Vector Machine) to benchmark several active learning strategies for two realworld air quality datasets (Delhi, India and Beijing, China). 
Zeel B Patel · Nipun Batra 🔗 
Sat 3:20 p.m.  3:25 p.m.

Coreset Sampling for Efficient Neural Architecture Search
(
Spotlight
)
SlidesLive Video Neural architecture search (NAS), an important branch of automatic machine learning, has become an effective approach to automate the design of deep learning models. However, the major issue in NAS is how to reduce the large search time imposed by the heavy computational burden. While most recent approaches focus on pruning redundant sets or developing new search methodologies, this paper attempts to formulate the problem based on the data curation manner. Our key strategy is to search the architecture using summarized data distribution, i.e., coreset. Typically, many NAS algorithms separate searching and training stages, and the proposed coreset methodology is only used in search stage, thus their performance degradation can be minimized. In our experiments, we were able to save overall computational time from 30.8 hours to 3.5 hours, 8.8x reduction, on a single RTX 3090 GPU without sacrificing accuracy. 
Jaehun Shim · Kyeongbo Kong · SukJu Kang 🔗 
Sat 3:25 p.m.  3:30 p.m.

On Coresets For Fair Regression
(
Spotlight
)
SlidesLive Video
In this work we present an algorithm to construct coresets for fair regression
for a dataset of $n$ points in $d$ dimensions and having some discretevalued protected attribute. For (fair) regression with statistical parity we first define the notion of a coreset and give a coreset that incurs only a small additive error in satisfying the constraints and a relative error in the objective function. The coreset size is linear in $\ell$, the number of unique values the protected attribute can take, and the dependence on dimension is near linear ($d \log{d}$). To the best of our knowledge this is the first coreset for fair regression problem with statistical parity. We also empirically demonstrate the performance of our coresets on real data sets.

Rachit Chhaya · Anirban Dasgupta · Supratim Shit · Jayesh Choudhari 🔗 
Sat 3:30 p.m.  3:35 p.m.

A Comparison of Contextual and NonContextual Preference Ranking for Set Addition Problems
(
Spotlight
)
SlidesLive Video In this paper, we study the problem of evaluating the addition of elements to a set. This problem is difficult, because it can, in the general case, not be reduced to unconditional preferences between the choices. Therefore, we model preferences based on the context of the decision. We discuss and compare two different Siamese network architectures for this task: a twin network that compares the two sets resulting after the addition, and a triplet network that models the contribution of each candidate to the existing set. We evaluate the two settings on a realworld task; learning human card preferences for deck building in the collectible card game Magic: The Gathering. We show that the triplet approach achieves a better result than the twin network and that both outperform previous results on this task. 
Timo Bertram · Johannes Fürnkranz · Martin Müller 🔗 
Sat 3:35 p.m.  3:40 p.m.

Statistical Measures For Defining Curriculum Scoring Function
(
Spotlight
)
SlidesLive Video Curriculum learning is a training strategy that sorts the training examples by some measure of their difficulty and gradually exposes them in that order to the learner to improve the network performance. In this work, we first propose and study a dynamic curriculum learning algorithm. Our dynamic curriculum algorithm greedily samples a subset of the training data whose gradients are directed towards the optimal weight. Motivated by our insights from the dynamic curriculum learning ordering and implicit curriculum ordering, we introduce a simple, practical curriculum learning strategy that uses statistical measures such as standard deviation and entropy values to score the difficulty of data points for real image classification tasks. Further, we also use our algorithms to discuss why curriculum learning is helpful. 
Vinu Sankar Sadasivan · Anirban Dasgupta 🔗 
Sat 3:40 p.m.  3:45 p.m.

An Extreme Point Approach to Subset Selection
(
Spotlight
)
SlidesLive Video Given a training data set, we develop a method to obtain a representative subset of training data meant for classification. Our focus is on subsets that are particularly suitable for training linear classification models. Our method is to reduce data sets in a manner that essentially preserves the convex hulls of the various classes of the data set. We develop scalable randomized algorithms for convex hull approximation, and obtain rigorous guarantees for models trained on subsets of linearly separable data sets. We empirically demonstrate the effectiveness of our approach for both training and data augmentation on the MNIST data set. 
Viveck Cadambe · Bill Kay 🔗 
Sat 3:45 p.m.  3:50 p.m.

Tighter mDPP Coreset Sample Complexity Bounds
(
Spotlight
)
SlidesLive Video Many machine learning problems involve processing large datasets, thus making them expensive computational tasks. One mitigation strategy selects representative subsets of the data and one way to achieve this is via coresets~ a weighted subset of the given data, such that the cost function under consideration can be additively and/or multiplicatively approximated. While much previous work chooses these subsets using independent importancebased sampling, such methods can be suboptimal when the system possesses redundancy. Recently, Tremblay et al. 2019 considered sampling from fixedsized determinantal point process (DPP) and gave lower bounds on the coreset sample complexity. While DPPs often empirically outperform importancebased sampling methods, their coreset size bound is worse compared to independent sampling. In this work, we extend their theoretical results by improving the lower bound on sample complexity for coreset approximation by a log(1 / epsilon) factor using chaining, a technique in empirical process literature. This, therefore, moves a step further in bridging the gap between theoretical justification and empirical observations. We provide additive approximation guarantees for the associated cost function and allow the possibility of negative weights, which has a direct application in layerwise pruning of deep neural networks. On top of additive approximation guarantees, we also extend our bounds to the scenarios with multiplicative approximations. 
Gantavya Bhatt · Jeff Bilmes 🔗 
Sat 3:50 p.m.  3:55 p.m.

SIMILAR: Submodular Information Measures Based Active Learning In Realistic Scenarios
(
Spotlight
)
SlidesLive Video Active learning has proven to be useful for minimizing labeling costs by selecting the most informative samples. However, existing active learning methods do not work well in realistic scenarios such as imbalance or rare classes, outofdistribution data in the unlabeled set, and redundancy. In this work, we propose SIMILAR (Submodular Information Measures based actIve LeARning), a unified active learning framework using recently proposed submodular information measures (SIM) as acquisition functions. We argue that SIMILAR not only works in standard active learning but also easily extends to the realistic settings considered above and acts as a onestop solution for active learning that is scalable to large realworld datasets. Empirically, we show that SIMILAR significantly outperforms existing active learning algorithms by as much as ≈ 5% − 18% in the case of rare classes and ≈ 5% − 10% in the case of outofdistribution data on CIFAR10 image classification. 
Suraj Kothawade · Krishnateja Killamsetty · Rishabh Iyer 🔗 
Sat 3:55 p.m.  3:59 p.m.

Minimax Optimization: The Case of ConvexSubmodular
(
Spotlight
)
SlidesLive Video In this paper, we study mixed continuousdiscrete minimax problems where the minimization is over a continuous variable belonging to Euclidean space and the maximization is over subsets of a given ground set. We introduce the class of convexsubmodular minimax problems, where the objective is convex with respect to the continuous variable and submodular with respect to the discrete variable. Even though such problems appear frequently in machine learning applications, little is known about how to address them from algorithmic and theoretical perspectives. For such problems, we first show that obtaining saddle points are hard up to any approximation, and thus introduce new notions of (near) optimality. We then provide several algorithmic procedures for solving convex and monotonesubmodular minimax problems and characterize their convergence rates, computational complexity, and quality of the final solution. Finally, we provide experiments to showcase the effectiveness of our purposed methods on training robust recommender systems in the presence of poisoning attacks. 
Arman Adibi · Aryan Mokhtari · Hamed Hassani 🔗 
Sat 3:59 p.m.  4:04 p.m.

Improved Regret Bounds for Online Submodular Maximization
(
Spotlight
)
SlidesLive Video
In this paper, we consider an online optimization problem over $T$ rounds where at each step $t\in[T]$, the algorithm chooses an action $x_t$ from the fixed convex and compact domain set $\mathcal{K}$. A utility function $f_t(\cdot)$ is then revealed and the algorithm receives the payoff $f_t(x_t)$. This problem has been previously studied under the assumption that the utilities are adversarially chosen monotone DRsubmodular functions and $\mathcal{O}(\sqrt{T})$ regret bounds have been derived. We first characterize the class of strongly DRsubmodular functions and then, we derive regret bounds for the following new online settings: $(1)$ $\{f_t\}_{t=1}^T$ are monotone strongly DRsubmodular and chosen adversarially, $(2)$ $\{f_t\}_{t=1}^T$ are monotone submodular (while the average $\frac{1}{T}\sum_{t=1}^T f_t$ is strongly DRsubmodular) and chosen by an adversary but they arrive in a uniformly random order, $(3)$ $\{f_t\}_{t=1}^T$ are drawn i.i.d. from some unknown distribution $f_t\sim \mathcal{D}$ where the expected function $f(\cdot)=\mathbb{E}_{f_t\sim\mathcal{D}}[f_t(\cdot)]$ is monotone DRsubmodular. For $(1)$, we obtain the first logarithmic regret bounds. In terms of the second framework, we show that it is possible to obtain similar logarithmic bounds with high probability. Finally, for the i.i.d. model, we provide algorithms with $\tilde{\mathcal{O}}(\sqrt{T})$ stochastic regret bound, both in expectation and with high probability.

Omid Sadeghi · Maryam Fazel 🔗 
Sat 4:04 p.m.  4:09 p.m.

Differentially Private Monotone Submodular Maximization Under Matroid and Knapsack Constraints
(
Spotlight
)
SlidesLive Video
Numerous tasks in machine learning and artificial intelligence have been modeled as submodular maximization problems. These problems usually involve sensitive data about individuals, and in addition to maximizing the utility, privacy concerns should be considered. In this paper, we study the general framework of nonnegative monotone submodular maximization subject to matroid or knapsack constraints in both offline and online settings. For the offline setting, we propose a differentially private $(1\frac{\kappa}{e})$approximation algorithm, where $\kappa\in[0,1]$ is the total curvature of the submodular set function, which improves upon prior works in terms of approximation guarantee and query complexity under the same privacy budget. In the online setting, we propose the first differentially private algorithm, and we specify the conditions under which the regret bound scales as $\O(\sqrt{T})$, i.e., privacy could be ensured while maintaining the same regret bound as the optimal regret guarantee in the nonprivate setting.

Omid Sadeghi · Maryam Fazel 🔗 
Sat 4:09 p.m.  4:14 p.m.

Effective Evaluation of Deep Active Learning on Image Classification Tasks
(
Spotlight
)
SlidesLive Video With the goal of making deep learning more labelefficient, a growing number of papers have been studying active learning (AL) for deep models. However, there are a number of issues in the prevalent experimental settings, mainly stemming from a lack of unified implementation and benchmarking. Issues in the current literature include sometimes contradictory observations on the performance of different AL algorithms, unintended exclusion of important generalization approaches such as data augmentation and SGD for optimization, a lack of study of evaluation facets like the labeling efficiency of AL, and little or no clarity on the scenarios in which AL outperforms random sampling (RS). In this work, we present a unified reimplementation of stateoftheart AL algorithms in the context of image classification, and we carefully study these issues as facets of effective evaluation. On the positive side, we show that AL techniques are 2x to 4x more labelefficient compared to RS with the use of data augmentation. Surprisingly, when data augmentation is included, there is no longer a consistent gain in using BADGE, a stateoftheart approach, over simple uncertainty sampling. We then do a careful analysis of how existing approaches perform with varying number of examples per class. Finally, we provide other insights for AL practitioners to consider in future work, such as the use of data subset selection techniques for intermediate model training. 
Nathan Beck · Durga Sivasubramanian · Ganesh Ramakrishnan · Rishabh Iyer 🔗 
Sat 4:14 p.m.  4:19 p.m.

Active Learning Convex Halfspaces on Graphs
(
Spotlight
)
SlidesLive Video We systematically study the query complexity of learning geodesically convex halfspaces on graphs. Geodesic convexity is a natural generalisation of Euclidean convexity and allows the definition of convex sets and halfspaces on graphs. We prove upper bounds on the query complexity linear in the treewidth and the minimum hull set size but only logarithmic in the diameter. We show tight lower bounds corresponding to wellestablished separation axioms and identify the Radon number as a central parameter bounding the query complexity and the VC dimension. While previous bounds typically depend on the cut size of the labelling, all parameters in our bounds can be computed from the unlabelled graph. We empirically compare our proposed approach with other active learning algorithms and provide evidence that groundtruth communities in realworld graphs are often convex. 
Maximilian Thiessen · Thomas Gärtner 🔗 
Sat 4:19 p.m.  4:23 p.m.

Parallel Quasiconcave set optimization: A new frontier that scales without needing submodularity
(
Spotlight
)
SlidesLive Video
Classes of set functions along with a choice of ground set are a bedrock to determine and develop corresponding variants of greedy algorithms to suitably obtain approximate and efficient solutions for combinatorial optimization. The class of constrained submodular optimization has seen huge advances at the intersection of good computational efficiency, versatility and approximation guarantees while unconstrained submodular optimization is NPhard. What is an alternative to situations when submodularity does not hold? Can efficient and globally exact solutions be obtained? We introduce one such new frontier: The class of quasiconcave set functions induced as a dual class to monotone linkage functions. We provide a parallel algorithm with a time complexity over $n$ processors of $\mathcal{O}(n^2g) +\mathcal{O}(\log{\log{n}})$ where $n$ is the cardinality of the ground set and $g$ is the complexity to compute the monotone linkage function that induces a corresponding quasiconcave set function via a duality. The complexity reduces to $\mathcal{O}(gn\log(n))$ on $n^2$ processors and to $\mathcal{O}(gn)$ on $n^3$ processors. Our approach reduces the currently existing cubic computational complexity to those mentioned above. Our algorithm provides a globally optimal solution to a maximin problem as opposed to submodular optimization which is approximate. We show a potential for widespread applications via an example of diverse feature subset selection with exact global maximin guarantees upon showing that a statistical dependency measure called distance correlation can be used to induce a quasiconcave set function.

Praneeth Vepakomma · Ramesh Raskar 🔗 
Sat 4:23 p.m.  4:30 p.m.

Concluding Remarks
(
Live Intro
)
SlidesLive Video Concluding Remarks to the Workshop. Thank you for attending SubSetML 2021! 
🔗 


Data efficiency in graph networks through equivariance
(
Poster
)
We introduce a novel architecture for graph networks which is equivariant to any transformation in the coordinate embeddings that preserves the distance between neighbouring nodes. In particular, it is equivariant the Euclidean and conformal orthogonal groups in ndimensions. Thanks to its equivariance properties, the proposed model is extremely more data efficient with respect to classical graph architectures and also intrinsically equipped with a better inductive bias. We show that, learning on a minimal amount of data, the architecture we propose can perfectly generalise to unseen data in a synthetic problem, while much more training data are required from a standard model to reach comparable performance. 
Francesco Farina · Emma Slade 🔗 


SubsetGAN: Pattern detection in the activation space for Identifying Synthesised Content
(
Poster
)
Generative Adversarial Networks (GANs) have recently achieved unprecedented success in photorealistic image synthesis from lowdimensional random noise. The ability to synthesize highquality content at a large scale brings potential risks as the generated samples may lead to misinformation that can create severe social, political, health, and business hazards. We propose SubsetGAN to identify generated content by detecting a subset of anomalous nodeactivations in the inner layers of pretrained neural networks. These nodes, as a group, maximize a nonparametric measure of divergence away from the expected distribution of activations created from real data. This enable us to identify synthesised images without prior knowledge of their distribution. SubsetGAN efficiently scores subsets of nodes and returns the group of nodes within the pretrained classifier that contributed to the maximum score. The classifier can be a general fake classifier trained over samples from multiple sources or the discriminator network from different GANs. Our approach shows consistently higher detection power than existing detection methods across several stateoftheart GANs (PGGAN, StarGAN, and CycleGAN) and over different proportions of generated content. 
Celia Cintas · Skyler Speakman · Girmaw Abebe Tadesse · Victor Akinwande · Kommy Weldemariam 🔗 


Ordinal Embedding for Sets
(
Poster
)
Ordinal embedding is the task of computing a meaningful multidimensional representation of objects, for which only qualitative constraints on their distance functions are known. In particular, we consider comparisons of the form “Which object from the pair (j, k) is more similar to object i?”. In this paper, we generalize this framework to the case where the ordinal constraints are not given at the level of individual points, but at the level of sets, and propose a distributional triplet embedding approach in a scalable learning framework. We show that the query complexity of our approach is on par with the single item approach. Without having access to features of the items to be embedded, we show the applicability of our model on toy datasets for the task of reconstruction, and demonstrate the validity of the obtained embeddings in experiments on synthetic and realworld datasets. 
Aissatou Diallo · Johannes Fürnkranz 🔗 


Differentiable architecture pruning for transfer learning
(
Poster
)
We propose a new gradientbased approach for extracting subarchitectures from a given large model. Contrarily to existing pruning methods, which are unable to disentangle the network architecture and the corresponding weights, our architecturepruning scheme produces transferable new structures that can be successfully retrained to solve different tasks. We focus on a transferlearning setup where architectures can be trained on a large data set but very few data points are available for finetuning them on new tasks. We define a new gradientbased algorithm that trains architectures of arbitrarily low complexity independently from the attached weights. Given a search space defined by an existing large neural mode, we reformulate the architecture search task as a complexitypenalized subsetselection problem and solve it through a twotemperature relaxation scheme. We provide theoretical convergence guarantees and validate the proposed transferlearning strategy on real data. 
Nicolo Colombo · Yang Gao 🔗 


When does lossbased prioritization fail?
(
Poster
)
Not all examples are created equal, but standard deep neural network training protocols treat each training point uniformly. Each example is propagated forward and backwards through the network the same amount of times, independent of how much the example contributes to the learning protocol. Recent work has proposed ways to accelerate training by deviating from this uniform treatment. Popular methods entail upweighting examples that contribute more to the loss with the intuition that examples with low loss have already been learned by the model, so their marginal value to the training procedure should be lower. This view assumes that updating the model with high loss examples will be beneficial to the model. However, this may not hold for noisy, realworld data. In this paper, we theorize and then empirically demonstrate that lossbased acceleration methods degrade in scenarios with noisy and corrupted data. Our work suggests measures of example difficulty need to correctly separate out noise from other types of challenging examples. 
Niel Hu · Xinyu Hu · Rosanne Liu · Sara Hooker · Jason Yosinski 🔗 


Geometrical Homogeneous Clustering for Image Data Reduction
(
Poster
)
In this paper, we present novel variations of an earlier approach called homogeneous clustering algorithm for reducing dataset size. The intuition behind the approaches proposed in this paper is to partition the dataset into homogeneous clusters and select some images which contribute significantly to the accuracy. We place an important criteria on sampled images that they should be humanreadable. We propose four variations: RHCKON, KONCW, CWKC & GHCIDR upon the baseline algorithm  RHC to achieve better accuracy. RHCKON involves selecting k farthest and one nearest neighbour of the centroid of the clusters. The intuition behind RHCKON is that the boundary points contribute significantly towards the representation of clusters. KONCW and CWKC introduce cluster weights to RHCKON. They are based on the fact that larger clusters contribute more than smaller sized clusters. The final variation is GHCIDR which selects points based on the geometrical aspect of data distribution. We performed the experiments on two deep learning models Fully Connected Networks (FCN), and VGG1. We experimented the four variants on three datasets MNIST, CIFAR10, and FashionMNIST. We found that GHCIDR gave the best accuracy of 99.35%, 81.10%, and 91.66% and a training data reduction of 87.27%, 32.34%, and 76.80% on MNIST, CIFAR10, and FashionMNIST respectively. 
Shril Mody · Janvi Thakkar · Devvrat Joshi · Siddharth Soni · Nipun Batra · Rohan Patil 🔗 


Interactive Teaching for Imbalanced Data Summarization
(
Poster
)
A fundamental problem in machine learning is developing data summarization, also known as coreset construction, techniques with minimal impact on model accuracy. However, a practical difficulty of the majority of existing methods is the lack of rigorous strategies for identifying the optimal coreset size. Moreover, these methods are often built around specific machine learning models, making it difficult for practitioners to apply them in various applications. This paper presents an interactive teaching method for adaptively constructing a small subset of representative samples to address these problems. Numerical experiments on three imbalanced data sets indicate the great potential and applicability of the proposed approach. 
Farhad PourkamaliAnaraki · Walter Bennette 🔗 


A Practical Notation for InformationTheoretic Quantities between Outcomes and Random Variables
(
Poster
)
Information theory is of importance to machine learning, but the notation for informationtheoretic quantities is not always clear. The right notation can convey valuable intuitions and concisely express new ideas. We propose such a notation for informationtheoretic quantities between events (outcomes) and random variables. We apply this notation to BALD (Bayesian Active Learning by Disagreement), an acquisition function in Bayesian active learning: it selects the most informative (unlabelled) samples for labeling by an expert; and extend BALD to the coreset problem, which consists of selecting the most informative samples given the labels. 
Andreas Kirsch · Yarin Gal 🔗 


Learning to Delegate for Largescale Vehicle Routing
(
Poster
)
Vehicle routing problems (VRPs) are a class of combinatorial problems with wide practical applications. While previous heuristic or learningbased works achieve decent solutions on small problem instances of up to 100 customers, their performance does not scale to large problems. This article presents a novel learningaugmented local search algorithm to solve largescale VRP. The method iteratively improves the solution by identifying appropriate subproblems and delegating their improvement to a black box subsolver. At each step, we leverage spatial locality to consider only a linear number of subproblems, rather than exponential. We frame subproblem selection as a regression problem and train a Transformer on a generated training set of problem instances. We show that our method achieves stateoftheart performance, with a speedup of up to 15 times over strong baselines, on VRPs with sizes ranging from 500 to 3000. 
Sirui Li · Zhongxia Yan · Cathy Wu 🔗 


Multiobjective diversification via Submodular Counterfactual Scoring for Track Sequencing on Spotify
(
Poster
)
Recommender systems powering online multistakeholder platforms often face the challenge of jointly optimizing for multiple objectives, in an attempt to efficiently match suppliers and consumers. Examples of such objectives include user behavioral metrics (e.g. clicks, streams, dwell time, etc), supplier exposure objectives (e.g. diversity) and platform centric objectives (e.g. promotions). Jointly optimizing multiple metrics in online recommender systems remains a challenging task. In this work, we propose a multiobjective diversification technique via submodular optimization and counterfactual scoring to sequence music tracks in sequential recommendation scenarios. Based on a large scale live AB test deployment of the proposed sequencing method, we empirically demonstrate the value of submodular diversification in balancing across multiple objectives. 
Rishabh Mehrotra 🔗 


A Data Subset Selection Framework for Efficient HyperParameter Tuning and Automatic Machine Learning
(
Poster
)
In recent years, deep learning models have found great success in various tasks viz., object detection, speech recognition, and translation, making the everyday lives of people easier. Despite the success, training a deep learning model is often challenging as its performance depends mainly on the hyperparameters used. Moreover, finding the best hyperparameter configuration is often timeconsuming, even when using stateoftheart (SOTA) hyperparameter optimization algorithms as they require multiple training runs over the entire dataset for different possible sets of hyperparameters. Our main insight is that using a subset of the dataset representing the entire dataset for model training runs involved in hyperparameter optimization allows us to find the optimal hyperparameter configuration significantly faster. In this work, we explore using the data subsets selected using the existing supervised learningbased data subset selection methods, namely \textsc{Craig}, \textsc{Glister}, \textsc{GradMatch}, for model training runs involved in hyperparameter optimization. Further, we empirically demonstrate through several experiments on realworld datasets that using data subsets for hyperparameter optimization achieves significantly faster turnaround times for hyperparameter selection that achieves comparable performance to the hyperparameters found using the entire dataset. 
Savan Amitbhai Visalpara · Krishnateja Killamsetty · Rishabh Iyer 🔗 


GoldiProx Selection: Faster training by learning what is learnable, not yet learned, and worth learning
(
Poster
)
We introduce GoldiProx Selection, a technique for faster model training which selects a sequence of training points that are 
Sören Mindermann · Muhammed Razzak · Adrien Morisot · Aidan Gomez · Sebastian Farquhar · Jan Brauner · Yarin Gal 🔗 


Online and Non Parametric Coresets for Bregman Divergence
(
Poster
)
We present algorithms that create coresets for clustering problems according to a wide subset of Bregman divergences. Our coresets have a small additive error, similar in magnitude to the lightweight coresets~\cite{bachem2018scalable}, and take an update time $O(d)$. We first give an algorithm that returns {\em online} coresets of size $\tilde{O}((dk\log k)/\epsilon^2\mu^2)$ for $k$clustering according to any $\mu$similar Bregman divergence. We also show the existence of {\em nonparametric} coresets, i.e. the coreset size is independent of the number of clusters $k$, as well the dimension of points $d$, for the same subclass of Bregman divergences.

Supratim Shit · Rachit Chhaya · Anirban Dasgupta · Jayesh Choudhari 🔗 


Unconstrained Submodular Maximization with Modular Costs: Tight Approximation and Application to Profit Maximization
(
Poster
)
Given a set V, the problem of unconstrained submodular maximization with modular costs (USMMC) asks for a subset S \subseteq V that maximizes f(S)  c(S), where f is a nonnegative, monotone, and submodular function that gauges the utility of S, and c is a nonnegative and modular function that measures the cost of S. This problem finds applications in numerous practical scenarios, such as profit maximization in viral marketing on social networks. This paper presents ROIGreedy, a polynomial time algorithm for USMMC that returns a solution S satisfying f(S)  c(S) >= f(S)  c(S)  c(S)\ln(f(S)/c(S)), where S is the optimal solution to USMMC. To our knowledge, ROIGreedy is the first algorithm that provides such a strong approximation guarantee. In addition, We show that this worstcase guarantee is tight, in the sense that no polynomial time algorithm can ensure f(S)  c(S) >= (1+\epsilon)(f(S)  c(S)  c(S) \ln(f(S)/c(S*)), for any \epsilon > 0. Further, we devise a nontrivial extension of ROIGreedy to solve the profit maximization problem, where the precise value of f(S) for any set S is unknown and can only be approximated via sampling. Extensive experiments on benchmark datasets demonstrate that ROIGreedy significantly outperforms competing methods in terms of the tradeoff between efficiency and solution quality. 
Tianyuan Jin · Yu Yang · Renchi Yang · Jieming Shi · Keke Huang · Xiaokui Xiao 🔗 


SVPCF: Selection via Proxy for Collaborative Filtering Data
(
Poster
)
We study the practical consequences of dataset sampling strategies on the performance of recommendation algorithms. Recommender systems are generally trained and evaluated on samples of larger datasets. Samples are often taken in a naive or adhoc fashion: e.g. by sampling a dataset randomly or by selecting users or items with many interactions. As we demonstrate, commonlyused data sampling schemes can have significant consequences on algorithm performancemasking performance deficiencies in algorithms or altering the relative performance of algorithms, as compared to models trained on the complete dataset. Following this observation, this paper makes the following main contributions: (1) characterizing the effect of sampling on algorithm performance, in terms of algorithm and dataset characteristics (e.g. sparsity characteristics, sequential dynamics, etc); and (2) designing SVPCF, which is a dataspecific sampling strategy, that aims to preserve the relative performance of models after sampling, and is especially suited to longtail interaction data. Detailed experiments show that SVPCF is more accurate than commonly used sampling schemes in retaining the relative ranking of different recommendation algorithms. 
Noveen Sachdeva · Julian McAuley · CaroleJean Wu 🔗 


Bayesian decision analysis for collecting nearlyoptimal subsets
(
Poster
)
Subset selection is valuable for interpretable learning, scientific discovery, and data compression. However, classical subset selection is often eschewed due to instability and computational bottlenecks. We address these challenges using Bayesian decision analysis. For any Bayesian predictive model, we elicit an acceptable family of subsets that provide nearlyoptimal linear prediction. The Bayesian model regularizes the linear coefficients and propagates uncertainty quantification for key predictive comparisons. The acceptable family spawns new variable importance metrics based on whether a variable appear in all, some, or no acceptable subsets. The proposed approach exhibits excellent prediction and stability for both simulated data (including p = 400 > n) and a large education dataset with highly correlated covariates. 
Daniel Kowal 🔗 


Fast Estimation Method for the Stability of Ensemble Feature Selectors
(
Poster
)
It is preferred that feature selectors be \textit{stable} for better interpretabity and robust prediction. Ensembling is known to be effective for improving the stability of feature selectors. Since ensembling is timeconsuming, it is desirable to reduce the computational cost to estimate the stability of the ensemble feature selectors. % for the data we use. We propose a simulator of a feature selector, and apply it to a fast estimation of the stability of ensemble feature selectors. To the best of our knowledge, this is the first study that estimates the stability of ensemble feature selectors and reduces the computation time theoretically and empirically. 
Rina Onda · Kenta Oono 🔗 


Selective Focusing Learning in Conditional GANs
(
Poster
)
Conditional generative adversarial networks (cGANs) have demonstrated remarkable success due to their classwise controllability and superior quality for complex generation tasks. Typical cGANs solve the joint distribution matching problem by decomposing two easier subproblems: marginal matching and conditional matching. From our toy experiments, we found that it is the best to apply only conditional matching to certain samples due to the contentaware optimization of the discriminator. This paper proposes a simple (a few lines of code) but effective training methodology, selective focusing learning, which enforces the discriminator and generator to learn easy samples of each class rapidly while maintaining diversity. Our key idea is to selectively apply conditional and joint matching for the data in each minibatch. We conducted experiments on recent cGAN variants in ImageNet (64x64 and 128x128), CIFAR10, and CIFAR100 datasets, and improved the performance significantly (up to 35.18% in terms of FID) without sacrificing diversity. 
Kyeongbo Kong · Kyunghun Kim · Woojin Song · SukJu Kang 🔗 


Kernel Thinning
(
Poster
)
We introduce kernel thinning, a new procedure for compressing a distribution $\mathbb{P}$ more effectively than i.i.d. sampling or standard thinning.
Given a suitable reproducing kernel $\mathbf{k}$ and $\mathcal{O}(n^2)$ time, kernel thinning compresses an $n$point approximation to $\mathbb{P}$ into a $\sqrt{n}$point approximation with comparable worstcase integration error across the associated reproducing kernel Hilbert space.
With high probability, the maximum discrepancy in integration error is $\mathcal{O}_d(n^{\frac{1}{2}}\sqrt{\log n})$ for compactly supported $\mathbb{P}$ and $\mathcal{O}_d(n^{\frac{1}{2}} \sqrt{(\log n)^{d+1}\log\log n})$ for subexponential $\mathbb{P}$ on $\mathbb{R}^d$.
In contrast, an equalsized i.i.d. sample from $\mathbb{P}$ suffers $\Omega(n^{\frac14})$ integration error.
Our subexponential guarantees resemble
the classical quasiMonte Carlo error rates for uniform $\mathbb{P}$ on $[0,1]^d$ but apply to general distributions on $\mathbb{R}^d$
and a wide range of common kernels.
We use our results to derive explicit nonasymptotic maximum mean discrepancy bounds for Gaussian, Mat\'ern, and Bspline kernels
and present two vignettes illustrating the practical benefits of kernel thinning over i.i.d. sampling and standard Markov chain Monte Carlo thinning.

Raaz Dwivedi · Lester Mackey 🔗 


Multiplecriteria Based Active Learning with Fixedsize Determinantal Point Processes
(
Poster
)
Active learning (AL) aims to achieve greater accuracy with less training data by selecting the most useful data samples from which it learns. Singlecriterion (i.e., informativeness and representativeness) based methods are simple and efficient; however, they lack adaptability to different realworld scenarios. In this paper, we introduce a multiplecriteria based AL algorithm, which incorporates three complementary criteria, i.e., informativeness, representativeness and diversity, to make appropriate selections in the AL rounds under different data types. We consider the selection process as a Determinantal Point Process, which good balance among these criteria. We refine the query selection strategy by both selecting the hardest unlabeled data sample and biasing towards the classifiers that are more suitable for the current data distribution. In addition, we also consider the dependencies and relationships between these data points in data selection by means of centroidbased clustering approaches. 
Xueying ZHAN · Qing Li · Antoni Chan 🔗 


Coresets for Classification – Simplified and Strengthened
(
Poster
)
We give relative error coresets for training linear classifiers with a broad class of loss functions, including the logistic loss and hinge loss. Our construction achieves $(1\pm \epsilon)$ relative error with $\tilde O(d \cdot \mu_y(X)^2/\epsilon^2)$ points, where $\mu_y(X)$ is a natural complexity measure of the data matrix $X \in \R^{n \times d}$ and label vector $y \in \{1,1\}^n$, introduced in \cite{munteanu2018coresets}. Our result is based on subsampling data points with probabilities proportional to their \emph{$\ell_1$ Lewis weights}. It significantly improves on existing theoretical bounds and performs well in practice, outperforming uniform subsampling along with other importance sampling methods. Our sampling distribution does not depend on the labels, so can be used for active learning. It also does not depend on the specific loss function, so a single coreset can be used in multiple training scenarios.

Anup Rao · Tung Mai · Cameron Musco 🔗 


Using Machine Learning to Recognise Statistical Dependence
(
Poster
)
Recognising statistical dependence of variables is a fundamentally prominent task in modelling data relatedness, having impact on multiple data disciplines. However, it is a peculiarly challenging task often requiring manual scrutiny of data in order to uncover the correct relatedness function. Inspired by this human ability, we cast the named task as machine learning problem. Solving such problem results in models encoding abstract meaning of statistical dependence. In this paper, we report overwhelmingly and significantly superior performance to known models, especially when few data points are available and/or high noise is present. Such novel view of the named task is transformative for numerous data disciplines such as unsupervised learning, feature selection and causal inference. 
Ubai Sandouk 🔗 


Mitigating Memorization in Sample Selection for Learning with Noisy Labels
(
Poster
)
Because deep learning is vulnerable to noisy labels, sample selection techniques, which train networks with only clean labeled data, have attracted a great attention. However, if the labels are dominantly corrupted by few classes, these noisy samples are called dominantnoisylabeled samples, the network also learns dominantnoisylabeled samples rapidly via contentaware optimization. In this study, we propose a compelling criteria to penalize dominantnoisylabeled samples intensively through classwise penalty labels. By averaging prediction confidences for the each observed label, we obtain suitable penalty labels that have high values if the labels are largely corrupted by some classes. Experiments were performed using benchmarks (CIFAR10, CIFAR100, TinyImageNet) and realworld datasets (ANIMAL10N, Clothing1M) to evaluate the proposed criteria in various scenarios with different noise rates. Using the proposed sample selection, the learning process of the network becomes significantly robust to noisy labels compared to existing methods in several noise types. 
Kyeongbo Kong · Junggi Lee · Youngchul Kwak · YoungRae Cho · SeongEun Kim · Woojin Song 🔗 


Sparsifying Transformer Models with Trainable Representation Pooling
(
Poster
)
We propose a novel method to sparsify attention in the Transformer model by learning to select the mostinformative token representations during the training process, thus focusing on the taskspecific parts of an input.
A reduction of quadratic time and memory complexity to sublinear was achieved due to a robust trainable top$k$ operator.
Our experiments on a challenging long document summarization task show that even our simple baseline performs comparably to the current SOTA, and with the trainable selection we can retain its top quality while being $1.8\times$ faster during training, $4.5\times$ faster during inference, and up to $16\times$ more computationally efficient in the decoder.
The method can be effortlessly applied to many models used in NLP and CV.

Michał Pietruszka · Łukasz Borchmann · Łukasz Garncarek 🔗 


Continual Learning via FunctionSpace Variational Inference: A Unifying View
(
Poster
)
Continual learning is the process of developing new abilities while retaining existing ones. Sequential Bayesian inference is a natural framework for this, but applying it successfully to deep neural networks remains a challenge. We propose continual functionspace variational inference (CFSVI), in which the variational distribution over functions induced by stochastic model parameters is encouraged to match the variational distribution over functions induced by stochastic parameters inferred on previous tasks. Unlike approaches that explicitly penalize changes in the model parameters, functionspace regularization allows parameters to vary widely during training, resulting in greater flexibility to fit new data. CFSVI improves on existing approaches to functionspace regularization by performing inference entirely in function space and without relying on carefully selected coreset points. We show that CFSVI outperforms alternative methods based on parameterspace and functionspace regularization on a range of tasks. 
Tim G. J. Rudner · Freddie Bickford Smith · Qixuan Feng · YeeWhye Teh · Yarin Gal 🔗 


Active Learning under Pool Set Distribution Shift and Noisy Data
(
Poster
)
Bayesian Active Learning has focused on BALD, which reduces model parameter uncertainty. However, we show that BALD gets stuck on outofdistribution or junk data that is not relevant for the task. We examine a novel Expected Predictive Information Gain (EPIG) to deal with distribution shifts of the pool set. EPIG reduces the uncertainty of predictions on an unlabelled evaluation set sampled from the test data distribution whose distribution might be different to the pool set distribution. Based on this, our new EPIGBALD acquisition function for Bayesian Neural Networks selects samples to improve the performance on the test data distribution instead of selecting samples that reduce model uncertainty everywhere, including for outofdistribution regions with low density in the test data distribution. Our method outperforms stateoftheart Bayesian active learning methods on highdimensional datasets and avoids outofdistribution junk data in cases where current stateoftheart methods fail. 
Andreas Kirsch · Tom Rainforth · Yarin Gal 🔗 


Batch Active Learning with Stochastic Acquisition Functions
(
Poster
)
In active learning, new labels are commonly acquired in batches. However, common acquisition functions are only meant for onesample acquisition rounds, and when their scores are used naively for batch acquisition, they result in batches lacking diversity, which negatively impacts performance. Stateoftheart batch acquisition functions are very costly to compute on the other hand. In this paper, we present a novel class of stochastic acquisition functions that extend onesample acquisition functions to the batch setting by observing that the computed acquisition scores are only really valid for the first sample that is selected in every batch acquisition round and that there is an increasing error in the scores for future samples in the batch. We model this error in the scores for additional batch samples. We acquire new samples by sampling from the pool set using the adapted scores. Our acquisition functions are both vastly cheaper to compute and outperform other batch acquisition functions. 
Andreas Kirsch · Sebastian Farquhar · Yarin Gal 🔗 


Sparse Bayesian Learning via Stepwise Regression
(
Poster
)
Sparse Bayesian Learning (SBL) is a powerful paradigm for attaining sparsity in probabilistic models. Herein, we propose a coordinate ascent algorithm for SBL termed Relevance Matching Pursuit (RMP) and show that, as its noise variance parameter goes to zero, RMP exhibits a surprising connection to Stepwise Regression. Further, we derive novel guarantees for Stepwise Regression algorithms, which also shed light on RMP. Our guarantees for Forward Regression improve on deterministic and probabilistic results for Orthogonal Matching Pursuit with noise. Our analysis of Backward Regression on determined systems culminates in a bound on the residual of the optimal solution to the {\it subset selection} problem that, if satisfied, guarantees the {\it optimality} of the result. To our knowledge, this bound is the first that can be computed in polynomial time and depends chiefly on the smallest singular value of the matrix. 
Sebastian Ament · Carla Gomes 🔗 


HighDimensional Variable Selection and NonLinear Interaction Discovery in Linear Time
(
Poster
)
Many scientific problems require identifying a small set of covariates that are associated with a target response and estimating their effects. Often, these effects are nonlinear and include interactions, so linear and additive methods can lead to poor estimation and variable selection. The Bayesian framework makes it straightforward to simultaneously express sparsity, nonlinearity, and interactions in a hierarchical model. But, as for the few other methods that handle this trifecta, inference is computationally intractable  with runtime at least quadratic in the number of covariates, and often worse. In the present work, we solve this computational bottleneck. We first show that suitable Bayesian models can be represented as Gaussian processes (GPs). We then demonstrate how a kernel trick can reduce computation with these GPs to $O$(\# covariates) time for both variable selection and estimation. Our resulting fit corresponds to a sparse orthogonal decomposition of the regression function in a Hilbert space (i.e., a functional ANOVA decomposition), where interaction effects represent all variation that cannot be explained by lowerorder effects. On a variety of synthetic and real datasets, our approach outperforms existing methods used for large, highdimensional datasets while remaining competitive (or being orders of magnitude faster) in runtime.

Raj Agrawal · Tamara Broderick 🔗 


Errordriven FixedBudget ASR Personalization for Accented Speakers
(
Poster
)
We consider the task of personalizing ASR models while being constrained by a fixed budget on recording speaker specific utterances. Given a speaker and an ASR model, we propose a method of identifying sentences for which the speaker's utterances are likely to be harder for the given ASR model to recognize. We assume a tiny amount of speakerspecific data to learn phonemelevel error models which help us select such sentences. We show that speaker's utterances on the sentences selected using our error model indeed have larger error rates when compared to speaker's utterances on randomly selected sentences. We find that finetuning the ASR model on the sentence utterances selected with the help of error models yield higher WER improvements in comparison to finetuning on an equal number of randomly selected sentence utterances. Thus, our method provides an efficient way of collecting speaker utterances under budget constraints for personalizing ASR models. 
Abhijeet Awasthi · Sunita Sarawagi · Preethi Jyothi 🔗 


MISNN: Multiple Imputation via Semiparametric Neural Networks
(
Poster
)
Multiple imputation (MI) has been widely applied to missing value problems in biomedical, social and econometric research, in order to avoid improper inference in the downstream data analysis. In the presence of highdimensional data, imputation models that include feature selection, especially $\ell_1$ regularized regression (such as Lasso, adaptive Lasso and Elastic Net), are common choices to prevent the model from underdetermination. However, conducting MI with feature selection is difficult: existing methods are often computationally inefficient and poor in performance. We propose MISNN, a novel and efficient algorithm that incorporates feature selection for MI. Leveraging the approximation power of neural networks, MISNN is a general and flexible framework, compatible to any feature selection method, any neural network architecture, high/lowdimensional data and general missing patterns. Through empirical experiments, MISNN has demonstrated great advantages over stateoftheart imputation methods (e.g. Bayesian Lasso and matrix completion), in terms of imputation accuracy, statistical consistency and computation speed.

Zhiqi Bu · Zongyu Dai · Yiliang Zhang · Qi Long 🔗 


Towards Active Air Quality Station Deployment
(
Poster
)
Air pollution is a global problem and has a severe impact on human health. Finegrained air quality (AQ) monitoring is important in mitigating air pollution. However, existing AQ station deployments are sparse due to installation and operational costs. In this work, we propose an Active Learningbased method for air quality station deployment. We use Gaussian processes and several classical machine learning algorithms (Random Forest, KNeighbors and Support Vector Machine) to benchmark several active learning strategies for two realworld air quality datasets (Delhi, India and Beijing, China). 
Zeel B Patel · Nipun Batra 🔗 


Coreset Sampling for Efficient Neural Architecture Search
(
Poster
)
Neural architecture search (NAS), an important branch of automatic machine learning, has become an effective approach to automate the design of deep learning models. However, the major issue in NAS is how to reduce the large search time imposed by the heavy computational burden. While most recent approaches focus on pruning redundant sets or developing new search methodologies, this paper attempts to formulate the problem based on the data curation manner. Our key strategy is to search the architecture using summarized data distribution, i.e., coreset. Typically, many NAS algorithms separate searching and training stages, and the proposed coreset methodology is only used in search stage, thus their performance degradation can be minimized. In our experiments, we were able to save overall computational time from 30.8 hours to 3.5 hours, 8.8x reduction, on a single RTX 3090 GPU without sacrificing accuracy. 
Jaehun Shim · Kyeongbo Kong · SukJu Kang 🔗 


On Coresets For Fair Regression
(
Poster
)
In this work we present an algorithm to construct coresets for fair regression
for a dataset of $n$ points in $d$ dimensions and having some discretevalued protected attribute. For (fair) regression with statistical parity we first define the notion of a coreset and give a coreset that incurs only a small additive error in satisfying the constraints and a relative error in the objective function. The coreset size is linear in $\ell$, the number of unique values the protected attribute can take, and the dependence on dimension is near linear ($d \log{d}$). To the best of our knowledge this is the first coreset for fair regression problem with statistical parity. We also empirically demonstrate the performance of our coresets on real data sets.

Rachit Chhaya · Anirban Dasgupta · Supratim Shit · Jayesh Choudhari 🔗 


A Comparison of Contextual and NonContextual Preference Ranking for Set Addition Problems
(
Poster
)
In this paper, we study the problem of evaluating the addition of elements to a set. This problem is difficult, because it can, in the general case, not be reduced to unconditional preferences between the choices. Therefore, we model preferences based on the context of the decision. We discuss and compare two different Siamese network architectures for this task: a twin network that compares the two sets resulting after the addition, and a triplet network that models the contribution of each candidate to the existing set. We evaluate the two settings on a realworld task; learning human card preferences for deck building in the collectible card game Magic: The Gathering. We show that the triplet approach achieves a better result than the twin network and that both outperform previous results on this task. 
Timo Bertram · Johannes Fürnkranz · Martin Müller 🔗 


Statistical Measures For Defining Curriculum Scoring Function
(
Poster
)
Curriculum learning is a training strategy that sorts the training examples by some measure of their difficulty and gradually exposes them in that order to the learner to improve the network performance. In this work, we first propose and study a dynamic curriculum learning algorithm. Our dynamic curriculum algorithm greedily samples a subset of the training data whose gradients are directed towards the optimal weight. Motivated by our insights from the dynamic curriculum learning ordering and implicit curriculum ordering, we introduce a simple, practical curriculum learning strategy that uses statistical measures such as standard deviation and entropy values to score the difficulty of data points for real image classification tasks. Further, we also use our algorithms to discuss why curriculum learning is helpful. 
Vinu Sankar Sadasivan · Anirban Dasgupta 🔗 


An Extreme Point Approach to Subset Selection
(
Poster
)
Given a training data set, we develop a method to obtain a representative subset of training data meant for classification. Our focus is on subsets that are particularly suitable for training linear classification models. Our method is to reduce data sets in a manner that essentially preserves the convex hulls of the various classes of the data set. We develop scalable randomized algorithms for convex hull approximation, and obtain rigorous guarantees for models trained on subsets of linearly separable data sets. We empirically demonstrate the effectiveness of our approach for both training and data augmentation on the MNIST data set. 
Viveck Cadambe · Bill Kay 🔗 


Tighter mDPP Coreset Sample Complexity Bounds
(
Poster
)
Many machine learning problems involve processing large datasets, thus making them expensive computational tasks. One mitigation strategy selects representative subsets of the data and one way to achieve this is via coresets~ a weighted subset of the given data, such that the cost function under consideration can be additively and/or multiplicatively approximated. While much previous work chooses these subsets using independent importancebased sampling, such methods can be suboptimal when the system possesses redundancy. Recently, Tremblay et al. 2019 considered sampling from fixedsized determinantal point process (DPP) and gave lower bounds on the coreset sample complexity. While DPPs often empirically outperform importancebased sampling methods, their coreset size bound is worse compared to independent sampling. In this work, we extend their theoretical results by improving the lower bound on sample complexity for coreset approximation by a log(1 / epsilon) factor using chaining, a technique in empirical process literature. This, therefore, moves a step further in bridging the gap between theoretical justification and empirical observations. We provide additive approximation guarantees for the associated cost function and allow the possibility of negative weights, which has a direct application in layerwise pruning of deep neural networks. On top of additive approximation guarantees, we also extend our bounds to the scenarios with multiplicative approximations. 
Gantavya Bhatt · Jeff Bilmes 🔗 


SIMILAR: Submodular Information Measures Based Active Learning In Realistic Scenarios
(
Poster
)
Active learning has proven to be useful for minimizing labeling costs by selecting the most informative samples. However, existing active learning methods do not work well in realistic scenarios such as imbalance or rare classes, outofdistribution data in the unlabeled set, and redundancy. In this work, we propose SIMILAR (Submodular Information Measures based actIve LeARning), a unified active learning framework using recently proposed submodular information measures (SIM) as acquisition functions. We argue that SIMILAR not only works in standard active learning but also easily extends to the realistic settings considered above and acts as a onestop solution for active learning that is scalable to large realworld datasets. Empirically, we show that SIMILAR significantly outperforms existing active learning algorithms by as much as ≈ 5% − 18% in the case of rare classes and ≈ 5% − 10% in the case of outofdistribution data on CIFAR10 image classification. 
Suraj Kothawade · Krishnateja Killamsetty · Rishabh Iyer 🔗 


Minimax Optimization: The Case of ConvexSubmodular
(
Poster
)
In this paper, we study mixed continuousdiscrete minimax problems where the minimization is over a continuous variable belonging to Euclidean space and the maximization is over subsets of a given ground set. We introduce the class of convexsubmodular minimax problems, where the objective is convex with respect to the continuous variable and submodular with respect to the discrete variable. Even though such problems appear frequently in machine learning applications, little is known about how to address them from algorithmic and theoretical perspectives. For such problems, we first show that obtaining saddle points are hard up to any approximation, and thus introduce new notions of (near) optimality. We then provide several algorithmic procedures for solving convex and monotonesubmodular minimax problems and characterize their convergence rates, computational complexity, and quality of the final solution. Finally, we provide experiments to showcase the effectiveness of our purposed methods on training robust recommender systems in the presence of poisoning attacks. 
Arman Adibi · Aryan Mokhtari · Hamed Hassani 🔗 


Improved Regret Bounds for Online Submodular Maximization
(
Poster
)
In this paper, we consider an online optimization problem over $T$ rounds where at each step $t\in[T]$, the algorithm chooses an action $x_t$ from the fixed convex and compact domain set $\mathcal{K}$. A utility function $f_t(\cdot)$ is then revealed and the algorithm receives the payoff $f_t(x_t)$. This problem has been previously studied under the assumption that the utilities are adversarially chosen monotone DRsubmodular functions and $\mathcal{O}(\sqrt{T})$ regret bounds have been derived. We first characterize the class of strongly DRsubmodular functions and then, we derive regret bounds for the following new online settings: $(1)$ $\{f_t\}_{t=1}^T$ are monotone strongly DRsubmodular and chosen adversarially, $(2)$ $\{f_t\}_{t=1}^T$ are monotone submodular (while the average $\frac{1}{T}\sum_{t=1}^T f_t$ is strongly DRsubmodular) and chosen by an adversary but they arrive in a uniformly random order, $(3)$ $\{f_t\}_{t=1}^T$ are drawn i.i.d. from some unknown distribution $f_t\sim \mathcal{D}$ where the expected function $f(\cdot)=\mathbb{E}_{f_t\sim\mathcal{D}}[f_t(\cdot)]$ is monotone DRsubmodular. For $(1)$, we obtain the first logarithmic regret bounds. In terms of the second framework, we show that it is possible to obtain similar logarithmic bounds with high probability. Finally, for the i.i.d. model, we provide algorithms with $\tilde{\mathcal{O}}(\sqrt{T})$ stochastic regret bound, both in expectation and with high probability.

Omid Sadeghi · Maryam Fazel 🔗 


Differentially Private Monotone Submodular Maximization Under Matroid and Knapsack Constraints
(
Poster
)
Numerous tasks in machine learning and artificial intelligence have been modeled as submodular maximization problems. These problems usually involve sensitive data about individuals, and in addition to maximizing the utility, privacy concerns should be considered. In this paper, we study the general framework of nonnegative monotone submodular maximization subject to matroid or knapsack constraints in both offline and online settings. For the offline setting, we propose a differentially private $(1\frac{\kappa}{e})$approximation algorithm, where $\kappa\in[0,1]$ is the total curvature of the submodular set function, which improves upon prior works in terms of approximation guarantee and query complexity under the same privacy budget. In the online setting, we propose the first differentially private algorithm, and we specify the conditions under which the regret bound scales as $\O(\sqrt{T})$, i.e., privacy could be ensured while maintaining the same regret bound as the optimal regret guarantee in the nonprivate setting.

Omid Sadeghi · Maryam Fazel 🔗 


Effective Evaluation of Deep Active Learning on Image Classification Tasks
(
Poster
)
With the goal of making deep learning more labelefficient, a growing number of papers have been studying active learning (AL) for deep models. However, there are a number of issues in the prevalent experimental settings, mainly stemming from a lack of unified implementation and benchmarking. Issues in the current literature include sometimes contradictory observations on the performance of different AL algorithms, unintended exclusion of important generalization approaches such as data augmentation and SGD for optimization, a lack of study of evaluation facets like the labeling efficiency of AL, and little or no clarity on the scenarios in which AL outperforms random sampling (RS). In this work, we present a unified reimplementation of stateoftheart AL algorithms in the context of image classification, and we carefully study these issues as facets of effective evaluation. On the positive side, we show that AL techniques are 2x to 4x more labelefficient compared to RS with the use of data augmentation. Surprisingly, when data augmentation is included, there is no longer a consistent gain in using BADGE, a stateoftheart approach, over simple uncertainty sampling. We then do a careful analysis of how existing approaches perform with varying number of examples per class. Finally, we provide other insights for AL practitioners to consider in future work, such as the use of data subset selection techniques for intermediate model training. 
Nathan Beck · Durga Sivasubramanian · Ganesh Ramakrishnan · Rishabh Iyer 🔗 


Active Learning Convex Halfspaces on Graphs
(
Poster
)
We systematically study the query complexity of learning geodesically convex halfspaces on graphs. Geodesic convexity is a natural generalisation of Euclidean convexity and allows the definition of convex sets and halfspaces on graphs. We prove upper bounds on the query complexity linear in the treewidth and the minimum hull set size but only logarithmic in the diameter. We show tight lower bounds corresponding to wellestablished separation axioms and identify the Radon number as a central parameter bounding the query complexity and the VC dimension. While previous bounds typically depend on the cut size of the labelling, all parameters in our bounds can be computed from the unlabelled graph. We empirically compare our proposed approach with other active learning algorithms and provide evidence that groundtruth communities in realworld graphs are often convex. 
Maximilian Thiessen · Thomas Gärtner 🔗 


Parallel Quasiconcave set optimization: A new frontier that scales without needing submodularity
(
Poster
)
Classes of set functions along with a choice of ground set are a bedrock to determine and develop corresponding variants of greedy algorithms to suitably obtain approximate and efficient solutions for combinatorial optimization. The class of constrained submodular optimization has seen huge advances at the intersection of good computational efficiency, versatility and approximation guarantees while unconstrained submodular optimization is NPhard. What is an alternative to situations when submodularity does not hold? Can efficient and globally exact solutions be obtained? We introduce one such new frontier: The class of quasiconcave set functions induced as a dual class to monotone linkage functions. We provide a parallel algorithm with a time complexity over $n$ processors of $\mathcal{O}(n^2g) +\mathcal{O}(\log{\log{n}})$ where $n$ is the cardinality of the ground set and $g$ is the complexity to compute the monotone linkage function that induces a corresponding quasiconcave set function via a duality. The complexity reduces to $\mathcal{O}(gn\log(n))$ on $n^2$ processors and to $\mathcal{O}(gn)$ on $n^3$ processors. Our approach reduces the currently existing cubic computational complexity to those mentioned above. Our algorithm provides a globally optimal solution to a maximin problem as opposed to submodular optimization which is approximate. We show a potential for widespread applications via an example of diverse feature subset selection with exact global maximin guarantees upon showing that a statistical dependency measure called distance correlation can be used to induce a quasiconcave set function.

Praneeth Vepakomma · Ramesh Raskar 🔗 