Timezone: »

 
Workshop
Subset Selection in Machine Learning: From Theory to Applications
Rishabh Iyer · Abir De · Ganesh Ramakrishnan · Jeff Bilmes

Sat Jul 24 06:00 AM -- 04:30 PM (PDT) @ None
Event URL: https://sites.google.com/view/icml-2021-subsetml/home »

A growing number of machine learning problems involve finding subsets of data points. Examples range from selecting subset of labeled or unlabeled data points, to subsets of features or model parameters, to selecting subsets of pixels, keypoints, sentences etc. in image segmentation, correspondence and summarization problems. The workshop would encompass a wide variety of topics ranging from theoretical aspects of subset selection e.g. coresets, submodularity, determinantal point processes, to several practical applications, {\em e.g.}, time and energy efficient learning, learning under resource constraints, active learning, human assisted learning, feature selection, model compression, feature induction, {\em etc.}

We believe that this workshop is very timely since, a) subset selection is naturally emerging and has often been considered in isolation in many of the above applications, and b) by connecting researchers working on both the theoretical and application domains above, we can foster a much needed discussion on reusing a several technical innovations across these subareas and applications. Furthermore, we would also like to connect researchers working on the theoretical foundations of subset selection (in areas such as coresets and submodularity) with researchers working in applications (such as feature selection, active learning, data efficient learning, model compression, and human assisted machine learning).

Sat 6:15 a.m. - 6:30 a.m.

In this introduction, we will cover:

  • Why this workshop?
  • Goals of this workshop and what all will be covered?
  • A list of talks
  • List of papers and areas covered
  • Panel discussion and topic of the panel discussion
Abir De, Rishabh Iyer, Ganesh Ramakrishnan, Jeff Bilmes
Sat 6:30 a.m. - 7:00 a.m.

When we want to discover patterns in data, we usually use the best available algorithm/software or try to improve it. In recent years we have started exploring a different approach: instead of improving the algorithm, reduce the input data and run the existing algorithm on the reduced data to obtain the desired output much faster. A core-set for a given problem is a semantic compression of its input, in the sense that a solution for the problem with the (small) core-set as input yields an approximate solution to the problem with the original (Big) data. Using existing techniques it can be computed on a streaming input that may be distributed among several machines on device, and in parallel (e.g. clouds or GPUs).

Dan Feldman
Sat 7:00 a.m. - 7:25 a.m.
  

In recent years, there have a raising interest on learning under algorithmic triage, a new learning paradigm which seeks the development of machine learning models that operate under different automation levels---models that take decisions for a given fraction of instances and leave the remaining ones to human experts. However, the interplay between the prediction accuracy of the models and the human experts under algorithmic triage is not well understood. In this talk, we will start by formally characterizing under which circumstances a predictive model may benefit from algorithmic triage. In doing so, we will also demonstrate that models trained for full automation may be suboptimal under triage. Then, given any model and desired level of triage, we will show that the optimal triage policy is a deterministic threshold rule in which triage decisions are derived deterministically by thresholding the difference between the model and human errors on a per-instance level. Building upon these results, we will introduce a practical gradient-based algorithm that is guaranteed to find a sequence of triage policies and predictive models of increasing performance. Finally, we will use real data from two important applications, content moderation and scientific discovery, to illustrate our theoretical results and show that the models and triage policies provided by our gradient-based algorithm outperform those provided by several competitive baselines.

Manuel Gomez Rodriguez
Sat 7:25 a.m. - 7:30 a.m.
Differentiable learning Under Algorithmic Triage Q&A (Live Q&A)
Sat 7:30 a.m. - 7:55 a.m.
  

Coresets are small data summaries that are sufficient for model training. They can be maintained online, enabling efficient handling of large data streams under resource constraints. However, existing constructions are limited to simple models such as k-means and logistic regression. In this work, we propose a novel coreset construction via cardinality-constrained bilevel optimization. We show how our framework can efficiently generate coresets for deep neural networks, and demonstrate its empirical benefits in continual and streaming deep learning, as well as active semi-supervised learning.

Joint work with Zalán Borsos, Mojmír Mutny and Marco Tagliasacchi

Andreas Krause
Sat 7:55 a.m. - 8:00 a.m.
Data Summarization via Bilevel Coresets: Live Q&A (Live Q&A)
Sat 8:00 a.m. - 8:25 a.m.
  

While constraints are ubiquitous in artificial intelligence and constraints are also commonly used in machine learning and data mining, the problem of learning constraints from examples has received less attention. I will discuss the problem of constraint learning in general, and present some recent contributions in the learning of various types of constraints and models for combinatorial optimisation. This includes the learning of formulae in Excel, learning SMT formulae, MAX-SAT and Linear Programs.

Luc De Raedt
Sat 8:25 a.m. - 8:30 a.m.
Learning Constraints from Examples Live Q&A (Live Q&A)
Sat 8:30 a.m. - 8:51 a.m.
  

In this talk, I will introduce 3 close friends of the greedy algorithm who can maximize a general submodular function (monotone or not) subject to very general constraints. They come in different flavors and guarantees.

Amin Karbasi
Sat 8:51 a.m. - 9:00 a.m.
Greedy and Its Friends Live Q&A (Live Q&A)
Sat 9:00 a.m. - 9:30 a.m.

We will be using gather town for the poster sessions. The links are below.

Room 1: https://eventhosts.gather.town/nI2bwxKP1HljbkId/subsetml-room-1 Room 2: https://eventhosts.gather.town/GV1QzCxwr3qkeeLU/subsetml-room-2

Sat 9:30 a.m. - 10:30 a.m.

Panelists:

Andreas Krause Dan Feldman Amin Karbasi Luc De Raedt Manuel Gomez Rodriguez Rajiv Khanna Baharan Mirzasoleiman Cody Coleman Matthai Phillipose Gaurav Aggarwal

Sat 10:30 a.m. - 10:44 a.m.
 link »   

In this talk (Part I and II), we will cover the different functionalities of DECILE (www.decile.org) which include modules like a) SUBMODLIB, b) CORDS, c) TRUST, d) DISTIL, e) SPEAR. SUBMODLIB is a library for submodular optimization which implements a number of submodular optimization algorithms and functions (including the submodular mutual information and conditional gain functions). CORDS is a library for data subset selection and coresets for compute-efficient training of deep models. TRUST is targeted subset selection for personalization and model remediation. DISTIL is an active learning toolkit for deep models and SPEAR is a library for weak supervision via labeling functions. We will also focus on the different SOTA algorithms implemented and the benchmarks enabled through these toolkits.

In Part I, we will cover submodlib (a toolkit for submodular optimization), and CORDS (a toolkit for data subset selection and coresets for efficient training of deep models).

CORDS: https://github.com/decile-team/cords/ DISTIL: https://github.com/decile-team/distil/ TRUST: https://github.com/decile-team/trust/ SubmodLib: https://github.com/decile-team/submodlib/ SPEAR: https://github.com/decile-team/spear/

Rishabh Iyer
Sat 10:44 a.m. - 10:58 a.m.
 link »   

In this talk (Part I and II), we will cover the different functionalities of DECILE (www.decile.org) which include modules like a) SUBMODLIB, b) CORDS, c) TRUST, d) DISTIL, e) SPEAR. SUBMODLIB is a library for submodular optimization which implements a number of submodular optimization algorithms and functions (including the submodular mutual information and conditional gain functions). CORDS is a library for data subset selection and coresets for compute-efficient training of deep models. TRUST is targeted subset selection for personalization and model remediation. DISTIL is an active learning toolkit for deep models and SPEAR is a library for weak supervision via labeling functions. We will also focus on the different SOTA algorithms implemented and the benchmarks enabled through these toolkits.

In Part II, we will cover TRUST, DISTIL, and SPEAR.

CORDS: https://github.com/decile-team/cords/ DISTIL: https://github.com/decile-team/distil/ TRUST: https://github.com/decile-team/trust/ SubmodLib: https://github.com/decile-team/submodlib/ SPEAR: https://github.com/decile-team/spear/

Ganesh Ramakrishnan
Sat 10:58 a.m. - 11:00 a.m.
Benchmarks and Toolkits for Data Subset Selection in ML through DECILE: Live Q&A (Live Q&A)
Sat 11:00 a.m. - 11:20 a.m.
 link »   

In this talk, we introduce several features of Summary Analytics, a new company that makes summarization-like abilities accessible to the masses, and scalable to petabyte-size datasets. We demonstrate wide applicability to different data modalities via a number of case studies. We demonstrate scalability by showing that it is possible to summarize 100 million records, each with 1700 features, down to 1000 records in 60 seconds using one CPU thread, and 14 seconds using four CPU threads, on a 2019-era laptop.

Jeff Bilmes
Sat 11:20 a.m. - 11:30 a.m.
More Information, Less Data: Q&A Session (Live Q&A)
Sat 11:30 a.m. - 11:50 a.m.
  

In this talk, I will present novel theoretical results to bridge the gap in theory and practice for interpretable dimensionality reduction aka feature selection. Specifically, I will show that feature selection satisfies a weaker form of submodularity. Because of this connection, for any function, one can provide constant-factor approximation guarantees that are solely dependent on the condition number of the function. Moreover, I will discuss that the "cost of interpretability" accrued because of selecting features as opposed to principal components is not as high as was previously thought to be.

Rajiv Khanna
Sat 11:50 a.m. - 12:00 p.m.
Theory of feature selection Live Q&A (Live Q&A)
Sat 12:00 p.m. - 12:04 p.m.
  
We present algorithms that create coresets for clustering problems according to a wide subset of Bregman divergences. Our coresets have a small additive error, similar in magnitude to the lightweight coresets~\cite{bachem2018scalable}, and take an update time $O(d)$. We first give an algorithm that returns {\em online} coresets of size $\tilde{O}((dk\log k)/\epsilon^2\mu^2)$ for $k$-clustering according to any $\mu$-similar Bregman divergence. We also show the existence of {\em non-parametric} coresets, i.e. the coreset size is independent of the number of clusters $k$, as well the dimension of points $d$, for the same subclass of Bregman divergences.
Jayesh Choudhari, Anirban Dasgupta, Rachit Chhaya, Supratim Shit
Sat 12:04 p.m. - 12:09 p.m.
[ Visit Poster at Spot A5 in Virtual World ]   

Given a set V, the problem of unconstrained submodular maximization with modular costs (USM-MC) asks for a subset S \subseteq V that maximizes f(S) - c(S), where f is a non-negative, monotone, and submodular function that gauges the utility of S, and c is a non-negative and modular function that measures the cost of S. This problem finds applications in numerous practical scenarios, such as profit maximization in viral marketing on social networks.

This paper presents ROI-Greedy, a polynomial time algorithm for USM-MC that returns a solution S satisfying f(S) - c(S) >= f(S) - c(S) - c(S)\ln(f(S)/c(S)), where S is the optimal solution to USM-MC. To our knowledge, ROI-Greedy is the first algorithm that provides such a strong approximation guarantee. In addition, We show that this worst-case guarantee is tight, in the sense that no polynomial time algorithm can ensure f(S) - c(S) >= (1+\epsilon)(f(S) - c(S) - c(S) \ln(f(S)/c(S*)), for any \epsilon > 0. Further, we devise a non-trivial extension of ROI-Greedy to solve the profit maximization problem, where the precise value of f(S) for any set S is unknown and can only be approximated via sampling. Extensive experiments on benchmark datasets demonstrate that ROI-Greedy significantly outperforms competing methods in terms of the trade-off between efficiency and solution quality.

Xiaokui Xiao, Keke Huang, Jieming Shi, Renchi Yang, Yu Yang, Tianyuan Jin
Sat 12:09 p.m. - 12:14 p.m.
  

We study the practical consequences of dataset sampling strategies on the performance of recommendation algorithms. Recommender systems are generally trained and evaluated on samples of larger datasets. Samples are often taken in a naive or ad-hoc fashion: e.g. by sampling a dataset randomly or by selecting users or items with many interactions. As we demonstrate, commonly-used data sampling schemes can have significant consequences on algorithm performance---masking performance deficiencies in algorithms or altering the relative performance of algorithms, as compared to models trained on the complete dataset. Following this observation, this paper makes the following main contributions: (1) characterizing the effect of sampling on algorithm performance, in terms of algorithm and dataset characteristics (e.g. sparsity characteristics, sequential dynamics, etc); and (2) designing SVP-CF, which is a data-specific sampling strategy, that aims to preserve the relative performance of models after sampling, and is especially suited to long-tail interaction data. Detailed experiments show that SVP-CF is more accurate than commonly used sampling schemes in retaining the relative ranking of different recommendation algorithms.

Carole-Jean Wu, Julian McAuley, Noveen Sachdeva
Sat 12:14 p.m. - 12:19 p.m.
  

Subset selection is valuable for interpretable learning, scientific discovery, and data compression. However, classical subset selection is often eschewed due to instability and computational bottlenecks. We address these challenges using Bayesian decision analysis. For any Bayesian predictive model, we elicit an acceptable family of subsets that provide nearly-optimal linear prediction. The Bayesian model regularizes the linear coefficients and propagates uncertainty quantification for key predictive comparisons. The acceptable family spawns new variable importance metrics based on whether a variable appear in all, some, or no acceptable subsets. The proposed approach exhibits excellent prediction and stability for both simulated data (including p = 400 > n) and a large education dataset with highly correlated covariates.

Daniel Kowal
Sat 12:19 p.m. - 12:23 p.m.
  

It is preferred that feature selectors be \textit{stable} for better interpretabity and robust prediction. Ensembling is known to be effective for improving the stability of feature selectors. Since ensembling is time-consuming, it is desirable to reduce the computational cost to estimate the stability of the ensemble feature selectors. % for the data we use. We propose a simulator of a feature selector, and apply it to a fast estimation of the stability of ensemble feature selectors. To the best of our knowledge, this is the first study that estimates the stability of ensemble feature selectors and reduces the computation time theoretically and empirically.

Kenta Oono, Rina Onda
Sat 12:23 p.m. - 12:27 p.m.
  

Conditional generative adversarial networks (cGANs) have demonstrated remarkable success due to their class-wise controllability and superior quality for complex generation tasks. Typical cGANs solve the joint distribution matching problem by decomposing two easier sub-problems: marginal matching and conditional matching. From our toy experiments, we found that it is the best to apply only conditional matching to certain samples due to the content-aware optimization of the discriminator. This paper proposes a simple (a few lines of code) but effective training methodology, selective focusing learning, which enforces the discriminator and generator to learn easy samples of each class rapidly while maintaining diversity. Our key idea is to selectively apply conditional and joint matching for the data in each mini-batch. We conducted experiments on recent cGAN variants in ImageNet (64x64 and 128x128), CIFAR-10, and CIFAR-100 datasets, and improved the performance significantly (up to 35.18% in terms of FID) without sacrificing diversity.

Suk-Ju Kang, Woo-jin Song, Kyunghun Kim, Kyeongbo Kong
Sat 12:27 p.m. - 12:32 p.m.
  
We introduce kernel thinning, a new procedure for compressing a distribution $\mathbb{P}$ more effectively than i.i.d. sampling or standard thinning. Given a suitable reproducing kernel $\mathbf{k}$ and $\mathcal{O}(n^2)$ time, kernel thinning compresses an $n$-point approximation to $\mathbb{P}$ into a $\sqrt{n}$-point approximation with comparable worst-case integration error across the associated reproducing kernel Hilbert space. With high probability, the maximum discrepancy in integration error is $\mathcal{O}_d(n^{-\frac{1}{2}}\sqrt{\log n})$ for compactly supported $\mathbb{P}$ and $\mathcal{O}_d(n^{-\frac{1}{2}} \sqrt{(\log n)^{d+1}\log\log n})$ for sub-exponential $\mathbb{P}$ on $\mathbb{R}^d$. In contrast, an equal-sized i.i.d. sample from $\mathbb{P}$ suffers $\Omega(n^{-\frac14})$ integration error. Our sub-exponential guarantees resemble the classical quasi-Monte Carlo error rates for uniform $\mathbb{P}$ on $[0,1]^d$ but apply to general distributions on $\mathbb{R}^d$ and a wide range of common kernels. We use our results to derive explicit non-asymptotic maximum mean discrepancy bounds for Gaussian, Mat\'ern, and B-spline kernels and present two vignettes illustrating the practical benefits of kernel thinning over i.i.d. sampling and standard Markov chain Monte Carlo thinning.
Lester Mackey, Raaz Dwivedi
Sat 12:32 p.m. - 12:37 p.m.
  

Active learning (AL) aims to achieve greater accuracy with less training data by selecting the most useful data samples from which it learns. Single-criterion (i.e., informativeness and representativeness) based methods are simple and efficient; however, they lack adaptability to different real-world scenarios. In this paper, we introduce a multiple-criteria based AL algorithm, which incorporates three complementary criteria, i.e., informativeness, representativeness and diversity, to make appropriate selections in the AL rounds under different data types. We consider the selection process as a Determinantal Point Process, which good balance among these criteria. We refine the query selection strategy by both selecting the hardest unlabeled data sample and biasing towards the classifiers that are more suitable for the current data distribution. In addition, we also consider the dependencies and relationships between these data points in data selection by means of centroid-based clustering approaches.

Antoni Chan, Qing Li, Xueying ZHAN
Sat 12:37 p.m. - 12:42 p.m.
  
We give relative error coresets for training linear classifiers with a broad class of loss functions, including the logistic loss and hinge loss. Our construction achieves $(1\pm \epsilon)$ relative error with $\tilde O(d \cdot \mu_y(X)^2/\epsilon^2)$ points, where $\mu_y(X)$ is a natural complexity measure of the data matrix $X \in \R^{n \times d}$ and label vector $y \in \{-1,1\}^n$, introduced in \cite{munteanu2018coresets}. Our result is based on subsampling data points with probabilities proportional to their \emph{$\ell_1$ Lewis weights}. It significantly improves on existing theoretical bounds and performs well in practice, outperforming uniform subsampling along with other importance sampling methods. Our sampling distribution does not depend on the labels, so can be used for active learning. It also does not depend on the specific loss function, so a single coreset can be used in multiple training scenarios.
Cameron Musco, Tung Mai, Anup Rao
Sat 12:42 p.m. - 12:47 p.m.
  

Recognising statistical dependence of variables is a fundamentally prominent task in modelling data relatedness, having impact on multiple data disciplines. However, it is a peculiarly challenging task often requiring manual scrutiny of data in order to uncover the correct relatedness function. Inspired by this human ability, we cast the named task as machine learning problem. Solving such problem results in models encoding abstract meaning of statistical dependence. In this paper, we report overwhelmingly and significantly superior performance to known models, especially when few data points are available and/or high noise is present. Such novel view of the named task is transformative for numerous data disciplines such as unsupervised learning, feature selection and causal inference.

Ubai Sandouk
Sat 12:47 p.m. - 12:52 p.m.
  
We propose a novel method to sparsify attention in the Transformer model by learning to select the most-informative token representations during the training process, thus focusing on the task-specific parts of an input. A reduction of quadratic time and memory complexity to sublinear was achieved due to a robust trainable top-$k$ operator. Our experiments on a challenging long document summarization task show that even our simple baseline performs comparably to the current SOTA, and with the trainable selection we can retain its top quality while being $1.8\times$ faster during training, $4.5\times$ faster during inference, and up to $16\times$ more computationally efficient in the decoder. The method can be effortlessly applied to many models used in NLP and CV.
Łukasz Garncarek, Łukasz Borchmann, Michał Pietruszka
Sat 12:52 p.m. - 12:57 p.m.
  

Continual learning is the process of developing new abilities while retaining existing ones. Sequential Bayesian inference is a natural framework for this, but applying it successfully to deep neural networks remains a challenge. We propose continual function-space variational inference (C-FSVI), in which the variational distribution over functions induced by stochastic model parameters is encouraged to match the variational distribution over functions induced by stochastic parameters inferred on previous tasks. Unlike approaches that explicitly penalize changes in the model parameters, function-space regularization allows parameters to vary widely during training, resulting in greater flexibility to fit new data. C-FSVI improves on existing approaches to function-space regularization by performing inference entirely in function space and without relying on carefully selected coreset points. We show that C-FSVI outperforms alternative methods based on parameter-space and function-space regularization on a range of tasks.

Yarin Gal, Yee-Whye Teh, Qixuan Feng, Freddie Bickford Smith, Tim G. J. Rudner
Sat 12:57 p.m. - 1:02 p.m.
  

Bayesian Active Learning has focused on BALD, which reduces model parameter uncertainty. However, we show that BALD gets stuck on out-of-distribution or junk data that is not relevant for the task. We examine a novel Expected Predictive Information Gain (EPIG) to deal with distribution shifts of the pool set. EPIG reduces the uncertainty of predictions on an unlabelled evaluation set sampled from the test data distribution whose distribution might be different to the pool set distribution. Based on this, our new EPIG-BALD acquisition function for Bayesian Neural Networks selects samples to improve the performance on the test data distribution instead of selecting samples that reduce model uncertainty everywhere, including for out-of-distribution regions with low density in the test data distribution. Our method outperforms state-of-the-art Bayesian active learning methods on high-dimensional datasets and avoids out-of-distribution junk data in cases where current state-of-the-art methods fail.

Yarin Gal, Tom Rainforth, Andreas Kirsch
Sat 1:02 p.m. - 1:07 p.m.
  

Sparse Bayesian Learning (SBL) is a powerful paradigm for attaining sparsity in probabilistic models. Herein, we propose a coordinate ascent algorithm for SBL termed Relevance Matching Pursuit (RMP) and show that, as its noise variance parameter goes to zero, RMP exhibits a surprising connection to Stepwise Regression. Further, we derive novel guarantees for Stepwise Regression algorithms, which also shed light on RMP. Our guarantees for Forward Regression improve on deterministic and probabilistic results for Orthogonal Matching Pursuit with noise. Our analysis of Backward Regression on determined systems culminates in a bound on the residual of the optimal solution to the {\it subset selection} problem that, if satisfied, guarantees the {\it optimality} of the result. To our knowledge, this bound is the first that can be computed in polynomial time and depends chiefly on the smallest singular value of the matrix.

Carla Gomes, Sebastian Ament
Sat 1:07 p.m. - 1:10 p.m.
  

Because deep learning is vulnerable to noisy labels, sample selection techniques, which train networks with only clean labeled data, have attracted a great attention. However, if the labels are dominantly corrupted by few classes, these noisy samples are called dominant-noisy-labeled samples, the network also learns dominant-noisy-labeled samples rapidly via content-aware optimization. In this study, we propose a compelling criteria to penalize dominant-noisy-labeled samples intensively through class-wise penalty labels. By averaging prediction confidences for the each observed label, we obtain suitable penalty labels that have high values if the labels are largely corrupted by some classes. Experiments were performed using benchmarks (CIFAR-10, CIFAR-100, Tiny-ImageNet) and real-world datasets (ANIMAL-10N, Clothing1M) to evaluate the proposed criteria in various scenarios with different noise rates. Using the proposed sample selection, the learning process of the network becomes significantly robust to noisy labels compared to existing methods in several noise types.

Woo-jin Song, Seong-Eun Kim, Young-Rae Cho, Youngchul Kwak, Junggi Lee, Kyeongbo Kong
Sat 1:10 p.m. - 1:59 p.m.

We will be using gather town for the poster sessions. The links are below.

Room 1: https://eventhosts.gather.town/nI2bwxKP1HljbkId/subsetml-room-1 Room 2: https://eventhosts.gather.town/GV1QzCxwr3qkeeLU/subsetml-room-2

Sat 1:59 p.m. - 2:00 p.m.
Introduction to Invited Talk (Live Intro)
Sat 2:00 p.m. - 2:29 p.m.
  

Large datasets have been crucial to the success of modern machine learning models. However, training on massive data has two major limitations. First, it is contingent on exceptionally large and expensive computational resources, and incurs a substantial cost due to the significant energy consumption. Second, in many real-world applications such as medical diagnosis, self-driving cars, and fraud detection, big data contains highly imbalanced classes and noisy labels. In such cases, training on the entire data does not result in a high-quality model.

In this talk, I will argue that we can address the above limitations by developing techniques that can identify and extract the representative subsets for learning from massive datasets. Training on representative subsets not only reduces the substantial costs of learning from big data, but also improves their accuracy and robustness against noisy labels. I will discuss how we can develop theoretically rigorous techniques that provide strong guarantees for the quality of the extracted subsets, as well as the learned models’ quality and robustness against noisy labels. I will also show the effectiveness of such methods in practice for data-efficient and robust learning.

Baharan Mirzasoleiman
Sat 2:29 p.m. - 2:30 p.m.
Data-efficient and Robust Learning from Massive Datasets Live Q&A (Live Q&A)
Sat 2:30 p.m. - 2:50 p.m.
  

Data selection methods, such as active learning and core-set selection, improve the data efficiency of machine learning by identifying the most informative data points to label or train on. Across the data selection literature, there are many ways to identify these training examples. However, classical data selection methods are prohibitively expensive to apply in deep learning because of the larger datasets and models. To make these methods tractable, we propose (1) “selection via proxy” (SVP) to avoid expensive training and reduce the computation per example and (2) “similarity search for efficient active learning and search” (SEALS) to reduce the number of examples processed. Both methods lead to order of magnitude performance improvements, making techniques like active learning on billions of unlabeled images practical for the first time.

Cody Coleman
Sat 2:50 p.m. - 3:00 p.m.
Computationally Efficient Data Selection for Deep Learning Live Q&A (Live Q&A)
Sat 3:00 p.m. - 3:05 p.m.
  
Many scientific problems require identifying a small set of covariates that are associated with a target response and estimating their effects. Often, these effects are non-linear and include interactions, so linear and additive methods can lead to poor estimation and variable selection. The Bayesian framework makes it straightforward to simultaneously express sparsity, non-linearity, and interactions in a hierarchical model. But, as for the few other methods that handle this trifecta, inference is computationally intractable --- with runtime at least quadratic in the number of covariates, and often worse. In the present work, we solve this computational bottleneck. We first show that suitable Bayesian models can be represented as Gaussian processes (GPs). We then demonstrate how a kernel trick can reduce computation with these GPs to $O$(\# covariates) time for both variable selection and estimation. Our resulting fit corresponds to a sparse orthogonal decomposition of the regression function in a Hilbert space (i.e., a functional ANOVA decomposition), where interaction effects represent all variation that cannot be explained by lower-order effects. On a variety of synthetic and real datasets, our approach outperforms existing methods used for large, high-dimensional datasets while remaining competitive (or being orders of magnitude faster) in runtime.
Tamara Broderick, Raj Agrawal
Sat 3:05 p.m. - 3:10 p.m.
  

We consider the task of personalizing ASR models while being constrained by a fixed budget on recording speaker specific utterances. Given a speaker and an ASR model, we propose a method of identifying sentences for which the speaker's utterances are likely to be harder for the given ASR model to recognize. We assume a tiny amount of speaker-specific data to learn phoneme-level error models which help us select such sentences. We show that speaker's utterances on the sentences selected using our error model indeed have larger error rates when compared to speaker's utterances on randomly selected sentences. We find that fine-tuning the ASR model on the sentence utterances selected with the help of error models yield higher WER improvements in comparison to fine-tuning on an equal number of randomly selected sentence utterances. Thus, our method provides an efficient way of collecting speaker utterances under budget constraints for personalizing ASR models.

Preethi Jyothi, Sunita Sarawagi, Abhijeet Awasthi
Sat 3:10 p.m. - 3:15 p.m.
  
Multiple imputation (MI) has been widely applied to missing value problems in biomedical, social and econometric research, in order to avoid improper inference in the downstream data analysis. In the presence of high-dimensional data, imputation models that include feature selection, especially $\ell_1$ regularized regression (such as Lasso, adaptive Lasso and Elastic Net), are common choices to prevent the model from underdetermination. However, conducting MI with feature selection is difficult: existing methods are often computationally inefficient and poor in performance. We propose MISNN, a novel and efficient algorithm that incorporates feature selection for MI. Leveraging the approximation power of neural networks, MISNN is a general and flexible framework, compatible to any feature selection method, any neural network architecture, high/low-dimensional data and general missing patterns. Through empirical experiments, MISNN has demonstrated great advantages over state-of-the-art imputation methods (e.g. Bayesian Lasso and matrix completion), in terms of imputation accuracy, statistical consistency and computation speed.
Qi Long, Yiliang Zhang, Zongyu Dai, Woody Bu
Sat 3:15 p.m. - 3:20 p.m.
  

Air pollution is a global problem and has a severe impact on human health. Fine-grained air quality (AQ) monitoring is important in mitigating air pollution. However, existing AQ station deployments are sparse due to installation and operational costs. In this work, we propose an Active Learning-based method for air quality station deployment. We use Gaussian processes and several classical machine learning algorithms (Random Forest, K-Neighbors and Support Vector Machine) to benchmark several active learning strategies for two real-world air quality datasets (Delhi, India and Beijing, China).

Nipun Batra, Zeel B Patel
Sat 3:20 p.m. - 3:25 p.m.
  

Neural architecture search (NAS), an important branch of automatic machine learning, has become an effective approach to automate the design of deep learning models. However, the major issue in NAS is how to reduce the large search time imposed by the heavy computational burden. While most recent approaches focus on pruning redundant sets or developing new search methodologies, this paper attempts to formulate the problem based on the data curation manner. Our key strategy is to search the architecture using summarized data distribution, i.e., core-set. Typically, many NAS algorithms separate searching and training stages, and the proposed core-set methodology is only used in search stage, thus their performance degradation can be minimized. In our experiments, we were able to save overall computational time from 30.8 hours to 3.5 hours, 8.8x reduction, on a single RTX 3090 GPU without sacrificing accuracy.

Suk-Ju Kang, Kyeongbo Kong, Jae-hun Shim
Sat 3:25 p.m. - 3:30 p.m.
  
In this work we present an algorithm to construct coresets for fair regression for a dataset of $n$ points in $d$ dimensions and having some discrete-valued protected attribute. For (fair) regression with statistical parity we first define the notion of a coreset and give a coreset that incurs only a small additive error in satisfying the constraints and a relative error in the objective function. The coreset size is linear in $\ell$, the number of unique values the protected attribute can take, and the dependence on dimension is near linear ($d \log{d}$). To the best of our knowledge this is the first coreset for fair regression problem with statistical parity. We also empirically demonstrate the performance of our coresets on real data sets.
Jayesh Choudhari, Supratim Shit, Anirban Dasgupta, Rachit Chhaya
Sat 3:30 p.m. - 3:35 p.m.
  

In this paper, we study the problem of evaluating the addition of elements to a set. This problem is difficult, because it can, in the general case, not be reduced to unconditional preferences between the choices. Therefore, we model preferences based on the context of the decision. We discuss and compare two different Siamese network architectures for this task: a twin network that compares the two sets resulting after the addition, and a triplet network that models the contribution of each candidate to the existing set. We evaluate the two settings on a real-world task; learning human card preferences for deck building in the collectible card game Magic: The Gathering. We show that the triplet approach achieves a better result than the twin network and that both outperform previous results on this task.

Martin Müller, Johannes Fürnkranz, Timo Bertram
Sat 3:35 p.m. - 3:40 p.m.
  

Curriculum learning is a training strategy that sorts the training examples by some measure of their difficulty and gradually exposes them in that order to the learner to improve the network performance. In this work, we first propose and study a dynamic curriculum learning algorithm. Our dynamic curriculum algorithm greedily samples a subset of the training data whose gradients are directed towards the optimal weight. Motivated by our insights from the dynamic curriculum learning ordering and implicit curriculum ordering, we introduce a simple, practical curriculum learning strategy that uses statistical measures such as standard deviation and entropy values to score the difficulty of data points for real image classification tasks. Further, we also use our algorithms to discuss why curriculum learning is helpful.

Anirban Dasgupta, Vinu Sankar Sadasivan
Sat 3:40 p.m. - 3:45 p.m.
  

Given a training data set, we develop a method to obtain a representative subset of training data meant for classification. Our focus is on subsets that are particularly suitable for training linear classification models. Our method is to reduce data sets in a manner that essentially preserves the convex hulls of the various classes of the data set. We develop scalable randomized algorithms for convex hull approximation, and obtain rigorous guarantees for models trained on subsets of linearly separable data sets. We empirically demonstrate the effectiveness of our approach for both training and data augmentation on the MNIST data set.

Bill Kay, Viveck Cadambe
Sat 3:45 p.m. - 3:50 p.m.
  

Many machine learning problems involve processing large datasets, thus making them expensive computational tasks. One mitigation strategy selects representative subsets of the data and one way to achieve this is via coresets~- a weighted subset of the given data, such that the cost function under consideration can be additively and/or multiplicatively approximated. While much previous work chooses these subsets using independent importance-based sampling, such methods can be suboptimal when the system possesses redundancy. Recently, Tremblay et al. 2019 considered sampling from fixed-sized determinantal point process (DPP) and gave lower bounds on the coreset sample complexity. While DPPs often empirically outperform importance-based sampling methods, their coreset size bound is worse compared to independent sampling. In this work, we extend their theoretical results by improving the lower bound on sample complexity for coreset approximation by a log(1 / epsilon) factor using chaining, a technique in empirical process literature. This, therefore, moves a step further in bridging the gap between theoretical justification and empirical observations. We provide additive approximation guarantees for the associated cost function and allow the possibility of negative weights, which has a direct application in layer-wise pruning of deep neural networks. On top of additive approximation guarantees, we also extend our bounds to the scenarios with multiplicative approximations.

Jeff Bilmes, Gantavya Bhatt
Sat 3:50 p.m. - 3:55 p.m.
  

Active learning has proven to be useful for minimizing labeling costs by selecting the most informative samples. However, existing active learning methods do not work well in realistic scenarios such as imbalance or rare classes, out-of-distribution data in the unlabeled set, and redundancy. In this work, we propose SIMILAR (Submodular Information Measures based actIve LeARning), a unified active learning framework using recently proposed submodular information measures (SIM) as acquisition functions. We argue that SIMILAR not only works in standard active learning but also easily extends to the realistic settings considered above and acts as a one-stop solution for active learning that is scalable to large real-world datasets. Empirically, we show that SIMILAR significantly outperforms existing active learning algorithms by as much as ≈ 5% − 18% in the case of rare classes and ≈ 5% − 10% in the case of out-of-distribution data on CIFAR-10 image classification.

Rishabh Iyer, Krishnateja Killamsetty, Suraj N Kothawade
Sat 3:55 p.m. - 3:59 p.m.
  

In this paper, we study mixed continuous-discrete minimax problems where the minimization is over a continuous variable belonging to Euclidean space and the maximization is over subsets of a given ground set. We introduce the class of convex-submodular minimax problems, where the objective is convex with respect to the continuous variable and submodular with respect to the discrete variable. Even though such problems appear frequently in machine learning applications, little is known about how to address them from algorithmic and theoretical perspectives. For such problems, we first show that obtaining saddle points are hard up to any approximation, and thus introduce new notions of (near-) optimality. We then provide several algorithmic procedures for solving convex and monotone-submodular minimax problems and characterize their convergence rates, computational complexity, and quality of the final solution. Finally, we provide experiments to showcase the effectiveness of our purposed methods on training robust recommender systems in the presence of poisoning attacks.

Hamed Hassani, Aryan Mokhtari, Arman Adibi
Sat 3:59 p.m. - 4:04 p.m.
  
In this paper, we consider an online optimization problem over $T$ rounds where at each step $t\in[T]$, the algorithm chooses an action $x_t$ from the fixed convex and compact domain set $\mathcal{K}$. A utility function $f_t(\cdot)$ is then revealed and the algorithm receives the payoff $f_t(x_t)$. This problem has been previously studied under the assumption that the utilities are adversarially chosen monotone DR-submodular functions and $\mathcal{O}(\sqrt{T})$ regret bounds have been derived. We first characterize the class of strongly DR-submodular functions and then, we derive regret bounds for the following new online settings: $(1)$ $\{f_t\}_{t=1}^T$ are monotone strongly DR-submodular and chosen adversarially, $(2)$ $\{f_t\}_{t=1}^T$ are monotone submodular (while the average $\frac{1}{T}\sum_{t=1}^T f_t$ is strongly DR-submodular) and chosen by an adversary but they arrive in a uniformly random order, $(3)$ $\{f_t\}_{t=1}^T$ are drawn i.i.d. from some unknown distribution $f_t\sim \mathcal{D}$ where the expected function $f(\cdot)=\mathbb{E}_{f_t\sim\mathcal{D}}[f_t(\cdot)]$ is monotone DR-submodular. For $(1)$, we obtain the first logarithmic regret bounds. In terms of the second framework, we show that it is possible to obtain similar logarithmic bounds with high probability. Finally, for the i.i.d. model, we provide algorithms with $\tilde{\mathcal{O}}(\sqrt{T})$ stochastic regret bound, both in expectation and with high probability.
Maryam Fazel, Omid Sadeghi
Sat 4:04 p.m. - 4:09 p.m.
  
Numerous tasks in machine learning and artificial intelligence have been modeled as submodular maximization problems. These problems usually involve sensitive data about individuals, and in addition to maximizing the utility, privacy concerns should be considered. In this paper, we study the general framework of non-negative monotone submodular maximization subject to matroid or knapsack constraints in both offline and online settings. For the offline setting, we propose a differentially private $(1-\frac{\kappa}{e})$-approximation algorithm, where $\kappa\in[0,1]$ is the total curvature of the submodular set function, which improves upon prior works in terms of approximation guarantee and query complexity under the same privacy budget. In the online setting, we propose the first differentially private algorithm, and we specify the conditions under which the regret bound scales as $\O(\sqrt{T})$, i.e., privacy could be ensured while maintaining the same regret bound as the optimal regret guarantee in the non-private setting.
Maryam Fazel, Omid Sadeghi
Sat 4:09 p.m. - 4:14 p.m.
  

With the goal of making deep learning more label-efficient, a growing number of papers have been studying active learning (AL) for deep models. However, there are a number of issues in the prevalent experimental settings, mainly stemming from a lack of unified implementation and benchmarking. Issues in the current literature include sometimes contradictory observations on the performance of different AL algorithms, unintended exclusion of important generalization approaches such as data augmentation and SGD for optimization, a lack of study of evaluation facets like the labeling efficiency of AL, and little or no clarity on the scenarios in which AL outperforms random sampling (RS). In this work, we present a unified re-implementation of state-of-the-art AL algorithms in the context of image classification, and we carefully study these issues as facets of effective evaluation. On the positive side, we show that AL techniques are 2x to 4x more label-efficient compared to RS with the use of data augmentation. Surprisingly, when data augmentation is included, there is no longer a consistent gain in using BADGE, a state-of-the-art approach, over simple uncertainty sampling. We then do a careful analysis of how existing approaches perform with varying number of examples per class. Finally, we provide other insights for AL practitioners to consider in future work, such as the use of data subset selection techniques for intermediate model training.

Rishabh Iyer, Ganesh Ramakrishnan, Durga S, Nathan Beck
Sat 4:14 p.m. - 4:19 p.m.
  

We systematically study the query complexity of learning geodesically convex halfspaces on graphs. Geodesic convexity is a natural generalisation of Euclidean convexity and allows the definition of convex sets and halfspaces on graphs. We prove upper bounds on the query complexity linear in the treewidth and the minimum hull set size but only logarithmic in the diameter. We show tight lower bounds corresponding to well-established separation axioms and identify the Radon number as a central parameter bounding the query complexity and the VC dimension. While previous bounds typically depend on the cut size of the labelling, all parameters in our bounds can be computed from the unlabelled graph. We empirically compare our proposed approach with other active learning algorithms and provide evidence that ground-truth communities in real-world graphs are often convex.

Thomas Gärtner, Maximilian Thiessen
Sat 4:19 p.m. - 4:23 p.m.
  
Classes of set functions along with a choice of ground set are a bedrock to determine and develop corresponding variants of greedy algorithms to suitably obtain approximate and efficient solutions for combinatorial optimization. The class of constrained submodular optimization has seen huge advances at the intersection of good computational efficiency, versatility and approximation guarantees while unconstrained submodular optimization is NP-hard. What is an alternative to situations when submodularity does not hold? Can efficient and globally exact solutions be obtained? We introduce one such new frontier: The class of quasi-concave set functions induced as a dual class to monotone linkage functions. We provide a parallel algorithm with a time complexity over $n$ processors of $\mathcal{O}(n^2g) +\mathcal{O}(\log{\log{n}})$ where $n$ is the cardinality of the ground set and $g$ is the complexity to compute the monotone linkage function that induces a corresponding quasi-concave set function via a duality. The complexity reduces to $\mathcal{O}(gn\log(n))$ on $n^2$ processors and to $\mathcal{O}(gn)$ on $n^3$ processors. Our approach reduces the currently existing cubic computational complexity to those mentioned above. Our algorithm provides a globally optimal solution to a maxi-min problem as opposed to submodular optimization which is approximate. We show a potential for widespread applications via an example of diverse feature subset selection with exact global maxi-min guarantees upon showing that a statistical dependency measure called distance correlation can be used to induce a quasi-concave set function.
Ramesh Raskar, Praneeth Vepakomma
Sat 4:23 p.m. - 4:30 p.m.

Concluding Remarks to the Workshop. Thank you for attending SubSetML 2021!

-
[ Visit Poster at Spot C6 in Virtual World ]

We introduce a novel architecture for graph networks which is equivariant to any transformation in the coordinate embeddings that preserves the distance between neighbouring nodes. In particular, it is equivariant the Euclidean and conformal orthogonal groups in n-dimensions. Thanks to its equivariance properties, the proposed model is extremely more data efficient with respect to classical graph architectures and also intrinsically equipped with a better inductive bias. We show that, learning on a minimal amount of data, the architecture we propose can perfectly generalise to unseen data in a synthetic problem, while much more training data are required from a standard model to reach comparable performance.

Francesco Farina, Emma Slade
-
[ Visit Poster at Spot C5 in Virtual World ]

Generative Adversarial Networks (GANs) have recently achieved unprecedented success in photo-realistic image synthesis from low-dimensional random noise. The ability to synthesize high-quality content at a large scale brings potential risks as the generated samples may lead to misinformation that can create severe social, political, health, and business hazards. We propose SubsetGAN to identify generated content by detecting a subset of anomalous node-activations in the inner layers of pre-trained neural networks. These nodes, as a group, maximize a non-parametric measure of divergence away from the expected distribution of activations created from real data. This enable us to identify synthesised images without prior knowledge of their distribution. SubsetGAN efficiently scores subsets of nodes and returns the group of nodes within the pre-trained classifier that contributed to the maximum score. The classifier can be a general fake classifier trained over samples from multiple sources or the discriminator network from different GANs. Our approach shows consistently higher detection power than existing detection methods across several state-of-the-art GANs (PGGAN, StarGAN, and CycleGAN) and over different proportions of generated content.

Celia Cintas, Skyler Speakman, Girmaw Abebe Tadesse, Victor Akinwande, Kommy Weldemariam
-
[ Visit Poster at Spot C4 in Virtual World ]

Ordinal embedding is the task of computing a meaningful multi-dimensional representation of objects, for which only qualitative constraints on their distance functions are known. In particular, we consider comparisons of the form “Which object from the pair (j, k) is more similar to object i?”. In this paper, we generalize this framework to the case where the ordinal constraints are not given at the level of individual points, but at the level of sets, and propose a distributional triplet embedding approach in a scalable learning framework. We show that the query complexity of our approach is on par with the single item approach. Without having access to features of the items to be embedded, we show the applicability of our model on toy datasets for the task of reconstruction, and demonstrate the validity of the obtained embeddings in experiments on synthetic and real-world datasets.

Aissatou Diallo, Johannes Fürnkranz
-
[ Visit Poster at Spot C3 in Virtual World ]

We propose a new gradient-based approach for extracting sub-architectures from a given large model. Contrarily to existing pruning methods, which are unable to disentangle the network architecture and the corresponding weights, our architecture-pruning scheme produces transferable new structures that can be successfully re-trained to solve different tasks. We focus on a transfer-learning setup where architectures can be trained on a large data set but very few data points are available for fine-tuning them on new tasks. We define a new gradient-based algorithm that trains architectures of arbitrarily low complexity independently from the attached weights. Given a search space defined by an existing large neural mode, we reformulate the architecture search task as a complexity-penalized subset-selection problem and solve it through a two-temperature relaxation scheme. We provide theoretical convergence guarantees and validate the proposed transfer-learning strategy on real data.

Nicolo Colombo, Yang Gao
-
[ Visit Poster at Spot C2 in Virtual World ]

Not all examples are created equal, but standard deep neural network training protocols treat each training point uniformly. Each example is propagated forward and backwards through the network the same amount of times, independent of how much the example contributes to the learning protocol. Recent work has proposed ways to accelerate training by deviating from this uniform treatment. Popular methods entail up-weighting examples that contribute more to the loss with the intuition that examples with low loss have already been learned by the model, so their marginal value to the training procedure should be lower. This view assumes that updating the model with high loss examples will be beneficial to the model. However, this may not hold for noisy, real-world data. In this paper, we theorize and then empirically demonstrate that loss-based acceleration methods degrade in scenarios with noisy and corrupted data. Our work suggests measures of example difficulty need to correctly separate out noise from other types of challenging examples.

Niel Hu, Xinyu Hu, Rosanne Liu, Sara Hooker, Jason Yosinski
-
[ Visit Poster at Spot C1 in Virtual World ]

In this paper, we present novel variations of an earlier approach called homogeneous clustering algorithm for reducing dataset size. The intuition behind the approaches proposed in this paper is to partition the dataset into homogeneous clusters and select some images which contribute significantly to the accuracy. We place an important criteria on sampled images that they should be human-readable. We propose four variations: RHCKON, KONCW, CWKC & GHCIDR upon the baseline algorithm - RHC to achieve better accuracy. RHCKON involves selecting k farthest and one nearest neighbour of the centroid of the clusters. The intuition behind RHCKON is that the boundary points contribute significantly towards the representation of clusters. KONCW and CWKC introduce cluster weights to RHCKON. They are based on the fact that larger clusters contribute more than smaller sized clusters. The final variation is GHCIDR which selects points based on the geometrical aspect of data distribution. We performed the experiments on two deep learning models- Fully Connected Networks (FCN), and VGG1. We experimented the four variants on three datasets- MNIST, CIFAR10, and Fashion-MNIST. We found that GHCIDR gave the best accuracy of 99.35%, 81.10%, and 91.66% and a training data reduction of 87.27%, 32.34%, and 76.80% on MNIST, CIFAR10, and Fashion-MNIST respectively.

Shril Mody, Janvi Thakkar, Devvrat Joshi, Siddharth Soni, Nipun Batra, Rohan Patil
-
[ Visit Poster at Spot C0 in Virtual World ]

A fundamental problem in machine learning is developing data summarization, also known as coreset construction, techniques with minimal impact on model accuracy. However, a practical difficulty of the majority of existing methods is the lack of rigorous strategies for identifying the optimal coreset size. Moreover, these methods are often built around specific machine learning models, making it difficult for practitioners to apply them in various applications. This paper presents an interactive teaching method for adaptively constructing a small subset of representative samples to address these problems. Numerical experiments on three imbalanced data sets indicate the great potential and applicability of the proposed approach.

Farhad Pourkamali-Anaraki, Walter Bennette
-
[ Visit Poster at Spot B6 in Virtual World ]

Information theory is of importance to machine learning, but the notation for information-theoretic quantities is not always clear. The right notation can convey valuable intuitions and concisely express new ideas. We propose such a notation for information-theoretic quantities between events (outcomes) and random variables. We apply this notation to BALD (Bayesian Active Learning by Disagreement), an acquisition function in Bayesian active learning: it selects the most informative (unlabelled) samples for labeling by an expert; and extend BALD to the core-set problem, which consists of selecting the most informative samples given the labels.

Andreas Kirsch, Yarin Gal
-
[ Visit Poster at Spot B5 in Virtual World ]

Vehicle routing problems (VRPs) are a class of combinatorial problems with wide practical applications. While previous heuristic or learning-based works achieve decent solutions on small problem instances of up to 100 customers, their performance does not scale to large problems. This article presents a novel learning-augmented local search algorithm to solve large-scale VRP. The method iteratively improves the solution by identifying appropriate subproblems and delegating their improvement to a black box subsolver. At each step, we leverage spatial locality to consider only a linear number of subproblems, rather than exponential. We frame subproblem selection as a regression problem and train a Transformer on a generated training set of problem instances. We show that our method achieves state-of-the-art performance, with a speed-up of up to 15 times over strong baselines, on VRPs with sizes ranging from 500 to 3000.

Sirui Li, Zhongxia Yan, Cathy Wu
-
[ Visit Poster at Spot B4 in Virtual World ]

Recommender systems powering online multi-stakeholder platforms often face the challenge of jointly optimizing for multiple objectives, in an attempt to efficiently match suppliers and consumers. Examples of such objectives include user behavioral metrics (e.g. clicks, streams, dwell time, etc), supplier exposure objectives (e.g. diversity) and platform centric objectives (e.g. promotions). Jointly optimizing multiple metrics in online recommender systems remains a challenging task. In this work, we propose a multi-objective diversification technique via submodular optimization and counterfactual scoring to sequence music tracks in sequential recommendation scenarios. Based on a large scale live AB test deployment of the proposed sequencing method, we empirically demonstrate the value of submodular diversification in balancing across multiple objectives.

Rishabh Mehrotra
-
[ Visit Poster at Spot B3 in Virtual World ]

In recent years, deep learning models have found great success in various tasks viz., object detection, speech recognition, and translation, making the everyday lives of people easier. Despite the success, training a deep learning model is often challenging as its performance depends mainly on the hyperparameters used. Moreover, finding the best hyperparameter configuration is often time-consuming, even when using state-of-the-art (SOTA) hyper-parameter optimization algorithms as they require multiple training runs over the entire dataset for different possible sets of hyperparameters. Our main insight is that using a subset of the dataset representing the entire dataset for model training runs involved in hyper-parameter optimization allows us to find the optimal hyperparameter configuration significantly faster. In this work, we explore using the data subsets selected using the existing supervised learning-based data subset selection methods, namely \textsc{Craig}, \textsc{Glister}, \textsc{Grad-Match}, for model training runs involved in hyper-parameter optimization. Further, we empirically demonstrate through several experiments on real-world datasets that using data subsets for hyper-parameter optimization achieves significantly faster turnaround times for hyper-parameter selection that achieves comparable performance to the hyper-parameters found using the entire dataset.

Savan Amitbhai Visalpara, Krishnateja Killamsetty, Rishabh Iyer
-
[ Visit Poster at Spot B2 in Virtual World ]

We introduce GoldiProx Selection, a technique for faster model training which selects a sequence of training points that are just right’’. We propose an information-theoretic acquisition function---the reducible validation loss---and compute it with a small proxy model to efficiently choose training points that maximally inform predictions on a validation set. We show that thehard’’ (e.g. high loss) points usually selected in the optimization literature are typically noisy, while the easy’’ (e.g. low noise) samples often prioritized for curriculum learning confer less information. Further, points with uncertain labels, typically targeted by active learning, tend to be less relevant to the task. In contrast, GoldiProx Selection chooses points that arejust right’’ and empirically outperforms the above approaches. Moreover, a single GoldiProx Sequence can accelerate training across architectures; practitioners can share and reuse it without the need to recompute it.

Sören Mindermann, Muhammed Razzak, Adrien Morisot, Aidan Gomez, Sebastian Farquhar, Jan Brauner, Yarin Gal
-
[ Visit Poster at Spot B1 in Virtual World ]
We present algorithms that create coresets for clustering problems according to a wide subset of Bregman divergences. Our coresets have a small additive error, similar in magnitude to the lightweight coresets~\cite{bachem2018scalable}, and take an update time $O(d)$. We first give an algorithm that returns {\em online} coresets of size $\tilde{O}((dk\log k)/\epsilon^2\mu^2)$ for $k$-clustering according to any $\mu$-similar Bregman divergence. We also show the existence of {\em non-parametric} coresets, i.e. the coreset size is independent of the number of clusters $k$, as well the dimension of points $d$, for the same subclass of Bregman divergences.
Supratim Shit, Rachit Chhaya, Anirban Dasgupta, Jayesh Choudhari
-
[ Visit Poster at Spot A2 in Virtual World ]

Given a set V, the problem of unconstrained submodular maximization with modular costs (USM-MC) asks for a subset S \subseteq V that maximizes f(S) - c(S), where f is a non-negative, monotone, and submodular function that gauges the utility of S, and c is a non-negative and modular function that measures the cost of S. This problem finds applications in numerous practical scenarios, such as profit maximization in viral marketing on social networks.

This paper presents ROI-Greedy, a polynomial time algorithm for USM-MC that returns a solution S satisfying f(S) - c(S) >= f(S) - c(S) - c(S)\ln(f(S)/c(S)), where S is the optimal solution to USM-MC. To our knowledge, ROI-Greedy is the first algorithm that provides such a strong approximation guarantee. In addition, We show that this worst-case guarantee is tight, in the sense that no polynomial time algorithm can ensure f(S) - c(S) >= (1+\epsilon)(f(S) - c(S) - c(S) \ln(f(S)/c(S*)), for any \epsilon > 0. Further, we devise a non-trivial extension of ROI-Greedy to solve the profit maximization problem, where the precise value of f(S) for any set S is unknown and can only be approximated via sampling. Extensive experiments on benchmark datasets demonstrate that ROI-Greedy significantly outperforms competing methods in terms of the trade-off between efficiency and solution quality.

Tianyuan Jin, Yu Yang, Renchi Yang, Jieming Shi, Keke Huang, Xiaokui Xiao
-
[ Visit Poster at Spot B0 in Virtual World ]

We study the practical consequences of dataset sampling strategies on the performance of recommendation algorithms. Recommender systems are generally trained and evaluated on samples of larger datasets. Samples are often taken in a naive or ad-hoc fashion: e.g. by sampling a dataset randomly or by selecting users or items with many interactions. As we demonstrate, commonly-used data sampling schemes can have significant consequences on algorithm performance---masking performance deficiencies in algorithms or altering the relative performance of algorithms, as compared to models trained on the complete dataset. Following this observation, this paper makes the following main contributions: (1) characterizing the effect of sampling on algorithm performance, in terms of algorithm and dataset characteristics (e.g. sparsity characteristics, sequential dynamics, etc); and (2) designing SVP-CF, which is a data-specific sampling strategy, that aims to preserve the relative performance of models after sampling, and is especially suited to long-tail interaction data. Detailed experiments show that SVP-CF is more accurate than commonly used sampling schemes in retaining the relative ranking of different recommendation algorithms.

Noveen Sachdeva, Julian McAuley, Carole-Jean Wu
-
[ Visit Poster at Spot A6 in Virtual World ]

Subset selection is valuable for interpretable learning, scientific discovery, and data compression. However, classical subset selection is often eschewed due to instability and computational bottlenecks. We address these challenges using Bayesian decision analysis. For any Bayesian predictive model, we elicit an acceptable family of subsets that provide nearly-optimal linear prediction. The Bayesian model regularizes the linear coefficients and propagates uncertainty quantification for key predictive comparisons. The acceptable family spawns new variable importance metrics based on whether a variable appear in all, some, or no acceptable subsets. The proposed approach exhibits excellent prediction and stability for both simulated data (including p = 400 > n) and a large education dataset with highly correlated covariates.

Daniel Kowal
-
[ Visit Poster at Spot A5 in Virtual World ]

It is preferred that feature selectors be \textit{stable} for better interpretabity and robust prediction. Ensembling is known to be effective for improving the stability of feature selectors. Since ensembling is time-consuming, it is desirable to reduce the computational cost to estimate the stability of the ensemble feature selectors. % for the data we use. We propose a simulator of a feature selector, and apply it to a fast estimation of the stability of ensemble feature selectors. To the best of our knowledge, this is the first study that estimates the stability of ensemble feature selectors and reduces the computation time theoretically and empirically.

Rina Onda, Kenta Oono
-
[ Visit Poster at Spot A4 in Virtual World ]

Conditional generative adversarial networks (cGANs) have demonstrated remarkable success due to their class-wise controllability and superior quality for complex generation tasks. Typical cGANs solve the joint distribution matching problem by decomposing two easier sub-problems: marginal matching and conditional matching. From our toy experiments, we found that it is the best to apply only conditional matching to certain samples due to the content-aware optimization of the discriminator. This paper proposes a simple (a few lines of code) but effective training methodology, selective focusing learning, which enforces the discriminator and generator to learn easy samples of each class rapidly while maintaining diversity. Our key idea is to selectively apply conditional and joint matching for the data in each mini-batch. We conducted experiments on recent cGAN variants in ImageNet (64x64 and 128x128), CIFAR-10, and CIFAR-100 datasets, and improved the performance significantly (up to 35.18% in terms of FID) without sacrificing diversity.

Kyeongbo Kong, Kyunghun Kim, Woo-jin Song, Suk-Ju Kang
-
[ Visit Poster at Spot A3 in Virtual World ]
We introduce kernel thinning, a new procedure for compressing a distribution $\mathbb{P}$ more effectively than i.i.d. sampling or standard thinning. Given a suitable reproducing kernel $\mathbf{k}$ and $\mathcal{O}(n^2)$ time, kernel thinning compresses an $n$-point approximation to $\mathbb{P}$ into a $\sqrt{n}$-point approximation with comparable worst-case integration error across the associated reproducing kernel Hilbert space. With high probability, the maximum discrepancy in integration error is $\mathcal{O}_d(n^{-\frac{1}{2}}\sqrt{\log n})$ for compactly supported $\mathbb{P}$ and $\mathcal{O}_d(n^{-\frac{1}{2}} \sqrt{(\log n)^{d+1}\log\log n})$ for sub-exponential $\mathbb{P}$ on $\mathbb{R}^d$. In contrast, an equal-sized i.i.d. sample from $\mathbb{P}$ suffers $\Omega(n^{-\frac14})$ integration error. Our sub-exponential guarantees resemble the classical quasi-Monte Carlo error rates for uniform $\mathbb{P}$ on $[0,1]^d$ but apply to general distributions on $\mathbb{R}^d$ and a wide range of common kernels. We use our results to derive explicit non-asymptotic maximum mean discrepancy bounds for Gaussian, Mat\'ern, and B-spline kernels and present two vignettes illustrating the practical benefits of kernel thinning over i.i.d. sampling and standard Markov chain Monte Carlo thinning.
Raaz Dwivedi, Lester Mackey
-
[ Visit Poster at Spot A2 in Virtual World ]

Active learning (AL) aims to achieve greater accuracy with less training data by selecting the most useful data samples from which it learns. Single-criterion (i.e., informativeness and representativeness) based methods are simple and efficient; however, they lack adaptability to different real-world scenarios. In this paper, we introduce a multiple-criteria based AL algorithm, which incorporates three complementary criteria, i.e., informativeness, representativeness and diversity, to make appropriate selections in the AL rounds under different data types. We consider the selection process as a Determinantal Point Process, which good balance among these criteria. We refine the query selection strategy by both selecting the hardest unlabeled data sample and biasing towards the classifiers that are more suitable for the current data distribution. In addition, we also consider the dependencies and relationships between these data points in data selection by means of centroid-based clustering approaches.

Xueying ZHAN, Qing Li, Antoni Chan
-
[ Visit Poster at Spot A0 in Virtual World ]
We give relative error coresets for training linear classifiers with a broad class of loss functions, including the logistic loss and hinge loss. Our construction achieves $(1\pm \epsilon)$ relative error with $\tilde O(d \cdot \mu_y(X)^2/\epsilon^2)$ points, where $\mu_y(X)$ is a natural complexity measure of the data matrix $X \in \R^{n \times d}$ and label vector $y \in \{-1,1\}^n$, introduced in \cite{munteanu2018coresets}. Our result is based on subsampling data points with probabilities proportional to their \emph{$\ell_1$ Lewis weights}. It significantly improves on existing theoretical bounds and performs well in practice, outperforming uniform subsampling along with other importance sampling methods. Our sampling distribution does not depend on the labels, so can be used for active learning. It also does not depend on the specific loss function, so a single coreset can be used in multiple training scenarios.
Anup Rao, Tung Mai, Cameron Musco
-
[ Visit Poster at Spot A1 in Virtual World ]

Recognising statistical dependence of variables is a fundamentally prominent task in modelling data relatedness, having impact on multiple data disciplines. However, it is a peculiarly challenging task often requiring manual scrutiny of data in order to uncover the correct relatedness function. Inspired by this human ability, we cast the named task as machine learning problem. Solving such problem results in models encoding abstract meaning of statistical dependence. In this paper, we report overwhelmingly and significantly superior performance to known models, especially when few data points are available and/or high noise is present. Such novel view of the named task is transformative for numerous data disciplines such as unsupervised learning, feature selection and causal inference.

Ubai Sandouk
-
[ Visit Poster at Spot D2 in Virtual World ]

Because deep learning is vulnerable to noisy labels, sample selection techniques, which train networks with only clean labeled data, have attracted a great attention. However, if the labels are dominantly corrupted by few classes, these noisy samples are called dominant-noisy-labeled samples, the network also learns dominant-noisy-labeled samples rapidly via content-aware optimization. In this study, we propose a compelling criteria to penalize dominant-noisy-labeled samples intensively through class-wise penalty labels. By averaging prediction confidences for the each observed label, we obtain suitable penalty labels that have high values if the labels are largely corrupted by some classes. Experiments were performed using benchmarks (CIFAR-10, CIFAR-100, Tiny-ImageNet) and real-world datasets (ANIMAL-10N, Clothing1M) to evaluate the proposed criteria in various scenarios with different noise rates. Using the proposed sample selection, the learning process of the network becomes significantly robust to noisy labels compared to existing methods in several noise types.

Kyeongbo Kong, Junggi Lee, Youngchul Kwak, Young-Rae Cho, Seong-Eun Kim, Woo-jin Song
-
[ Visit Poster at Spot D1 in Virtual World ]
We propose a novel method to sparsify attention in the Transformer model by learning to select the most-informative token representations during the training process, thus focusing on the task-specific parts of an input. A reduction of quadratic time and memory complexity to sublinear was achieved due to a robust trainable top-$k$ operator. Our experiments on a challenging long document summarization task show that even our simple baseline performs comparably to the current SOTA, and with the trainable selection we can retain its top quality while being $1.8\times$ faster during training, $4.5\times$ faster during inference, and up to $16\times$ more computationally efficient in the decoder. The method can be effortlessly applied to many models used in NLP and CV.
Michał Pietruszka, Łukasz Borchmann, Łukasz Garncarek
-
[ Visit Poster at Spot D0 in Virtual World ]

Continual learning is the process of developing new abilities while retaining existing ones. Sequential Bayesian inference is a natural framework for this, but applying it successfully to deep neural networks remains a challenge. We propose continual function-space variational inference (C-FSVI), in which the variational distribution over functions induced by stochastic model parameters is encouraged to match the variational distribution over functions induced by stochastic parameters inferred on previous tasks. Unlike approaches that explicitly penalize changes in the model parameters, function-space regularization allows parameters to vary widely during training, resulting in greater flexibility to fit new data. C-FSVI improves on existing approaches to function-space regularization by performing inference entirely in function space and without relying on carefully selected coreset points. We show that C-FSVI outperforms alternative methods based on parameter-space and function-space regularization on a range of tasks.

Tim G. J. Rudner, Freddie Bickford Smith, Qixuan Feng, Yee-Whye Teh, Yarin Gal
-
[ Visit Poster at Spot C6 in Virtual World ]

Bayesian Active Learning has focused on BALD, which reduces model parameter uncertainty. However, we show that BALD gets stuck on out-of-distribution or junk data that is not relevant for the task. We examine a novel Expected Predictive Information Gain (EPIG) to deal with distribution shifts of the pool set. EPIG reduces the uncertainty of predictions on an unlabelled evaluation set sampled from the test data distribution whose distribution might be different to the pool set distribution. Based on this, our new EPIG-BALD acquisition function for Bayesian Neural Networks selects samples to improve the performance on the test data distribution instead of selecting samples that reduce model uncertainty everywhere, including for out-of-distribution regions with low density in the test data distribution. Our method outperforms state-of-the-art Bayesian active learning methods on high-dimensional datasets and avoids out-of-distribution junk data in cases where current state-of-the-art methods fail.

Andreas Kirsch, Tom Rainforth, Yarin Gal
-
[ Visit Poster at Spot C5 in Virtual World ]

In active learning, new labels are commonly acquired in batches. However, common acquisition functions are only meant for one-sample acquisition rounds, and when their scores are used naively for batch acquisition, they result in batches lacking diversity, which negatively impacts performance. State-of-the-art batch acquisition functions are very costly to compute on the other hand. In this paper, we present a novel class of stochastic acquisition functions that extend one-sample acquisition functions to the batch setting by observing that the computed acquisition scores are only really valid for the first sample that is selected in every batch acquisition round and that there is an increasing error in the scores for future samples in the batch. We model this error in the scores for additional batch samples. We acquire new samples by sampling from the pool set using the adapted scores. Our acquisition functions are both vastly cheaper to compute and out-perform other batch acquisition functions.

Andreas Kirsch, Sebastian Farquhar, Yarin Gal
-
[ Visit Poster at Spot C4 in Virtual World ]

Sparse Bayesian Learning (SBL) is a powerful paradigm for attaining sparsity in probabilistic models. Herein, we propose a coordinate ascent algorithm for SBL termed Relevance Matching Pursuit (RMP) and show that, as its noise variance parameter goes to zero, RMP exhibits a surprising connection to Stepwise Regression. Further, we derive novel guarantees for Stepwise Regression algorithms, which also shed light on RMP. Our guarantees for Forward Regression improve on deterministic and probabilistic results for Orthogonal Matching Pursuit with noise. Our analysis of Backward Regression on determined systems culminates in a bound on the residual of the optimal solution to the {\it subset selection} problem that, if satisfied, guarantees the {\it optimality} of the result. To our knowledge, this bound is the first that can be computed in polynomial time and depends chiefly on the smallest singular value of the matrix.

Sebastian Ament, Carla Gomes
-
[ Visit Poster at Spot C3 in Virtual World ]
Many scientific problems require identifying a small set of covariates that are associated with a target response and estimating their effects. Often, these effects are non-linear and include interactions, so linear and additive methods can lead to poor estimation and variable selection. The Bayesian framework makes it straightforward to simultaneously express sparsity, non-linearity, and interactions in a hierarchical model. But, as for the few other methods that handle this trifecta, inference is computationally intractable --- with runtime at least quadratic in the number of covariates, and often worse. In the present work, we solve this computational bottleneck. We first show that suitable Bayesian models can be represented as Gaussian processes (GPs). We then demonstrate how a kernel trick can reduce computation with these GPs to $O$(\# covariates) time for both variable selection and estimation. Our resulting fit corresponds to a sparse orthogonal decomposition of the regression function in a Hilbert space (i.e., a functional ANOVA decomposition), where interaction effects represent all variation that cannot be explained by lower-order effects. On a variety of synthetic and real datasets, our approach outperforms existing methods used for large, high-dimensional datasets while remaining competitive (or being orders of magnitude faster) in runtime.
Raj Agrawal, Tamara Broderick
-
[ Visit Poster at Spot C2 in Virtual World ]

We consider the task of personalizing ASR models while being constrained by a fixed budget on recording speaker specific utterances. Given a speaker and an ASR model, we propose a method of identifying sentences for which the speaker's utterances are likely to be harder for the given ASR model to recognize. We assume a tiny amount of speaker-specific data to learn phoneme-level error models which help us select such sentences. We show that speaker's utterances on the sentences selected using our error model indeed have larger error rates when compared to speaker's utterances on randomly selected sentences. We find that fine-tuning the ASR model on the sentence utterances selected with the help of error models yield higher WER improvements in comparison to fine-tuning on an equal number of randomly selected sentence utterances. Thus, our method provides an efficient way of collecting speaker utterances under budget constraints for personalizing ASR models.

Abhijeet Awasthi, Sunita Sarawagi, Preethi Jyothi
-
[ Visit Poster at Spot C1 in Virtual World ]
Multiple imputation (MI) has been widely applied to missing value problems in biomedical, social and econometric research, in order to avoid improper inference in the downstream data analysis. In the presence of high-dimensional data, imputation models that include feature selection, especially $\ell_1$ regularized regression (such as Lasso, adaptive Lasso and Elastic Net), are common choices to prevent the model from underdetermination. However, conducting MI with feature selection is difficult: existing methods are often computationally inefficient and poor in performance. We propose MISNN, a novel and efficient algorithm that incorporates feature selection for MI. Leveraging the approximation power of neural networks, MISNN is a general and flexible framework, compatible to any feature selection method, any neural network architecture, high/low-dimensional data and general missing patterns. Through empirical experiments, MISNN has demonstrated great advantages over state-of-the-art imputation methods (e.g. Bayesian Lasso and matrix completion), in terms of imputation accuracy, statistical consistency and computation speed.
Woody Bu, Zongyu Dai, Yiliang Zhang, Qi Long
-
[ Visit Poster at Spot C0 in Virtual World ]

Air pollution is a global problem and has a severe impact on human health. Fine-grained air quality (AQ) monitoring is important in mitigating air pollution. However, existing AQ station deployments are sparse due to installation and operational costs. In this work, we propose an Active Learning-based method for air quality station deployment. We use Gaussian processes and several classical machine learning algorithms (Random Forest, K-Neighbors and Support Vector Machine) to benchmark several active learning strategies for two real-world air quality datasets (Delhi, India and Beijing, China).

Zeel B Patel, Nipun Batra
-
[ Visit Poster at Spot B6 in Virtual World ]

Neural architecture search (NAS), an important branch of automatic machine learning, has become an effective approach to automate the design of deep learning models. However, the major issue in NAS is how to reduce the large search time imposed by the heavy computational burden. While most recent approaches focus on pruning redundant sets or developing new search methodologies, this paper attempts to formulate the problem based on the data curation manner. Our key strategy is to search the architecture using summarized data distribution, i.e., core-set. Typically, many NAS algorithms separate searching and training stages, and the proposed core-set methodology is only used in search stage, thus their performance degradation can be minimized. In our experiments, we were able to save overall computational time from 30.8 hours to 3.5 hours, 8.8x reduction, on a single RTX 3090 GPU without sacrificing accuracy.

Jae-hun Shim, Kyeongbo Kong, Suk-Ju Kang
-
[ Visit Poster at Spot B5 in Virtual World ]
In this work we present an algorithm to construct coresets for fair regression for a dataset of $n$ points in $d$ dimensions and having some discrete-valued protected attribute. For (fair) regression with statistical parity we first define the notion of a coreset and give a coreset that incurs only a small additive error in satisfying the constraints and a relative error in the objective function. The coreset size is linear in $\ell$, the number of unique values the protected attribute can take, and the dependence on dimension is near linear ($d \log{d}$). To the best of our knowledge this is the first coreset for fair regression problem with statistical parity. We also empirically demonstrate the performance of our coresets on real data sets.
Rachit Chhaya, Anirban Dasgupta, Supratim Shit, Jayesh Choudhari
-
[ Visit Poster at Spot B4 in Virtual World ]

In this paper, we study the problem of evaluating the addition of elements to a set. This problem is difficult, because it can, in the general case, not be reduced to unconditional preferences between the choices. Therefore, we model preferences based on the context of the decision. We discuss and compare two different Siamese network architectures for this task: a twin network that compares the two sets resulting after the addition, and a triplet network that models the contribution of each candidate to the existing set. We evaluate the two settings on a real-world task; learning human card preferences for deck building in the collectible card game Magic: The Gathering. We show that the triplet approach achieves a better result than the twin network and that both outperform previous results on this task.

Timo Bertram, Johannes Fürnkranz, Martin Müller
-
[ Visit Poster at Spot B3 in Virtual World ]

Curriculum learning is a training strategy that sorts the training examples by some measure of their difficulty and gradually exposes them in that order to the learner to improve the network performance. In this work, we first propose and study a dynamic curriculum learning algorithm. Our dynamic curriculum algorithm greedily samples a subset of the training data whose gradients are directed towards the optimal weight. Motivated by our insights from the dynamic curriculum learning ordering and implicit curriculum ordering, we introduce a simple, practical curriculum learning strategy that uses statistical measures such as standard deviation and entropy values to score the difficulty of data points for real image classification tasks. Further, we also use our algorithms to discuss why curriculum learning is helpful.

Vinu Sankar Sadasivan, Anirban Dasgupta
-
[ Visit Poster at Spot B2 in Virtual World ]

Given a training data set, we develop a method to obtain a representative subset of training data meant for classification. Our focus is on subsets that are particularly suitable for training linear classification models. Our method is to reduce data sets in a manner that essentially preserves the convex hulls of the various classes of the data set. We develop scalable randomized algorithms for convex hull approximation, and obtain rigorous guarantees for models trained on subsets of linearly separable data sets. We empirically demonstrate the effectiveness of our approach for both training and data augmentation on the MNIST data set.

Viveck Cadambe, Bill Kay
-
[ Visit Poster at Spot B1 in Virtual World ]

Many machine learning problems involve processing large datasets, thus making them expensive computational tasks. One mitigation strategy selects representative subsets of the data and one way to achieve this is via coresets~- a weighted subset of the given data, such that the cost function under consideration can be additively and/or multiplicatively approximated. While much previous work chooses these subsets using independent importance-based sampling, such methods can be suboptimal when the system possesses redundancy. Recently, Tremblay et al. 2019 considered sampling from fixed-sized determinantal point process (DPP) and gave lower bounds on the coreset sample complexity. While DPPs often empirically outperform importance-based sampling methods, their coreset size bound is worse compared to independent sampling. In this work, we extend their theoretical results by improving the lower bound on sample complexity for coreset approximation by a log(1 / epsilon) factor using chaining, a technique in empirical process literature. This, therefore, moves a step further in bridging the gap between theoretical justification and empirical observations. We provide additive approximation guarantees for the associated cost function and allow the possibility of negative weights, which has a direct application in layer-wise pruning of deep neural networks. On top of additive approximation guarantees, we also extend our bounds to the scenarios with multiplicative approximations.

Gantavya Bhatt, Jeff Bilmes
-
[ Visit Poster at Spot B0 in Virtual World ]

Active learning has proven to be useful for minimizing labeling costs by selecting the most informative samples. However, existing active learning methods do not work well in realistic scenarios such as imbalance or rare classes, out-of-distribution data in the unlabeled set, and redundancy. In this work, we propose SIMILAR (Submodular Information Measures based actIve LeARning), a unified active learning framework using recently proposed submodular information measures (SIM) as acquisition functions. We argue that SIMILAR not only works in standard active learning but also easily extends to the realistic settings considered above and acts as a one-stop solution for active learning that is scalable to large real-world datasets. Empirically, we show that SIMILAR significantly outperforms existing active learning algorithms by as much as ≈ 5% − 18% in the case of rare classes and ≈ 5% − 10% in the case of out-of-distribution data on CIFAR-10 image classification.

Suraj N Kothawade, Krishnateja Killamsetty, Rishabh Iyer
-
[ Visit Poster at Spot A6 in Virtual World ]

In this paper, we study mixed continuous-discrete minimax problems where the minimization is over a continuous variable belonging to Euclidean space and the maximization is over subsets of a given ground set. We introduce the class of convex-submodular minimax problems, where the objective is convex with respect to the continuous variable and submodular with respect to the discrete variable. Even though such problems appear frequently in machine learning applications, little is known about how to address them from algorithmic and theoretical perspectives. For such problems, we first show that obtaining saddle points are hard up to any approximation, and thus introduce new notions of (near-) optimality. We then provide several algorithmic procedures for solving convex and monotone-submodular minimax problems and characterize their convergence rates, computational complexity, and quality of the final solution. Finally, we provide experiments to showcase the effectiveness of our purposed methods on training robust recommender systems in the presence of poisoning attacks.

Arman Adibi, Aryan Mokhtari, Hamed Hassani
-
[ Visit Poster at Spot A5 in Virtual World ]
In this paper, we consider an online optimization problem over $T$ rounds where at each step $t\in[T]$, the algorithm chooses an action $x_t$ from the fixed convex and compact domain set $\mathcal{K}$. A utility function $f_t(\cdot)$ is then revealed and the algorithm receives the payoff $f_t(x_t)$. This problem has been previously studied under the assumption that the utilities are adversarially chosen monotone DR-submodular functions and $\mathcal{O}(\sqrt{T})$ regret bounds have been derived. We first characterize the class of strongly DR-submodular functions and then, we derive regret bounds for the following new online settings: $(1)$ $\{f_t\}_{t=1}^T$ are monotone strongly DR-submodular and chosen adversarially, $(2)$ $\{f_t\}_{t=1}^T$ are monotone submodular (while the average $\frac{1}{T}\sum_{t=1}^T f_t$ is strongly DR-submodular) and chosen by an adversary but they arrive in a uniformly random order, $(3)$ $\{f_t\}_{t=1}^T$ are drawn i.i.d. from some unknown distribution $f_t\sim \mathcal{D}$ where the expected function $f(\cdot)=\mathbb{E}_{f_t\sim\mathcal{D}}[f_t(\cdot)]$ is monotone DR-submodular. For $(1)$, we obtain the first logarithmic regret bounds. In terms of the second framework, we show that it is possible to obtain similar logarithmic bounds with high probability. Finally, for the i.i.d. model, we provide algorithms with $\tilde{\mathcal{O}}(\sqrt{T})$ stochastic regret bound, both in expectation and with high probability.
Omid Sadeghi, Maryam Fazel
-
[ Visit Poster at Spot A4 in Virtual World ]
Numerous tasks in machine learning and artificial intelligence have been modeled as submodular maximization problems. These problems usually involve sensitive data about individuals, and in addition to maximizing the utility, privacy concerns should be considered. In this paper, we study the general framework of non-negative monotone submodular maximization subject to matroid or knapsack constraints in both offline and online settings. For the offline setting, we propose a differentially private $(1-\frac{\kappa}{e})$-approximation algorithm, where $\kappa\in[0,1]$ is the total curvature of the submodular set function, which improves upon prior works in terms of approximation guarantee and query complexity under the same privacy budget. In the online setting, we propose the first differentially private algorithm, and we specify the conditions under which the regret bound scales as $\O(\sqrt{T})$, i.e., privacy could be ensured while maintaining the same regret bound as the optimal regret guarantee in the non-private setting.
Omid Sadeghi, Maryam Fazel
-
[ Visit Poster at Spot A3 in Virtual World ]

With the goal of making deep learning more label-efficient, a growing number of papers have been studying active learning (AL) for deep models. However, there are a number of issues in the prevalent experimental settings, mainly stemming from a lack of unified implementation and benchmarking. Issues in the current literature include sometimes contradictory observations on the performance of different AL algorithms, unintended exclusion of important generalization approaches such as data augmentation and SGD for optimization, a lack of study of evaluation facets like the labeling efficiency of AL, and little or no clarity on the scenarios in which AL outperforms random sampling (RS). In this work, we present a unified re-implementation of state-of-the-art AL algorithms in the context of image classification, and we carefully study these issues as facets of effective evaluation. On the positive side, we show that AL techniques are 2x to 4x more label-efficient compared to RS with the use of data augmentation. Surprisingly, when data augmentation is included, there is no longer a consistent gain in using BADGE, a state-of-the-art approach, over simple uncertainty sampling. We then do a careful analysis of how existing approaches perform with varying number of examples per class. Finally, we provide other insights for AL practitioners to consider in future work, such as the use of data subset selection techniques for intermediate model training.

Nathan Beck, Durga S, Ganesh Ramakrishnan, Rishabh Iyer
-
[ Visit Poster at Spot A2 in Virtual World ]

We systematically study the query complexity of learning geodesically convex halfspaces on graphs. Geodesic convexity is a natural generalisation of Euclidean convexity and allows the definition of convex sets and halfspaces on graphs. We prove upper bounds on the query complexity linear in the treewidth and the minimum hull set size but only logarithmic in the diameter. We show tight lower bounds corresponding to well-established separation axioms and identify the Radon number as a central parameter bounding the query complexity and the VC dimension. While previous bounds typically depend on the cut size of the labelling, all parameters in our bounds can be computed from the unlabelled graph. We empirically compare our proposed approach with other active learning algorithms and provide evidence that ground-truth communities in real-world graphs are often convex.

Maximilian Thiessen, Thomas Gärtner
-
[ Visit Poster at Spot A1 in Virtual World ]
Classes of set functions along with a choice of ground set are a bedrock to determine and develop corresponding variants of greedy algorithms to suitably obtain approximate and efficient solutions for combinatorial optimization. The class of constrained submodular optimization has seen huge advances at the intersection of good computational efficiency, versatility and approximation guarantees while unconstrained submodular optimization is NP-hard. What is an alternative to situations when submodularity does not hold? Can efficient and globally exact solutions be obtained? We introduce one such new frontier: The class of quasi-concave set functions induced as a dual class to monotone linkage functions. We provide a parallel algorithm with a time complexity over $n$ processors of $\mathcal{O}(n^2g) +\mathcal{O}(\log{\log{n}})$ where $n$ is the cardinality of the ground set and $g$ is the complexity to compute the monotone linkage function that induces a corresponding quasi-concave set function via a duality. The complexity reduces to $\mathcal{O}(gn\log(n))$ on $n^2$ processors and to $\mathcal{O}(gn)$ on $n^3$ processors. Our approach reduces the currently existing cubic computational complexity to those mentioned above. Our algorithm provides a globally optimal solution to a maxi-min problem as opposed to submodular optimization which is approximate. We show a potential for widespread applications via an example of diverse feature subset selection with exact global maxi-min guarantees upon showing that a statistical dependency measure called distance correlation can be used to induce a quasi-concave set function.
Praneeth Vepakomma, Ramesh Raskar

Author Information

Rishabh Iyer (University of Texas at Dallas)
Abir De (IIT Bombay)
Ganesh Ramakrishnan (IIT Bombay)
Jeff Bilmes (UW)

More from the Same Authors