# Accepted papers

Printed proceedings are available for purchase.

All of the talk videos are available online.

## Conversational Speech Transcription Using Context-Dependent Deep Neural Networks

– Invited applications paper

**Abstract: **Context-Dependent Deep-Neural-Network HMMs, or CD-DNN-HMMs, combine the classic artificial-neural-network HMMs with traditional context-dependent acoustic modeling and deep-belief-network pre-training. CD-DNN-HMMs greatly outperform conventional CD-GMM (Gaussian mixture model) HMMs: The word error rate is reduced by up to one third on the difficult benchmarking task of speaker-independent single-pass transcription of telephone conversations.

## Data-driven Web Design

– Invited applications paper

**Abstract: **This short paper summarizes challenges and opportunities of applying machine learning methods to Web design problems, and describes how structured prediction, deep learning, and probabilistic program induction can enable useful interactions for designers. We intend for these techniques to foster new work in data-driven Web design.

## Learning the Central Events and Participants in Unlabeled Text

– Invited applications paper

**Abstract: **The majority of information on the Internet is expressed in written text. Understanding and extracting this information is crucial to building intelligent systems that can organize this knowledge. Today, most algorithms focus on learning atomic facts and relations. For instance, we can reliably extract facts like 'Annapolis is a City' by observing redundant word patterns across a corpus. However, these facts do not capture richer knowledge like the way detonating a bomb is related to destroying a building, or that the perpetrator who was convicted must have been arrested. A structured model of these events and entities is needed for a deeper understanding of language. This talk describes unsupervised approaches to learning such rich knowledge.

## Exemplar-SVMs for Visual Ob ject Detection, Label Transfer and Image Retrieval

– Invited applications paper

**Abstract: **Today's state-of-the-art visual object detection systems are based on three key components: 1) sophisticated features (to encode various visual invariances), 2) a powerful classifier (to build a discriminative object class model), and 3) lots of data (to use in large-scale hard-negative mining). While conventional wisdom tends to attribute the success of such methods to the ability of the classifier to generalize across the positive class instances, here we report on empirical findings suggesting that this might not necessarily be the case. We have experimented with a very simple idea: to learn a separate classifier for each positive object instance in the dataset. In this setup, no generalization across the positive instances is possible by definition, and yet, surprisingly, we did not observe any drastic drop in performance compared to the standard, category-based approaches.

## Capturing topical content with frequency and exclusivity

– Accepted

**Abstract: **Recent work in text analysis commonly describes topics in terms of their most frequent words, but the exclusivity of words to topics is equally important for communicating their content.We introduce Hierarchical Poisson Convolution (HPC), a model which infers regularized estimates of the differential use of words across topics as well as word frequency within topics. HPC uses known hierarchical structure on human labeled topics to make focused comparisons of differential usage within each branch of the tree. We develop a parallelized Hamiltonian Monte Carlo sampler that allows for fast and scalable computation.

## TrueLabel + Confusions: A Spectrum of Probabilistic Models in Analyzing Multiple Ratings

– Accepted

**Abstract: **This paper revisits the problem of analyzing multiple ratings given by different judges. Different from previous work that focuses on distilling the true labels from noisy crowdsourcing ratings, we emphasize gaining diagnostic insights into our in-house well-trained judges. We generalize the well-known DawidSkene model (Dawid & Skene, 1979) to a spectrum of probabilistic models under the same “TrueLabel + Confusion” paradigm, and show that our proposed hierarchical Bayesian model, called HybridConfusion, consistently outperforms DawidSkene on both synthetic and real-world data sets.

## Robust Multiple Manifold Structure Learning

– Accepted

**Abstract: **We present a robust multiple manifold structure learning (RMMSL) scheme to robustly estimate data structures under the multiple low intrinsic dimensional manifolds assumption. In the local learning stage, RMMSL efficiently estimates local tangent space by weighted low-rank matrix factorization. In the global learning stage, we propose a robust manifold clustering method based on local structure learning results. The proposed clustering method is designed to get the flattest manifolds clusters by introducing a novel curved-level similarity function. Our approach is evaluated and compared to state-of-the-art methods on synthetic data, handwritten digit images, human motion capture data and motorbike videos. We demonstrate the effectiveness of the proposed approach, which yields higher clustering accuracy, and produces promising results for challenging tasks of human motion segmentation and motion flow learning from videos.

## Two Manifold Problems with Applications to Nonlinear System Identification

– Accepted

**Abstract: ** Recently, there has been much interest in spectral approaches to learning manifolds—so-called kernel eigenmap methods. These methods have had some successes, but their applicability is limited because they are not robust to noise. To address this limitation, we look at two-manifold problems, in which we simultaneously reconstruct two related manifolds, each representing a different view of the same data. By solving these interconnected learning problems together, two-manifold algorithms are able to succeed where a non-integrated approach would fail: each view allows us to suppress noise in the other, reducing bias. We propose a class of algorithms for two-manifold problems, based on spectral decomposition of cross-covariance operators in Hilbert space and discuss when two-manifold problems are useful. Finally, we demonstrate that solving a two-manifold problem can aid in learning a nonlinear dynamical system from limited data.

## On the Difficulty of Nearest Neighbor Search

– Accepted

**Abstract: ** Fast approximate nearest neighbor search in large databases is becoming popular. Several powerful learning-based formulations have been proposed recently. However, not much attention has been paid to a more fundamental question: how difficult is (approximate) nearest neighbor search in a given data set? More broadly, which data properties affect the nearest neighbor search and how? This paper introduces the first concrete measure called Relative Contrast that can be used to evaluate the influence of several crucial data characteristics such as dimensionality, sparsity, and database size simultaneously in arbitrary normed metric spaces. To further justify why relative contrast is an important and effective measure, we present a theoretical analysis to prove how relative contrast determines/affects the performance/complexity of Locality Sensitive Hashing, a popular hashing based approximate nearest neighbor search method. Finally, relative contrast also provides an explanation for a family of heuristic hashing algorithms based on PCA with good practical performance.

## Learning Force Control Policies for Compliant Robotic Manipulation

– Invited applications paper

**Abstract: **Developing robots capable of fine manipulation skills is of major importance in order to build truly assistive robots. These robots need to be compliant in their actuation and control in order to operate safely in human environments. Manipulation tasks imply complex contact interactions with the external world, and involve reasoning about the forces and torques to be applied. Planning under contact conditions is usually impractical due to computational complexity, and a lack of precise dynamics models of the environment. We present an approach to acquiring manipulation skills on compliant robots through reinforcement learning. The initial position control policy for manipulation is initialized through kinesthetic demonstration. This policy is augmented with a force/torque profile to be controlled in combination with the position trajectories. The Policy Improvement with Path Integrals (PI^2) algorithm is used to learn these force/torque profiles by optimizing a cost function that measures task success. We introduce a policy representation that ensures trajectory smoothness during exploration and learning. Our approach is demonstrated on the Barrett WAM robot arm equipped with a 6-DOF force/torque sensor on two different manipulation tasks: opening a door with a lever door handle, and picking up a pen off the table. We show that the learnt force control policies allow successful, robust execution of the tasks.

## Estimation of Simultaneously Sparse and Low Rank Matrices

– Accepted

**Abstract: **The paper introduces a penalized matrix estimation procedure aiming at solutions which are sparse and low-rank at the same time. Such structures arise in the context of social networks or protein interactions where underlying graphs have adjacency matrices which are block-diagonal in the appropriate basis. We introduce a convex mixed penalty which involves *ℓ_1*-norm and trace norm simultaneously. We obtain an oracle inequality which indicates how the two effects interact according to the nature of the target matrix. We bound generalization error in the link prediction problem. We also develop proximal descent strategies to solve the the optimization problem efficiently and evaluate performance on synthetic and real data sets.

## Online Structured Prediction via Coactive Learning

– Accepted

**Abstract: **We propose Coactive Learning as a model of interaction between a learning system and a human user, where both have the common goal of providing results of maximum utility to the user. At each step, the system (e.g. search engine) receives a context (e.g. query) and predicts an object (e.g. ranking). The user responds by correcting the system if necessary, providing a slightly improved – but not necessarily optimal – object as feedback. We argue that such feedback can be inferred from observable user behavior, specifically clicks in web search. Evaluating predictions by their cardinal utility to the user, we propose efficient learning algorithms that have O(1/sqrt{T}) average regret, even though the learning algorithm never observes cardinal utility values. We demonstrate the applicability of our model and learning algorithms on a movie recommendation task, as well as ranking for web search.

discussion+video, ICML version (pdf) (bib), more on ArXiv (includes revisions)

## Using CCA to improve CCA: A new spectral method for estimating vector models of words

– Accepted

**Abstract: **Unlabeled data is often used to learn representations which can be used to supplement baseline features in a supervised learner. For example, for text applications where the words lie in a very high dimensional space (the size of the vocabulary), one can learn a low rank “dictionary” by an eigen-decomposition of the word co-occurrence matrix (e.g. using PCA or CCA). In this paper, we present a new spectral method based on CCA to learn an *eigenword* dictionary. Our improved procedure computes two set of CCAs, the first one between the left and right contexts of the given word and the second one between the projections resulting from this CCA and the word itself. We prove theoretically that this two-step procedure has lower sample complexity than the simple single step procedure and also illustrate the empirical efficacy of our approach and the richness of representations learned by our Two Step CCA (TSCCA) procedure on the tasks of POS tagging and sentiment classification.

## A Discrete Optimization Approach for Supervised Ranking with an Application to Reverse-Engineering Quality Ratings

– Not for proceedings

**Abstract: **We present a new methodology based on mixed integer optimization (MIO) for supervised ranking tasks. Other methods for supervised ranking approximate ranking quality measures by convex functions in order to accommodate extremely large problems, at the expense of exact solutions. As our MIO approach provides exact modeling for ranking problems, our solutions are benchmarks for the other non-exact methods. We report computational results that demonstrate significant advantages for MIO methods over current state-of-the-art. We also use our technique for a new application: reverse-engineering quality rankings. A good or bad product quality rating can make or break an organization, and in order to invest wisely in product development, organizations are starting to use intelligent approaches to reverse-engineer the rating models. We present experiments on data from a major quality rating company, and provide new methods for evaluating the solution. In addition, we provide an approach to use the reverse-engineered model to achieve a top ranked product in a cost-effective way.

discussion+video
~~pdf~~

## Bounded Planning in Passive POMDPs

– Accepted

**Abstract: **In Passive POMDPs actions do not affect the world state, but still incur costs. When the agent is bounded by information-processing constraints, it can only keep an approximation of the belief. We present a variational principle for the problem of maintaining the information which is most useful for minimizing the cost, and introduce an efficient and simple algorithm for finding an optimum.

## Minimizing The Misclassification Error Rate Using a Surrogate Convex Loss

– Accepted

**Abstract: **We carefully study how well minimizing convex surrogate loss functions, corresponds to minimizing the misclassification error rate for the problem of binary classification with linear predictors. In particular, we show that amongst all convex surrogate losses, the hinge loss gives essentially the best possible bound, of all convex loss functions, for the misclassification error rate of the resulting linear predictor in terms of the best possible margin error rate. We also provide lower bounds for specific convex surrogates that show how different commonly used losses qualitatively differ from each other.

## Bayesian Efficient Multiple Kernel Learning

– Accepted

**Abstract: **Multiple kernel learning algorithms are proposed to combine kernels in order to obtain a better similarity measure or to integrate feature representations coming from different data sources. Most of the previous research on such methods is focused on the computational efficiency issue. However, it is still not feasible to combine many kernels using existing Bayesian approaches due to their high time complexity. We propose a fully conjugate Bayesian formulation and derive a deterministic variational approximation, which allows us to combine hundreds or thousands of kernels very efficiently. We briefly explain how the proposed method can be extended for multiclass learning and semi-supervised learning. Experiments with large numbers of kernels on benchmark data sets show that our inference method is quite fast, requiring less than a minute. On one bioinformatics and three image recognition data sets, our method outperforms previously reported results with better generalization performance.

## Bayesian Nonexhaustive Learning for Online Discovery and Modeling of Emerging Classes

– Accepted

**Abstract: **In this study we present a framework for online inference in the presence of a nonexhaustively defined set of classes that incorporates supervised classification with class discovery and modeling. A Dirichlet process prior (DPP) model defined over class distributions ensures that both known and unknown class distributions originate according to a common base distribution. In an attempt to automatically discover potentially interesting class formations, the prior model is coupled with a suitably chosen data model, and sequential Monte Carlo sampling is used to perform online inference. Our approach takes into account the rapidly accumulating nature of samples representing emerging classes, which is most evident in the biodetection application considered in this study, where a new class of pathogen may suddenly appear, and the rapid increase in the number of samples originating from this class indicates the onset of an outbreak.

## Exact Soft Confidence-Weighted Learning

– Accepted

**Abstract: **In this paper, we propose a new Soft Confidence-Weighted (SCW) online learning scheme, which enables the conventional confidence-weighted learning method to handle non-separable cases. Unlike the previous confidence-weighted learning algorithms, the proposed soft confidence-weighted learning method enjoys all the four salient properties: (i) large margin training, (ii) confidence weighting, (iii) capability to handle non-separable data, and (iv) adaptive margin. Our experimental results show that SCW performs significantly better than the original CW algorithm. When comparing with the state-of-the-art AROW algorithm, we found that SCW in general achieves better or at least comparable predictive performance, but enjoys considerably better efficiency performance (i.e., producing less number of updates and spending less time cost).

## Distributed Tree Kernels

– Accepted

**Abstract: **In this paper, we propose the distributed tree kernels (DTK) as a novel method to reduce time and space complexity of tree kernels. Using a linear complexity algorithm to compute vectors for trees, we embed feature spaces of tree fragments in low-dimensional spaces where the kernel computation is directly done with dot product. We show that DTKs are faster, correlate with tree kernels, and obtain a statistically similar performance in two natural language processing tasks.

## Multiple Kernel Learning from Noisy Labels by Stochastic Programming

– Accepted

**Abstract: **We study the problem of multiple kernel learning from noisy labels. This is in contrast to most of the previous studies on multiple kernel learning that mainly focus on developing efficient algorithms and assume perfectly labeled training examples. Directly applying the existing multiple kernel learning algorithms to noisily labeled examples often leads to suboptimal performance due to the incorrect class assignments. We address this challenge by casting multiple kernel learning from noisy labels into a stochastic programming problem, and presenting a minimax formulation. We develop an efficient algorithm for solving the related convex-concave optimization problem with a fast convergence rate of *O(1/T)* where *T* is the number of iterations. Empirical studies on UCI data sets verify both the effectiveness of the proposed framework and the efficiency of the proposed optimization algorithm.

## Improved Nystrom Low-rank Decomposition with Priors

– Accepted

**Abstract: **Low-rank matrix decomposition has gained great popularity recently in scaling up kernel methods to large amount of data. However, some limitations could prevent them from working effectively in certain domains. For example, many existing approaches are intrinsically unsupervised, which does not incorporate side information (e.g., class labels) to produce task specific decomposition; also, they typically work “transductively” and the factorization does not generalize to new samples, causing lot of inconvenience for updated environment. To solve these problems, in this paper we propose an “inductive”-flavored method for low-rank kernel decomposition with priors or side information. We achieve this by generalizing the Nystróm method in a novel way. On the one hand, our approach employs a highly flexible, nonparametric structure that allows us to generalize the low-rank factors to arbitrarily new samples; on the other hand, it has linear time and space complexities, which can be orders of magnitudes faster than existing approaches and renders great efficiency in learning a low-rank kernel decomposition. Empirical results demonstrate the efficacy and efficiency of the proposed method.

## Active Learning for Matching Problems

– Accepted

**Abstract: **Effective learning of user preferences is critical to easing user burden in various types of matching problems. Equally important is active query selection to further reduce the amount of preference information users must provide. We address the problem of active learning of user preferences for matching problems, introducing a novel method for determining probabilistic matchings, and developing several new active learning strategies that are sensitive to the specific matching objective. Experiments with real-world data sets spanning diverse domains demonstrate that matching-sensitive active learning outperforms standard techniques.

## Ensemble Methods for Convex Regression with Applications to Geometric Programming Based Circuit Design

– Accepted

**Abstract: **Convex regression is a promising area for bridging statistical estimation and deterministic convex optimization. We develop a new piecewise linear convex regression method that uses the Convex Adaptive Partitioning (CAP) estimator in an ensemble setting, Ensemble Convex Adaptive Partitioning (E-CAP). The ensembles alleviate some problems associated with convex piecewise linear estimators, such as instability when used to approximate constraints or objective functions for optimization, while maintaining desirable properties, such as consistency and O(n log(n)^2) computational complexity. We empirically demonstrate that E-CAP outperforms existing convex regression methods both when used for prediction and optimization. We then apply E-CAP to device modeling and constraint approximation for geometric programming based circuit design.

## Groupwise Constrained Reconstruction for Subspace Clustering

– Accepted

**Abstract: **Recent proposed subspace clustering methods first compute a self reconstruction matrix for dataset, then converted it to an affinity matrix, before input to a spectral clustering method to obtain the final clustering result. Their success is largely based on the subspace independence assumption, which, however, does not always hold for the applications with increasing number of clusters such as face clustering. In this paper, we proposes a novel reconstruction based subspace clustering method without making the subspace independence assumption. In our model, certain properties of the reconstruction matrix are explicitly characterized using the latent cluster indicators, and the affinity matrix input to the spectral clustering is built from the posterior of the cluster indicators. Evaluation on both synthetic and real-world datasets show that our method can outperform the state-of-the-arts.

## Stability of matrix factorization for collaborative filtering

– Accepted

**Abstract: **We study the stability vis a vis adversarial noise of matrix factorization algorithm for matrix completion. In particular, our results include: (I) we bound the gap between the solution matrix of the factorization method and the ground truth in terms of root mean square error; (II) we treat the matrix factorization as a subspace fitting problem and analyze the difference between the solution subspace and the ground truth; (III) we analyze the prediction error of individual users based on the subspace stability. We apply these results to the problem of collaborative filtering under manipulator attack, which leads to useful insights and guidelines for collaborative filtering system design.

## Adaptive Regularization for Similarity Measures

– Accepted

**Abstract: **Algorithms for learning distributions over weight-vectors, such as AROW were recently shown empirically to achieve state-of-the-art performance at various problems, with strong theoretical guaranties. Extending these algorithms to matrix models poses a challenge since the number of free parameters in the covariance of the distribution scales as *n^4* with the dimension *n* of the matrix. We describe, analyze and experiment with two new algorithms for learning distribution of matrix models. Our first algorithm maintains a diagonal covariance over the parameters and is able to handle large covariance matrices. The second algorithm factores the covariance capturing some inter-features correlation while keeping the number of parameters linear in the size of the original matrix. We analyze the diagonal algorithm in the mistake bound model and show the superior precision of our approach over other algorithms in two tasks: retrieving similar images, and ranking similar documents. The second algorithms is shown to attain faster convergence rate.

## Linear Off-Policy Actor-Critic

– Accepted

**Abstract: **This paper presents the first off-policy actor-critic reinforcement learning algorithm with a per-time-step complexity that scales linearly with the number of learned parameters. Previous work on actor-critic algorithms is limited to the on-policy setting and does not take advantage of the recent advances in off-policy gradient temporal-difference learning. Off-policy techniques, such as Greedy-GQ, enable a target policy to be learned while following and obtaining data from another (behavior) policy. For many problems, however, actor-critic methods are more practical than action value methods (like Greedy-GQ) because they explicitly represent the policy; consequently, the policy can be stochastic and utilize a large action space. In this paper, we illustrate how to practically combine the generality and learning potential of off-policy learning with the flexibility in action selection given by actor-critic methods. We derive an incremental, linear time and space complexity algorithm that includes eligibility traces, prove convergence under assumptions similar to previous off-policy algorithms, and empirically show better or comparable performance to existing algorithms on standard reinforcement-learning benchmark problems.

discussion+video, ICML version (pdf) (bib), more on ArXiv (includes revisions)

## Modeling Latent Variable Uncertainty for Loss-based Learning

– Accepted

**Abstract: **We consider the problem of parameter estimation using weakly supervised datasets, where a training sample consists of the input and a partially specified annotation (called the output). In addition, the missing information in the annotation is modeled using latent variables. Traditional methods, such as expectation-maximization, overburden a single distribution with two separate tasks: (i) modeling the uncertainty in the latent variables during training; and (ii) making accurate predictions for the output and the latent variables during testing. We propose a novel framework that separates the demands of the two tasks using two distributions: (i) a conditional distribution to model the uncertainty of the latent variables for a given input-output pair; and (ii) a delta distribution to predict the output and the latent variables for a given input. During learning, we encourage agreement between the two distributions by minimizing a loss-based dissimilarity coefficient. Our approach generalizes latent SVM in two important ways: (i) it models the uncertainty over latent variables instead of relying on a pointwise estimate; and (ii) it allows the use of loss functions that depend on latent variables, which greatly increases its applicability. We demonstrate the efficacy of our approach on two challenging problems—object detection and action detection—using publicly available datasets.

## Dimensionality Reduction by Local Discriminative Gaussians

– Accepted

**Abstract: **We present local discriminative Gaussian (LDG) dimensionality reduction, a supervised linear dimensionality reduction technique that acts locally to each training point in order to find a mapping where similar data can be discriminated from dissimilar data. We focus on the classification setting; however, our algorithm can be applied whenever training data is accompanied with similarity or dissimilarity constraints. Our experiments show that LDG is superior to other state-of-the-art linear dimensionality reduction techniques when the number of features in the original data is large. We also adapt LDG to the transfer learning setting, and show that it achieves good performance when the test data distribution differs from that of the training data.

## Learning to Label Aerial Images from Noisy Data

– Accepted

**Abstract: **When training a system to label images, the amount of labeled training data tends to be a limiting factor. We consider the task of learning to label aerial images from existing maps. These provide abundant labels, but the labels are often incomplete and sometimes poorly registered. We propose two robust loss functions for dealing with these kinds of label noise and use the loss functions to train a deep neural network on two challenging aerial image datasets. The robust loss functions lead to big improvements in performance and our best system substantially outperforms the best published results on the task we consider.

## The Most Persistent Soft-Clique in a Set of Sampled Graphs

– Accepted

**Abstract: **When searching for characteristic subpatterns in potentially noisy graph data, it appears self-evident that having multiple observations would be better than having just one. However, it turns out that the inconsistencies introduced when different graph instances have different edge sets pose a serious challenge. In this work we address this challenge for the problem of finding maximum weighted cliques. We introduce the concept of most persistent soft-clique. This is subset of vertices, that 1) is almost fully or at least densely connected, 2) occurs in all or almost all graph instances, and 3) has the maximum weight. We present a measure of clique-ness, that essentially counts the number of edge missing to make a subset of vertices into a clique. With this measure, we show that the problem of finding the most persistent soft-clique problem can be cast either as: a) a max-min two person game optimization problem, or b) a min-min soft margin optimization problem. Both formulations lead to the same solution when using a partial Lagrangian method to solve the optimization problems. By experiments on synthetic data and on real social network data, we show that the proposed method is able to reliably find soft cliques in graph data, even if that is distorted by random noise or unreliable observations.

## Learning Efficient Structured Sparse Models

– Accepted

**Abstract: **We present a comprehensive framework for structured sparse coding and modeling extending the recent ideas of using learnable fast regressors to approximate exact sparse codes. For this purpose, we develop a novel block-coordinate proximal splitting method for the iterative solution of hierarchical sparse coding problems, and show an efficient feed forward architecture derived from its iteration. This architecture faithfully approximates the exact structured sparse codes with a fraction of the complexity of the standard optimization methods. We also show that by using different training objective functions, learnable sparse encoders are no longer restricted to be mere approximants of the exact sparse code for a pre-given dictionary, as in earlier formulations, but can be rather used as full-featured sparse encoders or even modelers. A simple implementation shows several orders of magnitude speedup compared to the state-of-the-art at minimal performance degradation, making the proposed framework suitable for real time and large-scale applications.

## PAC Subset Selection in Stochastic Multi-armed Bandits

– Accepted

**Abstract: **We consider the problem of selecting, from among the arms of a stochastic n-armed bandit, a subset of size m of those arms with the highest expected rewards, based on efficiently sampling the arms. This “subset selection” problem finds application in a variety of areas. Kalyanakrishnan & Stone (2010) frame this problem under a PAC setting (denoting it “Explore-m”) and analyze corresponding sampling algorithms both formally and experimentally. Whereas their formal analysis is restricted to the worst case sample complexity of algorithms, in this paper, we design and analyze an algorithm (“LUCB”) with improved expected sample complexity. Interestingly LUCB bears a close resemblance to the well-known UCB algorithm for regret minimization. We obtain a sample complexity bound for LUCB that matches the best existing bound for single-arm selection (that is, when m = 1). We also provide a lower bound on the worst case sample complexity of PAC algorithms for Explore-m.

## Nonparametric variational inference

– Accepted

**Abstract: **Variational methods are widely used for approximate posterior inference. However, their use is typically limited to families of distributions that enjoy particular conjugacy properties. To circumvent this limitation, we propose a family of variational approximations inspired by nonparametric kernel density estimation. The locations of these kernels and their bandwidth are treated as variational parameters and optimized to improve an approximate lower bound on the marginal likelihood of the data. Using multiple kernels allows the approximation to capture multiple modes of the posterior, unlike most other variational approximations. We demonstrate the efficacy of the nonparametric approximation with a hierarchical logistic regression model and a nonlinear matrix factorization model. We obtain predictive performance as good as or better than more specialized variational methods and sample-based approximations. The method is easy to apply to more general graphical models for which standard variational methods are difficult to derive.

## The Convexity and Design of Composite Multiclass Losses

– Accepted

**Abstract: **We consider composite loss functions for multiclass prediction comprising a proper (i.e., Fisher-consistent) loss over probability distributions and an inverse link function. We establish conditions for their (strong) convexity and explore their implications. We also show how the separation of concerns afforded by using this composite representation allows for the design of families of losses with the same Bayes risk.

## Finding Botnets Using Minimal Graph Clusterings

– Accepted

**Abstract: **We study the problem of identifying botnets and the IP addresses which they comprise, based on the observation of a fraction of the global email spam traffic. Observed mailing campaigns constitute evidence for joint botnet membership, they are represented by cliques in the graph of all messages. No evidence against an association of nodes is ever available. We reduce the problem of identifying botnets to a problem of finding a minimal clustering of the graph of messages. We directly model the distribution of clusterings given the input graph; this avoids potential errors caused by distributional assumptions of a generative model. We report on a case study in which we evaluate the model by its ability to predict the spam campaign that a given IP address is going to participate in.

## Learning the Experts for Online Sequence Prediction

– Accepted

**Abstract: **Online sequence prediction is the problem of predicting the next element of a sequence given previous elements. This problem has been extensively studied in the context of individual sequence prediction, where no prior assumptions are made on the origin of the sequence. Individual sequence prediction algorithms work quite well for long sequences, where the algorithm has enough time to learn the temporal structure of the sequence. However, they might give poor predictions for short sequences. A possible remedy is to rely on the general model of prediction with expert advice, where the learner has access to a set of *r* experts, each of which makes its own predictions on the sequence. It is well known that it is possible to predict almost as well as the best expert if the sequence length is order of * log (r)*. But, without firm prior knowledge on the problem, it is not clear how to choose a small set of *good* experts. In this paper we describe and analyze a new algorithm that learns a good set of experts using a training set of previously observed sequences. We demonstrate the merits of our approach by experimenting with the task of click prediction on the web.

## Efficient Active Algorithms for Hierarchical Clustering

– Accepted

**Abstract: **Advances in sensing technologies and the growth of the internet have resulted in an explosion in the size of modern datasets, while storage and processing power continue to lag behind. This motivates the need for algorithms that are efficient, both in terms of the number of measurements needed and running time. To combat the challenges associated with large datasets, we propose a general framework for active hierarchical clustering that repeatedly runs an off-the-shelf clustering algorithm on small subsets of the data and comes with guarantees on performance, measurement complexity and runtime complexity. We instantiate this framework with a simple spectral clustering algorithm and provide concrete results on its performance, showing that, under some assumptions, this algorithm recovers all clusters of size Omega(log n) using O(n log^2 n) similarities and runs in O(n log^3 n) time for a dataset of n objects. Through extensive experimentation we also demonstrate that this framework is practically alluring.

## Copula Mixture Model for Dependency-seeking Clustering

– Accepted

**Abstract: **We introduce a copula mixture model to perform dependency-seeking clustering when co-occurring samples from different data sources are available. The model takes advantage of the great flexibility offered by the copulas framework to extend mixtures of Canonical Correlation Analysis to multivariate data with arbitrary continuous marginal densities. We formulate our model as a non-parametric Bayesian mixture, while providing efficient MCMC inference. Experiments on synthetic and real data demonstrate that the increased flexibility of the copula mixture significantly improves the clustering and the interpretability of the results.

## The Landmark Selection Method for Multiple Output Prediction

– Accepted

**Abstract: **Conditional modeling x → y is a central problem in machine learning. A substantial research effort is devoted to such modeling when x is high dimensional. We consider, instead, the case of a high dimensional y, where x is either low dimensional or high dimensional. Our approach is based on selecting a small subset y_L of the dimensions of y, and proceed by modeling (i) x → y_L and (ii) y_L → y. Composing these two models, we obtain a conditional model x → y that possesses convenient statistical properties. Classification and regression experiments on multiple datasets show that this model outperforms the one vs. all approach as well as several sophisticated multiple output prediction methods.

## Subgraph Matching Kernels for Attributed Graphs

– Accepted

**Abstract: **We propose graph kernels based on subgraph matchings, i.e. structure-preserving bijections between subgraphs. While recently proposed kernels based on common subgraphs (Wale et al., 2008; Shervashidze et al., 2009) in general can not be applied to attributed graphs, our approach allows to rate mappings of subgraphs by a ﬂexible scoring scheme comparing vertex and edge attributes by kernels. We show that subgraph matching kernels generalize several known kernels. To compute the kernel we propose a graph-theoretical algorithm inspired by a classical relation between common subgraphs of two graphs and cliques in their product graph observed by Levi (1973). Encouraging experimental results on a classiﬁcation task of real-world graphs are presented.

## Adaptive Canonical Correlation Analysis Based On Matrix Manifolds

– Accepted

**Abstract: **In this paper, we formulate the Canonical Correlation Analysis (CCA) problem on matrix manifolds. This framework provides a natural way for dealing with matrix constraints and tools for building efficient algorithms even in an adaptive setting. Finally, an adaptive CCA algorithm is proposed and applied to a change detection problem in EEG signals.

## Batch Active Learning via Coordinated Matching

– Accepted

**Abstract: **Most prior work on active learning of classifiers has focused on sequentially selecting one unlabeled example at a time to be labeled in order to reduce the overall labeling effort. In many scenarios, however, it is desirable to label an entire batch of examples at once, for example, when labels can be acquired in parallel. This motivates us to study batch active learning, which iteratively selects batches of *k>1* examples to be labeled. We propose a novel batch active learning method that leverages the availability of high-quality and efficient sequential active-learning policies by attempting to approximate their behavior when applied for *k* steps. Specifically, our algorithm first uses Monte-Carlo simulation to estimate the distribution of unlabeled examples selected by a sequential policy over *k* step executions. The algorithm then attempts to select a set of *k* examples that best matches this distribution, leading to a combinatorial optimization problem that we term “bounded coordinated matching'. While we show this problem is NP-hard in general, we give an efficient greedy solution, which inherits approximation bounds from supermodular minimization theory. Our experimental results on eight benchmark datasets show that the proposed approach is highly effective

## Hybrid Batch Bayesian Optimization

– Accepted

**Abstract: **Bayesian Optimization aims at optimizing an unknown non-convex/concave function that is costly to evaluate. We are interested in application scenarios where concurrent function evaluations are possible. Under such a setting, BO could choose to either sequentially evaluate the function, one input at a time and wait for the output of the function before making the next selection, or evaluate the function at a batch of multiple inputs at once. These two different settings are commonly referred to as the *sequential* and *batch* settings of Bayesian Optimization. In general, the sequential setting leads to better optimization performance as each function evaluation is selected with more information, whereas the batch setting has an advantage in terms of the total experimental time (the number of iterations). In this work, our goal is to combine the strength of both settings. Specifically, we systematically analyze Bayesian optimization using Gaussian process as the posterior estimator and provide a hybrid algorithm that, based on the current state, dynamically switches between a sequential policy and a batch policy with *variable* batch sizes. We provide theoretical justification for our algorithm and present experimental results on eight benchmark BO problems. The results show that our method achieves substantial speedup (up to *% 78*) compared to a pure sequential policy, without suffering any significant performance loss.

## Efficient and Practical Stochastic Subgradient Descent for Nuclear Norm Regularization

– Accepted

**Abstract: **We describe novel subgradient methods for a broad class of matrix optimization problems involving nuclear norm regularization. Unlike existing approaches, our method executes very cheap iterations by combining low-rank stochastic subgradients with efficient incremental SVD updates, made possible by highly optimized and parallelizable dense linear algebra operations on small matrices. Our practical algorithms always maintain a low-rank factorization of iterates that can be conveniently held in memory and efficiently multiplied to generate predictions in matrix completion settings. Empirical comparisons confirm that our approach is highly competitive with several recently proposed state-of-the-art solvers for such problems.

## Gap Filling in the Plant Kingdom—Trait Prediction Using Hierarchical Probabilistic Matrix Factorization

– Accepted

**Abstract: **Plant traits are a key to understand and predict the adaptation of ecosystems to environmental changes, which motivates the TRY project aiming at constructing a global database for plant traits and becoming a standard resource for the ecological community. Despite its unprecedented coverage, a large percentage of missing data substantially constrains joint trait analysis. Meanwhile, the trait data are characterized by the hierarchical phylogenetic structure of the plant kingdom. While factorization based matrix completion techniques have been widely used to address the missing data problem, traditional matrix factorization methods are unable to leverage the phylogenetic structure. We propose hierarchical probabilistic matrix factorization (HPMF), which effectively uses hierarchical phylogenetic information for trait prediction. We demonstrate HPMF's high accuracy, effectiveness of incorporating hierarchical structure and ability to capture trait correlation through experiments.

## Sparse Support Vector Infinite Push

– Accepted

**Abstract: **In this paper, we address the problem of embedded feature selection for ranking on top of the list problems. We pose this problem as a regularized empirical risk minimization with *p*-norm push loss function (*p=∞*) and sparsity inducing regularizers. We leverage the issues related to this challenging optimization problem by considering an alternating direction method of multipliers algorithm which is built upon proximal operators of the loss function and the regularizer. Our main technical contribution is thus to provide a numerical scheme for computing the infinite push loss function proximal operator. Experimental results on toy, DNA microarray and BCI problems show how our novel algorithm compares favorably to competitors for ranking on top while using fewer variables in the scoring function.

## A Dantzig Selector Approach to Temporal Difference Learning

– Accepted

**Abstract: **LSTD is one of the most popular reinforcement learning algorithms for value function approximation. Whenever the number of samples is larger than the number of features, LSTD must be paired with some form of regularization. In particular, L1-regularization methods tends to perform feature selection by promoting sparsity and thus they are particularly suited in high–dimensional problems. Nonetheless, since LSTD is not a simple regression algorithm but it solves a fixed–point problem, the integration with L1-regularization is not straightforward and it might come with some drawbacks (see e.g., the P-matrix assumption for LASSO-TD). In this paper we introduce a novel algorithm obtained by integrating LSTD with the Dantzig Selector. In particular, we investigate the performance of the algorithm and its relationship with existing regularized approaches, showing how it overcomes some of the drawbacks of existing solutions.

## Scaling Up Coordinate Descent Algorithms for Large *ℓ_1* Regularization Problems

– Accepted

**Abstract: **We present a generic framework for parallel coordinate descent (CD) algorithms that has as special cases the original sequential algorithms of Cyclic CD and Stochastic CD, as well as the recent parallel Shotgun algorithm of Bradley et al. We introduce two novel parallel algorithms that are also special cases—Thread-Greedy CD and Coloring-Based CD—and give performance measurements for an OpenMP implementation of these.

## Cross-Domain Multitask Learning with Latent Probit Models

– Accepted

**Abstract: **Learning multiple tasks across heterogeneous domains is a challenging problem since the feature space may not be the same for different tasks. We assume the data in multiple tasks are generated from a latent common domain via sparse domain transforms and propose a latent probit model (LPM) to jointly learn the domain transforms, and the shared probit classifier in the common domain. To learn meaningful task relatedness and avoid over-fitting in classification, we introduce sparsity in the domain transforms matrices, as well as in the common classifier. We derive theoretical bounds for the estimation error of the classifier in terms of the sparsity of domain transforms. An expectation-maximization algorithm is derived for learning the LPM. The effectiveness of the approach is demonstrated on several real datasets.

## Structured Learning from Partial Annotations

– Accepted

**Abstract: **Structured learning is appropriate when predicting structured outputs such as trees, graphs, or sequences. Most prior work requires the training set to consist of complete trees, graphs or sequences. Specifying such detailed ground truth can be tedious or infeasible for large outputs. Our main contribution is a large margin formulation that makes structured learning from only partially annotated data possible. The resulting optimization problem is non-convex, yet can be efficiently solve by concave-convex procedure (CCCP) with novel speedup strategies. We apply our method to a challenging tracking-by-assignment problem of a variable number of divisible objects. On this benchmark, using only 25% of a full annotation we achieve a performance comparable to a model learned with a full annotation. Finally, we offer a unifying perspective of previous work using the hinge, ramp, or max loss for structured learning, followed by an empirical comparison on their practical performance.

## Maximum Margin Output Coding

– Accepted

**Abstract: **In this paper we study output coding for multi-label prediction. For a multi-label output coding to be discriminative, it is important that codewords for different label vectors are significantly different from each other. In the meantime, unlike in traditional coding theory, codewords in output coding are to be predicted from the input, so it is also critical to have a predictable label encoding. To find output codes that are both discriminative and predictable, we first propose a max-margin formulation that naturally captures these two properties. We then convert it to a metric learning formulation, but with an exponentially large number of constraints as commonly encountered in structured prediction problems. Without a label structure for tractable inference, we use overgenerating (i.e., relaxation) techniques combined with the cutting plane method for optimization. In our empirical study, the proposed output coding scheme outperforms a variety of existing multi-label prediction methods for image, text and music classification.

## Sequential Nonparametric Regression

– Accepted

**Abstract: **We present algorithms for nonparametric regression in settings where the data are obtained sequentially. While traditional estimators select bandwidths that depend upon the sample size, for sequential data the effective sample size is dynamically changing. We propose a linear time algorithm that adjusts the bandwidth for each new data point, and show that the estimator achieves the optimal minimax rate of convergence. We also propose the use of online expert mixing algorithms to adapt to unknown smoothness of the regression function. We provide simulations that confirm the theoretical results, and demonstrate the effectiveness of the methods.

## An Infinite Latent Attribute Model for Network Data

– Accepted

**Abstract: **Latent variable models for network data extract a summary of the relational structure underlying an observed network. The simplest possible models subdivide nodes of the network into clusters; the probability of a link between any two nodes then depends only on their cluster assignment. Currently available models can be classified by whether clusters are disjoint or are allowed to overlap. These models can explain a “flat” clustering structure. Hierarchical Bayesian models provide a natural approach to capture more complex dependencies. We propose a model in which objects are characterised by a latent feature vector. Each feature is itself partitioned into disjoint groups (subclusters), corresponding to a second layer of hierarchy. In experimental comparisons, the model achieves significantly improved predictive performance on social and biological link prediction tasks. The results indicate that models with a single layer hierarchy over-simplify real networks.

## On Local Regret

– Accepted

**Abstract: **Online learning typically aims to perform nearly as well as the best hypothesis in hindsight. For some hypothesis classes, though, even finding the best hypothesis offline is challenging. In such offline cases, local search techniques are often employed and only local optimality guaranteed. For online decision-making with such hypothesis classes, we introduce local regret, a generalization of regret that aims to perform nearly as well as only nearby hypotheses. We then present a general algorithm that can minimize local regret for arbitrary locality graphs. We also show that certain forms of structure in the graph can be exploited to drastically simplify learning. These algorithms are then demonstrated on a diverse set of online problems (some previously unexplored): online disjunct learning, online Max-SAT, and online decision tree learning.

## Smoothness and Structure Learning by Proxy

– Accepted

**Abstract: **As data sets grow in size, the ability of learning methods to find structure in them is increasingly hampered by the time needed to search the large spaces of possibilities and generate a score for each that takes all of the observed data into account. For instance, Bayesian networks, the model chosen in this paper, have a super-exponentially large search space for a fixed number of variables. One possible method to alleviate this problem is to use a proxy, such as a Gaussian Process regressor, in place of the true scoring function, training it on a selection of sampled networks. We prove here that the use of such a proxy is well-founded, as we can bound the smoothness of a commonly-used scoring function for Bayesian network structure learning. We show here that, compared to an identical search strategy using the network’s exact scores, our proxy-based search is able to get equivalent or better scores on a number of data sets in a fraction of the time.

## A fast and simple algorithm for training neural probabilistic language models

– Accepted

**Abstract: **Neural probabilistic language models (NPLMs) have recently superseded smoothed n-gram models as the best-performing model class for language modelling. Unfortunately, the adoption of NPLMs is held back by their notoriously long training times, which can be measured in weeks even for moderately-sized datasets. These are a consequence of the models being explicitly normalized, which leads to having to consider all words in the vocabulary when computing the log-likelihood gradients. We propose a fast and simple algorithm for training NPLMs based on noise-contrastive estimation, a newly introduced procedure for estimating unnormalized continuous distributions. We investigate the behaviour of the algorithm on the Penn Treebank corpus and show that it reduces the training times by more than an order of magnitude without affecting the quality of the resulting models. The algorithm is also more efficient and much more stable than importance sampling because it requires far fewer noise samples to perform well. We demonstrate the scalability of the proposed approach by training several neural language models on a 47M-word corpus with a 80K-word vocabulary, obtaining state-of-the-art results in the Microsoft Research Sentence Completion Challenge.

## Incorporating Causal Prior Knowledge as Path-Constraints in Bayesian Networks and Maximal Ancestral Graphs

– Accepted

**Abstract: **We consider the incorporation of causal knowledge about the presence or absence of (possibly indirect) causal relations into a causal model. Such causal relations correspond to directed paths in a causal model. This type of knowledge naturally arises from experimental data, among others. Specifically, we consider the formalisms of Causal Bayesian Networks and Maximal Ancestral Graphs and their Markov equivalence classes: Partially Directed Acyclic Graphs and Partially Oriented Ancestral Graphs. We introduce sound and complete procedures which are able to incorporate causal prior knowledge in such models. In simulated experiments, we show that often considering even a few causal facts leads to a significant number of new inferences. In a case study, we also show how to use real experimental data to infer causal knowledge and incorporate it into a real biological causal network.

## High-Dimensional Covariance Decomposition into Sparse Markov and Independence Domains

– Accepted

**Abstract: **In this paper, we present a novel framework incorporating a combination of sparse models in different domains. We posit the observed data as generated from a linear combination of a sparse Gaussian Markov model (with a sparse precision matrix) and a sparse Gaussian independence model (with a sparse covariance matrix). We provide efficient methods for decomposition of the data into two domains, viz Markov and independence domains. We characterize a set of sufficient conditions for identifiability and model consistency. Our decomposition method is based on a simple modification of the popular *ℓ_1*-penalized maximum-likelihood estimator (*ℓ_1*-MLE), and is easily implementable. We establish that our estimator is consistent in both the domains, i.e., it successfully recovers the supports of both Markov and independence models, when the number of samples *n* scales as *n = Ω(d^2 log p)*, where *p* is the number of variables and *d* is the maximum node degree in the Markov model. Our conditions for recovery are comparable to those of *ℓ_1*-MLE for consistent estimation of a sparse Markov model, and thus, we guarantee successful high-dimensional estimation of a richer class of models under comparable conditions.

## Latent Collaborative Retrieval

– Accepted

**Abstract: **Retrieval tasks typically require a ranking of items given a query. Collaborative filtering tasks, on the other hand, learn models comparing users with items. In this paper we study the joint problem of recommending items to a user with respect to a given query, which is a surprisingly common task. This setup differs from the standard collaborative filtering one in that we are given a query × user × item tensor for training instead of the more traditional user × item matrix. Compared to document retrieval we do have a query, but we may or may not have content features (we will consider both cases) and we can also take account of the user’s profile. We introduce a factorized model for this new task that optimizes the top ranked items returned for the given query and user. We report empirical results where it outperforms several baselines.

## Lightning Does Not Strike Twice: Robust MDPs with Coupled Uncertainty

– Accepted

**Abstract: **We consider Markov decision processes under parameter uncertainty. Previous studies all restrict to the case that uncertainties among different states are uncoupled, which leads to conservative solutions. In contrast, we introduce an intuitive concept, termed 'Lightning Does not Strike Twice,' to model coupled uncertain parameters. Specifically, we require that the system can deviate from its nominal parameters only a bounded number of times. We give probabilistic guarantees indicating that this model represents real life situations and devise tractable algorithms for computing optimal control policies using this concept.

## On causal and anticausal learning

– Accepted

**Abstract: **We consider the problem of function estimation in the case where an underlying causal model can be identified. This has implications for popular scenarios such as covariate shift, concept drift, transfer learning and semi-supervised learning. We argue that causal knowledge may facilitate some approaches for a given problem, and rule out others. In particular, we formulate a hypothesis for when semi-supervised learning can help, and corroborate it with empirical results.

## Compact Hyperplane Hashing with Bilinear Functions

– Accepted

**Abstract: **Hyperplane hashing aims at rapidly searching nearest points to a hyperplane, and has shown practical impact in scaling up active learning with SVMs. Unfortunately, the existing randomized methods need long hash codes and a number of hash tables to achieve reasonable search accuracy. Thus, they suffer from reduced search speed and large memory overhead. To this end, this paper proposes a novel hyperplane hashing technique which yields compact hash codes. The key idea is the bilinear form of the proposed hash functions, which leads to higher collision probability than the existing hyperplane hash functions when using random projections. To further increase the performance, we propose a learning based framework in which bilinear functions are directly learned from the data. This yields compact yet discriminative codes, and also increases the search performance over the random projection based solutions. Large-scale active learning experiments carried out on two datasets with up to one million samples demonstrate the overall superiority of the proposed approach.

## Continuous Inverse Optimal Control with Locally Optimal Examples

– Accepted

**Abstract: **Inverse optimal control, also known as inverse reinforcement learning, is the problem of recovering an unknown reward function in a Markov decision process from expert demonstrations of the optimal policy. We introduce a probabilistic inverse optimal control algorithm that scales gracefully with task dimensionality, and is suitable for large, continuous domains where even computing a full policy is impractical. By using a local approximation of the reward function, our method can also drop the assumption that the demonstrations are globally optimal, requiring only local optimality. This allows it to learn from examples that are unsuitable for prior methods.

## Convex Multitask Learning with Flexible Task Clusters

– Accepted

**Abstract: **Traditionally, multitask learning (MTL) assumes that all the tasks are related. This can lead to negative transfer when tasks are indeed incoherent. Recently, a number of approaches have been proposed that alleviate this problem by discovering the underlying task clusters or relationships. However, they are limited to modeling these relationships at the task level,which may be restrictive in some applications. In this paper, we propose a novel MTL formulation that captures task relationships at the feature-level. Depending on the interactions among tasks and features, the proposed method construct different task clusters for different features, without even the need of pre-specifying the number of clusters. Computationally, the proposed formulation is strongly convex, and can be efficiently solved by accelerated proximal methods. Experiments are performed on a number of synthetic and real-world data sets. Under various degrees of task relationships, the accuracy of the proposed method is consistently among the best. Moreover, the feature-specific task clusters obtained agree with the known/plausible task structures of the data.

## A Hierarchical Dirichlet Process Model with Multiple Levels of Clustering for Human EEG Seizure Modeling

– Accepted

**Abstract: **Driven by the multi-level structure of human intracranial electroencephalogram (iEEG) recordings of epileptic seizures, we introduce a new variant of a hierarchical Dirichlet Process—the multi-level clustering hierarchical Dirichlet Process (MLC-HDP)—that simultaneously clusters datasets on multiple levels. Our seizure dataset contains brain activity recorded in typically more than a hundred individual channels for each seizure of each patient. The MLC-HDP model clusters over channels-types, seizure-types, and patient-types simultaneously. We describe this model and its implementation in detail. We also present the results of a simulation study comparing the MLC-HDP to a similar model, the Nested Dirichlet Process and finally demonstrate the MLC-HDP's use in modeling seizures across multiple patients. We find the MLC-HDP's clustering to be comparable to independent human physician clusterings. To our knowledge, the MLC-HDP model is the first in the epilepsy literature capable of clustering seizures within and between patients.

## Levy Measure Decompositions for the Beta and Gamma Processes

– Accepted

**Abstract: **We develop new representations for the Levy measures of the beta and gamma processes. These representations are manifested in terms of an infinite sum of well-behaved (proper) beta and gamma distributions. Further, we demonstrate how these infinite sums may be truncated in practice, and explicitly characterize truncation errors. We also perform an analysis of the characteristics of posterior distributions, based on the proposed decompositions. The decompositions provide new insights into the beta and gamma processes, and we demonstrate how the proposed representation unifies some properties of the two. This paper is meant to provide a rigorous foundation for and new perspectives on L´evy processes, as these are of increasing importance in machine learning.

## Building high-level features using large scale unsupervised learning

– Accepted

**Abstract: **We consider the challenge of building feature detectors for high-level concepts from only unlabeled data. For example, we would like to understand if it is possible to learn a face detector using only unlabeled images downloaded from the Internet. To answer this question, we trained a 9-layered locally connected sparse autoencoder with pooling and local contrast normalization on a large dataset of images (which has 10 million images, each image has 200x200 pixels). On contrary to what appears to be a widely-held negative belief, our experimental results reveal that it is possible to achieve a face detector via only unlabeled data. Control experiments show that the feature detector is robust not only to translation but also to scaling and 3D rotation. Also via recognition and visualization, we find that the same network is sensitive to other high-level concepts such as cat faces and human bodies.

## Near-Optimal BRL using Optimistic Local Transitions

– Accepted

**Abstract: **Model-based Bayesian Reinforcement Learning (BRL) allows a found formalization of the problem of acting optimally while facing an unknown environment, i.e., avoiding the exploration-exploitation dilemma. However, algorithms explicitly addressing BRL suffer from such a combinatorial explosion that a large body of work relies on heuristic algorithms. This paper introduces BOLT, a simple and (almost) deterministic heuristic algorithm for BRL which is optimistic about the transition function. We analyze BOLT's sample complexity, and show that under certain parameters, the algorithm is near-optimal in the Bayesian sense with high probability. Then, experimental results highlight the key differences of this method compared to previous work.

## A Unified Robust Classification Model

– Accepted

**Abstract: **A wide variety of machine learning algorithms such as support vector machine (SVM), minimax probability machine (MPM), and Fisher discriminant analysis (FDA), exist for binary classification. The purpose of this paper is to provide a unified classification model that includes the above models through a robust optimization approach. This unified model has several benefits. One is that the extensions and improvements intended for SVM become applicable to MPM and FDA, and vice versa. Another benefit is to provide theoretical results to above learning methods at once by dealing with the unified model. We give a statistical interpretation of the unified classification model and propose a non-convex optimization algorithm that can be applied to non-convex variants of existing learning methods.

## Manifold Relevance Determination

– Accepted

**Abstract: **In this paper we present a fully Bayesian latent variable model which exploits conditional non- linear (in)-dependence structures to learn an efficient latent representation. The model is capable of learning from extremely high-dimensional data such as directly modelling high resolution images. The latent representation is factorized to represent shared and private information from multiple views of the data. Bayesian techniques allow us to automatically estimate the dimensionality of the latent spaces. We demonstrate the model by prediction of human pose in an ambiguous setting. Our Bayesian representation allows us to perform disambiguation in a principled manner by including priors which incorporate the dynamics structure of the data. We demonstrate the ability of the model to capture structure underlying extremely high dimensional spaces by learning a low-dimensional representation of a set of facial images under different illumination conditions. The model correctly automatically creates a factorized representation where the lighting variance is represented in a separate latent space from the variance associated with different faces. We show that the model is capable of generating morphed faces and images from novel light directions.

## Residual Components Analysis

– Accepted

**Abstract: **Probabilistic principal component analysis (PPCA) seeks a low dimensional representation of a data set in the presence of independent spherical Gaussian noise, * Σ = σ^2I*. The maximum likelihood solution for the model is an eigenvalue problem on the sample covariance matrix. In this paper we consider the situation where the data variance is already partially explained by other factors, e.g. conditional dependencies between the covariates, or temporal correlations leaving some residual variance. We decompose the residual variance into its components through a generalised eigenvalue problem, which we call residual component analysis (RCA). We explore a range of new algorithms that arise from the framework, including one that factorises the covariance of a Gaussian density into a low-rank and a sparse-inverse component. We illustrate the ideas on the recovery of a protein-signaling network, a gene expression time-series data set and the recovery of the human skeleton from motion capture 3-D cloud data.

## Clustering to Maximize the Ratio of Split to Diameter

– Accepted

**Abstract: **Given a weighted and complete graph G = (V, E), V denotes the set of n objects to be clustered, and the weight d(u, v) associated with an edge (u, v) belonging to E denotes the dissimilarity between objects u and v. The diameter of a cluster is the maximum dissimilarity between pairs of objects in the cluster, and the split of a cluster is the minimum dissimilarity between objects within the cluster and objects outside the cluster. In this paper, we propose a new criterion for measuring the goodness of clusters:?the ratio of the minimum split to the maximum diameter, and the objective is to maximize the ratio. For k = 2, we present an exact algorithm. For k >= 3, we prove that the problem is NP-hard and present a factor of 2 approximation algorithm on the precondition that the weights associated with edges of G satisfy the triangle inequality. The worst-case runtime of both algorithms is O(n3). We compare the proposed algorithms with the Normalized Cut by applying them to image segmentation. The experimental results on both natural and synthetic images demonstrate the effectiveness of the proposed algorithms.

## A Graphical Model Formulation of Collaborative Filtering Neighbourhood Methods with Fast Maximum Entropy Training

– Accepted

**Abstract: **Item neighbourhood methods for collaborative filtering learn a weighted graph over the set of items, where each item is connected to those it is most similar to. The prediction of a user's rating on an item is then given by that rating of neighbouring items, weighted by their similarity. This paper presents a new neighbourhood approach which we call item fields, whereby an undirected graphical model is formed over the item graph. The resulting prediction rule is a simple generalization of the classical approaches, which takes into account non-local information in the graph, allowing its best results to be obtained when using drastically fewer edges than other neighbourhood approaches. A fast approximate maximum entropy training method based on the Bethe approximation is presented which utilizes a novel decomposition into tractable sub-problems. When using precomputed sufficient statistics on the Movielens dataset, our method outperforms maximum likelihood approaches by two orders of magnitude.

## On-Line Portfolio Selection with Moving Average Reversion

– Accepted

**Abstract: **On-line portfolio selection has attracted increasing interests in machine learning and AI communities recently. Empirical evidences show that stock's high and low prices are temporary and stock price relatives are likely to follow the mean reversion phenomenon. While the existing mean reversion strategies are shown to achieve good empirical performance on many real datasets, they often make the *single-period mean reversion* assumption, which is not always satisfied in some real datasets, leading to poor performance when the assumption does not hold. To overcome the limitation, this article proposes a *multiple-period mean reversion*, or so-called “Moving Average Reversion' (MAR), and a new on-line portfolio selection strategy named “On-Line Moving Average Reversion” (OLMAR), which exploits MAR by applying powerful online learning techniques. From our empirical results, we found that OLMAR can overcome the drawback of existing mean reversion algorithms and achieve significantly better results, especially on the datasets where the existing mean reversion algorithms failed. In addition to superior trading performance, OLMAR also runs extremely fast, further supporting its practical applicability to a wide range of applications.

## Improved Information Gain Estimates for Decision Tree Induction

– Accepted

**Abstract: **Ensembles of classification and regression trees remain popular machine learning methods because they define flexible non-parametric models that predict well and are computationally efficient both during training and testing. During induction of decision trees one aims to find predicates that are maximally informative about the prediction target. To select good predicates most approaches estimate an information-theoretic scoring function, the information gain, both for classification and regression problems. We point out that the common estimation procedures are biased and show that by replacing them with improved estimators of the discrete and the differential entropy we can obtain better decision trees. In effect our modifications yield improved predictive performance and are simple to implement in any decision tree code.

## Influence Maximization in Continuous Time Diffusion Networks

– Accepted

**Abstract: **The problem of finding the optimal set of source nodes in a diffusion network that maximizes the spread of information, influence, and diseases in a limited amount of time depends dramatically on the underlying temporal dynamics of the network. However, this still remains largely unexplored to date. To this end, given a network and its temporal dynamics, we first describe how continuous time Markov chains allow us to analytically compute the average total number of nodes reached by a diffusion process starting in a set of source nodes. We then show that selecting the set of most influential source nodes in the continuous time influence maximization problem is NP-hard and develop an efficient approximation algorithm with provable near-optimal performance. Experiments on synthetic and real diffusion networks show that our algorithm outperforms other state of the art algorithms by at least 20% and is robust across different network topologies.

## On the Size of the Online Kernel Sparsification Dictionary

– Accepted

**Abstract: **We analyze the size of the dictionary constructed from online kernel sparsification, using a novel formula that expresses the expected determinant of the kernel Gram matrix in terms of the eigenvalues of the covariance operator. Using this formula, we are able to connect the cardinality of the dictionary with the eigen-decay of the covariance operator. In particular, we show that for bounded kernels, the size of the dictionary always grows sub-linearly in the number of data points, and, as a consequence, the kernel linear regressor constructed from the resulting dictionary is consistent.

## Multi-level Lasso for Sparse Multi-task Regression

– Accepted

**Abstract: **We present a flexible formulation for variable selection in multi-task regression to allow for discrepancies in the estimated sparsity patterns accross the multiple tasks, while leveraging the common structure among them. Our approach is based on an intuitive decomposition of the regression coefficients into a product between a component that is common to all tasks and another component that captures task-specificity. This decomposition yields the Multi-level Lasso objective that can be solved efficiently via alternating optimization. The analysis of the “orthonormal design” case reveals some interesting insights on the nature of the shrinkage performed by our method, compared to that of related work. Theoretical guarantees are provided on the consistency of Multi-level Lasso. Simulations and empirical study of micro-array data further demonstrate the value of our framework.

## Fast Computation of Subpath Kernel for Trees

– Accepted

**Abstract: **The kernel method is a potential approach to analyzing structured data such as sequences, trees, and graphs; however, unordered trees have not been investigated extensively. Kimura et al. (2011) proposed a kernel function for unordered trees on the basis of their subpaths, which are vertical substructures of trees responsible for hierarchical information in them. Their kernel exhibits practically good performance in terms of accuracy and speed; however, linear-time computation is not guaranteed theoretically, unlike the case of the other unordered tree kernel proposed by Vishwanathan and Smola (2003). In this paper, we propose a theoretically guaranteed linear-time kernel computation algorithm that is practically fast, and we present an efficient prediction algorithm whose running time depends only on the size of the input tree. Experimental results show that the proposed algorithms are quite efficient in practice.

## Total Variation and Euler's Elastica for Supervised Learning

– Accepted

**Abstract: **In recent years, total variation (TV) and Euler's elastica (EE) have been successfully applied to image processing tasks such as denoising and inpainting. This paper investigates how to extend TV and EE to the supervised learning settings on high dimensional data. The supervised learning problem can be formulated as an energy functional minimization under Tikhonov regularization scheme, where the energy is composed of a squared loss and a total variation smoothing (or Euler's elastica smoothing). Its solution via variational principles leads to an Euler-Lagrange PDE. However, the PDE is always high-dimensional and cannot be directly solved by common methods. Instead, radial basis functions are utilized to approximate the target function, reducing the problem to finding the linear coefficients of basis functions. We apply the proposed methods to supervised learning tasks (including binary classification, multi-class classification, and regression) on benchmark data sets. Extensive experiments have demonstrated promising results of the proposed methods.

## Learning the Dependence Graph of Time Series with Latent Factors

– Accepted

**Abstract: **This paper considers the problem of learning, from samples, the dependency structure of a system of linear stochastic differential equations, when some of the variables are latent. We observe the time evolution of some variables, and never observe other variables; from this, we would like to find the dependency structure of the observed variables – separating out the spurious interactions caused by the latent variables' time series. We develop a new convex optimization based method to do so in the case when the number of latent variables is smaller than the number of observed ones. For the case when the dependency structure between the observed variables is sparse, we theoretically establish a high-dimensional scaling result for structure recovery. We verify our theoretical result with both synthetic and real data (from the stock market).

## A Generalized Loop Correction Method for Approximate Inference in Graphical Models

– Accepted

**Abstract: **Belief Propagation (BP) is one of the most popular methods for inference in probabilistic graphical models. BP is guaranteed to return the correct answer for tree structures, but can be incorrect or non-convergent for loopy graphical models. Recently, several new approximate inference algorithms based on cavity distribution have been proposed. These methods can account for the effect of loops by incorporating the dependency between BP messages. Alternatively, region-based approximations (that lead to methods such as Generalized Belief Propagation) improve upon BP by considering interactions within small clusters of variables, thus taking small loops within these clusters into account. This paper introduces an approach, Generalized Loop Correction (GLC), that benefits from both of these types of loop correction. We show how GLC relates to these two families of inference methods, then provide empirical evidence that GLC works effectively in general, and can be significantly more accurate than both correction schemes.

## Consistent Covariance Selection From Data With Missing Values

– Accepted

**Abstract: ** Data sets with missing values arise in many practical problems and domains. However, correct statistical analysis of these data sets is difficult. A popular likelihood approach to statistical inference from partially observed data is the expectation maximization (EM) algorithm, which leads to non-convex optimization and estimates that are difficult to analyze theoretically. We study a simple two step procedure for covariance selection, which is tractable in high-dimensions and does not require imputation of the missing values. We provide rates of convergence for this estimator in the spectral norm, Frobenius norm and element-wise *ℓ_∞* norm. Simulation studies show that this estimator compares favorably with the EM algorithm. Our results have important practical consequences as they show that standard tools for covariance selection can be used when data contains missing values, without resorting to the iterative EM algorithm that can be slow to converge in practice for large problems.

## Is margin preserved after random projection?

– Accepted

**Abstract: **Random projections have been applied in many machine learning algorithms. However, whether margin is preserved after random projection is non-trivial and not well studied. In this paper we analyse margin distortion after random projection, and give the conditions of margin preservation for binary classification problems. We also extend our analysis to margin for multiclass problems, and provide theoretical bounds on multiclass margin on the projected data.

## A Bayesian Approach to Approximate Joint Diagonalization of Square Matrices

– Accepted

**Abstract: **We present a fully Bayesian approach to simultaneously approximate diagonalization of several square matrices which are not necessary symmetric. A Gibbs sampler has been derived for simulating the common eigenvectors and the eigenvalues for these matrices. Several data are used to demonstrate the performance of the proposed Gibbs sampler and we then provide comparisons to several other joint diagonalization algorithms, which shows that the Gibbs sampler achieves the state-of-the-art performance. As a byproduct, the output of the Gibbs sampler could be used to estimate the log marginal likelihood by the Bayesian information criterion (BIC) which correctly located the number of common eigenvectors. We then applied the Gibbs sampler to the blind sources separation problem, the common principal component analysis and the common spatial pattern analysis.

## Predicting accurate probabilities with a ranking loss

– Accepted

**Abstract: **In many real-world applications of machine learning classifiers, it is essential to predict the probability of an example belonging to a particular class. This paper proposes a simple technique for predicting probabilities based on optimizing a ranking loss, followed by isotonic regression. This semi-parametric technique offers both good ranking and regression performance, and models a richer set of probability distributions than statistical workhorses such as logistic regression. We provide experimental results that show the effectiveness of this technique on real-world applications of probability prediction.

## Learning with Augmented Features for Heterogeneous Domain Adaptation

– Accepted

**Abstract: **We propose a new learning method for heterogeneous domain adaptation (HDA), in which the data from the source domain and the target domain are represented by heterogeneous features with different dimensions. Using two different projection matrices, we first transform the data from two domains into a common subspace in order to measure the similarity between the data from two domains. We then propose two new feature mapping functions to augment the transformed data with their original features and zeros. The existing learning methods (e.g., SVM and SVR) can be readily incorporated with our newly proposed augmented feature representations to effectively utilize the data from both domains for HDA. Using the hinge loss function in SVM as an example, we introduce the detailed objective function in our method called Heterogeneous Feature Augmentation (HFA) for a linear case and also describe its kernelization in order to efficiently cope with the data with very high dimensions. Moreover, we also develop an alternating optimization algorithm to effectively solve the nontrivial optimization problem in our HFA method. Comprehensive experiments on two benchmark datasets clearly demonstrate that our HFA outperforms the existing HDA methods.

## Dirichlet Process with Mixed Random Measures: A Nonparametric Topic Model for Labeled Data

– Accepted

**Abstract: **We describe a nonparametric topic model for labeled data. The model uses a mixture of random measures (MRM) as a base distribution of the Dirichlet process (DP) of the HDP framework, so we call it the DP-MRM. To model labeled data, we define a DP distributed random measure for each label, and the resulting model generates an unbounded number of topics for each label. We apply DP-MRM on single-labeled and multi-labeled corpora of documents and compare the performance on label prediction with LDA-SVM and Labeled-LDA. We further enhance the model by incorporating ddCRP and modeling multi-labeled images for image segmentation and object labeling, comparing the performance with nCuts and rddCRP.

## Evaluating Bayesian and L1 Approaches for Sparse Unsupervised Learning

– Accepted

**Abstract: ** The use of *L_1* regularisation for sparse learning has generated immense research interest, with many successful applications in diverse areas such as signal acquisition, image coding, genomics and collaborative filtering. While existing work highlights the many advantages of *L_1* methods, in this paper we find that *L_1* regularisation often dramatically under-performs in terms of predictive performance when compared with other methods for inferring sparsity. We focus on unsupervised latent variable models, and develop *L_1* minimising factor models, Bayesian variants of “*L_1*”, and Bayesian models with a stronger *L_0*-like sparsity induced through spike-and-slab distributions. These spike-and-slab Bayesian factor models encourage sparsity while accounting for uncertainty in a principled manner, and avoid unnecessary shrinkage of non-zero values. We demonstrate on a number of data sets that in practice spike-and-slab Bayesian methods outperform *L_1* minimisation, even on a computational budget. We thus highlight the need to re-assess the wide use of *L_1* methods in sparsity-reliant applications, particularly when we care about generalising to previously unseen data, and provide an alternative that, over many varying conditions, provides improved generalisation performance.

## Collaborative Topic Regression with Social Matrix Factorization for Recommendation Systems

– Accepted

**Abstract: **Social network websites, such as Facebook, YouTube, Lastfm etc, have become a popular platform for users to connect with each other and share content or opinions. They provide rich information to study the influence of user's social circle in their decision process. In this paper, we are interested in examining the effectiveness of social network information to predict the user's ratings of items. We propose a novel hierarchical Bayesian model which jointly incorporates topic modeling and probabilistic matrix factorization of social networks. A major advantage of our model is to automatically infer useful latent topics and social information as well as their importance to collaborative filtering from the training data. Empirical experiments on two large-scale datasets show that our algorithm provides a more effective recommendation system than the state-of-the art approaches. Our results also reveal interesting insight that the social circles have more influence on people's decisions about the usefulness of information (e.g., bookmarking preference on Delicious) than personal taste (e.g., music preference on Lastfm).

## LPQP for MAP: Putting LP Solvers to Better Use

– Accepted

**Abstract: **MAP inference for general energy functions remains a challenging problem. While most efforts are channeled towards improving the linear programming (LP) based relaxation, this work is motivated by the quadratic programming (QP) relaxation. We propose a novel MAP relaxation that penalizes the Kullback-Leibler divergence between the LP pairwise auxiliary variables, and QP equivalent terms given by the product of the unaries. We develop two efficient algorithms based on variants of this relaxation. The algorithms minimize the non-convex objective using belief propagation and dual decomposition as building blocks. Experiments on synthetic and real-world data show that the solutions returned by our algorithms substantially improve over the LP relaxation.

## Clustering by Low-Rank Doubly Stochastic Matrix Decomposition

– Accepted

**Abstract: **Clustering analysis by nonnegative low-rank approximations has achieved remarkable progress in the past decade. However, most approximation approaches in this direction are still restricted to matrix factorization. We propose a new low-rank learning method to improve the clustering performance, which is beyond matrix factorization. The approximation is based on a two-step bipartite random walk through virtual cluster nodes, where the approximation is formed by only cluster assigning probabilities. Minimizing the approximation error measured by Kullback-Leibler divergence is equivalent to maximizing the likelihood of a discriminative model, which endows our method with a solid probabilistic interpretation. The optimization is implemented by a relaxed Majorization-Minimization algorithm that is advantageous in finding good local minima. Furthermore, we point out that the regularized algorithm with Dirichlet prior only serves as initialization. Experimental results show that the new method has strong performance in clustering purity for various datasets, especially for large-scale manifold data.

## Dependent Hierarchical Normalized Random Measures for Dynamic Topic Modeling

– Accepted

**Abstract: **We develop dependent hierarchical normalized random measures and apply them to dynamic topic modeling. The dependency arises via superposition, subsampling and point transition on the underlying Poisson processes of these measures. The measures used include normalised generalised Gamma processes that demonstrate power law properties, unlike Dirichlet processes used previously in dynamic topic modeling. Inference for the model includes adapting a recently developed slice sampler to directly manipulate the underlying Poisson process. Experiments performed on news, blogs, academic and Twitter collections demonstrate the technique gives superior perplexity over a number of previous models.

## State-Space Inference for Non-Linear Latent Force Models with Application to Satellite Orbit Prediction

– Accepted

**Abstract: **Latent force models (LFMs) are flexible models that combine mechanistic modelling principles (i.e., physical models) with non-parametric data-driven components. Several key applications of LFMs need non-linearities, which results in analytically intractable inference. In this work we show how non-linear LFMs can be represented as non-linear white noise driven state-space models and present an efficient non-linear Kalman filtering and smoothing based method for approximate state and parameter inference. We illustrate the performance of the proposed methodology via two simulated examples, and apply it to a real-world problem of long-term prediction of GPS satellite orbits.

## Sparse Additive Functional and Kernel CCA

– Accepted

**Abstract: **Canonical Correlation Analysis (CCA) is a classical tool for finding correlations among the components of two random vectors. In recent years, CCA has been widely applied to the analysis of genomic data, where it is common for researchers to perform multiple assays on a single set of patient samples. Recent work has proposed sparse variants of CCA to address the high dimensionality of such data. However, classical and sparse CCA are based on linear models, and are thus limited in their ability to find general correlations. In this paper, we present two approaches to high-dimensional nonparametric CCA, building on recent developments in high-dimensional nonparametric regression. We present estimation procedures for both approaches, and analyze their theoretical properties in the high-dimensional setting. We demonstrate the effectiveness of these procedures in discovering nonlinear correlations via extensive simulations, as well as through experiments with genomic data.

## The Kernelized Stochastic Batch Perceptron

– Accepted

**Abstract: **We present a novel approach for training kernel Support Vector Machines, establish learning runtime guarantees for our method that are better then those of any other known kernelized SVM optimization approach, and show that our method works well in practice compared to existing alternatives.

## Fast classification using sparse decision DAGs

– Accepted

**Abstract: **In this paper we propose an algorithm that builds sparse decision DAGs (directed acyclic graphs) out of a list of base classifiers provided by an external learning method such as AdaBoost. The basic idea is to cast the DAG design task as a Markov decision process. Each instance can decide to use or to skip each base classifier, based on the current state of the classifier being built. The result is a sparse decision DAG where the base classifiers are selected in a data-dependent way. The method has a single hyperparameter with a clear semantics of controlling the accuracy/speed trade-off. The algorithm is competitive with state-of-the-art cascade detectors on three object-detection benchmarks, and it clearly outperforms them in the regime of low number of base classifiers. Unlike cascades, it is also readily applicable for multi-class classification. Using the multi-class setup, we show on a benchmark web page ranking data set that we can significantly improve the decision speed without harming the performance of the ranker.

## A Combinatorial Algebraic Approach for the Identifiability of Low-Rank Matrix Completion

– Accepted

**Abstract: **In this paper, we review the problem of matrix completion and expose its intimate relations with algebraic geometry and combinatorial graph theory. We present the first necessary and sufficient combinatorial conditions for matrices of arbitrary rank to be identifiable from a set of matrix entries, yielding theoretical constraints and new algorithms for the problem of matrix completion. We conclude with algorithmically evaluating the tightness of the given conditions and algorithms for practically relevant matrix sizes, showing that the algebraic-combinatoric approach can lead to improvements over state-of-the-art matrix completion methods.

## Rethinking Collapsed Variational Bayes Inference for LDA

– Accepted

**Abstract: **We provide a unified view for existing inference algorithms of latent Dirichlet allocation by using the alpha-divergence projection. In particular, we explain the collapsed variational Bayes inference with zero-order information, called the CVB0 inference, in terms of the local alpha-divergence projection. We show that the CVB0 inference is composed of two different divergence projections: alpha=1 and -1.

## Exact Maximum Margin Structure Learning of Bayesian Networks

– Accepted

**Abstract: **Recently, there has been much interest in finding globally optimal Bayesian network structures. %, which is NP-hard in general. These techniques were developed for generative scores and can not be directly extended to discriminative scores, as desired for classification. In this paper, we propose an exact method for finding network structures maximizing the probabilistic soft margin, a successfully applied discriminative score. Our method is based on branch-and-bound techniques within a linear programming framework and maintains an any-time solution, together with worst-case suboptimality bounds. We introduce a set of order constraints for enforcing the network structure to be acyclic, which allows a compact problem representation and the use of general-purpose optimization techniques. In classification experiments, our methods clearly outperform generatively trained network structures and compete with support vector machines.

## AOSO-LogitBoost: Adaptive One-Vs-One LogitBoost for Multi-Class Problem

– Accepted

**Abstract: **This paper is dedicated to the improvement of model learning in multi-class LogitBoost for classification. Motivated by statistical view, LogitBoost can be seen as additive tree regression. Important facts in such a setting are 1) coupled classifier output as sum-to-zero constraint and 2) dense Hessian matrix arising in tree node split gain and node values fitting. On the one hand, the setting is too complicated for a tractable model learning algorithm; On the other hand, too aggressive simplification of the setting may lead to degraded performance. For example, the original LogitBoost is outperformed by ABC-LogitBoost due to the later's more careful treatment for the above two key points in problem settings. In this paper we propose improved methods to address the challenge: we adopt 1) vector tree (i.e. node value is vector) that enforces sum-to-zero constraint and 2) adaptive block coordinate descent exploiting dense Hessian when computing tree split gain and node values. Higher classification accuracy and faster convergence rate are observed for a range of public data sets when comparing to both original and ABC LogitBoost.

discussion+video, ICML version (pdf) (bib), more on ArXiv (includes revisions)

## Hypothesis testing using pairwise distances and associated kernels

– Accepted

**Abstract: **We provide a unifying framework linking two classes of statistics used in two-sample and independence testing: on one hand, the energy distances and distance covariances from the statistics literature; on the other, distances between embeddings of distributions to reproducing kernel Hilbert spaces (RKHS), as established in machine learning. The equivalence holds when energy distances are computed with semimetrics of negative type, in which case a kernel may be defined such that the RKHS distance between distributions corresponds exactly to the energy distance. We determine the class of probability distributions for which kernels induced by semimetrics are characteristic (that is, for which embeddings of the distributions to an RKHS are injective). Finally, we investigate the performance of this family of kernels in two-sample and independence tests: we show in particular that the energy distance most commonly employed in statistics is just one member of a parametric family of kernels, and that other choices from this family can yield more powerful tests.

## Monte Carlo Bayesian Reinforcement Learning

– Accepted

**Abstract: **Bayesian reinforcement learning (BRL) encodes prior knowledge of the world in a model and represents uncertainty in model parameters by maintaining a posterior distribution over them. This paper proposes a simple and general approach to BRL. The idea is to sample a priori a finite set of hypotheses for the model parameter values and form a discrete partially observable Markov decision process (POMDP) whose state space is a cross product of the state space for the reinforcement learning task and the sampled model parameter space. The POMDP does not require conjugate distributions for belief representation, as earlier works do, and can be solved relatively easily with point-based approximation algorithms. Our approach naturally handles both fully and partially observable worlds. Theoretical and experimental results show that the discrete POMDP approximates the underlying BRL problem well with guaranteed performance.

## A Topic Model for Melodic Sequences

– Accepted

**Abstract: **We examine the problem of learning a probabilistic model for melody directly from musical sequences belonging to the same genre. This is a challenging task as one needs to tackle not only the complex statistical dependencies that characterize a music genre, but also the element of novelty, as each music piece is a unique realization of a musical form. To address this problem we introduce the Variable-gram Topic Model, which couples the latent topic formalism with a systematic model for contextual information. We evaluate the model using a next-step prediction task. Additionally, we present a novel way of model evaluation, where we directly compare model samples with data sequences using the Maximum Mean Discrepancy (MMD) of string kernels, to assess how close is the model distribution to the data distribution. We show that the model has the highest performance under both evaluation measures when compared to LDA, the Topic Bigram and related non-topic models.

## Output Space Search for Structured Prediction

– Accepted

**Abstract: **We consider a framework for structured prediction based on search in the space of complete structured outputs. Given a structured input, an output is produced by running a time-bounded search procedure, guided by a learned cost function, and then returning the least cost output uncovered during the search. This framework can be instantiated for a wide range of search spaces and search procedures, and easily incorporates arbitrary structured-prediction loss functions. In this paper, we make two main technical contributions within this framework. First, we describe a novel approach to automatically defining an effective search space over structured outputs, which is able to leverage the availability of powerful classification learning algorithms. In particular, we define the limited-discrepancy search space and relate the quality of that space to the quality of learned classifiers. Second, we give a generic cost function learning approach that can be instantiated for a wide range of search procedures. The key idea is to learn a cost function that attempts to mimic the behavior of conducting searches guided by the true loss function. Our experiments on six benchmark domains demonstrate that using our framework with only a small amount of search is sufficient for significantly improving on state-of-the-art structured-prediction performance.

## How To Grade a Test Without Knowing the Answers — A Bayesian Graphical Model for Adaptive Crowdsourcing and Aptitude Testing

– Accepted

**Abstract: **We propose a new probabilistic graphical model that jointly models the difficulties of questions, the abilities of participants and the correct answers to questions in aptitude testing and crowdsourcing settings. We devise an active learning/adaptive testing scheme based on a greedy minimization of expected model entropy, which allows a more efficient resource allocation by dynamically choosing the next question to be asked based on the previous responses. We present experimental results that confirm the ability of our model to infer the required parameters and demonstrate that the adaptive testing scheme requires fewer questions to obtain the same accuracy as a static test scenario.

## Parallelizing Exploration-Exploitation Tradeoffs with Gaussian Process Bandit Optimization

– Accepted

**Abstract: **Can one parallelize complex exploration–exploitation tradeoffs? As an example, consider the problem of optimal high-throughput experimental design, where we wish to sequentially design batches of experiments in order to simultaneously learn a surrogate function mapping stimulus to response, as well as identifying the maximum of the function. We formalize the task as a multi-armed bandit problem, where the unknown payoff function is sampled from a Gaussian process (GP), and instead of a single arm, in each round we pull a batch of several arms in parallel. We develop GP-BUCB, a principled algorithm for choosing batches, based on the GP-UCB algorithm for sequential GP optimization. We prove that, perhaps surprisingly, the cumulative regret of the parallel algorithm, compared to the sequential approach, only increases by a constant factor independent of the batch size B, for any fixed B. Our results provide rigorous theoretical support for exploiting parallelism in Bayesian global optimization. We demonstrate the effectiveness of our approach on two real-world applications.

## A Simple Algorithm for Semi-supervised Learning with Improved Generalization Error Bound

– Accepted

**Abstract: **In this work, we develop a simple algorithm for semi-supervised regression. The key idea is to use the top eigenfunctions of integral operator derived from both labeled and unlabeled examples as the basis functions and learn the prediction function by a simple linear regression. We show that under appropriate assumptions about the integral operator, this approach is able to achieve an improved regression error bound better than existing bounds of supervised learning. We also verify the effectiveness of the proposed algorithm by an empirical study.

## Bayesian Optimal Active Search and Surveying

– Accepted

**Abstract: **We consider two active binary-classification problems with atypical objectives. In the first, active search, our goal is to actively uncover as many members of a given class as possible. In the second, active surveying, our goal is to actively query points to ultimately predict the class proportion of a given class. Numerous real-world problems can be framed in these terms, and in either case typical model-based concerns such as generalization error are only of secondary importance. We approach these problems via Bayesian decision theory; after choosing natural utility functions, we derive the optimal policies. We provide three contributions. In addition to introducing the active surveying problem, we extend previous work on active search in two ways. First, we prove a novel theoretical result, that less-myopic approximations to the optimal policy can outperform more-myopic approximations by any arbitrary degree. We then derive bounds that for certain models allow us to reduce (in practice dramatically) the exponential search space required by a naïve implementation of the optimal policy, enabling further lookahead while still ensuring the optimal decisions are always made.

## Incorporating Domain Knowledge in Matching Problems via Harmonic Analysis

– Accepted

**Abstract: **Matching one set of objects to another is a ubiquitous task in machine learning and computer vision that often reduces to some form of the quadratic assignment problem (QAP). The QAP is known to be notoriously hard, both in theory and in practice. Here, we investigate if this difficulty can be mitigated when some additional piece of information is available: (a) that all QAP instances of interest come from the same application, and (b) the correct solution for a set of such QAP instances is given. We propose a new approach to accelerating the solution of QAPs based on *learning* parameters for a modified objective function from prior QAP instances. A key feature of our approach is that it takes advantage of the algebraic structure of permutations, in conjunction with special methods for optimizing functions over the symmetric group \m{\Sb_n} in Fourier space. Experiments show that in practical domains the new method can significantly outperform existing approaches.

## Lognormal and Gamma Mixed Negative Binomial Regression

– Accepted

**Abstract: **In regression analysis of counts, a lack of simple and efficient algorithms for posterior computation has made Bayesian approaches appear unattractive and thus underdeveloped. We propose a lognormal and gamma mixed negative binomial (NB) regression model for counts, and present efficient closed-form Bayesian inference; unlike conventional Poisson models, the proposed approach has two free parameters to include two different kinds of dispersion, and allows the incorporation of prior information, such as sparsity in the regression coefficients. By placing a gamma distribution prior on the NB dispersion parameter r, and connecting a lognormal distribution prior with the logit of the NB probability parameter p, efficient Gibbs sampling and variational Bayes inference are both developed. The closed-form updates are obtained by exploiting conditional conjugacy via both a compound Poisson representation and a Polya-Gamma distribution based data augmentation approach. The proposed Bayesian inference can be implemented routinely, while being easily generalizable to more complex settings involving multivariate dependence structures. The algorithms are illustrated using real examples.

## Greedy Algorithms for Sparse Reinforcement Learning

– Accepted

**Abstract: **Feature selection and regularization are becoming increasingly prominent tools in the efforts of the reinforcement learning (RL) community to expand the reach and applicability of RL. One approach to the problem of feature selection is to impose a sparsity-inducing form of regularization on the learning method. Recent work on *L_1* regularization has adapted techniques from the supervised learning literature for use with RL. Another approach that has received renewed attention in the supervised learning community is that of using a simple algorithm that greedily adds new features. Such algorithms have many of the good properties of the *L_1* regularization methods, while also being extremely efficient and, in some cases, allowing theoretical guarantees on recovery of the true form of a sparse target function from sampled data. This paper considers variants of orthogonal matching pursuit (OMP) applied to reinforcement learning. The resulting algorithms are analyzed and compared experimentally with existing *L_1* regularized approaches. We demonstrate that perhaps the most natural scenario in which one might hope to achieve sparse recovery fails; however, one variant, OMP-BRM, provides promising theoretical guarantees under certain assumptions on the feature dictionary. Another variant, OMP-TD, empirically outperforms prior methods both in approximation accuracy and efficiency on several benchmark problems.

## Variance Function Estimation in High-dimensions

– Accepted

**Abstract: ** We consider the high-dimensional heteroscedastic regression model, where the mean and the log variance are modeled as a linear combination of input variables. Existing literature on high-dimensional linear regression models has largely ignored non-constant error variances, even though they commonly occur in a variety of applications ranging from biostatistics to finance. In this paper we study a class of non-convex penalized pseudolikelihood estimators for both the mean and variance parameters. We show that the Heteroscedastic Iterative Penalized Pseudolikelihood Optimizer (HIPPO) achieves the oracle property, that is, we prove that the rates of convergence are the same as if the true model was known. We demonstrate numerical properties of the procedure on a simulation study and real world data.

## Conditional Sparse Coding and Grouped Multivariate Regression

– Accepted

**Abstract: **We study the problem of multivariate regression where the data are naturally grouped, and a regression matrix is to be estimated for each group. We propose an approach in which a dictionary of low rank parameter matrices is estimated across groups, and a sparse linear combination of the dictionary elements is estimated to form a model within each group. We refer to the method as conditional sparse coding since it is a coding procedure for the response vectors Y conditioned on the covariate vectors X. This approach captures the shared information across the groups while adapting to the structure within each group. It exploits the same intuition behind sparse coding that has been successfully developed in computer vision and computational neuroscience. We propose an algorithm for conditional sparse coding, analyze its theoretical properties in terms of predictive accuracy, and present the results of simulation experiments that compare the new technique to reduced rank regression.

## Apprenticeship Learning for Model Parameters of Partially Observable Environments

– Accepted

**Abstract: **We consider apprentice learning — i.e., having an agent learn a task by observing an expert demonstrating the task — in a partially observable environment when the model of the environment is uncertain. This setting is useful in applications where the explicit modeling of the environment is difficult, such as a dialogue system. We show that we can extract information about the environment model by inferring action selection process behind the demonstration, under the assumption that the expert is choosing optimal actions based on knowledge of the true model of the target environment. We show that our proposed algorithms can estimate the parameters of the environment model with much shorter demonstration compared to learning the model only from the reaction from the environment.

## Modeling Images using Transformed Indian Buffet Processes

– Accepted

**Abstract: **Latent feature models are attractive for image modeling; images generally contain multiple objects. However, many latent feature models ignore that objects can appear at different locations, or require pre-segmentation of images. While the transformed Indian buffet process (tIBP) provides a method for modeling transformation-invariant features in simple, unsegmented binary images, in its current form it is inappropriate for real images because of computational constraints and modeling assumptions. We combine the tIBP with likelihoods appropriate for real images. We also develop an efficient inference scheme using the cross-correlation between images and features that is both theoretically and empirically faster than existing inference techniques. We demonstrate that, using our method, we are able to discover reasonable components and achieve effective image reconstruction in natural images.

## Learning Object Arrangements in 3D Scenes using Human Context

– Accepted

**Abstract: **We consider the problem of learning object arrangements in a 3D scene. The key idea here is to learn how objects relate to human skeletons based their affordances, ease of use and reachability. We design appropriate density functions based on 3D spatial features to capture this. We then learn the distribution of human poses in a scene using Dirichlet processes. This allows our algorithm to reason about arrangement of the objects in the room. This is in contrast to previous approaches that learn context by learning object - object context such as co-occurrence relationships. We tested our algorithm extensively on 20 different rooms with a total of 47 objects. Our algorithm predicted correct placements with an error of 1.6 meters from the ground truth and received a score of 4.3/5 in arranging five real scenes compared to 3.7 of the best baseline.

## Cross Language Text Classification via Subspace Co-regularized Multi-view Learning

– Accepted

**Abstract: **In many multilingual text classification problems, the documents in different languages often share the same set of categories. To reduce the labeling cost of training a classification model for each individual language, it is important to transfer knowledge gained from the labeled data in one language to another language by conducting cross language classification. In this paper we develop a novel subspace co-regularized multi-view learning method for cross language text classification. This method is built on parallel corpora produced by machine translation. It jointly minimizes the training error of each classifier in each language while penalizing the distance between the subspace representations of parallel documents. Our empirical study on a large set of cross language text classification tasks shows the proposed method consistently outperforms the alternative inductive methods, domain adaptation methods, and multi-view learning methods.

## Plug-in martingales for testing exchangeability on-line

– Accepted

**Abstract: **A standard assumption in machine learning is the exchangeability of data; in other words, the examples are assumed to be generated from the same probability distribution independently. This paper is devoted to testing the assumption of exchangeability on-line: the examples arrive one by one, and after receiving each example we would like to have a valid measure of the degree to which the assumption of exchangeability has been falsified. Such measure are provided by exchangeability martingales. We extend known techniques for constructing exchangeability martingales and show that our new method is competitive to martingales introduced before. Finally we investigate the performance of our testing method on two benchmark datasets, USPS and Statlog Satellite data; for the former, the known techniques give satisfactory results, but for the latter our new more flexible method becomes necessary.

discussion+video, ICML version (pdf) (bib), more on ArXiv (includes revisions)

## An Iterative Locally Linear Embedding Algorithm

– Accepted

**Abstract: **Local Linear embedding(LLE) is a popular dimension reduction method. In this paper, we first show LLE with nonnegative constraint is equivalent to the widely used Laplacian embedding. We further propose to iterate the two steps in LLE repeatedly to improve the results. Thirdly, we relax the kNN constraint of LLE and present a sparse similarity learning algorithm. The final Iterative LLE combines these three improvements. Extensive experiment results show that iterative LLE algorithm significantly improve both classification and clustering results.

## Gaussian Process Quantile Regression using Expectation Propagation

– Accepted

**Abstract: **Direct quantile regression involves estimating a given quantile of a response variable as a function of input variables. We present a new framework for direct quantile regression where a Gaussian process model is learned, minimising the expected tilted loss function. The integration required in learning is not analytically tractable so to speed up the learning we employ the Expectation Propagation algorithm. We describe how this work relates to other quantile regression methods and apply the method on both synthetic and real data sets.

## Latent Multi-group Membership Graph Model

– Accepted

**Abstract: **We develop the Latent Multi-group Membership Graph (LMMG) model, a model of networks with rich node feature structure. In the LMMG model, each node belongs to multiple groups and each latent group models the occurrence of links as well as the node feature structure. The LMMG can be used to summarize the network structure, to predict links between the nodes, and to predict missing features of a node. We derive efficient inference and learning algorithms and evaluate the predictive performance of the LMMG on several social and document network datasets.

## Exponential Regret Bounds for Gaussian Process Bandits with Deterministic Observations

– Accepted

**Abstract: **This paper analyzes the problem of Gaussian process (GP) bandits with deterministic observations. The analysis uses a branch and bound algorithm that is related to the UCB algorithm of (Srinivas et al, 2010). For GPs with Gaussian observation noise, with variance strictly greater than zero, Srinivas et al proved that the regret vanishes at the approximate rate of *O(1/√t)*, where t is the number of observations. To complement their result, we attack the deterministic case and attain a much faster exponential convergence rate. Under some regularity assumptions, we show that the regret decreases asymptotically according to *O(e^{-τ t/(ln t)^{d/4}})* with high probability. Here, d is the dimension of the search space and tau is a constant that depends on the behaviour of the objective function near its global maximum.

## Estimating the Hessian by Back-propagating Curvature

– Accepted

**Abstract: **In this work we develop Curvature Propagation (CP), a general technique for efficiently computing unbiased approximations of the Hessian of any function that is computed using a computational graph. At the cost of roughly two gradient evaluations, CP can give a rank-1 approximation of the whole Hessian, and can be repeatedly applied to give increasingly precise unbiased estimates of any or all of the entries of the Hessian. Of particular interest is the diagonal of the Hessian, for which no general approach is known to exist that is both efficient and accurate. We show in experiments that CP turns out to work well in practice, giving very accurate estimates of the Hessian of neural networks, for example, with a relatively small amount of work. We also apply CP to Score Matching, where a diagonal of a Hessian plays an integral role in the Score Matching objective, and where it is usually computed exactly using inefficient algorithms which do not scale to larger and more complex models.

discussion+video, ICML version (pdf) (bib), more on ArXiv (includes revisions)

## Feature Selection via Probabilistic Outputs

– Accepted

**Abstract: **This paper investigates two feature-scoring criteria that make use of estimated class probabilities: one method proposed by *shen* and a complementary approach proposed below. We develop a theoretical framework to analyze each criterion and show that both estimate the spread (across all values of a given feature) of the probability that an example belongs to the positive class. Based on our analysis, we predict when each scoring technique will be advantageous over the other and give empirical results validating those predictions.

## Hierarchical Exploration for Accelerating Contextual Bandits

– Accepted

**Abstract: **Contextual bandit learning is a popular approach to optimizing recommender systems, but can be slow to converge in practice due to the need for explorating a large feature space. In this paper, we propose a coarse-to-fine hierarchical approach for encoding prior knowl- 016 edge which drastically reduces the amount of exploration required. Intuitively, user preferences can be reasonably embedded in a coarse low-dimensional feature space that can be explored efficiently, requiring exploration in the high-dimensional space only as necessary. We introduce a bandit algorithm for exploring within this coarse-to-fine spectrum, and prove performance guarantees that depend on how well the coarser feature space captures the user’s preferences. We demonstrate substantial improvement over conventional bandit algorithms through extensive simulation as well as a live user study in the setting of personalized news recommendation.

## Fast Training of Nonlinear Embedding Algorithms

– Accepted

**Abstract: **Stochastic neighbor embedding and related nonlinear manifold learning algorithms achieve high-quality low-dimensional representations of similarity data, but are notoriously slow to train. We propose a generic formulation of embedding algorithms that includes SNE and other existing algorithms, and study their relation with spectral methods and graph Laplacians. This allows us to define several partial-Hessian optimization strategies, characterize their global and local convergence, and evaluate them empirically. We achieve up to two orders of magnitude speedup over existing training methods with a strategy that adds nearly no overhead to the gradient and yet is simple, scalable and applicable to several existing and future embedding algorithms.

## Fast Prediction of New Feature Utility

– Accepted

**Abstract: **We study the new feature utility prediction problem: statistically testing whether adding a new feature to the data representation can improve predictive accuracy on a supervised learning task. In many applications, identifying new informative features is the primary pathway for improving performance. However, evaluating every potential feature by re-training the predictor with it can be costly computationally, logistically and financially. The paper describes an efficient, learner-independent technique for estimating new feature utility without re-training based on the current predictor's outputs. The method is obtained by deriving a connection between loss reduction potential and the new feature's correlation with the loss gradient of the current predictor. This leads to a simple yet powerful hypothesis testing procedure, for which we prove consistency. Our theoretical analysis is accompanied by empirical evaluation on standard benchmarks and a large-scale industrial dataset.

## Robust Classification with Adiabatic Quantum Optimization

– Accepted

**Abstract: **We propose a non-convex training objective for robust binary classification of data sets in which label noise is present. The design is guided by the intention of solving the resulting problem by adiabatic quantum optimization. Two choices are prompted by the engineering constraints of existing quantum hardware: training problems are formulated as quadratic unconstrained binary optimization; and model parameters are represented as binary expansions of low bit-depth. In the present work we validate this approach by using a heuristic classical solver as a stand-in for quantum hardware. Testing on several popular data sets and comparing with a number of existing convex and non-convex losses solved by convex optimization, we find substantial advantages in robustness as measured by test error under increasing label noise. Robustness is enabled by the non-convexity of our hardware-compatible loss function, which we name * q-loss*. We emphasize that we do not claim any theoretical superiority of

*q*-loss over other non-convex losses. In fact if solving them to optimality was possible, they could be just as good or better, but

*q*-loss has one important property that all others are lacking—namely that it is compatible with quantum hardware.

## Agglomerative Bregman Clustering

– Accepted

**Abstract: **This manuscript develops the theory of agglomerative clustering with Bregman divergences. Geometric smoothing techniques are developed to deal with degenerate clusters. To allow for cluster models based on exponential families with overcomplete representations, Bregman divergences are developed for nondifferentiable convex functions.

## Isoelastic Agents and Wealth Updates in Machine Learning Markets

– Accepted

**Abstract: **Recently, prediction markets have shown considerable promise for developing flexible mechanisms for machine learning. In this paper agents with isoelastic utilities are considered, and it is shown that the costs associated with homogeneous markets of agents with isoelastic utilities produce equilibrium prices corresponding to alpha-mixtures, with a particular form of mixing component relating to each agent's wealth. We also demonstrate that wealth accumulation for logarithmic and other isoelastic agents (through payoffs on prediction of training targets) can implement both Bayesian model updates and mixture weight updates by imposing different market payoff structures. An efficient variational algorithm is given for market equilibrium computation. We demonstrate that inhomogeneous markets of agents with isoelastic utilities outperform state of the art aggregate classifiers such as random forests, as well as single classifiers (neural networks, decision trees) on a number of machine learning benchmarks, and show that isoelastic combination methods are generally better than using logarithmic agents.

discussion+video, ICML version (pdf) (bib), more on ArXiv (includes revisions)

## Quasi-Newton Methods: A New Direction

– Accepted

**Abstract: **Four decades after their invention, quasi-Newton methods are still state of the art in unconstrained numerical optimization. Although not usually interpreted thus, these are learning algorithms that fit a local quadratic approximation to the objective function. We show that many, including the most popular, quasi-Newton methods can be interpreted as approximations of Bayesian linear regression under varying prior assumptions. This new notion elucidates some shortcomings of classical algorithms, and lights the way to a novel nonparametric quasi-Newton method, which is able to make more efficient use of available information at computational cost similar to its predecessors.

## No-Regret Learning in Extensive-Form Games with Imperfect Recall

– Accepted

**Abstract: **Counterfactual Regret Minimization (CFR) is an efficient no-regret learning algorithm for decision problems modeled as extensive games. CFR's regret bounds depend on the requirement of perfect recall: players always remember information that was revealed to them and the order in which it was revealed. In games without perfect recall, however, CFR's guarantees do not apply. In this paper, we present the first regret bound for CFR when applied to a general class of games with imperfect recall. In addition, we show that CFR applied to any abstraction belonging to our general class results in a regret bound not just for the abstract game, but for the full game as well. We verify our theory and show how imperfect recall can be used to trade a small increase in regret for a significant reduction in memory in three domains: die-roll poker, phantom tic-tac-toe, and Bluff.

## Information-theoretic Semi-supervised Metric Learning via Entropy Regularization

– Accepted

**Abstract: **We propose a general information-theoretic approach called Seraph for semi-supervised metric learning that does not rely upon the manifold assumption. It maximizes/minimizes entropies of labeled/unlabeled data pairs in the supervised/unsupervised part, which allows these two parts to be integrated in a natural and meaningful way. Furthermore, it is equipped with the hyper-sparsity: Given a certain probabilistic model parameterized by the learned metric, the posterior distribution and the resultant projection matrix are both `sparse'. Consequently, the metric learned by Seraph possesses high discriminability even under a noisy environment. The optimization problem of Seraph can be solved efficiently and stably by an EM-like scheme, where the E-Step is analytical and the M-Step is convex and `smooth'. Experiments demonstrate that Seraph compares favorably with many well-known metric learning methods.

## Fast approximation of matrix coherence and statistical leverage

– Accepted

**Abstract: **The statistical leverage scores of a matrix A are the squared row-norms of the matrix containing its (top) left singular vectors and the coherence is the largest leverage score. These quantities have been of interest in recently-popular problems such as matrix completion and Nystrom-based low-rank matrix approximation; in large-scale statistical data analysis applications more generally; and since they define the key structural nonuniformity that must be dealt with in developing fast randomized matrix algorithms. Our main result is a randomized algorithm that takes as input an arbitrary n × d matrix A, with n >> d, and that returns as output relative-error approximations to all n of the statistical leverage scores. The proposed algorithm runs in O(nd log n) time, as opposed to the O(nd2) time required by the naıve algorithm that involves computing an orthogonal basis for the range of A. This resolves an open question from (Drineas et al., 2006b) and (Mohri & Talwalkar, 2011); and our result leads to immediate improvements in coreset-based L2-regression, the estimation of the coherence of a matrix, and several related low-rank matrix problems. Interestingly, to achieve our result we judiciously apply random projections on both sides of A.

## Predicting Manhole Events in New York City

– Not for proceedings

**Abstract: **We present a knowledge discovery and data mining process developed as part of the Columbia/Con Edison project on manhole event prediction. This process can assist with real-world prioritization problems that involve raw data in the form of noisy documents requiring significant amounts of pre-processing. The documents are linked to a set of instances to be ranked according to prediction criteria. In the case of manhole event prediction, which is a new application for machine learning, the goal is to rank the electrical grid structures in New York City (manholes and service boxes) according to their vulnerability to serious manhole events such as fires, explosions and smoking manholes. Our ranking results are currently used to help prioritize repair work on the Manhattan, Brooklyn, and Bronx electrical grids.

discussion+video
~~pdf~~

## Artist Agent: A Reinforcement Learning Approach to Automatic Stroke Generation in Oriental Ink Painting

– Accepted

**Abstract: **Oriental ink painting, called Sumi-e, is one of the most appealing painting styles that has attracted artists around the world. Major challenges in computer-based Sumi-e simulation are to abstract complex scene information and draw smooth and natural brush strokes. To automatically find such strokes, we propose to model the brush as a reinforcement learning agent, and learn desired brush-trajectories by maximizing the sum of rewards in the policy search framework. We also provide elaborate design of actions, states, and rewards tailored for a Sumi-e agent. The effectiveness of our proposed approach is demonstrated through simulated Sumi-e experiments.

## On multi-view feature learning

– Accepted

**Abstract: ** Sparse coding is a common approach to learning local features for object recognition. Recently, there has been an increasing interest in learning features from spatio-temporal, binocular, or other multi-observation data, where the goal is to encode the relationship between images rather than the content of a single image. We discuss the role of multiplicative interactions and of squaring non-linearities in learning such relations. In particular, we show that training a sparse coding model whose filter responses are multiplied or squared amounts to jointly diagonalizing a set of matrices that encode image transformations. Inference amounts to detecting rotations in the shared eigenspaces. Our analysis helps explain recent experimental results showing that Fourier features and circular Fourier features emerge when training complex cell models on translating or rotating images. It also shows how learning about transformations makes it possible to learn invariant features.

## Fast Bounded Online Gradient Descent Algorithms for Scalable Kernel-Based Online Learning

– Accepted

**Abstract: **Although kernel-based online learning has shown promising performance in a number of studies, its main shortcoming is the unbounded number of support vectors, making it non-scalable and unsuitable for large-scale datasets. In this work, we study the problem of bounded kernel-based online learning that aims to constrain the number of support vectors by a predefined budget. Although several algorithms have been proposed, they are mostly based on the Perceptron algorithm and only provide bounds for the number of classification mistakes. To address this limitation, we propose a framework for bounded kernel-based online learning based on online gradient descent approach. We propose two efficient algorithms of bounded online gradient descent (BOGD) for bounded kernel-based online learning: (i) BOGD using uniform sampling and (ii) BOGD++ using non-uniform sampling. We present theoretical analysis of regret bound for both algorithms, show the promising empirical performance of the proposed algorithms by comparing them to the state-of-the-art algorithms for bounded kernel-based online learning.

## A Hybrid Algorithm for Convex Semidefinite Optimization

– Accepted

**Abstract: **We present a hybrid algorithm for optimizing a convex, smooth function over the cone of positive semidefinite matrices. Our algorithm converges to the global optimal solution and can be used to solve general large-scale semidefinite programs and hence can be readily applied to a variety of machine learning problems. We show experimental results on three machine learning problems (matrix completion, metric learning, and sparse PCA) . Our approach outperforms state-of-the-art algorithms.

## A Complete Analysis of the l_1,p Group-Lasso

– Accepted

**Abstract: **The Group-Lasso is a well-known tool for joint regularization in machine learning methods. While the l_{1,2} and the l_{1,∞} version have been studied in detail and efficient algorithms exist, there are still open questions regarding other l_{1,p} variants. We characterize conditions for solutions of the l_{1,p} Group-Lasso for all p-norms with 1 <= p <= ∞, and we present a unified active set algorithm. For all p-norms, a highly efficient projected gradient algorithm is presented. This new algorithm enables us to compare the prediction performance of many variants of the Group-Lasso in a multi-task learning setting, where the aim is to solve many learning problems in parallel which are coupled via the Group-Lasso constraint. We conduct large-scale experiments on synthetic data and on two real-world data sets. In accordance with theoretical characterizations of the different norms we observe that the weak-coupling norms with p between 1.5 and 2 consistently outperform the strong-coupling norms with p >> 2.

## Convergence Rates of Biased Stochastic Optimization for Learning Sparse Ising Models

– Accepted

**Abstract: **We study the convergence rate of stochastic optimization of exact (NP-hard) objectives, for which only biased estimates of the gradient are available. We motivate this problem in the context of learning the structure and parameters of Ising models. We first provide a convergence-rate analysis of deterministic errors for forward-backward splitting (FBS). We then extend our analysis to biased stochastic errors, by first characterizing a family of samplers and providing a high probability bound that allows understanding not only FBS, but also proximal gradient (PG) methods. We derive some interesting conclusions: FBS requires only a logarithmically increasing number of random samples in order to converge (although at a very low rate); the required number of random samples is the same for the deterministic and the biased stochastic setting for FBS and basic PG; accelerated PG is not guaranteed to converge in the biased stochastic setting.

## Decoupling Exploration and Exploitation in Multi-Armed Bandits

– Accepted

**Abstract: **We consider a multi-armed bandit problem where the decision maker can explore and exploit different arms at every round. The exploited arm adds to the decision maker's cumulative reward (without necessarily observing the reward) while the explored arm reveals its value. We devise algorithms for this setup and show that the dependence on the number of arms can be much better than the standard square root of the number of arms dependence, according to the behavior of the arms' reward sequences. For the important case of piecewise stationary stochastic bandits, we show a significant improvement over existing algorithms. Our algorithms are based on a non-uniform sampling policy, which we show is essential to the success of any algorithm in the adversarial setup. We finally show some simulation results on an ultra-wide band channel selection inspired setting indicating the applicability of our algorithms.

discussion+video, ICML version (pdf) (bib), more on ArXiv (includes revisions)

## Learning to Identify Regular Expressions that Describe Email Campaigns

– Accepted

**Abstract: **This paper addresses the problem of inferring a regular expression from a given set of strings that resembles, as closely as possible, the regular expression that a human expert would have written to identify the language. This is motivated by our goal of automating the task of postmasters of an email service who use regular expressions to describe and blacklist email spam campaigns. Training data contains batches of messages and corresponding regular expressions that an expert postmaster feels confident to blacklist. We model this task as a learning problem with structured output spaces and an appropriate loss function, derive a decoder and the resulting optimization problem, and a report on a case study conducted with an email service.

## Deep Mixtures of Factor Analysers

– Accepted

**Abstract: **An efficient way to learn deep density models that have many layers of latent variables is to learn one layer at a time using a model that has only one layer of latent variables. After learning each layer, samples from the posterior distributions for that layer are used as training data for learning the next layer. This approach is commonly used with Restricted Boltzmann Machines, which are *undirected* graphical models with a single hidden layer, but it can also be used with Mixtures of Factor Analysers (MFAs) which are *directed* graphical models. In this paper, we present a greedy layer-wise learning algorithm for Deep Mixtures of Factor Analysers (DMFAs). Even though a DMFA can be converted to an equivalent shallow MFA by multiplying together the factor loading matrices at different levels, learning and inference are much more efficient in a DMFA and the sharing of each lower-level factor loading matrix by many different higher level MFAs prevents overfitting. We demonstrate empirically that DMFAs learn better density models than both MFAs and two types of Restricted Boltzmann Machines on a wide variety of datasets.

## Revisiting k-means: New Algorithms via Bayesian Nonparametrics

– Accepted

**Abstract: **Bayesian models offer great flexibility for clustering applications—Bayesian nonparametrics can be used for modeling infinite mixtures, and hierarchical Bayesian models can be utilized for shared clusters across multiple data sets. For the most part, such flexibility is lacking in classical clustering methods such as k-means. In this paper, we revisit the k-means clustering algorithm from a Bayesian nonparametric viewpoint. Inspired by the asymptotic connection between k-means and mixtures of Gaussians, we show that a Gibbs sampling algorithm for the Dirichlet process mixture approaches a hard clustering algorithm in the limit, and further that the resulting algorithm monotonically minimizes an elegant underlying k-means-like clustering objective that includes a penalty for the number of clusters. We generalize this analysis to the case of clustering multiple data sets through a similar asymptotic argument with the hierarchical Dirichlet process. We also discuss further extensions that highlight the benefits of our analysis: i) a spectral relaxation involving thresholded eigenvectors, and ii) a normalized cut graph clustering algorithm that does not fix the number of clusters in the graph.

## Gaussian Process Regression Networks

– Accepted

**Abstract: **We introduce a new regression framework, Gaussian process regression networks (GPRN), which combines the structural properties of Bayesian neural networks with the nonparametric flexibility of Gaussian processes. GPRN accommodates input (predictor) dependent signal and noise correlations between multiple output (response) variables, input dependent length-scales and amplitudes, and heavy-tailed predictive distributions. We derive both elliptical slice sampling and variational Bayes inference procedures for GPRN. We apply GPRN as a multiple output regression and multivariate volatility model, demonstrating substantially improved performance over eight popular multiple output (multi-task) Gaussian process models and three multivariate volatility models on real datasets, including a 1000 dimensional gene expression dataset.

## Analysis of Kernel Mean Matching under Covariate Shift

– Accepted

**Abstract: **In real supervised learning scenarios, it is not uncommon that the training and test sample follow different probability distributions, thus rendering the necessity to correct the sampling bias. Focusing on a particular covariate shift problem, we derive high probability confidence bounds for the kernel mean matching (KMM) estimator, whose convergence rate turns out to depend on some regularity measure of the regression function and also on some capacity measure of the kernel. By comparing KMM with the natural plug-in estimator, we establish the superiority of the former hence provide concrete evidence/understanding to the effectiveness of KMM under covariate shift.

## Tighter Variational Representations of f-Divergences via Restriction to Probability Measures

– Accepted

**Abstract: **We show that the variational representations for f-divergences currently used in the literature can be tightened. This has implications to a number of methods recently proposed based on this representation. As an example application we use our tighter representation to derive a general f-divergence estimator based on two i.i.d. samples and derive the dual program for this estimator that performs well empirically. We also point out a connection between our estimator and MMD.

## Training Restricted Boltzmann Machines on Word Observations

– Accepted

**Abstract: **The restricted Boltzmann machine (RBM) is a flexible tool for modeling complex data, however there have been significant computational difficulties in using RBMs to model high-dimensional multinomial observations. In natural language processing applications, words are naturally modeled by K-ary discrete distributions, where K is determined by the vocabulary size and can easily be in the hundreds of thousands. The conventional approach to training RBMs on word observations is limited because it requires sampling the states of K-way softmax visible units during block Gibbs updates, an operation that takes time linear in K. In this work, we address this issue by employing a more general class of Markov chain Monte Carlo operators on the visible units, yielding updates with computational complexity independent of K. We demonstrate the success of our approach by training RBMs on hundreds of millions of word n-grams using larger vocabularies than previously feasible and using the learned features to improve performance on chunking and sentiment classification tasks, achieving state-of-the-art results on the latter.

discussion+video, ICML version (pdf) (bib), more on ArXiv (includes revisions)

## Bayesian Watermark Attacks

– Accepted

**Abstract: **This paper presents an application of statistical machine learning to the field of watermarking. We develop a new attack model on spread-spectrum watermarking systems, using Bayesian statistics. Our model jointly infers the watermark signal and embedded message bitstream, directly from the watermarked signal. No access to the watermark decoder is required. We develop an efficient Markov chain Monte Carlo sampler for updating the model parameters from their conjugate full conditional posteriors. We also provide a variational Bayesian solution, which further increases the convergence speed of the algorithm. Experiments with synthetic and real image signals demonstrate that the attack model is able to correctly infer a large part of the message bitstream, while at the same time obtaining a very accurate estimate of the watermark signal.

## Max-Margin Nonparametric Latent Feature Models for Link Prediction

– Accepted

**Abstract: **Recent work on probabilistic latent feature models have shown great promise in predicting unseen links in social network and relational data. We present a max-margin nonparametric latent feature model, which unites the ideas of max-margin learning and Bayesian nonparametrics to discover discriminative latent features for link prediction and automatically infer the unknown latent social dimension. By minimizing a hinge-loss using the linear expectation operator, we can perform posterior inference efficiently without dealing with a highly nonlinear link likelihood function; by using a fully-Bayesian formulation, we can avoid tuning regularization constants. Experimental results on real datasets appear to demonstrate the benefits inherited from max-margin learning and fully-Bayesian nonparametric inference.

## Discriminative Probabilistic Prototype Learning

– Accepted

**Abstract: **In this paper we propose a simple yet powerful method for learning representations in supervised learning scenarios where each original input datapoint is described by a set of vectors and their associated outputs may be given by soft labels indicating, for example, class probabilities. We represent an input datapoint as a mixture of probabilities over the corresponding set of feature vectors where each probability indicates how likely each vector is to belong to an unknown prototype pattern. We propose a probabilistic model that parameterizes these prototype patterns in terms of hidden variables and therefore it can be trained with conventional approaches based on likelihood maximization. More importantly, both the model parameters and the prototype patterns can be learned from data in a discriminative way. We show that our model can be seen as a probabilistic generalization of learning vector quantization (LVQ). We apply our method to the problems of shape classification, hyperspectral imaging classification and people's work class categorization, showing the superior performance of our method compared to the standard prototype-based classification approach and other competitive benchmark methods.

## Sparse-GEV: Sparse Latent Space Model for Multivariate Extreme Value Time Serie Modeling

– Accepted

**Abstract: **In many applications of time series models, such as climate analysis or social media analysis, we are often interested in extreme events, such as heatwave, wind gust, and burst of topics. These time series data usually exhibit a heavy-tailed distribution rather than a normal distribution. This poses great challenges to existing approaches due to the significantly different assumptions on the data distributions and the lack of sufficient past data on extreme events. In this paper, we propose the sparse-GEV model, a latent state model based on the theory of extreme value modeling to automatically learn sparse temporal dependence and make predictions. Our model is theoretically significant because it is among the first models to learn *sparse* temporal dependencies between multivariate extreme value time series. We demonstrate the superior performance of our algorithm compared with state-of-art methods, including Granger causality, copula approach, and transfer entropy, on one synthetic dataset, one climate dataset and one Twitter dataset.

## Factorized Asymptotic Bayesian Hidden Markov Models

– Accepted

**Abstract: **This paper addresses the issue of model selection for hidden Markov models (HMMs). We generalize factorized asymptotic Bayesian inference (FAB), which has been recently developed for model selection on independent hidden variables (i.e., mixture models), for time-dependent hidden variables. As with FAB in mixture models, FAB for HMMs is derived as an iterative lower bound maximization algorithm of a factorized information criterion (FIC). It inherits, from FAB for mixture models, several desirable properties for learning HMMs, such as asymptotic consistency of FIC with marginal log-likelihood, a shrinkage effect for hidden state selection, monotonic increase of the lower FIC bound through the iterative optimization. Further, it does not have a tunable hyper-parameter, and thus its model selection process can be fully automated. Experimental results shows that FAB outperforms states-of-the-art variational Bayesian HMM and non-parametric Bayesian HMM in terms of model selection accuracy and computational efficiency.

## PAC-Bayesian Generalization Bound on Confusion Matrix for Multi-Class Classification

– Accepted

**Abstract: **In this paper, we propose a PAC-Bayes bound for the generalization risk of the Gibbs classifier in the multi-class classification framework. The novelty of our work is the critical use of the confusion matrix of a classifier as an error measure; this puts our contribution in the line of work aiming at dealing with performance measure that are richer than mere scalar criterion such as the misclassification rate. Thanks to very recent and beautiful results on matrix concentration inequalities, we derive two bounds showing that the true confusion risk of the Gibbs classifier is upper-bounded by its empirical risk plus a term depending on the number of training examples in each class. To the best of our knowledge, this is the first PAC-Bayes bounds based on confusion matrices.

discussion+video, ICML version (pdf) (bib), more on ArXiv (includes revisions)

## Semi-Supervised Learning of Class Balance under Class-Prior Change by Distribution Matching

– Accepted

**Abstract: **In real-world classification problems, the class balance in the training dataset does not necessarily reflect that of the test dataset, which can cause significant estimation bias. If the class ratio of the test dataset is known, instance re-weighting or resampling allows systematical bias correction. However, learning the class ratio of the test dataset is challenging when no labeled data is available from the test domain. In this paper, we propose to estimate the class ratio in the test dataset by matching probability distributions of training and test input data. We demonstrate the utility of the proposed approach through experiments.

## A Proximal-Gradient Homotopy Method for the L1-Regularized Least-Squares Problem

– Accepted

**Abstract: **We consider the *ℓ_1*-regularized least-squares problem for sparse recovery and compressed sensing. Since the objective function is not strongly convex, standard proximal gradient methods only achieve sublinear convergence. We propose a homotopy continuation strategy, which employs a proximal gradient method to solve the problem with a sequence of decreasing regularization parameters. It is shown that under common assumptions in compressed sensing, the proposed method ensures that all iterates along the homotopy solution path are sparse, and the objective function is effectively strongly convex along the solution path. This observation allows us to obtain a global geometric convergence rate for the procedure. Empirical results are presented to support our theoretical analysis.

## Comparison-Based Learning with Rank Nets

– Accepted

**Abstract: **We consider the problem of search through comparisons, where a user is presented with two candidate objects and reveals which is closer to her intended target. We study adaptive strategies for finding the target, that require knowledge of rank relationships but not actual distances between objects. We propose a new strategy based on rank nets, and show that for target distributions with a bounded *doubling constant*, it finds the target in a number of comparisons close to the entropy of the target distribution and, hence, of the optimum. We extend these results to the case of noisy oracles, and compare this strategy to prior art over multiple datasets.

## Semi-Supervised Collective Classification via Hybrid Label Regularization

– Accepted

**Abstract: **Many classification problems involve data instances that are interlinked with each other, such as webpages connected by hyperlinks. Techniques for 'collective classification' (CC) often increase accuracy for such data graphs, but usually require a fully-labeled training graph. In contrast, we examine how to improve the semi-supervised learning of CC models when given only a sparsely-labeled graph, a common situation. We first describe how to use novel combinations of classifiers to exploit the different characteristics of the relational features vs. the non-relational features. We also extend the ideas of 'label regularization' to such hybrid classifiers, enabling them to leverage the unlabeled data to bias the learning process. We find that these techniques, which are efficient and easy to implement, significantly increase accuracy on three real datasets. In addition, our results explain conflicting findings from prior related studies.

## Shortest path distance in random k-nearest neighbor graphs

– Accepted

**Abstract: **Consider a weighted or unweighted k-nearest neighbor graph that has been built on n data points drawn randomly according to some density p on R^d. We study the convergence of the shortest path distance in such graphs as the sample size tends to infinity. We prove that for unweighted kNN graphs, this distance converges to an unpleasant distance function on the underlying space whose properties are detrimental to machine learning. We also study the behavior of the shortest path distance in weighted kNN graphs.

## A Split-Merge Framework for Comparing Clusterings

– Accepted

**Abstract: **Clustering evaluation measures are frequently used to evaluate the performance of algorithms. However, most measures are not suitable for this task mainly because they are not normalized properly and ignore some information in the inherent structure of clusterings such as dependency and consistency. We model two clusterings as a bipartite graph and propose a general component-based decomposition formula based on the components of the graph. Under this formula, most existing measures can be reinterpreted. In order to capture dependency and satisfy consistency in the component, we further propose a split-merge framework for comparing clusterings of different data sets. The proposed framework is conditionally normalized, flexible to utilize previous measures, and extensible to take into account data point information, such as feature vectors and pairwise distances. Moreover, experimental results on a real world data set demonstrate that one instantiated measure of the framework outperforms other measures.

discussion+video, ICML version (pdf) (bib), more on ArXiv (includes revisions)

## Compositional Planning Using Optimal Option Models

– Accepted

**Abstract: **In this paper we introduce a framework for option model composition. Option models are temporal abstractions that, like macro-operators in classical planning, jump directly from a start state to an end state. Prior work has focused on constructing option models from primitive actions, by intra-option model learning; or on using option models to construct a value function, by inter-option planning. We present a unified view of intra- and inter-option model learning, based on a major generalisation of the Bellman equation. Our fundamental operation is the recursive composition of option models into other option models. This key idea enables compositional planning over many levels of abstraction. We illustrate our framework using a dynamic programming algorithm that simultaneously constructs optimal option models for multiple subgoals, and also searches over those option models to provide rapid progress towards other subgoals.

## Information-Theoretical Learning of Discriminative Clusters for Unsupervised Domain Adaptation

– Accepted

**Abstract: **We study the problem of unsupervised domain adaptation, which aims to adapt classifiers trained on a labeled source domain to an unlabeled target domain. Many existing approaches first learn domain-invariant features and then construct classifiers with them. We propose a novel approach that jointly learn the both. Specifically, the proposed method identifies a feature space where data in the source and the target domains are similarly distributed. More importantly, our method learns a discriminative feature space, optimizing an information-theoretic metric as an proxy to the expected misclassification error on the target domain. We show how this optimization can be effectively carried out with simple gradient-based methods and how hyperparameters can be cross validated without demanding any labeled data from the target domain. Empirical studies on the benchmark tasks of visual object recognition and sentiment analysis validate our modeling assumptions and demonstrate significant improvement of our method over competing ones in classification accuracies.

## Flexible Modeling of Latent Task Structures in Multitask Learning

– Accepted

**Abstract: **When learning multiple related tasks with limited training data per task, it is natural to share parameters across tasks to improve generalization performance. Imposing the correct notion of task relatedness is critical to achieve this goal. However, one rarely knows the correct relatedness notion a priori. We present a generative model that is based on the weight vector of each task being drawn from a nonparametric mixture of nonparametric factor analyzers. The nonparametric aspect of the model makes it broad enough to subsume many types of latent task structures previously considered in the literature as special cases, and also allows it to learn more general task structures, addressing their individual shortcomings. This brings in considerable flexibility as compared to the commonly used multitask learning models that are based on some a priori fixed notion of task relatedness. We also present an efficient variational inference algorithm for our model. Experimental results on synthetic and real-world datasets, on both regression and classification problems, establish the efficacy of the proposed method.

## An Efficient Approach to Sparse Linear Discriminant Analysis

– Accepted

**Abstract: **We present the Group-Lasso Optimal Scoring Solver (GLOSS), a novel approach to the formulation and the resolution of sparse Linear Discriminant Analysis (LDA). Our formulation, which is based on penalized Optimal Scoring, preserves an exact equivalence with sparse LDA, contrary to the multi-class approaches based on the regression of class indicator that have been proposed so far. Additionally, the versatility of the implementation allows to impose some particular structure on the within-class covariance matrix. Computationally, the optimization algorithm considers two nested convex problems, the main one being a linear regression regularized by a quadratic penalty implementing the group-lasso penalty. As group-Lasso selects the same features in all discriminant directions, it generates extremely parsimonious models without compromising the prediction performances. Moreover, the resulting sparse discriminant directions are amenable to low-dimensional representations. Our algorithm is highly efficient for medium to large number of variables, and is thus particularly well suited to the analysis of gene expression data.

## The Greedy Miser: Learning under Test-time Budgets

– Accepted

**Abstract: **As machine learning algorithms enter applications in industrial settings, there is increased interest in controlling their cpu-time during testing. The cpu-time consists of the running time of the algorithm and the extraction time of the features. The latter can vary drastically when the feature set is diverse. In this paper, we propose an algorithm, the Greedy Miser, that incorporates the feature extraction cost during training to explicitly minimize the cpu-time during testing. The algorithm is a straightforward extension of stage-wise regression and is equally suitable for regression or multi-class classification. Compared to prior work, it is significantly more cost-effective and scales to larger data sets.

## Canonical Trends: Detecting Trend Setters in Web Data

– Accepted

**Abstract: **The web offers large amounts of information most of which is just copied or rephrased, a phenomenon that is often called trend. A central problem in the context of web data mining is to detect those web sources that publish information which will give rise to a trend first. Here we present a simple and efficient method for finding trends dominating a pool of web sources and identifying those web sources that publish the information relevant to this trend before other websites do so. We validate our approach on real data collected from the most influential technology news feeds.

## A convex relaxation for weakly supervised classifiers

– Accepted

**Abstract: **This paper introduces a general multi-class approach to weakly supervised classification. Inferring the labels and learning the parameters of the model is usually done jointly through a block-coordinate descent algorithm such as expectation-maximization (EM), which may lead to local minima. To avoid this problem, we propose a cost function based on a convex relaxation of the soft-max loss. We then propose an algorithm specifically designed to efficiently solve the corresponding semidefinite program (SDP). Empirically, our method compares favorably to standard ones on different datasets for multiple instance learning and semi-supervised learning as well as on clustering tasks.

## Demand-Driven Clustering in Relational Domains for Predicting Adverse Drug Events

– Accepted

**Abstract: **Learning from electronic medical records (EMR) is challenging due to their relational nature and the need to capture the uncertain dependence between a patient's past and future health status. Statistical relational learning (SRL), which combines relational and probabilistic representations, is a natural fit for analyzing EMRs. SRL is less adept at handling the inherent latent structure, such as connections between related medications or diseases, in EMRs. One solution is to capture the latent structure via a relational clustering of objects. We propose a novel approach that, instead of pre-clustering the objects, performs a demand-driven clustering during the learning process. We evaluate our algorithm on three real-world tasks where the goal is to use EMRs to predict whether a patient will have an adverse reaction to a medication. We find that the proposed approach is more accurate than performing no clustering, pre-clustering and using expert-constructed medical heterarchies.

## A Binary Classification Framework for Two-Stage Multiple Kernel Learning

– Accepted

**Abstract: **With the advent of kernel methods, automating the task of specifying a suitable kernel has become increasingly important. In this context, the Multiple Kernel Learning (MKL) problem of finding a combination of pre-specified base kernels that is suitable for the task at hand has received significant attention from researchers. In this paper we show that Multiple Kernel Learning can be framed as a standard binary classification problem with additional constraints that ensure the positive definiteness of the learned kernel. Framing MKL in this way has the distinct advantage that it makes it easy to leverage the extensive research in binary classification to develop better performing and more scalable MKL algorithms that are conceptually simpler, and, arguably, more accessible to practitioners. Experiments on nine data sets from different domains show that, despite their simplicity, the proposed techniques can significantly outperform current leading MKL approaches.

## Learning Invariant Representations with Local Transformations

– Accepted

**Abstract: **The difficulty of developing feature learning algorithms that are robust to the novel transformations (e.g., scale, rotation, or translation) has been a challenge in many applications (e.g., object recognition problems). In this paper, we address this important problem of transformation invariant feature learning by introducing the transformation matrices into the energy function of the restricted Boltzmann machines. Speciﬁcally, the proposed transformation-invariant restricted Boltzmann machines not only learn the diverse patterns by explicitly transforming the weight matrix, but it also achieves the invariance of the feature representation via probabilistic max pooling of hidden units over the set of transformations. Furthermore, we show that our transformation-invariant feature learning framework is not limited to this speciﬁc algorithm, but can be also extended to many unsupervised learning methods, such as an autoencoder or sparse coding. To validate, we evaluate our algorithm on several benchmark image databases such as MNIST variation, CIFAR-10, and STL-10 as well as the customized digit datasets with signiﬁcant transformations, and show very competitive classiﬁcation performance to the state-of-the-art. Besides the image data, we apply the method for phone classiﬁcation tasks on TIMIT database to show the wide applicability of our proposed algorithms to other domains, achieving state-of-the-art performance.

## Consistent Multilabel Ranking through Univariate Losses

– Accepted

**Abstract: **We consider the problem of rank loss minimization in the setting of multilabel classification, which is commonly tackled by means of convex surrogate losses defined on pairs of labels. Very recently, this approach was put into question by the negative result showing that any such loss function is necessarily inconsistent. In this paper, we show a positive result which is arguably surprising in light of the previous one: despite being simpler, common convex surrogates for binary classification, defined on single labels instead of label pairs, are consistent for rank loss minimization. Instead of directly proving convergence, we give a much stronger result by deriving regret bounds and convergence rates. The proposed losses suggest efficient and scalable algorithms, which are tested experimentally.

## On the Equivalence between Herding and Conditional Gradient Algorithms

– Accepted

**Abstract: **We show the equivalence of the herding procedure of Welling (2009) with a standard convex optimization algorithm–namely a conditional gradient algorithm minimizing a quadratic moment discrepancy. This link enables us to harness convergence results from convex optimization as well as suggest faster alternatives for the task of approximating integrals in a reproducing kernel Hilbert space. We study the behavior of the different variants with numerical simulations. The experiments indicate that while we can improve on the task of approximating integrals, the original herding algorithm tends to approach more often the maximum entropy distribution, shedding more light on the learning bias behind herding.

## Variational Bayesian Inference with Stochastic Search

– Accepted

**Abstract: **Mean-field variational inference is an approximate posterior inference method for Bayesian models. It approximates the full posterior distribution of a model's variables with a factorized set of distributions by maximizing a lower bound on the marginal likelihood. This requires the ability to integrate the log joint likelihood of the model with respect to the factorized approximation. Often not all integrals are closed-form, which is traditionally handled using lower bound approximations. We present an algorithm based on stochastic optimization that allows for direct optimization of the variational lower bound in all models. This method uses control variates for variance reduction of the stochastic search gradient, in which existing lower bounds can play an important role. We demonstrate the approach on logistic regression and the hierarchical Dirichlet process.

## Small-sample brain mapping: sparse recovery on spatially correlated designs with randomization and clustering

– Accepted

**Abstract: **Functional neuroimaging can measure the brain’s response to an external stimulus. It is used to perform brain mapping: identifying from these observations the brain regions involved. This problem can be cast into a linear supervised learning task where the neuroimaging data are used as predictors for the stimulus. Brain mapping is then seen as a support recovery problem. On functional MRI (fMRI) data, this problem is particularly challenging as i) the number of samples is small due to limited acquisition time and ii) the variables are strongly correlated. We propose to overcome these difficulties using sparse regression models over new variables obtained by clustering of the original variables. The use of randomization techniques, e.g. bootstrap samples, and hierarchical clustering of the variables improves the recovery properties of sparse methods. We demonstrate the benefit of our approach on an extensive simulation study as well as two publicly available fMRI datasets.

## Joint Optimization and Variable Selection of High-dimensional Gaussian Processes

– Accepted

**Abstract: **Maximizing high-dimensional, non-convex functions through noisy observations is a notoriously hard problem, but one that arises in many applications. In this paper, we tackle this challenge by modeling the unknown function as a sample from a high-dimensional Gaussian process (GP) distribution. Assuming that the unknown function only depends on few relevant variables, we show that it is possible to perform joint variable selection and GP optimization. We provide strong performance guarantees for our algorithm, bounding the sample complexity of variable selection, and as well as providing cumulative regret bounds. We further provide empirical evidence on the effectiveness of our algorithm on several benchmark optimization problems.

## Large-Scale Feature Learning With Spike-and-Slab Sparse Coding

– Accepted

**Abstract: **We consider the problem of object recogni- tion with a large number of classes. In or- der to scale existing feature learning algo- rithms to this setting, we introduce a new feature learning and extraction procedure based on a factor model we call spike-and- slab sparse coding (S3C). Prior work on this model has not prioritized the ability to ex- ploit parallel architectures and scale to the enormous problem sizes needed for object recognition. We present an inference proce- dure appropriate for use with GPUs which allows us to dramatically increase both the training set size and the amount of latent factors. We demonstrate that this approach improves upon the supervised learning ca- pabilities of both sparse coding and the ss- RBM on the CIFAR-10 dataset. We use the CIFAR-100 dataset to demonstrate that our method scales to large numbers of classes bet- ter than previous methods. Finally, we use our method to win the NIPS 2011 Workshop on Challenges In Learning Hierarchical Mod- els’ Transfer Learning Challenge.

## Anytime Marginal MAP Inference

– Accepted

**Abstract: **This paper presents a new anytime algorithm for the marginal MAP problem in graphical models. The algorithm is described in detail, its complexity and convergence rate are studied, and relations to previous theoretical results for the problem are discussed. It is shown that the algorithm runs in polynomial-time if the underlying graph of the model has bounded tree-width, and that it provides guarantees to the lower and upper bounds obtained within a fixed amount of computational resources. Experiments with both real and synthetic generated models highlight its main characteristics and show that it compares favorably against Park and Darwiche's systematic search, particularly in the case of problems with many MAP variables and moderate tree-width.

## Distributed Parameter Estimation via Pseudo-likelihood

– Accepted

**Abstract: **Estimating statistical models within sensor networks requires distributed algorithms, in which both data and computation are distributed across the nodes of the network. We propose a general approach for distributed learning based on combining local estimators defined by pseudo-likelihood components, encompassing a number of combination methods, and provide both theoretical and experimental analysis. We show that simple linear combination or max-voting methods, when combined with second-order information, are statistically competitive with more advanced and costly joint optimization. Our algorithms have many attractive properties including low communication and computational cost and “any-time” behavior.

## The Nonparametric Metadata Dependent Relational Model

– Accepted

**Abstract: **We introduce the nonparametric metadata dependent relational (NMDR) model, a Bayesian nonparametric extension of previous stochastic block models. The NMDR allows the entities associated with each node to have mixed membership in an unbounded collection of latent communities. Learned regression models allow these memberships to depend on, and be predicted from, arbitrary node metadata. We develop efficient MCMC algorithms for learning NMDR models from (partially observed) node relationships. Retrospective MCMC methods allow our sampler to work directly with the infinite stick-breaking representation of the NMDR, avoiding the need for finite truncations. Our results demonstrate recovery of useful latent communities from real-world social and ecological networks, and the usefulness of metadata in link prediction tasks.

## Deep Lambertian Networks

– Accepted

**Abstract: **Visual perception is a challenging problem in part due to illumination variations. A possible solution is to first estimate an illumination invariant representation before using it for recognition. The object albedo and surface normals are examples of such representation. In this paper, we introduce a multilayer generative model where the latent variables include the albedo, surface normals, and the light source. Combining Deep Belief Nets with the Lambertian reflectance assumption, our model can learn good priors over the albedo from 2D images. Illumination variations can be explained by changing only the lighting latent variable in our model. By transferring learned knowledge from similar objects, albedo and surface normals estimation from a *single* image is possible in our model. Experiments demonstrate that our model is able to generalize as well as improve over standard baselines in *one-shot* face recognition.

## Convergence of the EM Algorithm for Gaussian Mixtures with Unbalanced Mixing Coefficients

– Accepted

**Abstract: **The speed of convergence of the Expectation Maximization (EM) algorithm for Gaussian mixture model fitting is known to be dependent on the amount of overlap among the mixture components. In this paper, we study the impact of mixing coefficients on the convergence of EM. We show that when the mixture components exhibit some overlap, the convergence of EM becomes slower as the dynamic range among the mixing coefficients increases. We propose a deterministic anti-annealing algorithm, that significantly improves the speed of convergence of EM for such mixtures with unbalanced mixing coefficients. The proposed algorithm is compared against other standard optimization techniques like LBFGS, Conjugate Gradient, and the traditional EM algorithm. Finally, we propose a similar deterministic anti-annealing based algorithm for the Dirichlet process mixture model fitting and demonstrate its advantages over the conventional variational Bayesian approach.

## A Joint Model of Language and Perception for Grounded Attribute Learning

– Accepted

**Abstract: **As robots become more ubiquitous and capable, it becomes ever more important to enable untrained users to easily interact with them. Recently, this has led to study of the language grounding problem, where the goal is to extract representations of the meanings of natural language tied to perception and actuation in the physical world. In this paper, we present an approach for joint learning of language and perception models for grounded attribute induction. Our perception model includes attribute classifiers, for example to detect object color and shape, and the language model is based on a probabilistic categorial grammar that enables the construction of rich, compositional meaning representations. The approach is evaluated on the task of interpreting sentences that describe sets of objects in a physical workspace. We demonstrate accurate task performance and effective latent-variable concept induction in physical grounded scenes.

## Learning Parameterized Skills

– Accepted

**Abstract: **We introduce a method for constructing a single skill capable of solving any tasks drawn from a distribution of parameterized reinforcement learning problems. The method draws example tasks from a distribution of interest and uses the corresponding learned policies to estimate the geometry of the lower-dimensional manifold on which the skill policies lie. This manifold models how policy parameters change as the skill parameters are varied. We then apply non-linear regression to construct a parameterized skill by predicting policy parameters from task parameters. We evaluate our method on an underactuated simulated robotic arm tasked with learning to accurately throw darts at any target location around it.

discussion+video, ICML version (pdf) (bib), more on ArXiv (includes revisions)

## Safe Exploration in Markov Decision Processes

– Accepted

**Abstract: ** In unknown or partially unknown environments exploration is necessary to learn how to perform well. Existing reinforcement learning algorithms provide strong exploration guarantees, but they tend to rely on an ergodicity assumption. While ergodicity is a more generally applicable property, it is simplest to state for discrete state spaces, and its essence is that any state is reachable from any other state within some finite amount of time with non-zero probability by following a suitable policy. This assumption allows for exploration algorithms that operate by favoring visiting previously unvisited (or rarely visited) states. For most physical systems this assumption is completely impractical as the systems would break before any reasonable exploration has taken place, i.e., most physical systems don't satisfy the ergodicity assumption. In this paper we address the need for *safe* exploration methods in Markov decision processes. We first propose a general formulation of safety that can be thought of as forcing ergodicity. We show that imposing the safety constraint exactly is NP-hard. We present an efficient approximation algorithm for guaranteed safe, but potentially suboptimal, exploration. At the core of our approach is an optimization formulation in which the constraints restrict attention to guaranteed safe policies. The objective favors exploration policies. Our framework is compatible with the majority of previously proposed exploration methods, which rely on an exploration bonus, which we can replicate in the objective. Our experiments show that our method is able to safely explore state-spaces in which classical exploration methods get stuck.

discussion+video, ICML version (pdf) (bib), more on ArXiv (includes revisions)

## Improved Estimation in Time Varying Models

– Accepted

**Abstract: **Locally adapted parametrizations of a model (such as locally weighted regression) are expressive but often suffer from high variance. We describe an approach for reducing the variance, based on the idea of estimating simultaneously a transformed space for the model, as well as locally adapted parameterizations in this new space. We present a new problem formulation that captures this idea and illustrate it in the important context of time-varying models. We develop an algorithm for learning a set of bases for approximating a time-varying sparse network; each learned basis constitutes an archetypal sparse network structure. We also provide an extension for learning task-specific bases. We present empirical results on synthetic data sets, as well as on a BCI EEG classification task.

## Poisoning Attacks against Support Vector Machines

– Accepted

**Abstract: **We investigate a family of poisoning attacks against Support Vector Machines (SVM). Such attacks amount to injecting specially crafted training data so as to increase the test error of the trained SVM. Central to the motivation for these attacks is the fact that most learning algorithms assume that their training data comes from a natural or well-behaved distribution. However, this iassumption does not generally hold in security-sensitive settings. As we demonstrate in this contribution, an intelligent adversary can to some extent predict the change of the SVM decision function in response to malicious input and use this ability to construct malicious data points. The proposed attack uses a gradient ascent strategy in which the gradient is computed based on properties of the SVM's optimal solution. The gradient ascent method can be easily kernelized and enables the attack to be constructed in the input space even for non-linear kernels. We experimentally demonstrate that our gradient ascent procedure reliably identifies good local maxima of the non-convex validation error surface and inflicts a significant damage on the test error of the trained classifier.

discussion+video, ICML version (pdf) (bib), more on ArXiv (includes revisions)

## Regularizers versus Losses for Nonlinear Dimensionality Reduction: A Factored View with New Convex Relaxations

– Accepted

**Abstract: **We demonstrate that almost all non-parametric dimensionality reduction methods can be expressed by a simple procedure: regularized loss minimization plus singular value truncation. By distinguishing the role of the loss and regularizer in such a process, we recover a factored perspective that reveals some gaps in the current literature. Beyond identifying a useful new loss for manifold unfolding, a key contribution is to derive new convex regularizers that combine distance maximization with rank reduction. These regularizers can be applied to any loss.

## Utilizing Static Analysis and Code Generation to Accelerate Neural Networks

– Accepted

**Abstract: **As datasets continue to grow, neural network (NN) applications are becoming increasingly limited by both the amount of available computational power and the ease of developing high-performance applications. Researchers often must have expert systems knowledge to make their algorithms run efficiently. Although available computing power approximately doubles every two years, algorithm efficiency is not able to keep pace due to the use of general purpose compilers, which are not able to fully optimize specialized application domains. Within the domain of NNs, we have the added knowledge that network architecture remains constant during training, meaning the architecture's data structure can be statically optimized by a compiler. In this paper, we present SONNC, a compiler for NNs that utilizes static analysis to generate optimized parallel code. We show that SONNC's use of static optimizations make it able to outperform optimized MATLAB code by up to 242X. Additionally, we show that use of SONNC significantly reduces code complexity when using structurally sparse networks.

## Similarity Learning for Provably Accurate Sparse Linear Classification

– Accepted

**Abstract: **In recent years, the crucial importance of metrics in machine learning algorithms has led to an increasing interest for optimizing distance and similarity functions. Most of the state of the art focus on learning Mahalanobis distances (requiring to fulfill a constraint of positive semi-definiteness) for use in a local k-NN algorithm. However, no theoretical link is established between the learned metrics and their performance in classification. In this paper, we make use of the formal framework of good similarities introduced by Balcan et al. to propose an algorithm for learning a non PSD linear similarity optimized in a nonlinear feature space, tailored to a use in a global linear classifier. We show that our approach has the property of stability, allowing us to derive a generalization bound on the classification error. Experiments performed on various datasets confirm the effectiveness of our approach compared to state-of-the-art methods and provide evidence that (i) it is fast, (ii) robust to overfitting and (iii) produces very sparse models.

## Variational Inference in Non-negative Factorial Hidden Markov Models for Efficient Audio Source Separation

– Accepted

**Abstract: **The last decade has seen substantial work on the use of non-negative matrix factorization (NMF) and its probabilistic counterparts for audio source separation. Although able to capture audio spectral structure well, these models neglect the non-stationarity and temporal dynamics that are important properties of audio. The recently proposed non-negative factorial hidden Markov model (N-FHMM) introduces a temporal dimension and improves source separation performance. However, the factorial nature of this model makes the complexity of inference exponential in the number of sound sources. Here, we present a Bayesian variant of the N-FHMM suited to an efficient variational inference algorithm, whose complexity is linear in the number of sound sources. Our algorithm performs comparably to exact inference in the original NFHMM but is significantly faster. In typical configurations of the N-FHMM, our method achieves around a 30x increase in speed.

## Conditional Likelihood Maximization: A Unifying Framework for Information Theoretic Feature Selection

– Not for proceedings

**Abstract: **We present a unifying framework for information theoretic feature selection, bringing almost two decades of research on heuristic filter criteria under a single theoretical interpretation. This is in response to the question: “what are the implicit statistical assumptions of feature selection criteria based on mutual information?”. To answer this, we adopt a different strategy than is usual in the feature selection literature—instead of trying to deﬁne a criterion, we derive one, directly from a clearly specified objective function: the conditional likelihood of the training labels. While many hand-designed heuristic criteria try to optimize a definition of feature ‘relevancy’ and ‘redundancy’, our approach leads to a probabilistic framework which naturally incorporates these concepts. As a result we unify the numerous criteria published over the last two decades, and show them to be low-order approximations to the exact (but intractable) optimization problem. The primary contribution is to show that common heuristics for information based feature selection (including Markov Blanket algorithms as a special case) are approximate iterative maximizers of the conditional likelihood. A large empirical study provides strong evidence to favor certain classes of criteria, in particular those that balance the relative size of the relevancy/redundancy terms. Overall we conclude that the JMI criterion (Yang and Moody, 1999; Meyer et al., 2008) provides the best tradeoff in terms of accuracy, stability, and flexibility with small data samples.

discussion+video
~~pdf~~

## Communications Inspired Linear Discriminant Analysis

– Accepted

**Abstract: **We study the problem of supervised linear dimensionality reduction, taking an information-theoretic viewpoint. The linear projection matrix is designed by maximizing the mutual information between the projected signal and the class label (based on a Shannon entropy measure). By harnessing a recent theoretical result on the gradient of mutual information, the above optimization problem can be solved directly using gradient descent, without requiring simplification of the objective function. Theoretical analysis and empirical comparison are made between the proposed method and two closely related methods (Linear Discriminant Analysis and Information Discriminant Analysis), and comparisons are also made with a method in which Renyi entropy is used to define the mutual information (in this case the gradient may be computed simply, under a special parameter setting). Relative to these alternative approaches, the proposed method achieves promising results on real datasets.

## Sparse stochastic inference for latent Dirichlet allocation

– Accepted

**Abstract: **We present a hybrid algorithm for Bayesian topic models that combines the efficiency of sparse Gibbs sampling with the scalability of online stochastic inference. The resulting method scales to a corpus of 1.2 million books comprising 33 billion words with thousands of topics on one CPU. This approach reduces the bias of variational inference and generalizes to many Bayesian hidden-variable models.

## Stochastic Smoothing for Nonsmooth Minimizations: Accelerating SGD by Exploiting Structure

– Accepted

**Abstract: **In this work we consider the stochastic minimization of nonsmooth convex loss functions, a central problem in machine learning. We propose a novel algorithm called Accelerated Nonsmooth Stochastic Gradient Descent (ANSGD), which exploits the structure of common nonsmooth loss functions to achieve optimal convergence rates for a class of problems including SVMs. It is the first stochastic algorithm that can achieve the optimal O(1/t) rate for minimizing nonsmooth loss functions. The fast rates are confirmed by empirical comparisons, in which ANSGD significantly outperforms previous subgradient descent algorithms including SGD.

discussion+video, ICML version (pdf) (bib), more on ArXiv (includes revisions)

## A Convex Feature Learning Formulation for Latent Task Structure Discovery

– Accepted

**Abstract: **This paper considers the multi-task learning problem and in the setting where some relevant features could be shared across few related tasks. Most of the existing methods assume the extent to which the given tasks are related or share a common feature space to be known apriori. In real-world applications however, it is desirable to automatically discover the groups of related tasks that share a feature space. In this paper we aim at searching the exponentially large space of all possible groups of tasks that may share a feature space. The main contribution is a convex formulation that employs a graph-based regularizer and simultaneously discovers few groups of related tasks, having close-by task parameters, as well as the feature space shared within each group. The regularizer encodes an important structure among the groups of tasks leading to an efficient algorithm for solving it: if there is no feature space under which a group of tasks has close-by task parameters, then there does not exist such a feature space for any of its supersets. An efficient active set algorithm that exploits this simplification and performs a clever search in the exponentially large space is presented. The algorithm is guaranteed to solve the proposed formulation (within some precision) in a time polynomial in the number of groups of related tasks discovered. Empirical results on benchmark datasets show that the proposed formulation achieves good generalization and outperforms state-of-the-art multi-task learning algorithms in some cases.

## Efficient Decomposed Learning for Structured Prediction

– Accepted

**Abstract: **Structured prediction is the cornerstone of several machine learning applications. Unfortunately, in structured prediction settings with expressive inter-variable interactions, inference and hence exact learning is often intractable. We present a new way, Decomposed Learning (DecL), for performing efficient learning over structured output spaces. In DecL, we restrict the inference step to a limited part of the output space. We use characterizations based on the structure, target parameters, and gold labels to guarantee that DecL with limited inference is equivalent to exact learning. We also show that in real world settings, where our theoretical assumptions may not hold exactly, DecL-based algorithms are significantly more efficient and provide accuracies close to exact learning.

## Path Integral Policy Improvement with Covariance Matrix Adaptation

– Accepted

**Abstract: **There has been a recent focus in reinforcement learning (RL) on addressing continuous state and action problems by optimizing parameterized policies. PI2 is a recent example of this approach. It combines a derivation from first principles of stochastic optimal control with tools from statistical estimation theory. In this paper, we consider PI2 as a member of the wider family of methods which share the concept of probability-weighted averaging to iteratively update parameters to optimize a cost function. We compare PI2 to other members of the same family – Cross-Entropy Methods and CMA-ES – at the conceptual level and in terms of performance. The comparison suggests the derivation of a novel algorithm which we call PI2-CMA for “Path Integral Policy Improvement with Covariance Matrix Adaptation”. PI2-CMA's main advantage is that it determines the magnitude of the exploration noise automatically.

## Optimizing F-measure: A Tale of Two Approaches

– Accepted

**Abstract: **F-measures are popular performance metrics, particularly for tasks with imbalanced data sets. Algorithms for learning to maximize F-measures follow two approaches: the empirical utility maximization (EUM) approach learns a classifier having optimal performance on training data, while the decision-theoretic approach learns a probabilistic model and then predict labels with maximum expected F-measure. In this paper, we investigate the theoretical justifications and connections for these two approaches, and we study the conditions under which one approach is preferable to the other using synthetic and real datasets. Given accurate models, our results suggest that the two approaches are asymptotically equivalent given large training and test sets. Nevertheless, the EUM approach appears to be more robust against model misspecification, and given a good model, the decision-theoretic approach appears to be better for handling rare classes and for a common domain adaptation scenario.

## Efficient Euclidean Projections onto the Intersection of Norm Balls

– Accepted

**Abstract: **Using sparse-inducing norms to learn robust models has received increasing attention from many fields for its attractive properties. Projection-based methods have been widely applied to learning tasks constrained by such norms. As a key building block of these methods, an efficient operator for Euclidean projection onto the intersection of *ℓ_1* and *ℓ_{1,q}* norm balls *(q=2* or ∞) is proposed in this paper. We prove that the projection can be reduced to finding the root of an auxiliary function which is piecewise smooth and monotonic. Hence, a bisection algorithm is sufficient to solve the problem. We show that the time complexity of our solution is *O(n+g log g)* for *q=2* and *O(n log n)* for *q=∞*, where *n* is the dimensionality of the vector to be projected and *g* is the number of disjoint groups; we confirm this complexity by experimentation. Empirical study reveals that our method achieves significantly better performance than classical methods in terms of running time and memory usage. We further show that embedded with our efficient projection operator, projection-based algorithms can solve regression problems with composite norm constraints more efficiently than other methods and give superior accuracy.

## Making Gradient Descent Optimal for Strongly Convex Stochastic Optimization

– Accepted

**Abstract: **Stochastic gradient descent (SGD) is a simple and popular method to solve stochastic optimization problems which arise in machine learning. For strongly convex problems, its convergence rate was known to be O(log (T)/T), by running SGD for T iterations and returning the average point. However, recent results showed that using a different algorithm, one can get an optimal O1/T) rate. This might lead one to believe that standard SGD is suboptimal, and maybe should even be replaced as a method of choice. In this paper, we investigate the optimality of SGD in a stochastic setting. We show that for smooth problems, the algorithm attains the optimal O(1/T) rate. However, for non-smooth problems, the convergence rate with averaging might really be Ω(log (T)/T), and this is not just an artifact of the analysis. On the flip side, we show that a simple modification of the averaging step suffices to recover the O(1/T) rate, and no other change of the algorithm is necessary. We also present experimental results which support our findings, and point out open problems.

discussion+video, ICML version (pdf) (bib), more on ArXiv (includes revisions)

## Clustering using Max-norm Constrained Optimization

– Accepted

**Abstract: **We suggest using the max-norm as a convex surrogate constraint for clustering. We show how this yields a better exact cluster recovery guarantee than previously suggested nuclear-norm relaxation, and study the effectiveness of our method, and other related convex relaxations, compared to other approaches.

## Submodular Inference of Diffusion Networks from Multiple Trees

– Accepted

**Abstract: **Diffusion and propagation of information, influence and diseases take place over increasingly larger networks. We observe when a node copies information, makes a decision or becomes infected but networks are often hidden or unobserved. Since networks are highly dynamic, changing and growing rapidly, we only observe a relatively small set of cascades before a network changes significantly. Scalable network inference based on a small cascade set is then necessary for understanding the rapidly evolving dynamics that govern diffusion. In this article, we develop a scalable approximation algorithm with provable near-optimal performance which achieves a high accuracy in such scenario, solving an open problem first introduced by Gomez-Rodriguez et al (2010). Experiments on synthetic and real diffusion data show that our algorithm in practice achieves an optimal trade-off between accuracy and running time.

## Approximate Dynamic Programming By Minimizing Distributionally Robust Bounds

– Accepted

**Abstract: **Approximate dynamic programming is a popular method for solving large Markov decision processes. This paper describes a new class of approximate dynamic programming (ADP) methods—distributionally robust ADP—that address the curse of dimensionality by minimizing a pessimistic bound on the policy loss. This approach turns ADP into an optimization problem, for which we derive new mathematical program formulations and analyze itsp properties. DRADP improves on the theoretical guarantees of existing ADP methods—it guarantees convergence and L_1 norm-based error bounds. The empirical evaluation of DRADP shows that the theoretical guarantees translate well into good performance on benchmark problems.

## Modelling transition dynamics in MDPs with RKHS embeddings

– Accepted

**Abstract: **We propose a new, nonparametric approach to learning and representing transition dynamics in Markov decision processes (MDPs), which can be combined easily with dynamic programming methods for policy optimisation and value estimation. This approach makes use of a recently developed representation of conditional distributions as *embeddings* in a reproducing kernel Hilbert space (RKHS). Such representations bypass the need for estimating transition probabilities or densities, and apply to any domain on which kernels can be defined. This avoids the need to calculate intractable integrals, since expectations are represented as RKHS inner products whose computation has linear complexity in the number of points used to represent the embedding. We are able to provide guarantees for the proposed applications in MDPs: in the context of a value iteration algorithm, we prove convergence to either the optimal policy, or to the closest projection of the optimal policy in our model class (an RKHS), under reasonable assumptions. In experiments, we investigate a learning task in a typical classical control setting (the under-actuated pendulum), and on a navigation problem where only images from a sensor are observed. For policy optimisation we compare with least-squares policy iteration where a Gaussian process is used for value function estimation. For value estimation we also compare to the recent NPDP method. Our approach achieves better performance in all experiments.

## Approximate Principal Direction Trees

– Accepted

**Abstract: **We introduce a new spatial data structure for high dimensional data called the *approximate principal direction tree* (APD tree) that adapts to the intrinsic dimension of the data. Our algorithm ensures vector-quantization accuracy similar to that of computationally-expensive PCA trees with similar time-complexity to that of lower-accuracy RP trees. APD trees use a small number of power-method iterations to find splitting planes for recursively partitioning the data. As such they provide a natural trade-off between the running-time and accuracy achieved by RP and PCA trees. Our theoretical results establish a) strong performance guarantees regardless of the convergence rate of the power-method and b) that *O(log d)* iterations suffice to establish the guarantee of PCA trees when the intrinsic dimension is *d*. We demonstrate this trade-off and the efficacy of our data structure on both the CPU and GPU.

## Unachievable Region in Precision-Recall Space and Its Effect on Empirical Evaluation

– Accepted

**Abstract: **Precision-recall (PR) curves and the areas under them are widely used to summarize machine learning results, especially for data sets exhibiting class skew. They are often used analogously to ROC curves and the area under ROC curves. It is already known that PR curves vary as class skew varies. What was not recognized before this paper is that there is a region of PR space that is completely unachievable, and the size of this region varies only with the skew. This paper precisely characterizes the size of that region and discusses its implications for empirical evaluation methodology in machine learning.

discussion+video, ICML version (pdf) (bib), more on ArXiv (includes revisions)

## Randomized Smoothing for (Parallel) Stochastic Optimization

– Not for proceedings

**Abstract: **By combining randomized smoothing techniques with accelerated gradient methods, we obtain convergence rates for stochastic optimization procedures, both in expectation and with high probability, that have optimal dependence on the variance of the gradient estimates. To the best of our knowledge, these are the first variance-based convergence guarantees for non-smooth optimization. A combination of our techniques with recent work on decentralized optimization yields order-optimal parallel stochastic optimization algorithms. We give applications of our results to several statistical machine learning problems, providing experimental results demonstrating the effectiveness of our algorithms.

discussion+video
~~pdf~~

## Marginalized Denoising Autoencoders for Domain Adaptation

– Accepted

**Abstract: **Glorot et al. (2011) have successfully used Stacked Denoising Autoencoders (SDAs) (Vincent et al., 2008) to learn new representations for domain adaptation, resulting in record accuracy levels on well-known benchmark datasets for sentiment analysis. The representations are learned by reconstructing input data from partial corruption. In this paper, we introduce a variation, marginalized SDA (mSDA). In contrast to the original SDA, mSDA requires no training through back-propagation as we explicitly marginalize out the feature corruption and solve for the parameters in closed form. mSDA learns representations that lead to comparable accuracy levels, but can be implemented in only 20 lines of MATLAB and reduces the computation time on large data sets from several days to mere minutes.

## Copula-based Kernel Dependency Measures

– Accepted

**Abstract: **The paper presents a new copula based method for measuring dependence between random variables. Our approach extends the Maximum Mean Discrepancy to the copula of the joint distribution. We prove that this approach has several advantageous properties. Similarly to Shannon mutual information, the proposed dependence measure is invariant to any strictly increasing transformation of the marginal variables. This is important in many applications, for example in feature selection. The estimator is consistent, robust to outliers, and uses rank statistics only. We derive upper bounds on the convergence rate and propose independence tests too. We illustrate the theoretical contributions through a series of experiments in feature selection and low-dimensional embedding of distributions using real and toy datasets.

## Group Sparse Additive Models

– Accepted

**Abstract: **We consider the problem of sparse variable selection in nonparametric additive models, with the prior knowledge of the structure among the covariates to encourage those variables within a group to be selected jointly. Previous works either study the group sparsity in the parametric setting (e.g., group lasso), or address the variable selection problem in the nonparametric setting without exploiting the structural information (e.g., sparse additive models (SpAM)). In this paper, we present a new method, called group sparse additive models (GroupSpAM), which can handle group sparsity in nonparametric additive models. We generalize the l1/l2 norm to Hilbert spaces as the sparsity-inducing penalty in GroupSpAM. Moreover, we derive a novel thresholding condition for identifying the functional sparsity at the group level, and propose an efficient block coordinate descent algorithm for constructing the estimate. We demonstrate by simulation that GroupSpAM substantially outperforms competing methods in terms of support recovery and prediction accuracy in additive models, and also conduct a comparative experiment on a real breast cancer dataset.

## Policy Gradients with Variance Related Risk Criteria

– Accepted

**Abstract: **Managing risk in dynamic decision problems is of cardinal importance in many fields such as finance and process control. The most common approach to defining risk is through various variance related criteria such as the Sharpe Ratio or the standard deviation adjusted reward. It is known that optimizing many of the variance related risk criteria is NP-hard. In this paper we devise a framework for local policy gradient style algorithms for reinforcement learning for variance related criteria. Our starting point is a new formula for the variance of the cost-to-go in episodic tasks. Using this formula we develop policy gradient algorithms for criteria that involve both the expected cost and the variance of the cost. We prove the convergence of these algorithms to local minima and demonstrate their applicability in a portfolio planning problem.

## Efficient Structured Prediction with Latent Variables for General Graphical Models

– Accepted

**Abstract: **In this paper we propose a unified framework for structured prediction with hidden variables which includes hidden conditional random fields and latent structural support vector machines as special cases. We describe an approximation for this general structured prediction formulation using duality, which is based on a local entropy approximation. We then derive an efficient message passing algorithm that is guaranteed to converge, and demonstrate its effectiveness in the tasks of image segmentation as well as 3D indoor scene understanding from single images, showing that our approach is superior to latent structural support vector machines and hidden conditional random fields.

## On the Partition Function and Random Maximum A-Posteriori Perturbations

– Accepted

**Abstract: **n this paper we relate the partition function to the max-statistics of random variables. In particular, we provide a novel framework for approximating and bounding the partition function using MAP inference on randomly perturbed models. As a result, we can directly use efficient MAP solvers such as graph-cuts to evaluate the corresponding partition function. We show that our method excels in the typical “high signal - high coupling” regime that results in ragged energy landscapes difficult for alternative approaches.

## Infinite Tucker Decomposition: Nonparametric Bayesian Models for Multiway Data Analysis

– Accepted

**Abstract: **Tensor decomposition is a powerful computational tool for multiway data analysis. Many popular tensor decomposition approaches—such as the Tucker decomposition and CANDECOMP/PARAFAC (CP)—amount to multi-linear factorization. They are insufficient to model (i) complex interactions between data entities, (ii) various data types (eg missing data and binary data), and (iii) noisy observations and outliers. To address these issues, we propose tensor-variate latent nonparametric Bayesian models, coupled with efficient inference methods, for multiway data analysis. We name these models InfTucker. Using these InfTucker, we conduct Tucker decomposition in an infinite feature space. Unlike classical tensor decomposition models, our new approaches handle both continuous and binary data in a probabilistic framework. Unlike previous Bayesian models on matrices and tensors, our models are based on latent Gaussian or *t* processes with nonlinear covariance functions. To efficiently learn the InfTucker from data, we develop a variational inference technique on tensors. Compared with classical implementation, the new technique reduces both time and space complexities by several orders of magnitude. Our experimental results on chemometrics and social network datasets demonstrate that our new models achieved significantly higher prediction accuracy than the most state-of-art tensor decomposition approaches.

## Inferring Latent Structure From Mixed Real and Categorical Relational Data

– Accepted

**Abstract: **We consider analysis of relational data (a matrix), in which the rows correspond to subjects (e.g., people) and the columns correspond to attributes. The elements of the matrix may be a mix of real and categorical. Each subject and attribute is characterized by a latent binary feature vector, and an inferred matrix maps each row-column pair of binary feature vectors to an observed matrix element. The latent binary features of the rows are modeled via a multivariate Gaussian distribution with low-rank covariance matrix, and the Gaussian random variables are mapped to latent binary features via a probit link. The same type construction is applied jointly to the columns. The model infers latent, low-dimensional binary features associated with each row and each column, as well correlation structure between all rows and between all columns. The Bayesian construction is successfully applied to real-world data, demonstrating an ability to infer meaningful low-dimensional structure from high-dimensional relational data.

## Bayesian Conditional Cointegration

– Accepted

**Abstract: **Cointegration is an important topic for time-series, and describes a relationship between two series in which a linear combination is stationary. Classically, the test for cointegration is based on a two stage process in which first the linear relation between the series is estimated by Ordinary Least Squares. Subsequently a unit root test is performed on the residuals. A well-known deficiency of this classical approach is that is can lead to erroneous conclusions about the presence of cointegration. As an alternative, we present a coherent framework for estimating whether cointegration exists using Bayesian inference which is empirically superior to the classical approach. Finally, we apply our technique to model segmented cointegration in which cointegration may exist only for limited time. In contrast to previous approaches our model makes no restriction on the number of possible cointegration segments.

## Online Alternating Direction Method

– Accepted

**Abstract: **Online optimization has emerged as powerful tool in large scale optimization. In this paper, we introduce efficient online algorithms based on the alternating directions method (ADM). We introduce a new proof technique for ADM in the batch setting, which yields a linear rate of convergence of ADM and forms the basis of regret analysis in the online setting. We consider two scenarios in the online setting, based on whether the solution needs to lie in the feasible set or not. In both settings, we establish regret bounds for both the objective function as well as constraint violation for general and strongly convex functions. Preliminary results are presented to illustrate the performance of the proposed algorithms.

## On the Sample Complexity of Reinforcement Learning with a Generative Model

– Accepted

**Abstract: **We consider the problem of learning the optimal action-value function in the discounted-reward Markov decision processes (MDPs). We prove a new PAC bound on the sample-complexity of model-based value iteration algorithm in the presence of the generative model, which indicates that for an MDP with N state-action pairs and the discount factor γ∈[0,1) only O(N log (N/δ)/((1-γ)^3ε^2)) samples are required to find an ε-optimal estimation of the action-value function with the probability 1-δ. We also prove a matching lower bound of Θ (N log (N/δ)/((1-γ)^3ε^2)) on the sample complexity of estimating the optimal action-value function by every RL algorithm. To the best of our knowledge, this is the first matching result on the sample complexity of estimating the optimal (action-)value function in which the upper bound matches the lower bound of RL in terms of N, ε, *δ* and 1-γ. Also, both our lower bound and our upper bound significantly improve on the state-of-the-art in terms of 1/(1-γ).

## Convergence Rates for Differentially Private Statistical Estimation

– Accepted

**Abstract: **Differential privacy is a cryptographically-motivated privacy definition which has gained significant attention over the past few years. Differentially private solutions enforce privacy by adding random noise to the data or a function computed on the data, and the challenge in designing such algorithms is to optimize the privacy-accuracy-sample size tradeoff. This work studies differentially-private statistical estimation, and shows upper and lower bounds on the convergence rates of differentially private approximations to statistical estimators. Our results reveal a connection between differential privacy and the notion of B-robustness in robust statistics, by showing that unless an estimator is B-robust, we cannot approximate it well with differential privacy over a large class of distributions. We then provide an upper bound on the convergence rate of a differentially private approximation to a B-robust estimator with a bounded range. We show that the bounded range condition is necessary if we wish to ensure a strict form of differential privacy.

## Learning Task Grouping and Overlap in Multi-task Learning

– Accepted

**Abstract: **In the paradigm of multi-task learning, mul- tiple related prediction tasks are learned jointly, sharing information across the tasks. We propose a framework for multi-task learn- ing that enables one to selectively share the information across the tasks. We assume that each task parameter vector is a linear combi- nation of a finite number of underlying basis tasks. The coefficients of the linear combina- tion are sparse in nature and the overlap in the sparsity patterns of two tasks controls the amount of sharing across these. Our model is based on on the assumption that task pa- rameters within a group lie in a low dimen- sional subspace but allows the tasks in differ- ent groups to overlap with each other in one or more bases. Experimental results on four datasets show that our approach outperforms competing methods.

## High Dimensional Semiparametric Gaussian Copula Graphical Models

– Accepted

**Abstract: **In this paper, we propose a semiparametric approach named nonparanormal SKEPTIC for efficiently and robustly estimating high dimensional undirected graphical models. To achieve modeling flexibility, we consider Gaussian Copula graphical models (or the nonparanormal) as proposed by Liu et al. (2009). To achieve estimation robustness, we exploit nonparametric rank-based correlation coefficient estimators, including Spearman's rho and Kendall's tau. In high dimensional settings, we prove that the nonparanormal SKEPTIC achieves the optimal parametric rate of convergence in both graph and parameter estimation. This celebrating result suggests that the Gaussian copula graphical models can be used as a safe replacement of the popular Gaussian graphical models, even when the data are truly Gaussian. Besides theoretical analysis, we also conduct thorough numerical simulations to compare different estimators for their graph recovery performance under both ideal and noisy settings. The proposed methods are then applied on a large-scale genomic dataset to illustrate their empirical usefulness.

## Discovering Support and Affiliated Features from Very High Dimensions

– Accepted

**Abstract: **In this paper, a novel learning paradigm is presented to automatically identify groups of informative and correlated features from very high dimensions. Specifically, we explicitly incorporate correlation measures as constraints and then propose an efficient embedded feature selection method using recently developed cutting plane strategy. The benefits of the proposed algorithm are two-folds. First, it can identify the optimal discriminative and uncorrelated feature subset to the output labels, denoted here as Support Features, which brings about significant improvements in prediction performance over other state of the art feature selection methods considered in the paper. Second, during the learning process, the underlying group structures of correlated features associated with each support feature, denoted as Affiliated Features, can also be discovered without any additional cost. These affiliated features serve to improve the interpretations on the learning tasks. Extensive empirical studies on both synthetic and very high dimensional real-world datasets verify the validity and efficiency of the proposed method.

## Online Bandit Learning against an Adaptive Adversary: from Regret to Policy Regret

– Accepted

**Abstract: **Online learning algorithms are designed to learn even when their input is generated by an adversary. The widely-accepted formal definition of an online algorithm's ability to learn is the game-theoretic notion of regret. We argue that the standard definition of regret becomes inadequate if the adversary is allowed to adapt to the online algorithm's actions. We define the alternative notion of pol- icy regret, which attempts to provide a more meaningful way to measure an online algorithm's performance against adaptive adversaries. Focusing on the online bandit setting, we show that no bandit algorithm can guarantee a sublinear policy regret against an adaptive adversary with unbounded memory. On the other hand, if the adversary's memory is unbounded, we present a general technique that converts any bandit algorithm with a sublinear regret bound into an algorithm with a sublinear policy regret bound. We extend this result to other variants of regret, such as switching regret, internal regret, and swap regret.

## Statistical linear estimation with penalized estimators: an application to reinforcement learning

– Accepted

**Abstract: **Motivated by value function estimation in reinforcement learning, we study statistical linear inverse problems, i.e., problems where the coefficients of a linear system to be solved are observed in noise. We consider penalized estimators, where performance is evaluated using a matrix-weighted two-norm of the defect of the estimator measured with respect to the true, unknown coefficients. Two objective functions are considered depending whether the error of the defect measured with respect to the noisy coefficients is squared or unsquared. We propose simple, yet novel and theoretically well-founded data-dependent choices for the regularization parameters for both cases that avoid data-splitting. A distinguishing feature of our analysis is that we derive deterministic error bounds in terms of the error of the coefficients, thus allowing the complete separation of the analysis of the stochastic properties of these errors. We show that our results lead to new insights and bounds for linear value function estimation in reinforcement learning.

## Large Scale Variational Bayesian Inference for Structured Scale Mixture Models

– Accepted

**Abstract: **Natural image statistics exhibit hierarchical dependencies across multiple scales. Representing such prior knowledge in non-factorial latent tree models can boost performance of image denoising, inpainting, deconvolution or reconstruction substantially, beyond standard factorial “sparse” methodology. We derive a large scale approximate Bayesian inference algorithm for linear models with non-factorial (latent tree-structured) scale mixture priors. Experimental results on a range of denoising and inpainting problems demonstrate substantially improved performance compared to MAP estimation or to inference with factorial priors.

## Bayesian Posterior Sampling via Stochastic Gradient Fisher Scoring

– Accepted

**Abstract: **In this paper we address the following question: “Can we approximately sample from a Bayesian posterior distribution if we are only allowed to touch a small mini-batch of data-items for every sample we generate?”. An algorithm based on the Langevin equation with stochastic gradients (SGLD) was previously proposed to solve this, but its mixing rate was slow. By leveraging the Bayesian Central Limit Theorem, we extend the SGLD algorithm so that at high mixing rates it will sample from a normal approximation of the posterior, while for slow mixing rates it will mimic the behavior of SGLD with a pre-conditioner matrix. As a bonus, the proposed algorithm is reminiscent of Fisher scoring (with stochastic gradients) and as such an efficient optimizer during burn-in.

## An adaptive algorithm for finite stochastic partial monitoring

– Accepted

**Abstract: **We present a new anytime algorithm that achieves near-optimal regret for any instance of finite stochastic partial monitoring. In particular, the new algorithm achieves the minimax regret, within logarithmic factors, for both “easy” and “hard” problems. For easy problems, it additionally achieves logarithmic individual regret. Most importantly, the algorithm is adaptive in the sense that if the opponent strategy is in an “easy region” of the strategy space then the regret grows as if the problem was easy. As an implication, we show that under some reasonable additional assumptions, the algorithm enjoys an *O(√T)* regret in Dynamic Pricing, proven to be hard by Bartok et al. (2011).

## The Big Data Bootstrap

– Accepted

**Abstract: **The bootstrap provides a simple and powerful means of assessing the quality of estimators. However, in settings involving large datasets, the computation of bootstrap-based quantities can be prohibitively demanding. As an alternative, we present BLB, a new procedure which incorporates features of both the bootstrap and subsampling to obtain a robust, computationally efficient means of assessing the quality of estimators. BLB is well suited to modern parallel and distributed computing architectures and retains the generic applicability, statistical efficiency, and favorable theoretical properties of the bootstrap. We provide a theoretical analysis elucidating the properties of BLB, as well as an extensive empirical investigation, including a study of its statistical correctness, its large-scale implementation and performance, selection of hyperparameters, and performance on real data.

## Predicting Consumer Behavior in Commerce Search

– Accepted

**Abstract: **Traditional approaches to ranking in web search follow the paradigm of rank-by-score: a learned function gives each query-URL combination an absolute score and URLs are ranked according to this score. This paradigm ensures that if the score of one URL is better than another then one will always be ranked higher than the other. Scoring contradicts prior work in behavioral economics that showed that users' preferences between two items depend not only on the items but also on the presented alternatives. Thus, for the same query, users' preference between items A and B depends on the presence/absence of item C. We propose a new model of ranking, the Random Shopper Model, that allows and explains such behavior. In this model, each feature is viewed as a Markov chain over the items to be ranked, and the goal is to find a weighting of the features that best reflects their importance. We show that our model can be learned under the empirical risk minimization framework, and give an efficient learning algorithm. Experiments on commerce search logs demonstrate that our algorithm outperforms scoring-based approaches including regression and listwise ranking.

## Conditional mean embeddings as regressors

– Accepted

**Abstract: ** We demonstrate an equivalence between reproducing kernel Hilbert space (RKHS) embeddings of conditional distributions and vector-valued regressors. This connection introduces a natural regularized loss function which the RKHS embeddings minimise, providing an intuitive understanding of the embeddings and a solid justification for their use. Furthermore, the equivalence allows the application of vector-valued regression methods and results to the problem of learning conditional distributions. Using this link we derive a sparse version of the embedding by considering alternative formulations. Further, by applying convergence results for vector-valued regression to the embedding problem we derive minimax convergence rates which are approximately *O(log (n)/n)* – compared to current state of the art rates of *O(n^{-1/4})* – and are valid under milder and more intuitive assumptions. These minimax lower rates coincide with upper rates up to a logarithmic factor and show that the embedding method achieves nearly optimal rates. We study our sparse embedding algorithm in a reinforcement learning task where the algorithm shows significant improvement in sparsity over a Cholesky decomposition.

## A Generative Process for Contractive Auto-Encoders

– Accepted

**Abstract: **The contractive auto-encoder learns a representation of the input data that captures the local manifold structure around each data point, through the leading singular vectors of the Jacobian of the transformation from input to representation. The corresponding singular values specify how much local variation is plausible in directions associated with the corresponding singular vectors, while remaining in a high-density region of the input space. This paper proposes a procedure for generating samples that are consistent with the local structure captured by a contractive auto-encoder. The associated stochastic process defines a distribution from which one can sample, and which experimentally appears to converge quickly and mix well between modes, compared to Restricted Boltzmann Machines and Deep Belief Networks. The intuitions behind this procedure can also be used to train the second layer of contraction that pools lower-level features and learns to be invariant to the local directions of variation discovered in the first layer. We show that this can help learn and represent invariances present in the data and improve classification error.

## Local Loss Optimization in Operator Models: A New Insight into Spectral Learning

– Accepted

**Abstract: **This paper re-visits the spectral method for learning latent variable models defined in terms of observable operators. We give a new perspective on the method, showing that operators can be recovered by minimizing a loss defined on a finite subspace of the domain. A non-convex optimization similar to the spectral method is derived. We also propose a regularized convex relaxation of this optimization. We show that in practice the availabilty of a continuous regularization parameter (in contrast with the discrete number of states in the original method) allows a better trade-off between accuracy and model complexity. We also prove that in general, a randomized strategy for choosing the local loss will succeed with high probability. We also prove that in general, a randomized strategy for choosing the local loss will succeed with high probability.

## Robust PCA in High-dimension: A Deterministic Approach

– Accepted

**Abstract: **We consider principal component analysis for contaminated data-set in the high dimensional regime, where the number of observations is comparable or more than the dimensionality of each observation. We propose a *deterministic* high-dimensional robust PCA algorithm which inherits all theoretical properties of its *randomized* counterpart, i.e., it is tractable, robust to contaminated points, easily kernelizable, asymptotic consistent and achieves maximal robustness - a breakdown point of 50%. More importantly, the proposed method exhibits significantly better computational efficiency, which makes it suitable for large-scale real applications.

## Complexity Analysis of the Lasso Regularization Path

– Accepted

**Abstract: **The regularization path of the Lasso can be shown to be piecewise linear, making it possible to “follow” and explicitly compute the entire path. We analyze in this paper this popular strategy, and prove that its worst case complexity is exponential in the number of variables. We then oppose this pessimistic result to an (optimistic) approximate analysis: We show that an approximate path with at most *O(1/√ϵ)* linear segments can always be obtained, where every point on the path is guaranteed to be optimal up to a relative *ϵ*-duality gap. We complete our theoretical analysis with a practical algorithm to compute these approximate paths.

## Projection-free Online Learning

– Accepted

**Abstract: **We present efficient online learning algorithms that eschew projections in favor of linear optimizations using the Frank-Wolfe technique. We obtain regret bounds that vary from the optimal *Õ(√T)* for stochastic online smooth convex optimization to *Õ(T^{3/4})* for general online convex optimization. Besides the computational advantage, other desirable features of our algorithms are that they are parameter-free in the stochastic case and produce sparse decisions.

## Machine Learning that Matters

– Accepted

**Abstract: **Much of current machine learning (ML) research has lost its connection to problems of import to the larger world of science and society. From this perspective, there exist glaring limitations in the data sets we investigate, the metrics we employ for evaluation, and the degree to which results are communicated back to their originating domains. What changes are needed to how we conduct research to increase the impact that ML has? We present six Impact Challenges to explicitly focus the field’s energy and attention, and we discuss existing obstacles that must be addressed. We aim to inspire ongoing discussion and focus on ML that matters.

## Scene parsing with Multiscale Feature Learning, Purity Trees, and Optimal Covers

– Accepted

**Abstract: **Scene parsing consists in labeling each pixel in an image with the category of the object it belongs to. We propose a method that uses a multiscale convolutional network trained from raw pixels to extract dense feature vectors that encode regions of multiple sizes centered on each pixel. The method alleviates the need for engineered features. In parallel to feature extraction, a tree of segments is computed from a graph of pixel dissimilarities. The feature vectors associated with the segments covered by each node in the tree are aggregated and fed to a classifier which produces an estimate of the distribution of object categories contained in the segment. A subset of tree nodes that cover the image are then selected so as to maximize the average 'purity' of the class distributions, hence maximizing the overall likelihood that each segment will contain a single object. The system yields record accuracies on the the Sift Flow Dataset (33 classes) and the Barcelona Dataset (170 classes) and near-record accuracy on Stanford Background Dataset (8 classes), while being an order of magnitude faster than competing approaches, producing a 320x240 image labeling in less than 1 second, including feature extraction.

## Linear Regression with Limited Observation

– Accepted

**Abstract: **We consider the most common variants of linear regression, including Ridge, Lasso and Support-vector regression, in a setting where the learner is allowed to observe only a fixed number of attributes of each example at training time. We present simple and efficient algorithms for these problems: for Lasso and Ridge regression they need the same number of attributes (up to constants) as do full-information algorithms, for reaching a certain accuracy. For Support-vector regression, we require exponentially less attributes compared to the state of the art. By that, we resolve an open problem recently posed by (Cesa-Bianchi et al., 2010). Experiments show the theoretical bounds to be justified by superior performance compared to the state of the art.

## An Online Boosting Algorithm with Theoretical Justifications

– Accepted

**Abstract: **We study the task of online boosting, which aims to combine online weak learners into an online strong learner. While boosting in the batch setting has a sound theoretical foundation, not much theory is known for the online setting. In this paper, we propose an online boosting algorithm and we provide a theoretical guarantee. In addition, we also perform experiments on real-world datasets and the results show that our algorithm compares favorably with existing algorithms.

## Modeling Temporal Dependencies in High-Dimensional Sequences: Application to Polyphonic Music Generation and Transcription

– Accepted

**Abstract: **We investigate the problem of modeling symbolic sequences of polyphonic music in a completely general piano-roll representation. We introduce a probabilistic model based on distribution estimators conditioned on a recurrent neural network that is able to discover temporal dependencies in high-dimensional sequences. Our approach outperforms many traditional models of polyphonic music on a variety of realistic datasets. We show how our musical language model can serve as a symbolic prior to improve the accuracy of polyphonic transcription.

## Approximate Modified Policy Iteration

– Accepted

**Abstract: **Modified policy iteration (MPI) is a dynamic programming (DP) algorithm that contains the two celebrated policy and value iteration methods. Despite its generality, MPI has not been thoroughly studied, especially its approximation form which is used when the state and/or action spaces are large or infinite. In this paper, we propose three approximate MPI (AMPI) algorithms that are extensions of the well-known approximate DP algorithms: fitted-value iteration, fitted-Q iteration, and classification-based policy iteration. We provide an error propagation analysis for AMPI that unifies those for approximate policy and value iteration. We also provide a finite-sample analysis for the classification-based implementation of AMPI (CBMPI), which is more general (and somehow contains) than the analysis of the other presented AMPI algorithms. An interesting observation is that the MPI's parameter allows us to control the balance of errors (in value function approximation and in estimating the greedy policy) in the final performance of the CBMPI algorithm.

## Nonparametric Link Prediction in Dynamic Networks

– Accepted

**Abstract: **We propose a non-parametric link prediction algorithm for a sequence of graph snapshots over time. The model predicts links based on the features of its endpoints, as well as those of the *local neighborhood* around the endpoints. This allows for different *types* of neighborhoods in a graph, each with its own dynamics (e.g, growing or shrinking communities). We prove the consistency of our estimator, and give a fast implementation based on locality-sensitive hashing. Experiments with simulated as well as five real-world dynamic graphs show that we outperform the state of the art, especially when sharp fluctuations or non-linearities are present.

## Agnostic System Identification for Model-Based Reinforcement Learning

– Accepted

**Abstract: **A fundamental problem in control is to learn a model of a system from observations that is useful for controller synthesis. To provide good performance guarantees, existing methods must assume that the real system is in the class of models considered during learning. We present an iterative method with strong guarantees even in the agnostic case where the system is not in the class. In particular, we show that any no-regret online learning algorithm can be used to obtain a near-optimal policy, provided some model achieves low training error and access to a good exploration distribution. Our approach applies to both discrete and continuous domains. We demonstrate its efficacy and scalability on a challenging helicopter domain from the literature.

discussion+video, ICML version (pdf) (bib), more on ArXiv (includes revisions)