Hashing with Graphs
Wei Liu, Jun Wang, Sanjiv Kumar, Shih-Fu Chang
Abstract:Hashing is becoming increasingly popular for efficient nearest neighbor search in massive databases. However, learning short codes that yield good search performance is still a challenge. Moreover, in many cases real-world data lives on a low-dimensional manifold, which should be taken into account to capture meaningful nearest neighbors. In this paper, we propose a novel graph-based hashing method which automatically discovers the neighborhood structure inherent in the data to learn appropriate compact codes. To make such an approach computationally feasible, we utilize Anchor Graphs to obtain tractable low-rank adjacency matrices. Our formulation allows constant time hashing of a new data point by extrapolating graph Laplacian eigenvectors to eigenfunctions. Finally, we describe a hierarchical threshold learning procedure in which each eigenfunction yields multiple bits, leading to higher search accuracy. Experimental comparison with the other state-of-the-art methods on two large datasets demonstrates the efficacy of the proposed method.
[download] [bibTeX] [discuss]
Efficient Sparse Modeling with Automatic Feature Grouping
Wenliang Zhong, James Kwok
Abstract:The grouping of features is highly beneficial in learning with high-dimensional data. It reduces the variance in the estimation and improves the stability of feature selection, leading to improved generalization. Moreover, it can also help in data understanding and interpretation. OSCAR is a recent sparse modeling tool that achieves this by using a $\ell_1$-regularizer and a pairwise $\ell_\infty$-regularizer. However, its optimization is computationally expensive. In this paper, we propose an efficient solver based on the accelerated gradient methods. We show that its key projection step can be solved by a simple iterative group merging algorithm. It is highly efficient and reduces the empirical time complexity from $O(d^3 \sim d^5)$ for the existing solvers to just $O(d)$, where $d$ is the number of features. Experimental results on toy and real-world data sets demonstrate that OSCAR is a competitive sparse modeling approach with the added ability of automatic feature grouping.
[download] [bibTeX] [discuss]
Multi-Label Classification on Tree- and DAG-Structured Hierarchies
Wei Bi, James Kwok
Abstract:Many real-world applications involve multi-label classification, in which the labels are organized in the form of a tree or directed acyclic graph (DAG). However, current research efforts typically ignore the label dependencies or can only exploit the dependencies in tree-structured hierarchies. In this paper, we present a novel hierarchical multi-label classification algorithm which can be used on both tree- and DAG-structured hierarchies. The key idea is to formulate the search for the optimal consistent multi-label as the finding of the best subgraph in a tree/DAG. Using a simple greedy strategy, the proposed algorithm is computationally efficient, easy to implement, does not suffer from the problem of insufficient/skewed training data in classifier training, and can be readily used on large hierarchies. Theoretical results guarantee the optimality of the obtained solution. Experiments are performed on a large number of functional genomics data sets. The proposed method consistently outperforms the state-of-the-art method on both tree- and DAG-structured hierarchies.
[download] [bibTeX] [discuss]
A Graph-based Framework for Multi-Task Multi-View Learning
Jingrui He, Rick Lawrence
Abstract:Many real-world problems exhibit dual-heterogeneity. A single learning task might have features in multiple views (i.e., feature heterogeneity); multiple learning tasks might be related with each other through one or more shared views (i.e., task heterogeneity). Existing multi-task learning or multi-view learning algorithms only capture one type of heterogeneity. In this paper, we introduce Multi-Task Multi-View (M^2TV) learning for such complicated learning problems with both feature heterogeneity and task heterogeneity. We propose a graph-based framework (GraM^2) to take full advantage of the dual-heterogeneous nature. Our framework has a natural connection to Reproducing Kernel Hilbert Space (RKHS). Furthermore, we propose an iterative algorithm (IteM^2) for GraM^2 framework, and analyze its optimality, convergence and time complexity. Experimental results on various real data sets demonstrate its effectiveness.
[download] [bibTeX] [discuss]
GoDec: Randomized Low-rank & Sparse Matrix Decomposition in Noisy Case
Tianyi Zhou, Dacheng Tao, University of Technology
Abstract:Low-rank and sparse structures have been profoundly studied in matrix completion and compressed sensing. In this paper, we develop ``Go Decomposition'' (GoDec) to efficiently and robustly estimate the low-rank part $L$ and the sparse part $S$ of a matrix $X=L+S+G$ with noise $G$. GoDec alternatively assigns the low-rank approximation of $X-S$ to $L$ and the sparse approximation of $X-L$ to $S$. The algorithm can be significantly accelerated by bilateral random projections (BRP). We also propose GoDec for matrix completion as an important variant. We prove that the objective value $\|X-L-S\|_F^2$ converges to a local minimum, while $L$ and $S$ linearly converge to local optimums. Theoretically, we analyze the influence of $L$, $S$ and $G$ to the asymptotic/convergence speeds in order to discover the robustness of GoDec. Empirical studies suggest the efficiency, robustness and effectiveness of GoDec comparing with representative matrix decomposition and completion tools, e.g., Robust PCA and OptSpace.
[download] [bibTeX] [discuss]
Unimodal Bandits
Jia Yuan Yu, Shie Mannor
Abstract:We consider multiarmed bandit problems where the expected reward is unimodal over partially ordered arms. In particular, the arms may belong to a continuous interval or correspond to vertices in a graph, where the graph structure represents similarity in rewards. The unimodality assumption has an important advantage: we can determine if a given arm is optimal by sampling the possible directions around it. This property allows us to quickly and efficiently find the optimal arm and detect abrupt changes in the reward distributions. For the case of bandits on graphs, we incur a regret proportional to the maximal degree and the diameter of the graph, instead of the total number of vertices.
[download] [bibTeX] [discuss]
Learning Output Kernels with Block Coordinate Descent
Francesco Dinuzzo, Cheng Soon Ong, Peter Gehler, Gianluigi Pillonetto
Abstract:We propose a method to learn simultaneously a vector-valued function and a kernel between its components. The obtained kernel can be used both to improve learning performances and to reveal structures in the output space which may be important in their own right. Our method is based on the solution of a suitable regularization problem over a reproducing kernel Hilbert space (RKHS) of vector-valued functions. Although the regularized risk functional is non-convex, we show that it is invex, implying that all local minimizers are global minimizers. We derive a block-wise coordinate descent method that efficiently exploits the structure of the objective functional. Then, we empirically demonstrate that the proposed method can improve classification accuracy. Finally, we provide a visual interpretation of the learned kernel matrix for some well known datasets.
[download] [bibTeX] [discuss]
Vector-valued Manifold Regularization
Ha Quang Minh, Vikas Sindhwani
Abstract:We consider the general problem of learning an unknown functional dependency, f : X->Y, between a structured input space X and a structured output space Y, from labeled and unlabeled examples. We formulate this problem in terms of data-dependent regularization in Vector-valued Reproducing Kernel Hilbert Spaces (Micchelli & Pontil, 2005) which elegantly extend familiar scalar-valued kernel methods to the general setting where Y has a Hilbert space structure. Our methods provide a natural extension of Manifold Regularization (Belkin et al., 2006) algorithms to also exploit output inter-dependencies while enforcing smoothness with respect to input data geometry. We propose a class of matrix-valued kernels which allow efficient implementations of our algorithms via the use of numerical solvers for Sylvester matrix equations. On multilabel image annotation and text classification problems, we find favorable empirical comparisons against several competing alternatives.
[download] [bibTeX] [discuss]
On Information-Maximization Clustering: Tuning Parameter Selection and Analytic Solution
Masashi Sugiyama, Makoto Yamada, Manabu Kimura, Hirotaka Hachiya
Abstract:Information-maximization clustering learns a probabilistic classifier in an unsupervised manner so that mutual information between feature vectors and cluster assignments is maximized. A notable advantage of this approach is that it only involves continuous optimization of model parameters, which is substantially easier to solve than discrete optimization of cluster assignments. However, existing methods still involve non-convex optimization problems, and therefore finding a good local optimal solution is not straightforward in practice. In this paper, we propose an alternative information-maximization clustering method based on a squared-loss variant of mutual information. This novel approach gives a clustering solution analytically in a computationally efficient way via kernel eigenvalue decomposition. Furthermore, we provide a practical model selection procedure that allows us to objectively optimize tuning parameters included in the kernel function. Through experiments, we demonstrate the usefulness of the proposed approach.
[download] [bibTeX] [discuss]
On tracking portfolios with certainty equivalents on a generalization of Markowitz model: the Fool, the Wise and the Adaptive
Richard Nock, Brice Magdalou, Eric Briys, Frank Nielsen
Abstract:Portfolio allocation theory has been heavily influenced by a major contribution of Harry Markowitz in the early fifties: the mean-variance approach. While there has been a continuous line of works in on-line learning portfolios over the past decades, very few works have really tried to cope with Markowitz model. A major drawback of the mean-variance approach is that it is approximation-free only when stock returns obey a Gaussian distribution, an assumption known not to hold in real data. In this paper, we first alleviate this assumption, and rigorously lift the mean-variance model to a more general mean-divergence model in which stock returns are allowed to obey any exponential family of distributions. We then devise a general on-line learning algorithm in this setting. We prove for this algorithm the first lower bounds on the most relevant quantity to be optimized in the framework of Markowitz model: the certainty equivalents. Experiments on four real-world stock markets display its ability to track portfolios whose cumulated returns exceed those of the best stock by orders of magnitude.
[download] [bibTeX] [discuss]
Multiple Instance Learning with Manifold Bags
Boris Babenko, Nakul Verma, Piotr Dollar, Serge Belongie
Abstract:In many machine learning applications, labeling every instance of data is burdensome. Multiple Instance Learning (MIL), in which training data is provided in the form of labeled bags rather than labeled instances, is one approach for a more relaxed form of supervised learning. Though much progress has been made in analyzing MIL problems, existing work considers bags that have a finite number of instances. In this paper we argue that in many applications of MIL (e.g. image, audio, e.t.c.) the bags are better modeled as low dimensional manifolds in high dimensional feature space. We show that the geometric structure of such manifold bags affects PAC-learnability. We discuss how a learning algorithm that is designed for finite sized bags can be adapted to learn from manifold bags. Furthermore, we propose a simple heuristic that reduces the memory requirements of such algorithms. Our experiments on real-world data validate our analysis and show that our approach works well.
[download] [bibTeX] [discuss]
Eigenvalue Sensitive Feature Selection
Yi Jiang, Jiangtao Ren
Abstract:In recent years, some spectral feature selection methods are proposed to choose those features with high power of preserving sample similarity. However, when there exist lots of irrelevant or noisy features in data, the similarity matrix constructed from all the un-weighted features may be not reliable, which then misleads existing spectral feature selection methods to select 'wrong' features. To solve this problem, we propose that feature importance should be evaluated according to their impacts on similarity matrix, which means features with high impacts on similarity matrix should be chosen as important ones. Since graph Laplacian\cite{luxbury2007} is defined on the similarity matrix, then the impact of each feature on similarity matrix can be reflected on the change of graph Laplacian, especially on its eigen-system. Based on this point of view, we propose an Eigenvalue Sensitive Criteria (EVSC) for feature selection, which aims at seeking those features with high impact on graph Laplacian's eigenvalues. Empirical analysis demonstrates our proposed method outperforms some traditional spectral feature selection methods.
[download] [bibTeX] [discuss]
Large Scale Text Classification using Semi-supervised Multinomial Naive Bayes
Jiang Su, Jelber Sayyad Shirab, Stan Matwin
Abstract:Numerous semi-supervised learning methods have been proposed to augment Multinomial Naive Bayes (MNB) using unlabeled documents, but their use in practice is often limited due to implementation difficulty, inconsistent prediction performance, or high computational cost. In this paper, we propose a new, very simple semi-supervised extension of MNB, called Semi-supervised Frequency Estimate (SFE). Our experiments show that it consistently improves MNB with additional data (labeled or unlabeled) in terms of AUC and accuracy, which is not the case when combining MNB with Expectation Maximization (EM). We attribute this to the fact that SFE consistently produces better conditional log likelihood values than both EM+MNB and MNB in labeled training data.
[download] [bibTeX] [discuss]
Enhanced Gradient and Adaptive Learning Rate for Training Restricted Boltzmann Machines
KyungHyun Cho, Tapani Raiko, Alexander Ilin
Abstract:Boltzmann machines are often used as building blocks in greedy learning of deep networks. However, training even a simplified model, known as restricted Boltzmann machine (RBM), can be extremely laborious: Traditional learning algorithms often converge only with the right choice of the learning rate scheduling and the scale of the initial weights. They are also sensitive to specific data representation: An equivalent RBM can be obtained by flipping some bits and changing the weights and biases accordingly, but traditional learning rules are not invariant to such transformations. Without careful tuning of these training settings, traditional algorithms can easily get stuck at plateaus or even diverge. In this work, we present an enhanced gradient which is derived such that it is invariant to bit-flipping transformations. We also propose a way to automatically adjust the learning rate by maximizing a local likelihood estimate. Our experiments confirm that the proposed improvements yield more stable training of RBMs.
[download] [bibTeX] [discuss]
Dynamic Tree Block Coordinate Ascent
Daniel Tarlow, Dhruv Batra, Pushmeet Kohli, Vladimir Kolmogorov
Abstract:This paper proposes a novel Linear Programming (LP) based algorithm, called Dynamic Tree-Block Coordinate Ascent (DTBCA), for performing maximum a posteriori (MAP) inference in probabilistic graphical models. Unlike traditional message passing algorithms, which operate uniformly on the whole factor graph, our method dynamically chooses regions of the factor graph on which to focus message-passing efforts. We propose two criteria for selecting regions, including an efficiently computable upper-bound on the increase in the objective possible by passing messages in any particular region. This bound is derived from the theory of primal-dual methods from combinatorial optimization, and the forest that maximizes the bounds can be chosen efficiently using a maximum-spanning-tree-like algorithm. Experimental results show that our dynamic schedules significantly speed up state-of-the- art LP-based message-passing algorithms on a wide variety of real-world problems.
[download] [bibTeX] [discuss]
Implementing regularization implicitly via approximate eigenvector computation
Michael Mahoney, Lorenzo Orecchia
Abstract:Regularization is a powerful technique for extracting useful information from noisy data. Typically, it is implemented by adding some sort of norm constraint to an objective function and then exactly optimizing the modified objective function. This procedure often leads to optimization problems that are computationally more expensive than the original problem, a fact that is clearly problematic if one is interested in large-scale applications. On the other hand, a large body of empirical work has demonstrated that heuristics, and in some cases approximation algorithms, developed to speed up computations sometimes have the side-effect of performing regularization implicitly. Thus, we consider the question: What is the regularized optimization objective that an approximation algorithm is exactly optimizing? We address this question in the context of computing approximations to the smallest nontrivial eigenvector of a graph Laplacian; and we consider three random-walk-based procedures: one based on the heat kernel of the graph, one based on computing the the PageRank vector associated with the graph, and one based on a truncated lazy random walk. In each case, we provide a precise characterization of the manner in which the approximation method can be viewed as implicitly computing the exact solution to a regularized problem. Interestingly, the regularization is not on the usual vector form of the optimization problem, but instead it is on a related semidefinite program.
[download] [bibTeX] [discuss]
Learning Mallows Models with Pairwise Preferences
Tyler Lu, Craig Boutilier
Abstract:Learning preference distributions is a key problem in many areas (e.g., recommender systems, IR, social choice). However, existing methods require restrictive data models for evidence about user preferences. We relax these restrictions by considering as data arbitrary pairwise comparisons---the fundamental building blocks of ordinal rankings. We develop the first algorithms for learning Mallows models (and mixtures) with pairwise comparisons. At the heart is a new algorithm, the generalized repeated insertion model (GRIM), for sampling from arbitrary ranking distributions. We develop approximate samplers that are exact for many important special cases---and have provable bounds with pairwise evidence---and derive algorithms for evaluating log-likelihood, learning Mallows mixtures, and non-parametric estimation. Experiments on large, real-world datasets show the effectiveness of our approach.
[download] [bibTeX] [discuss]
Surrogate losses and regret bounds for cost-sensitive classification with example-dependent costs
Clayton Scott
Abstract:We study surrogate losses in the context of cost-sensitive classification with example-dependent costs, a problem also known as regression level set estimation. We give sufficient conditions on the surrogate loss for the existence of a surrogate regret bound. Such bounds imply that as the surrogate risk tends to its optimal value, so too does the expected misclassification cost. Our sufficient conditions encompass example-dependent versions of the hinge, exponential, and other common losses. These results provide theoretical justification for some previously proposed surrogate-based algorithms, and suggests others that have not yet been developed.
[download] [bibTeX] [discuss]
Efficient Rule Ensemble Learning using Hierarchical Kernels
Pratik Jawanpuria, Saketha Nath Jagarlapudi, Ganesh Ramakrishnan
Abstract:This paper addresses the problem of Rule Ensemble Learning (REL), where the goal is simultaneous discovery of a small set of simple rules and their optimal weights that lead to good generalization. Rules are assumed to be conjunctions of basic propositions concerning the values taken by the input features. From the perspectives of interpretability as well as generalization, it is highly desirable to construct rule ensembles with low training error, having rules that are i) simple, {\em i.e.}, involve few conjunctions and ii) few in number. We propose to explore the (exponentially) large feature space of all possible conjunctions optimally and efficiently by employing the recently introduced Hierarchical Kernel Learning (HKL) framework. The regularizer employed in the HKL formulation can be interpreted as a potential for discouraging selection of rules involving large number of conjunctions -- justifying its suitability for constructing rule ensembles. Simulation results show that the proposed approach improves over state-of-the-art REL algorithms in terms of generalization and indeed learns simple rules. Although this is encouraging, it can be shown that HKL selects a conjunction only if all its subsets are selected, and this is highly undesirable. We propose a novel convex formulation which alleviates this problem and generalizes the HKL framework. The main technical contribution of this paper is an efficient mirror-descent based active set algorithm for solving the new formulation. Empirical evaluations on REL problems illustrate the utility of generalized HKL.
[download] [bibTeX] [discuss]
An Augmented Lagrangian Approach to Constrained MAP Inference
Andre Martins, Mario Figueiredo, pedro Aguiar, Noah Smith, Eric Xing
Abstract:We propose a new fast algorithm for approximate MAP inference on factor graphs, which combines augmented Lagrangian optimization with the dual decomposition method. Each slave subproblem is given a quadratic penalty, which pushes toward faster consensus than in previous subgradient approaches. Our algorithm is provably convergent, parallelizable, and suitable for fine decompositions of the graph. We show how it can efficiently handle problems with (possibly global) structural constraints via simple sort operations. Experiments on synthetic and real-world data show that our approach compares favorably with the state-of-the-art.
[download] [bibTeX] [discuss]
Time Series Clustering: Complex is Simpler!
Lei Li, B. Aditya Prakash
Abstract:Given a motion capture sequence, how to identify the category of the motion? Classifying human motions is a critical task in motion editing and synthesizing, for which manual labeling is clearly inefficient for large databases. Here we study the general problem of time series clustering. We propose a novel method of clustering time series that can (a) learn joint temporal dynamics in the data; (b) handle time lags; and (c) produce interpretable features. We achieve this by developing complex-valued linear dynamical systems (CLDS), which include real-valued Kalman filters as a special case; our advantage is that the transition matrix is simpler (just diagonal), and the transmission one easier to interpret. We then present Complex-Fit, a novel EM algorithm to learn the parameters for the general model and its special case for clustering. Our approach produces significant improvement in clustering quality, 1.5 to 5 times better than well-known competitors on real motion capture sequences.
[download] [bibTeX] [discuss]
Max-margin Learning for Lower Linear Envelope Potentials in Binary Markov Random Fields
Stephen Gould
Abstract:The standard approach to max-margin parameter learning for Markov random fields (MRFs) involves incrementally adding the most violated constraints during each iteration of the algorithm. This requires exact MAP inference, which is intractable for many classes of MRF. In this paper, we propose an exact MAP inference algorithm for binary MRFs containing a class of higher-order models, known as lower linear envelope potentials. Our algorithm is polynomial in the number of variables and number of linear envelope functions. With tractable inference in hand, we show how the parameters and corresponding feature vectors can be represented in a max-margin framework for efficiently learning lower linear envelope potentials.
[download] [bibTeX] [discuss]
BCDNPKL: Scalable Non-Parametric Kernel Learning Using Block Coordinate Descent
En-Liang Hu, Bo Wang, SongCan Chen
Abstract:Most existing approaches for non-parametric kernel learning (NPKL) suffer from expensive computation, which would limit their applications to large-scale problems. To address the scalability problem of NPKL, we propose a novel algorithm called BCDNPKL, which is very efficient and scalable. Superior to most existing approaches, BCDNPKL keeps away from semidefinite programming (SDP) and eigen-decomposition, which benefits from two findings: 1) The original SDP framework of NPKL can be reduced into a far smaller-sized counterpart which is corresponding to the sub-kernel (referred to as boundary kernel) learning; 2) The sub-kernel learning can be efficiently solved by using the proposed block coordinate descent (BCD) technique. We provide a formal proof of global convergence for the proposed BCDNPKL algorithm. The extensive experiments verify the scalability and effectiveness of BCDNPKL, compared with the state-of-the-art algorithms.
[download] [bibTeX] [discuss]
Learning Discriminative Fisher Kernels
Laurens Van der Maaten
Abstract:Fisher kernels provide a commonly used vectorial representation of structured objects. The paper presents a technique that exploits label information to improve the object representation of Fisher kernels by employing ideas from metric learning. In particular, the new technique trains a generative model in such a way that the distance between the log-likelihood gradients induced by two objects with the same label is as small as possible, and the distance between the gradients induced by two objects with different labels is as large as possible. We illustrate the strong performance of classifiers trained on the resulting object representations on problems in handwriting recognition, speech recognition, facial expression analysis, and bio-informatics.
[download] [bibTeX] [discuss]
Online AUC Maximization
Peilin Zhao, Steven Hoi, Rong Jin, Tianbao Yang
Abstract:Most studies of online learning measure the performance of a learner by classification accuracy, which is inappropriate for applications where the data are unevenly distributed among different classes. We address this limitation by developing online learning algorithm for maximizing Area Under the ROC curve (AUC), a metric that is widely used for measuring the classification performance for imbalanced data distributions. The key challenge of online AUC maximization is that it needs to optimize the pairwise loss between two instances from different classes. This is in contrast to the classical setup of online learning where the overall loss is a sum of losses over individual training examples. We address this challenge by exploiting the reservoir sampling technique, and present two algorithms for online AUC maximization with theoretic performance guarantee. Extensive experimental studies confirm the effectiveness and the efficiency of the proposed algorithms for maximizing AUC.
[download] [bibTeX] [discuss]
Beat the Mean Bandit
Yisong Yue, Thorsten Joachims
Abstract:The Dueling Bandits Problem is an online learning framework in which actions are restricted to noisy comparisons between pairs of strategies (also known as bandits). It models settings where absolute rewards are difficult to elicit but pairwise preferences are readily available. In this paper, we extend the Dueling Bandits Problem to a relaxed setting where preference magnitudes can violate transitivity. We present the first algorithm for this more general Dueling Bandits Problem and provide theoretical guarantees in both the online and the PAC settings. Furthermore, we show that the new algorithm has stronger guarantees than existing results even in the original Dueling Bandits Problem, which we validate empirically.
[download] [bibTeX] [discuss]
Ultra-Fast Optimization Algorithm for Sparse Multi Kernel Learning
Francesco Orabona, Luo Jie
Abstract:Many state-of-the-art approaches for Multi Kernel Learning (MKL) struggle at finding a compromise between performance, sparsity of the solution and speed of the optimization process. In this paper we look at the MKL problem at the same time from a learning and optimization point of view. So, instead of designing a regularizer and then struggling to find an efficient method to minimize it, we design the regularizer while keeping the optimization algorithm in mind. Hence, we introduce a novel MKL formulation, which mixes elements of p-norm and elastic-net kind of regularization. We also propose a fast stochastic gradient descent method that solves the novel MKL formulation. We show theoretically and empirically that our method has 1) state-of-the-art performance on many classification tasks; 2) exact sparse solutions with a tunable level of sparsity; 3) a convergence rate bound that depends only logarithmically on the number of kernels used, and is independent of the sparsity required; 4) independence on the particular convex loss function used.
[download] [bibTeX] [discuss]
On optimization methods for deep learning
Quoc Le, Jiquan Ngiam, Adam Coates, Abhik Lahiri, Bobby Prochnow, Andrew Ng
Abstract:The predominant methodology in training deep learning advocates the use of stochastic gradient descent methods (SGDs). Despite its ease of implementation, SGDs are difficult to tune and parallelize. These problems make it challenging to develop, debug and scale up deep learning algorithms with SGDs. In this paper, we show that more sophisticated off-the-shelf optimization methods such as Limited memory BFGS (L-BFGS) and Conjugate gradient (CG) with linesearch can significantly simplify and speed up the process of pretraining deep algorithms. In our experiments, the difference between L-BFGS/CG and SGDs are more pronounced if we consider algorithmic extensions (e.g., sparsity regularization) and hardware extensions (e.g., GPUs or computer clusters). Our experiments with distributed optimization support the use of L-BFGS with locally connected networks and convolutional neural networks. Using L-BFGS, our convolutional network model achieves 0.69\% on the standard MNIST dataset. This is a state-of-the-art result on MNIST among algorithms that do not use distortions or pretraining.
[download] [bibTeX] [discuss]
Multiclass Classification with Bandit Feedback using Adaptive Regularization
Koby Crammer, Claudio Gentile
Abstract:We present a new multiclass algorithm in the bandit framework, where after making a prediction, the learning algorithm receives only partial feedback, i.e., a single bit of right-or-wrong, rather then the true label. Our algorithm is based on the second-order Perceptron, and uses upper-confidence bounds to trade off exploration and exploitation. We analyze this algorithm in a partial adversarial setting, where instances are chosen adversarially, while the labels are chosen according to a linear probabilistic model, which is also chosen adversarially. From the theoretical viewpoint, we show a regret of O(\sqrt{T}\log(T)), which improves over the current best bounds of O(T^{2/3}) in the fully adversarial setting. We evaluate our algorithm on nine real-world text classification problems, obtaining state-of-the-art results, even compared with non-bandit online algorithms, especially when label noise is introduced.
[download] [bibTeX] [discuss]
ABC-EP: Expectation Propagation for Likelihood-free Bayesian Computation
Simon Barthelmé, Nicolas Chopin
Abstract:Many statistical models of interest to the natural and social sciences have no tractable likelihood function. Until recently, Bayesian inference for such models was thought infeasible. Pritchard et al. (1999) introduced an algorithm known as ABC, for Approximate Bayesian Computation, that enables Bayesian computation in such models. Despite steady progress since this first breakthrough, such as the adaptation of MCMC and Sequential Monte Carlo techniques to likelihood-free inference, state-of-the art methods remain notoriously hard to use and require enormous computation times. Among other issues, one faces the difficult task of finding appropriate summary statistics for the model, and tuning the algorithm can be time-consuming when little prior information is available. We show that Expectation Propagation, a widely successful approximate inference technique, can be adapted to the likelihood-free context. The resulting algorithm does not require summary statistics, is an order of magnitude faster than existing techniques, and remains usable when prior information is vague.
[download] [bibTeX] [discuss]
Integrating Partial Model Knowledge in Model Free RL Algorithms
Aviv Tamar, Dotan Di Castro, Ron Meir
Abstract:In reinforcement learning an agent uses online feedback from the environment and prior knowledge in order to adaptively select an effective policy. Model free approaches address this task by directly mapping external and internal states to actions, while model based methods attempt to construct a model of the environment, followed by a selection of optimal actions based on that model. Given the complementary advantages of both approaches, we suggest a novel algorithm which combines them into a single algorithm, which switches between a model based and a model free mode, depending on the current environmental state and on the status of the agent's knowledge. We prove that such an approach leads to improved performance whenever environmental knowledge is available, without compromising performance when such knowledge is absent. Numerical simulations demonstrate the effectiveness of the approach and suggest its efficacy in boosting policy gradient learning.
[download] [bibTeX] [discuss]
Fast Newton-type Methods for Total Variation Regularization
Álvaro Barbero, Suvrit Sra
Abstract:Numerous applications in statistics, signal processing, and machine learning regularize using Total Variation (TV) penalties. We study anisotropic (l1-based) TV and also a related l2-norm variant. We consider for both variants associated (1D) proximity operators, which lead to challenging optimization problems. We solve these problems by developing Newton-type methods that outperform the state-of-the-art algorithms. More importantly, our 1D-TV algorithms serve as building blocks for solving the harder task of computing 2- (and higher)-dimensional TV proximity. We illustrate the computational benefits of our methods by applying them to several applications: (i) image denoising; (ii) image deconvolution (by plugging in our TV solvers into publicly available software); and (iii) four variants of fused-lasso. The results show large speedups--and to support our claims, we provide software accompanying this paper.
[download] [bibTeX] [discuss]
Parallel Coordinate Descent for L1-Regularized Loss Minimization
Joseph Bradley, Aapo Kyrola, Daniel Bickson, Carlos Guestrin
Abstract:We propose Shotgun, a parallel coordinate descent algorithm for minimizing L1-regularized losses. Though coordinate descent seems inherently sequential, we prove convergence bounds for Shotgun which predict linear speedups, up to a problem-dependent limit. We present a comprehensive empirical study of Shotgun for Lasso and sparse logistic regression. Our theoretical predictions on the potential for parallelism closely match behavior on real data. Shotgun outperforms other published solvers on a range of large problems, proving to be one of the most scalable algorithms for L1.
[download] [bibTeX] [discuss]
Approximate Dynamic Programming for Storage Problems
Lauren Hannah, David Dunson
Abstract:Storage problems are an important subclass of stochastic control problems. This paper presents a new method, approximate dynamic programming for storage, to solve storage problems with continuous, convex decision sets. Unlike other solution procedures, ADPS allows math programming to be used to make decisions each time period, even in the presence of large state variables. We test ADPS on the day ahead wind commitment problem with storage.
[download] [bibTeX] [discuss]
Online Submodular Minimization for Combinatorial Structures
Stefanie Jegelka, Jeff Bilmes
Abstract:Most results for online decision problems with structured concepts, such as trees or cuts, assume linear costs. In many settings, however, nonlinear costs are more realistic. Owing to their non-separability, these lead to much harder optimization problems. Going beyond linearity, we address online approximation algorithms for structured concepts that allow the cost to be submodular, i.e., nonseparable. In particular, we show regret bounds for three Hannan-consistent strategies that capture different settings. Our results also tighten a regret bound for unconstrained online submodular minimization.
[download] [bibTeX] [discuss]
The Hierarchical Beta Process for Convolutional Factor Analysis and Deep Learning
Bo Chen, Gungor Polatkan, Guillermo Sapiro, David Dunson, Lawrence Carin
Abstract:A convolutional factor-analysis model is developed, with the number of filters (factors) inferred via the beta process (BP) and hierarchical BP, for single-task and multi-task learning, respectively. The computation of the model parameters is implemented within a Bayesian setting, employing Gibbs sampling; we explicitly exploit the convolutional nature of the expansion to accelerate computations. The model is used in a multi-level (deep) analysis of general data, with specific results presented for image-processing data sets, e.g., classification.
[download] [bibTeX] [discuss]
Topic Modeling with Nonparametric Markov Tree
Haojun Chen, David Dunson, Lawrence Carin
Abstract:A new hierarchical tree-based topic model is developed, based on nonparametric Bayesian techniques. The model has two unique attributes: (i) a child node in the tree may have more than one parent, with the goal of eliminating redundant sub-topics deep in the tree; and (ii) parsimonious sub-topics are manifested, by removing redundant usage of words at multiple scales. The depth and width of the tree are unbounded within the prior, with a retrospective sampler employed to adaptively infer the appropriate tree size based upon the corpus under study. Excellent quantitative results are manifested on five standard data sets, and the inferred tree structure is also found to be highly interpretable.
[download] [bibTeX] [discuss]
Relational Active Learning for Joint Collective Classification Models
Ankit Kuwadekar, Jennifer Neville
Abstract:In many network domains, labeled data may be costly to acquire---indicating a need for {\em relational active learning} methods. Recent work has demonstrated that relational model performance can be improved by taking network structure into account when choosing instances to label. However, in collective inference settings, {\em both} model estimation {\em and} prediction can be improved by acquiring a node's label---since relational models estimate a joint distribution over labels in the network and collective classification methods propagate information from labeled training data during prediction. This conflates improvement in learning with improvement in inference, since labeling nodes can reduce inference error without improving the overall quality of the learned model. Here, we use {\em across-network} classification to separate the effects on learning and prediction, and focus on reduction of learning error. When label propagation is used for learning, we find that labeling based on prediction {\em certainty} is more effective than labeling based on {\em uncertainty}. As such, we propose a novel active learning method that combines a network-based {\em certainty} metric with semi-supervised learning and relational resampling. We evaluate our approach on synthetic and real-world networks and show faster learning compared to several baselines, including the network based method of Bilgic et al. 2010.
[download] [bibTeX] [discuss]
A Co-training Approach for Multi-view Spectral Clustering
Abhishek Kumar, Hal Daume III, University of Maryland
Abstract:We propose a spectral clustering algorithm for the multi-view setting where we have access to multiple views of the data, each of which can be independently used for clustering. Our spectral clustering algorithm has a flavor of co-training, which is already a widely used idea in semi-supervised learning. We work on the assumption that the true underlying clustering would assign a point to the same cluster irrespective of the view. Hence, we constrain our approach to only search for the clusterings that agree across the views. Our algorithm does not have any hyperparameters to set, which is a major advantage in unsupervised learning. We empirically compare with a number of baseline methods on synthetic and real-world datasets to show the efficacy of the proposed algorithm.
[download] [bibTeX] [discuss]
Adaptive Kernel Approximation for Large-Scale Non-Linear SVM Prediction
Michele Cossalter, Rong Yan, Lu Zheng
Abstract:The applicability of non-linear support vector machines (SVMs) has been limited in large-scale data collections because of their linear prediction complexity to the size of support vectors. We propose an efficient prediction algorithm with performance guarantee for non-linear SVMs, termed AdaptSVM. It can selectively collapse the kernel function computation to a reduced set of support vectors, compensated by an additional correction term that can be easily computed on-line. It also allows adaptive fall-back to original kernel computation based on its estimated variance and maximum error tolerance. In addition to theoretical analysis, we empirically evaluate on multiple large-scale datasets to show that the proposed algorithm can speed up the prediction process up to 10000 times with only <0.5 accuracy loss.
[download] [bibTeX] [discuss]
Risk-Based Generalizations of f-divergences
Darío García-García, Ulrike von Luxburg, Raúl Santos-Rodríguez
Abstract:We derive a generalized notion of f-divergences, called (f,l)-divergences. We show that this generalization enjoys many of the nice properties of f-divergences, although it is a richer family. It also provides alternative definitions of standard divergences in terms of surrogate risks. As a first practical application of this theory, we derive a new estimator for the Kulback-Leibler divergence that we use for clustering sets of vectors.
[download] [bibTeX] [discuss]
Learning Multi-View Neighborhood Preserving Projections
Novi Quadrianto, Christoph Lampert
Abstract:We address the problem of metric learning for multi-view data, namely the construction of embedding projections from data in different representations into a shared feature space, such that the Euclidean distance in this space provides a meaningful within-view as well as between-view similarity. Our motivation stems from the problem of cross-media retrieval tasks, where the availability of a joint Euclidean distance function is a prerequisite to allow fast, in particular hashing-based, nearest neighbor queries. We formulate an objective function that expresses the intuitive concept that matching samples are mapped closely together in the output space, whereas non-matching samples are pushed apart, no matter in which view they are available. The resulting optimization problem is not convex, but it can be decomposed explicitly into a convex and a concave part, thereby allowing efficient optimization using the convex-concave procedure. Experiments on an image retrieval task show that nearest-neighbor based cross-view retrieval is indeed possible, and the proposed technique improves the retrieval accuracy over baseline techniques.
[download] [bibTeX] [discuss]
Minimax Learning Rates for Bipartite Ranking and Plug-in Rules
Sylvain Robbiano, Stéphan Clémençon
Abstract:While it is now well-known in the standard binary classication setup, that, under suitable margin assumptions and complexity conditions on the regression function, fast or even super-fast rates (i.e. rates faster than n^(-1/2) or even faster than n^-1) can be achieved by plug-in classiers, no result of this nature has been proved yet in the context of bipartite ranking, though akin to that of classication. It is the main purpose of the present paper to investigate this issue. Viewing bipartite ranking as a nested continuous collection of cost-sensitive classication problems, we exhibit a global low noise condition under which certain plug-in ranking rules can be shown to achieve fast (but not super-fast) rates, establishing thus minimax upper bounds for the excess of ranking risk.
[download] [bibTeX] [discuss]
Task Space Retrieval Using Inverse Feedback Control
Nikolay Jetchev, Marc Toussaint
Abstract:Learning complex skills by repeating and generalizing expert behavior is a fundamental problem in robotics. A common approach is learning from demonstration: given examples of correct motions, learn a policy mapping state to action consistent with the training data. However, the usual approaches do not answer the question of what are appropriate representations to generate motions for specific tasks. Inspired by Inverse Optimal Control, we present a novel method to learn latent costs, imitate and generalize demonstrated behavior, and discover a task relevant motion representation: Task Space Retrieval Using Inverse Feedback Control (TRIC). We use the learned latent costs to create motion with a feedback controller. We tested our method on robot grasping of objects, a challenging high-dimensional task. TRIC learns the important control dimensions for the grasping task from a few example movements and is able to robustly approach and grasp objects in new situations.
[download] [bibTeX] [discuss]
Bayesian CCA via Group Sparsity
Seppo Virtanen, Arto Klami, Samuel Kaski
Abstract:Bayesian treatments of Canonical Correlation Analysis (CCA) -type latent variable models have been recently proposed for coping with overfitting in small sample sizes, as well as for producing factorizations of the data sources into correlated and non-shared effects. However, all of the current implementations of Bayesian CCA and its extensions are computationally inefficient for high-dimensional data and, as shown in this paper, break down completely for high-dimensional sources with low sample count. Furthermore, they cannot reliably separate the correlated effects from non-shared ones. We propose a new Bayesian CCA variant that is computationally efficient and works for high-dimensional data, while also learning the factorization more accurately. The improvements are gained by introducing a group sparsity assumption and an improved variational approximation. The method is demonstrated to work well on multi-label prediction tasks and in analyzing brain correlates of naturalistic audio stimulation.
[download] [bibTeX] [discuss]
PILCO: A Model-Based and Data-Efficient Approach to Policy Search
Marc Deisenroth, Carl Rasmussen
Abstract:In this paper, we introduce PILCO, a practical, data-efficient model-based policy search method. PILCO reduces model bias, one of the key problems of model-based reinforcement learning, in a principled way. By learning a probabilistic dynamics model and explicitly incorporating model uncertainty into long-term planning, PILCO can cope with very little data and facilitates learning from scratch in only a few trials. Policy evaluation is performed in closed form using state-of-the-art approximate inference. Furthermore, policy gradients are computed analytically for policy improvement. We report unprecedented learning efficiency on challenging and high-dimensional control tasks.
[download] [bibTeX] [discuss]
Suboptimal Solution Path Algorithm for Support Vector Machine
Masayuki Karasuyama, Ichiro Takeuchi
Abstract:We consider a suboptimal solution path algorithm for the Support Vector Machine. The solution path algorithm is known as an effective tool for solving a sequence of a parametrized optimization problems in machine learning. However, the algorithm needs to keep strict optimality conditions satisfied everywhere on the path. This requirement narrows the applicability of the path algorithm and adversely affects its computational efficiency. In our algorithm, user can specify tolerances to the optimality and control the trade-off between accuracy of the solution and the computational cost. We also show that our suboptimal solutions can be interpreted as the solution of a perturbed optimization problem from the original one, provide some theoretical analyses of our algorithm based on a novel interpretation. The experimental results demonstrate the effectiveness of our algorithm in terms of efficiency and accuracy.
[download] [bibTeX] [discuss]
Incremental Basis Construction from Temporal Difference Error
Yi Sun, Faustino Gomez, Mark Ring, Jürgen Schmidhuber
Abstract:In many reinforcement-learning (RL) systems, the value function is approximated as a linear combination of a fixed set of basis functions. Performance can be improved by adding to this set. Previous approaches construct a series of basis functions that in sufficient number can eventually represent the value function. In contrast, we show that there is a single, ideal basis function, which can directly represent the value function. Its addition to the set immediately reduces the error to zero---without changing existing weights. Moreover, this ideal basis function is simply the value function that results from replacing the MDP's reward function with its Bellman error. This result suggests a novel method for improving value-function estimation: a primary reinforcement learner estimates its value function using its present basis functions; it then sends its TD error to a secondary learner, which interprets that error as a reward function and estimates the corresponding value function; the resulting value function then becomes the primary learner's new basis function. We present both batch and online versions in combination with incremental basis projection, and demonstrate that the performance is superior to existing methods, especially in the case of large discount factors.
[download] [bibTeX] [discuss]
Predicting Legislative Roll Calls from Text
Sean Gerrish, David Blei
Abstract:We develop several predictive models linking legislative sentiment to legislative text. Our models, which draw on ideas from ideal point estimation and topic models, predict voting patterns based on the contents of bills and infer the political leanings of legislators. With supervised topics, we provide an exploratory window into how the language of the law is correlated with political support. We also derive approximate posterior inference algorithms based on variational methods. Across 12 years of legislative data, we predict specific voting patterns with high accuracy.
[download] [bibTeX] [discuss]
On Bayesian PCA: Automatic Dimensionality Selection and Analytic Solution
Shinichi Nakajima, Masashi Sugiyama, Derin Babacan
Abstract:In probabilistic PCA, the fully Bayesian estimation is computationally intractable. To cope with this problem, two types of approximation schemes were introduced: the partially Bayesian PCA (PB-PCA) where only the latent variables are integrated out, and the variational Bayesian PCA (VB-PCA) where the loading vectors are also integrated out. The VB-PCA was proposed as an improved variant of PB-PCA for enabling automatic dimensionality selection (ADS). In this paper, we investigate whether VB-PCA is really the best choice from the viewpoints of computational efficiency and ADS. We first show that ADS is not the unique feature of VB-PCA---PB-PCA is also actually equipped with ADS. We further show that PB-PCA is more advantageous in computational efficiency than VB-PCA because the global solution of PB-PCA can be computed analytically. However, we also show the negative fact that PB-PCA results in a trivial solution in the empirical Bayesian framework. We next consider a simplified variant of VB-PCA, where the latent variables and loading vectors are assumed to be mutually independent (while the ordinary VB-PCA only requires matrix-wise independence). We show that this simplified VB-PCA is the most advantageous in practice because its empirical Bayes solution experimentally works as well as the original VB-PCA, and its global optimal solution can be computed efficiently in a closed form.
[download] [bibTeX] [discuss]
Learning Linear Functions with Quadratic and Linear Multiplicative Updates
Tom Bylander
Abstract:We analyze variations of multiplicative updates for learning linear functions online. These can be described as substituting exponentiation in the Exponentiated Gradient (EG) algorithm with quadratic and linear functions. Both kinds of updates substitute exponentiation with simpler operations and reduce dependence on the parameter that specifies the sum of the weights during learning. In particular, the linear multiplicative update places no restrictions on the sum of the weights, and, under a wide range of conditions, achieves worst-case behavior close to the EG algorithm. We perform our analysis for square loss and absolute loss, and for regression and classification. We also describe some experiments showing that the performance of our algorithms are comparable to EG and the $p$-norm algorithm.
[download] [bibTeX] [discuss]
Domain Adaptation for Large-Scale Sentiment Classification: A Deep Learning Approach
Xavier Glorot, Antoine Bordes, Yoshua Bengio
Abstract:The exponential increase in the availability of online reviews and recommendations makes sentiment classification an interesting topic in academic and industrial research. Reviews can span so many different domains that it is difficult to gather annotated training data for all of them. Hence, this paper studies the problem of domain adaptation for sentiment classifiers, hereby a system is trained on labeled reviews from one source domain but is meant to be deployed on another. We propose a deep learning approach which learns to extract a meaningful representation for each review in an unsupervised fashion. Sentiment classifiers trained with this high-level feature representation clearly outperform state-of-the-art methods on a benchmark composed of reviews of 4 types of Amazon products. Furthermore, this method scales well and allowed us to successfully perform domain adaptation on a larger industrial-strength dataset of 22 domains.
[download] [bibTeX] [discuss]
Learning with Whom to Share in Multi-task Feature Learning
Zhuoliang Kang, Kristen Grauman, Fei Sha
Abstract:In multi-task learning (MTL), multiple tasks are learnt jointly. A major assumption for this paradigm is that all those tasks are indeed related so that the joint training is appropriate and beneficial. In this paper, we study the problem of multi-task learning of shared feature representations among tasks, while simultaneously determining ``with whom'' each task should share. We formulate the problem as a mixed integer programming and provide an alternating minimization technique to solve the optimization problem of jointly identifying grouping structures and parameters. The algorithm monotonically decreases the objective function and converges to a local optimum. Compared to the standard MTL paradigm where all tasks are in a single group, our algorithm improves its performance with statistical significance for three out of the four datasets we have studied. We also demonstrate its advantage over other task grouping techniques investigated in literature.
[download] [bibTeX] [discuss]
Boosting on a Budget: Sampling for Feature-Efficient Prediction
Lev Reyzin
Abstract:In this paper, we tackle the problem of feature-efficient prediction: classification using a limited number of features per test example. We show that modifying an ensemble classifier such as AdaBoost, by sampling hypotheses from its final weighted predictor, is well-suited for this task. We further consider an extension of this problem, where the costs of examining the various features can differ from one another, and we give an algorithm for this more general setting. We prove the correctness of our algorithms and derive bounds for the number of samples needed for given error rates. We also experimentally verify the effectiveness of our methods.
[download] [bibTeX] [discuss]
Speeding-Up Hoeffding-Based Regression Trees With Options
Elena Ikonomovska, Joăo Gama, Bernard Zenko, Saso Dzeroski
Abstract:Data streams are ubiquitous and have in the last two decades become an important research topic. For their predictive non-parametric analysis, Hoeffding-based trees are often a method of choice, offering a possibility of any-time predictions. However, one of their main problems is the delay in learning progress due to the existence of equally discriminative attributes. Options are a natural way to deal with this problem. Option trees build upon regular trees by adding splitting options in the internal nodes. As such they are known to improve accuracy, stability and reduce ambiguity. In this paper, we present on-line option trees for faster learning on numerical data streams. Our results show that options improve the any-time performance of ordinary on-line regression trees, while preserving the interpretable structure of trees and without significantly increasing the computational complexity of the algorithm.
[download] [bibTeX] [discuss]
Linear Regression under Fixed-Rank Constraints: A Riemannian Approach
Gilles Meyer, Silvčre Bonnabel, Rodolphe Sepulchre, University of Ličge
Abstract:In this paper, we tackle the problem of learning a linear regression model whose parameter is a fixed-rank matrix. We study the Riemannian manifold geometry of the set of fixed-rank matrices and develop efficient line-search algorithms. The proposed algorithms have many applications, scale to high-dimensional problems, enjoy local convergence properties and confer a geometric basis to recent contributions on learning fixed-rank matrices. Numerical experiments on benchmarks suggest that the proposed algorithms compete with the state-of-the-art, and that manifold optimization offers a versatile framework for the design of rank-constrained machine learning algorithms.
[download] [bibTeX] [discuss]
Cauchy Graph Embedding
Dijun Luo, Chris Ding, Feiping Nie, Heng Huang
Abstract:Laplacian embedding provides a low-dimensional representation for the nodes of a graph where the edge weights denote pairwise similarity among the node objects. It is commonly assumed that the Laplacian embedding results preserve the local topology of the original data on the low-dimensional projected subspaces, i.e., for any pair of graph nodes with large similarity, they should be embedded closely in the embedded space. However, in this paper, we will show that the Laplacian embedding often cannot preserve local topology well as we expected. To enhance the local topology preserving property in graph embedding, we propose a novel Cauchy graph embedding which preserves the similarity relationships of the original data in the embedded space via a new objective. Consequentially the machine learning tasks (such as k Nearest Neighbor type classifications) can be easily conducted on the embedded data with better performance. The experimental results on both synthetic and real world benchmark data sets demonstrate the usefulness of this new type of embedding.
[download] [bibTeX] [discuss]
Uncovering the Temporal Dynamics of Diffusion Networks
Manuel Gomez Rodriguez, David Balduzzi, Bernhard Schölkopf
Abstract:Time plays an essential role in the diffusion of information, influence and disease over networks. In many cases we only observe when a node copies information, makes a decision or becomes infected -- but the connectivity, transmission rates between nodes and transmission sources are unknown. Inferring the underlying dynamics is of outstanding interest since it enables forecasting, influencing and retarding infections, broadly construed. To this end, we model diffusion processes as discrete networks of continuous temporal processes occurring at different rates. Given cascade data -- observed infection times of nodes -- we infer the edges of the global diffusion network and estimate the transmission rates of each edge that best explain the observed data. The optimization problem is convex. The model naturally (without heuristics) imposes sparse solutions and requires no parameter tuning. The problem decouples into a collection of independent smaller problems, thus scaling easily to networks on the order of hundreds of thousands of nodes. Experiments on real and synthetic data show that our algorithm both recovers the edges of diffusion networks and accurately estimates their transmission rates from cascade data.
[download] [bibTeX] [discuss]
Multiclass Boosting with Hinge Loss based on Output Coding
Tianshi Gao, Daphne Koller
Abstract:Multiclass classification is an important and fundamental problem in machine learning. A popular family of multiclass classification methods belongs to reducing multiclass to binary based on output coding. Several multiclass boosting algorithms have been proposed to learn the coding matrix and the associated binary classifiers in a problem-dependent way. These algorithms can be unified under a sum-of-exponential loss function defined in the domain of margins. Instead, multiclass SVM uses another type of loss function based on hinge loss. In this paper, we present a new output-coding-based multiclass boosting algorithm using the multiclass hinge loss, which we call HingeBoost.OC. HingeBoost.OC is tested on various real world datasets and shows better performance than the existing multiclass boosting algorithm AdaBoost.ERP, one-vs-one, one-vs-all, ECOC and multiclass SVM in a majority of different cases.
[download] [bibTeX] [discuss]
Approximation Bounds for Inference using Cooperative Cuts
Stefanie Jegelka, Jeff Bilmes
Abstract:We analyze a family of probability distributions that are characterized by an embedded combinatorial structure. This family includes models having arbitrary treewidth and arbitrary sized factors. Unlike general models with such freedom, where the most probable explanation (MPE) problem is inapproximable, the combinatorial structure within our model, in particular the indirect use of submodularity, leads to several MPE algorithms that all have approximation guarantees.
[download] [bibTeX] [discuss]
Brier Curves: a New Cost-Based Visualisation of Classifier Performance
Jose Hernandez-Orallo, Peter Flach, Cčsar Ferri
Abstract:It is often necessary to evaluate classifier performance over a range of operating conditions, rather than as a point estimate. This is typically assessed through the construction of curvesover a space, visualising how one or two performance metrics vary with the operating condition. For binary classifiers in particular, cost space is a natural way of showing this range of performance, visualising loss against operating condition. However, the curves which have been traditionally drawn in cost space, known as cost curves, show the optimal loss, and hence assume knowledge of the optimal decision threshold for a given operating condition. Clearly, this leads to an optimistic assessment of classifier performance. In this paper we propose a more natural way of visualising classifier performance in cost space, which is to plot probabilistic loss on the y-axis, i.e., the loss arising from the probability estimates. This new curve provides new ways of understanding classifier performance and new tools to compare classifiers. In addition, we show that the area under this curve is exactly the Brier score, one of the most popular performance metrics for probabilistic classifiers.
[download] [bibTeX] [discuss]
OptiML: An Implicitly Parallel Domain-Specific Language for Machine Learning
Arvind Sujeeth, HyoukJoong Lee, Kevin Brown, Tiark Rompf, Hassan Chafi, Michael Wu, Anand Atreya, Martin Odersky, Kunle Olukotun
Abstract:As the size of datasets continues to grow, machine learning applications are becoming increasingly limited by the amount of available computational power. Taking advantage of modern hardware requires using multiple parallel programming models targeted at different devices (e.g. CPUs and GPUs). However, programming these devices to run efficiently and correctly is difficult, error-prone, and results in software that is harder to read and maintain. We present OptiML, a domain-specific language (DSL) for machine learning. OptiML is an implicitly parallel, expressive and high performance alternative to MATLAB and C++. OptiML performs domain-specific analyses and optimizations and automatically generates CUDA code for GPUs. We show that OptiML outperforms explicitly parallelized MATLAB code in nearly all cases.
[download] [bibTeX] [discuss]
Infinite SVM: a Dirichlet Process Mixture of Large-margin Kernel Machines
Jun Zhu, Ning Chen, Eric Xing
Abstract:We present Infinite SVM (iSVM), a Dirichlet process mixture of large-margin kernel machines for multi-way classification. An iSVM enjoys the advantages of both Bayesian nonparametrics in handling the unknown number of mixing components, and large-margin kernel machines in robustly capturing local nonlinearity of complex data. We develop an efficient variational learning algorithm for posterior inference of iSVM, and we demonstrate the advantages of iSVM over Dirichlet process mixture of generalized linear models and other benchmarks on both synthetic and real Flickr image classification datasets.
[download] [bibTeX] [discuss]
Piecewise Bounds for Estimating Bernoulli-Logistic Latent Gaussian Models
Benjamin Marlin, Mohammad Khan, Kevin Murphy
Abstract:Bernoulli-logistic latent Gaussian models (bLGMs) are a useful model class, but accurate parameter estimation is complicated by the fact that the marginal likelihood contains an intractable logistic-Gaussian integral. In this work, we propose the use of fixed piecewise linear and quadratic upper bounds to the logistic-log-partition (LLP) function as a way of circumventing this intractable integral. We describe a framework for approximately computing minimax optimal piecewise quadratic bounds, as well a generalized expectation maximization algorithm based on using piecewise bounds to estimate bLGMs. We prove a theoretical result relating the maximum error in the LLP bound to the maximum error in the marginal likelihood estimate. Finally, we present empirical results showing that piecewise bounds can be significantly more accurate than previously proposed variational bounds.
[download] [bibTeX] [discuss]
Access to Unlabeled Data can Speed up Prediction Time
Ruth Urner, Shai Shalev-Shwartz, Shai Ben-David
Abstract:Semi-supervised learning (SSL) addresses the problem of training a classifier using a small number of labeled examples and many unlabeled examples. Most previous work on SSL focused on how availability of unlabeled data can improve the accuracy of the learned classifiers. In this work we study how unlabeled data can be beneficial for constructing faster classifiers. We propose an SSL algorithmic framework which can utilize unlabeled examples for learning classifiers from a predefined set of fast classifiers. We formally analyze conditions under which our algorithmic paradigm obtains significant improvements by the use of unlabeled data. As a side benefit of our analysis we propose a novel quantitative measure of the so-called cluster assumption. We demonstrate the potential merits of our approach by conducting experiments on the MNIST data set, showing that, when a sufficiently large unlabeled sample is available, a fast classifier can be learned from much fewer labeled examples than without such a sample.
[download] [bibTeX] [discuss]
From PAC-Bayes Bounds to Quadratic Programs for Majority Votes
Jean-Francis Roy, Francois Laviolette, Mario Marchand
Abstract:We propose to construct a weighted majority vote on a set of basis functions by minimizing a risk bound (called the C-bound) that depends on the first two moments of the margin of the Q-convex combination realized on the training data. This bound minimization algorithm turns out to be a quadratic program that can be efficiently solved. A first version of the algorithm is designed for the supervised inductive setting and turns out to be competitive with AdaBoost, MDBoost and the SVM. The second version of the algorithm, designed for the transductive setting, competes well with TSVM. We also propose a new PAC-Bayes theorem that bounds the difference between the "true" value of the C-bound and its empirical estimate and that, unexpectedly, contains no KL-divergence.
[download] [bibTeX] [discuss]
A Coherent Interpretation of AUC as a Measure of Aggregated Classification Performance
Peter Flach, Jose Hernandez-Orallo, Cčsar Ferri
Abstract:The area under the ROC curve (AUC), a well-known measure of ranking performance, is also often used as a measure of classification performance, aggregating over decision thresholds as well as class and cost skews. However, David Hand has recently argued that AUC is fundamentally incoherent as a measure of aggregated classifier performance and proposed an alternative measure. Specifically, Hand derives a linear relationship between AUC and expected minimum loss, where the expectation is taken over a distribution of the misclassification cost parameter that depends on the model under consideration. Replacing this distribution with a Beta(2;2) distribution, Hand derives his alternative measure H. In this paper we offer an alternative, coherent interpretation of AUC as linearly related to expected loss. We use a distribution over cost parameter and a distribution over data points, both uniform and hence model independent. Should one wish to consider only optimal thresholds, we demonstrate that a simple and more intuitive alternative to Hands H measure is already available in the form of the area under the cost curve.
[download] [bibTeX] [discuss]
Adaptively Learning the Crowd Kernel
Omer Tamuz, Ce Liu, Serge Belongie, Ohad Shamir, Adam Kalai
Abstract:We introduce an algorithm that, given n objects, learns a similarity matrix over all n^2 pairs, from crowdsourced data *alone*. The algorithm samples responses to adaptively chosen triplet-based relative-similarity queries. Each query has the form "is object a more similar to b or to c?" and is chosen to be maximally informative given the preceding responses. The output is an embedding of the objects into Euclidean space (like MDS); we refer to this as the "crowd kernel." SVMs reveal that the crowd kernel captures prominent and subtle features across a number of domains, such as "is striped" among neckties and "vowel vs. consonant" among letters.
[download] [bibTeX] [discuss]
Multimodal Deep Learning
Jiquan Ngiam, Aditya Khosla, Mingyu Kim, Juhan Nam, Honglak Lee, Andrew Ng
Abstract:Deep networks have been successfully applied to unsupervised feature learning for single modalities (e.g., text, images or audio). In this work, we propose a novel application of deep networks to learn features over multiple modalities. We present a series of tasks for multimodal learning and show how to train deep networks that learn features to address these tasks. In particular, we demonstrate cross modality feature learning, where better features for one modality (e.g., video) can be learned if multiple modalities (e.g., audio and video) are present at feature learning time. Furthermore, we show how to learn a shared representation between modalities and evaluate it on a unique task, where the classifier is trained with audio-only data but tested with video-only data and vice-versa. Our methods are validated on the CUAVE and AVLetters datasets with an audio-visual speech classification task, demonstrating best published visual speech classification on AVLetters and effective shared representation learning.
[download] [bibTeX] [discuss]
On the Robustness of Kernel Density M-Estimators
JooSeuk Kim, Clayton Scott
Abstract:We analyze a method for nonparametric density estimation that exhibits robustness to contamination of the training sample. This method achieves robustness by combining a traditional kernel density estimator (KDE) with ideas from classical M-estimation. The KDE based on a Gaussian kernel is interpreted as a sample mean in the associated reproducing kernel Hilbert space (RKHS). This mean is estimated robustly through the use of a robust loss, yielding the so-called robust kernel density estimator (RKDE). This robust sample mean can be found via a kernelized iteratively re-weighted least squares (IRWLS) algorithm. Our contributions are summarized as follows. First, we present a representer theorem for the RKDE, which gives an insight into the robustness of the RKDE. Second, we provide necessary and sufficient conditions for kernel IRWLS to converge to the global minimizer, in the Gaussian RKHS, of the objective function defining the RKDE. Third, characterize and provide a method for computing the influence function associated with the RKDE. Fourth, we illustrate the robustness of the RKDE through experiments on several data sets.
[download] [bibTeX] [discuss]
Beam Search based MAP Estimates for the Indian Buffet Process
Piyush Rai, Hal Daume III, University of Maryland
Abstract:Nonparametric latent feature models offer a flexible way to discover the latent features underlying the data, without having to a priori specify their number. The Indian Buffet Process (IBP) is a popular example of such a model. Inference in IBP based models, however, remains a challenge. Sampling techniques such as MCMC can be computationally expensive and can take a long time to converge to the stationary distribution. Variational techniques, although faster than sampling, can be difficult to design, and can still remain slow on large data. In many problems, however, we only seek a maximum a posteriori (MAP) estimate of the latent feature assignment matrix. For such cases, we show that techniques such as beam search can give fast, approximate MAP estimates in the IBP based models. If samples from the posterior are desired, these MAP estimates can also serve as sensible initializers for MCMC based algorithms. Experimental results on a variety of datasets suggest that our algorithms can be a computationally viable alternative to Gibbs sampling, the particle filter, and variational inference based approaches for the IBP, and also perform better than other heuristics such as greedy search.
[download] [bibTeX] [discuss]
Optimal Distributed Online Prediction
Ofer Dekel, Ran Gilad-Bachrach, Ohad Shamir, Lin Xiao
Abstract:Online prediction methods are typically studied as serial algorithms running on a single processor. In this paper, we present the distributed mini-batch (DMB) framework, a method of converting a serial gradient-based online algorithm into a distributed algorithm, and prove an asymptotically optimal regret bound for smooth convex loss functions and stochastic examples. Our analysis explicitly takes into account communication latencies between computing nodes in a network. We also present robust variants, which are resilient to failures and node heterogeneity in an synchronous distributed environment. Our method can also be used for distributed stochastic optimization, attaining an asymptotically linear speedup. Finally, we empirically demonstrate the merits of our approach on large-scale online prediction problems.
[download] [bibTeX] [discuss]
Message Passing Algorithms for the Dirichlet Diffusion Tree
David Knowles, Jurgen Van Gael, Zoubin Ghahramani
Abstract:We demonstrate efficient approximate inference for the Dirichlet Diffusion Tree, a Bayesian nonparametric prior over tree structures. Although DDTs provide a powerful and elegant approach for modeling hierarchies they haven't seen much use to date. One problem is the computational cost of MCMC inference. We provide the first deterministic approximate inference methods for DDT models and show excellent performance compared to the MCMC alternative. We present message passing algorithms to approximate the Bayesian model evidence for a specific tree. This is used to drive sequential tree building and greedy search to find optimal tree structures, corresponding to hierarchical clusterings of the data. We demonstrate appropriate observation models for continuous and binary data. The empirical performance of our method is very close to the computationally expensive MCMC alternative on a density estimation problem, and significantly outperforms kernel density estimators.
[download] [bibTeX] [discuss]
Structure Learning in Ergodic Factored MDPs without Knowledge of the Transition Function's In-Degree
Doran Chakraborty, Peter Stone
Abstract:This paper introduces Learn Structure and Exploit RMax (LSE-RMax), a novel model based structure learning algorithm for ergodic factored-state MDPs. Given a planning horizon that satisfies a condition, LSE-RMax provably guarantees a return very close to the optimal return, with a high certainty, without requiring any prior knowledge of the in-degree of the transition function as input. LSE-RMax is fully implemented with a thorough analysis of its sample complexity. We also present empirical results demonstrating its effectiveness compared to prior approaches to the problem.
[download] [bibTeX] [discuss]
Tree preserving embedding
Albert Shieh, Tatsunori Hashimoto, Edo Airoldi
Abstract:Visualization techniques for complex data are a workhorse of modern scientific pursuits. The goal of visualization is to embed high dimensional data in a low dimensional space, while preserving structure in the data relevant to exploratory data analysis, such as the existence of clusters. However, existing visualization methods often either entirely fail to preserve clusters in embeddings due to the crowding problem or can only preserve clusters at a single resolution. Here, we develop a new approach to visualization, tree preserving embedding (TPE). Our approach takes advantage of the topological notion of connectedness to provably preserve clusters at all resolutions. Our performance guarantee holds for finite samples, which makes TPE a robust method for applications. Our approach suggests new strategies for robust data visualization in practice.
[download] [bibTeX] [discuss]
Tree-Structured Infinite Sparse Factor Model
XianXing Zhang, David Dunson, Lawrence Carin
Abstract:A new tree-structured multiplicative gamma process (TMGP) is developed, for inferring the depth of a tree-based factor-analysis model. This new model is coupled with the nested Chinese restaurant process, to nonparametrically infer the depth and width (structure) of the tree. In addition to developing the model, theoretical properties of the TMGP are addressed, and a novel MCMC sampler is developed. The structure of the inferred tree is used to learn relationships between high-dimensional data, and the model is also applied to compressive sensing and interpolation of incomplete images.
[download] [bibTeX] [discuss]
Preserving Personalized Pagerank in Subgraphs
Andrea Vattani, Deepayan Chakrabarti, Maxim Gurevich
Abstract:Choosing a subgraph that can concisely represent a large real-world graph is useful in many scenarios. The usual strategy employed is to sample nodes so that the induced subgraph matches the original graphs degree distribution, clustering coefficient, etc., but no attempt is made to preserve fine-grained relationships between nodes, which are vital for applications like clustering, classification, and ranking. In this work, we model such relationships via the notion of Personalized PageRank Value (PPV). We show that induced subgraphs output by current sampling methods do not preserve PPVs, and propose algorithms for creating PPV-preserving subgraphs on any given subset of graph nodes. Experiments on three large real-world graphs show that the subgraphs created by our method improve upon induced subgraphs in terms of preserving PPVs, clustering accuracy, and maintaining basic graph properties.
[download] [bibTeX] [discuss]
A Three-Way Model for Collective Learning on Multi-Relational Data
Maximilian Nickel, Volker Tresp, Hans-Peter Kriegel
Abstract:Relational learning is becoming increasingly important in many areas of application. Here, we present a novel approach to relational learning based on the factorization of a three-way tensor. We show that unlike other tensor approaches, our method is able to perform collective learning via the latent components of the model and provide an efficient algorithm to compute the factorization. We substantiate our theoretical considerations regarding the collective learning capabilities of our model by the means of experiments on both a new dataset and a dataset commonly used in entity resolution. Furthermore, we show on common benchmark datasets that our approach achieves better or on-par results, if compared to current state-of-the-art relational learning solutions, while it is significantly faster to compute.
[download] [bibTeX] [discuss]
Variational Inference for Policy Search in changing situations
Gerhard Neumann
Abstract:Many policy search algorithms minimize the Kullback-Leibler (KL) divergence to a certain target distribution in order to fit their policy. The commonly used KL-divergence forces the resulting policy to be 'reward-attracted'. The policy tries to reproduce all positively rewarded experience while negative experience is neglected. However, the KL-divergence is not symmetric and we can also minimize the the reversed KL-divergence, which is typically used in variational inference. The policy now becomes 'cost-averse'. It tries to avoid reproducing any negatively-rewarded experience while maximizing exploration. Due to this 'cost-averseness' of the policy, Variational Inference for Policy Search (VIP) has several interesting properties. It requires no kernel-bandwith nor exploration rate, such settings are determined automatically by the inference. The algorithm meets the performance of state-of-the-art methods while being applicable to simultaneously learning in multiple situations. We concentrate on using VIP for policy search in robotics. We apply our algorithm to learn dynamic counterbalancing of different kinds of pushes with a human-like 4-link robot.
[download] [bibTeX] [discuss]
Learning Scoring Functions with Order-Preserving Losses and Standardized Supervision
David Buffoni, Clément Calauzenes, Patrick Gallinari, Nicolas Usunier
Abstract:We address the problem of designing surrogate losses for learning scoring functions in the context of label ranking. We extend to ranking problems a notion of order preserving losses previously introduced for multiclass classi?cation, and show that these losses lead to consistent formulations with respect to a family of ranking evaluation metrics. An order-preserving loss can be tailored for a given evaluation metric by appropriately setting some weights depending on this metric and the observed supervision. These weights, called the standard form of the supervision, do not always exist, but we show that previous consistency results for ranking were proved in special cases where they do. We then evaluate a new pairwise loss consistent with the (Normalized) Discounted Cumulative Gain on benchmark datasets.
[download] [bibTeX] [discuss]
Contractive Auto-Encoders: Explicit Invariance During Feature Extraction
Salah RIFAI, Pascal Vincent, Xavier Muller, Xavier Glorot, Yoshua Bengio
Abstract:We present in this paper a novel approach for training deterministic auto-encoders. We show that by adding a well chosen penalty term to the classical reconstruction cost function, we can achieve results that equal or surpass those attained by other regularized auto-encoders as well as denoising auto-encoders on a range of datasets. This penalty term corresponds to the Frobenius norm of the Jacobian matrix of the encoder activations with respect to the input. We show that this penalty term results in a localized space contraction which in turn yields robust features on the activation layer. Furthermore, we show how this penalty term is related to both regularized auto-encoders and denoising auto-encoders and how it can be seen as a link between deterministic and non-deterministic auto-encoders. We find empirically that this penalty helps to carve a representation that better captures the local directions of variation dictated by the data, corresponding to a lower-dimensional non-linear manifold, while being more invariant to the vast majority of directions orthogonal to the manifold. Finally, we show that by using the learned features to initialize an MLP, we achieve state of the art classification error on a range of datasets, surpassing other methods of pre-training.
[download] [bibTeX] [discuss]
Variational Heteroscedastic Gaussian Process Regression
Miguel Lazaro-Gredilla, Michalis Titsias
Abstract:Standard Gaussian processes (GPs) model observations' noise as constant throughout input space. This is often a too restrictive assumption, but one that is needed for GP inference to be tractable. In this work we present a non-standard variational approximation that allows accurate inference in heteroscedastic GPs (i.e., under input-dependent noise conditions). Computational cost is roughly twice that of the standard GP, and also scales as O(n^3). Accuracy is verified by comparing with the golden standard MCMC and its effectiveness is illustrated on several synthetic and real datasets of diverse characteristics. An application to volatility forecasting is also considered.
[download] [bibTeX] [discuss]
Dynamic Egocentric Models for Citation Networks
Duy Vu, Arthur Asuncion, David Hunter, Padhraic Smyth
Abstract:The analysis of the formation and evolution of networks over time is of fundamental importance to social science, biology, and many other fields. While longitudinal network data sets are increasingly being recorded at the granularity of individual time-stamped events, most studies only focus on collapsed cross-sectional snapshots of the network. In this paper, we introduce a dynamic egocentric framework that models continuous-time network data using multivariate counting processes. For inference, an efficient partial likelihood approach is used, allowing our methods to scale to large networks. We apply our techniques to various citation networks and demonstrate the predictive power and interpretability of the learned statistical models.
[download] [bibTeX] [discuss]
The Constrained Weight Space SVM: Learning with Ranked Features
Kevin Small, Byron Wallace, Carla Brodley, Thomas Trikalinos
Abstract:Applying supervised learning methods to new classification tasks requires domain experts to label sufficient training data for the classifier to achieve acceptable performance. It is desirable to mitigate this annotation effort. To this end, a pertinent observation is that instance labels are often an indirect form of supervision; it may be more efficient to impart domain knowledge directly to the model in the form of labeled features. We present a novel algorithm for exploiting such domain knowledge which we call the Constrained Weight Space SVM (CW-SVM). In addition to exploiting binary labeled features, our approach allows domain experts to provide ranked features, and, more generally, to express arbitrary expected relationships between sets of features. Our empirical results show that the CW-SVM outperforms both baseline supervised learning strategies and previously proposed methods for learning with labeled features.
[download] [bibTeX] [discuss]
Robust Matrix Completion and Corrupted Columns
Yudong Chen, Huan Xu, Constantine Caramanis, Sujay Sanghavi, Dept. of Electrical and Computer Engineering
Abstract:This paper considers the problem of matrix completion, when some number of the columns are arbitrarily corrupted. It is well-known that standard algorithms for matrix completion can return arbitrarily poor results, if even a single column is corrupted. What can be done if a large number, or even a constant fraction of columns are corrupted? In this paper, we study this very problem, and develop an robust and efficient algorithm for its solution. One direct application comes from robust collaborative filtering. Here, some number of users are so-called manipulators, and try to skew the predictions of the algorithm. Significantly, our results hold {\it without any assumptions on the observed entries of the manipulated columns}.
[download] [bibTeX] [discuss]
Online Discovery of Feature Dependencies
Alborz Geramifard, Finale Doshi, Joshua Redding, Nicholas Roy, Jonathan How
Abstract:Online representational expansion techniques have improved the learning speed of existing reinforcement learning (RL) algorithms in low dimensional domains, yet existing online expansion methods do not scale well to high dimensional problems. We conjecture that one of the main difficulties limiting this scaling is that features defined over the full-dimensional state space often generalize poorly. Hence, we introduce incremental Feature Dependency Discovery (iFDD) as a computationally-inexpensive method for representational expansion that can be combined with any online, value-based RL method that uses binary features. Unlike other online expansion techniques, iFDD creates new features in low dimensional subspaces of the full state space where feedback errors persist. We provide convergence and computational complexity guarantees for iFDD, as well as showing empirically that iFDD scales well to high dimensional multi-agent planning domains with hundreds of millions of state-action pairs.
[download] [bibTeX] [discuss]
Minimum Probability Flow Learning
Jascha Sohl-Dickstein, Peter Battaglino, Michael DeWeese
Abstract:Fitting probabilistic models to data is often difficult, due to the general intractability of the partition function and its derivatives. Here we propose a new parameter estimation technique that does not require computing an intractable normalization factor or sampling from the equilibrium distribution of the model. This is achieved by establishing dynamics that would transform the observed data distribution into the model distribution, and then setting as the objective the minimization of the KL divergence between the data distribution and the distribution produced by running the dynamics for an infinitesimal time. Score matching, minimum velocity learning, and certain forms of contrastive divergence are shown to be special cases of this learning technique. We demonstrate parameter estimation in Ising models, deep belief networks and an independent component analysis model of natural scenes. In the Ising model case, current state of the art techniques are outperformed by at least an order of magnitude in learning time, with lower error in recovered coupling parameters.
[download] [bibTeX] [discuss]
Infinite Dynamic Bayesian Networks
Finale Doshi, David Wingate, Josh Tenenbaum, Nicholas Roy
Abstract:We present the infinite dynamic Bayesian network model (iDBN), a nonparametric, factored state-space model that generalizes dynamic Bayesian networks (DBNs). The iDBN can infer every aspect of a DBN: the number of hidden factors, the number of values each factor can take, and (arbitrarily complex) connections and conditionals between factors and observations. In this way, the iDBN generalizes other nonparametric state space models, which until now generally focused on binary hidden nodes and more restricted connection structures. We show how this new prior allows us to find interesting structure in benchmark tests and on two real-world datasets involving weather data and neural information flow networks.
[download] [bibTeX] [discuss]
The Importance of Encoding Versus Training with Sparse Coding and Vector Quantization
Adam Coates, Andrew Ng
Abstract: While vector quantization (VQ) has been applied widely to generate features for visual recognition problems, much recent work has focused on more powerful methods. In particular, sparse coding has emerged as a strong alternative to traditional VQ approaches and has been shown to achieve consistently higher performance on benchmark datasets. Both approaches can be split into a training phase, where the system learns a dictionary of basis functions from unlabeled data, and an encoding phase, where the dictionary is used to extract features from new inputs. In this work, we investigate the reasons for the success of sparse coding over VQ by decoupling these phases, allowing us to separate out the contributions of the training and encoding in a controlled way. Through extensive experiments on CIFAR, NORB and Caltech 101 datasets, we compare sparse coding and several other training and encoding schemes, including a form of VQ paired with a soft threshold activation function. Our results show not only that we can use fast VQ algorithms for training without penalty, but that we can just as well use randomly chosen exemplars from the training set. Rather than spend resources on training, we find it is more important to choose a good encoder---which can often be as simple as a feed forward non-linearity. Among our results, we demonstrate state-of-the-art performance on both CIFAR and NORB.
[download] [bibTeX] [discuss]
Learning attentional policies for tracking and recognition in video with deep networks
Loris Bazzani, Nando Freitas, Hugo Larochelle, Vittorio Murino, Jo-Anne Ting
Abstract:We propose a novel attentional model for simultaneous object tracking and recognition that is driven by gaze data. Motivated by theories of the human perceptual system, the model consists of two interacting pathways: ventral and dorsal. The ventral pathway models object appearance and classification using deep (factored)-restricted Boltzmann machines. At each point in time, the observations consist of retinal images, with decaying resolution toward the periphery of the gaze. The dorsal pathway models the location, orientation, scale and speed of the attended object. The posterior distribution of these states is estimated with particle filtering. Deeper in the dorsal pathway, we encounter an attentional mechanism that learns to control gazes so as to minimize tracking uncertainty. The approach is modular (with each module easily replaceable with more sophisticated algorithms), straightforward to implement, practically efficient, and works well in simple video sequences.
[download] [bibTeX] [discuss]
Large-Scale Learning of Embeddings with Reconstruction Sampling
Yann Dauphin, Xavier Glorot, Yoshua Bengio
Abstract:In this paper, we present a novel method to speed up the learning of embeddings for large-scale learning tasks involving very sparse data, as is typically the case for Natural Language Processing tasks. Our speed-up method has been developed in the context of Denoising Auto-encoders, which are trained in a purely unsupervised way to capture the input distribution, and learn embeddings for words and text similar to earlier neural language models. The main contribution is a new method to approximate reconstruction error by a sampling procedure. We show how this approximation can be made to obtain an unbiased estimator of the training criterion, and we show how it can be leveraged to make learning much more computationally efficient. We demonstrate the effectiveness of this method on the Amazon and RCV1 NLP datasets. Instead of reducing vocabulary size to make learning practical, our method allows us to train using very large vocabularies. In particular, reconstruction sampling requires 22x less training time on the full Amazon dataset.
[download] [bibTeX] [discuss]
Automatic Feature Decomposition for Single View Co-training
Minmin Chen, Kilian Weinberger, Yixin Chen
Abstract:One of the most successful semi-supervised learning approaches is co-training for multi-view data. In co-training, one trains two classifiers, one for each view, and uses the most confident predictions of the unlabeled data for the two classifiers to ``teach each other''. In this paper, we extend co-training to learning scenarios without an explicit multi-view representation. Inspired by a theoretical analysis of Balcan et. al (2004), we introduce a novel algorithm that splits the feature space during learning, explicitly to encourage co-training to be successful. We demonstrate the efficacy of our proposed method in a weakly-supervised setting on the challenging Caltech-256 object recognition task, where we improve significantly over previous results by (Bergamo & Torresani, 2010) in almost all training-set size settings.
[download] [bibTeX] [discuss]
Mapping kernels for trees
Kilho Shin, Marco Cuturi, Tetsuji Kuboyama
Abstract:We propose a comprehensive survey of tree kernels through the lens of the mapping kernels framework. We argue that most existing tree kernels, as well as many more that are presented for the first time in this paper, fall into a typology of kernels whose seemingly intricate computation can be efficiently factorized to yield polynomial time algorithms. Despite this fact, we argue that a naive implementation of such kernels remains prohibitively expensive to compute. We propose an approach whereby some computations for smaller trees are cached, which speeds up considerably the computation of all these tree kernels. We provide experimental evidence of this fact as well as preliminary results on the performance of these kernels.
[download] [bibTeX] [discuss]
Size-constrained Submodular Minimization through Minimum Norm Base
Kiyohito Nagano, Yoshinobu Kawahara, Kazuyuki Aihara
Abstract:A number of combinatorial optimization problems in machine learning can be described as the problem of minimizing a submodular function. It is known that the unconstrained submodular minimization problem can be solved in strongly polynomial time. However, additional constraints make the problem intractable in many settings. In this paper, we discuss the submodular minimization under a size constraint, which is NP-hard, and generalizes the densest subgraph problem and the uniform graph partitioning problem. Because of NP-hardness, it is difficult to compute an optimal solution even for a prescribed size constraint. In our approach, we do not give approximation algorithms. Instead, the proposed algorithm computes optimal solutions for some of possible size constraints in polynomial time. Our algorithm utilizes the basic polyhedral theory associated with submodular functions. Additionally, we evaluate the performance of the proposed algorithm through computational experiments.
[download] [bibTeX] [discuss]
Locally Linear Support Vector Machines
Lubor Ladicky, Philip Torr
Abstract:Linear support vector machines ({\sc svm}s) have become popular for solving classification tasks due to their fast and simple online application to large scale data sets. However, many problems are not linearly separable. For these problems kernel-based {\sc svm}s are often used, but unlike their linear variant they suffer from various drawbacks in terms of computational and memory efficiency. Their response can be represented only as a function of the set of support vectors, which has been experimentally shown to grow linearly with the size of the training set. In this paper we propose a novel locally linear {\sc svm} classifier with smooth decision boundary and bounded curvature. We show how the functions defining the classifier can be approximated using local codings and show how this model can be optimized in an online fashion by performing stochastic gradient descent with the same convergence guarantees as standard gradient descent method for linear {\sc svm}. Our method achieves comparable performance to the state-of-the-art whilst being significantly faster than competing kernel {\sc svm}s. We generalise this model to locally finite dimensional kernel {\sc svm}.
[download] [bibTeX] [discuss]
Clustering Partially Observed Graphs via Convex Optimization
Ali Jalali, Yudong Chen, Sujay Sanghavi, Huan Xu
Abstract:This paper considers the problem of clustering a partially observed unweighted graph -- i.e. one where for some node pairs we know there is an edge between them, for some others we know there is no edge, and for the remaining we do not know whether or not there is an edge. We want to organize the nodes into disjoint clusters so that there is relatively dense (observed) connectivity within clusters, and sparse across clusters. We take a novel yet natural approach to this problem, by focusing on finding the clustering that minimizes the number of "disagreements" - i.e. the sum of the number of (observed) missing edges within clusters, and (observed) present edges across clusters. Our algorithm uses convex optimization; its basis is a reduction of disagreement minimization to the problem of recovering an (unknown) low-rank matrix and an (unknown) sparse matrix from their partially observed sum. We show that our algorithm succeeds under certain natural assumptions on the optimal clustering and its disagreements. Our results significantly strengthen existing matrix splitting results for the special case of our clustering problem. Our results directly enhance solutions to the problem of Correlation Clustering with partial observations
[download] [bibTeX] [discuss]
On the Use of Variational Inference for Learning Discrete Graphical Models
Eunho Yang, Pradeep Ravikumar
Abstract:We study the general class of estimators for graphical model structure based on optimizing $\ell_1$-regularized approximate log-likelihood, where the approximate likelihood uses tractable variational approximations of the partition function. We provide a message-passing algorithm that \emph{directly} computes the $\ell_1$ regularized approximate MLE. Further, in the case of certain reweighted entropy approximations to the partition function, we show that surprisingly the $\ell_1$ regularized approximate MLE estimator has a \emph{closed-form}, so that we would no longer need to run through many iterations of approximate inference and message-passing. Lastly, we analyze this general class of estimators for graph structure recovery, or its \emph{sparsistency}, and show that it is indeed sparsistent under certain conditions.
[download] [bibTeX] [discuss]
Generating Text with Recurrent Neural Networks
Ilya Sutskever, James Martens, Geoffrey Hinton
Abstract:Recurrent Neural Networks (RNNs) are very powerful sequence models that do not enjoy widespread use because it is extremely difficult to train them properly. Fortunately, recent advances in Hessian-free optimization have been able to overcome the difficulties associated with training RNNs, making it possible to apply them successfully to challenging sequence problems. In this paper we demonstrate the power of RNNs trained with the new Hessian-Free optimizer (HF) by applying them to character-level language modeling tasks. The standard RNN architecture, while effective, is not ideally suited for such tasks, so we introduce a new RNN variant that uses multiplicative (or ``gated'') connections which allow the current input character to determine the transition matrix from one hidden state vector to the next. After training the multiplicative RNN with the HF optimizer for five days on 8 high-end Graphics Processing Units, we were able to surpass the performance of the best previous single method for character-level language modeling -- a hierarchical non-parametric sequence model. To our knowledge this represents the largest recurrent neural network application to date.
[download] [bibTeX] [discuss]
Probabilistic Matrix Addition
Amrudin Agovic, Arindam Banerjee, Snigdhansu Chatterje
Abstract:We introduce Probabilistic Matrix Addition (PMA) for modeling real-valued data matrices by simultaneously capturing covariance structure among rows and among columns. PMA additively combines two latent matrices drawn from two Gaussian Processes respectively over rows and columns. The resulting joint distribution over the observed matrix does not factorize over entries, rows, or columns, and can thus capture intricate dependencies in the matrix. Exact inference in PMA is possible, but involves inversion of large matrices, and can be computationally prohibitive. Efficient approximate inference is possible due to the sparse dependency structure among latent variables. We propose two families of approximate inference algorithms for PMA based on Gibbs sampling and MAP inference. We demonstrate the effectiveness of PMA for missing value prediction and multi-label classification problems.
[download] [bibTeX] [discuss]
Learning Recurrent Neural Networks with Hessian-Free Optimization
James Martens, Ilya Sutskever
Abstract: In this work we resolve the long-outstanding problem of how to effectively train recurrent neural networks (RNNs) on complex and difficult sequence modeling problems which may contain long-term data dependencies. Utilizing recent advances in the Hessian-free optimization approach \citep{hf}, together with a novel damping scheme, we successfully train RNNs on two sets of challenging problems. First, a collection of pathological synthetic datasets which are known to be impossible for standard optimization approaches (due to their extremely long-term dependencies), and second, on three natural and highly complex real-world sequence datasets where we find that our method significantly outperforms the previous state-of-the-art method for training neural sequence models: the Long Short-term Memory approach of \citet{lstm}. Additionally, we offer a new interpretation of the generalized Gauss-Newton matrix of \citet{schraudolph} which is used within the HF approach of Martens.
[download] [bibTeX] [discuss]
Sparse Additive Generative Models of Text
Jacob Eisenstein, Amr Ahmed, Eric Xing
Abstract:Generative models of text typically associate a multinomial with every class label or topic. Even in simple models this requires the estimation of thousands of parameters; in multifaceted latent variable models, standard approaches require additional latent ``switching'' variables for every token, complicating inference. In this paper, we propose an alternative generative model for text. The central idea is that each class label or latent topic is endowed with a model of the deviation in log-frequency from a constant background distribution. This approach has two key advantages: we can enforce sparsity to prevent overfitting, and we can combine generative facets through simple addition in log space, avoiding the need for latent switching variables. We demonstrate the applicability of this idea to a range of scenarios: classification, topic modeling, and more complex multifaceted generative models.
[download] [bibTeX] [discuss]
Classification-based Policy Iteration with a Critic
Victor Gabillon, Alessandro Lazaric, Mohammad Ghavamzadeh, Bruno Scherrer
Abstract:In this paper, we study the effect of adding a value function approximation component (critic) to rollout classification-based policy iteration (RCPI) algorithms. The idea is to use a critic to approximate the return after we truncate the rollout trajectories. This allows us to control the bias and variance of the rollout estimates of the action-value function. Therefore, the introduction of a critic can improve the accuracy of the rollout estimates, and as a result, enhance the performance of the RCPI algorithm. We present a new RCPI algorithm, called direct policy iteration with critic (DPI-Critic), and provide its finite-sample analysis when the critic is based on the LSTD method. We empirically evaluate the performance of DPI-Critic and compare it with DPI and LSPI in two benchmark reinforcement learning problems.
[download] [bibTeX] [discuss]
Submodular meets Spectral: Greedy Algorithms for Subset Selection, Sparse Approximation and Dictionary Selection
Abhimanyu Das, David Kempe
Abstract:We study the problem of selecting a subset of $k$ random variables from a large set, in order to obtain the best linear prediction of another variable of interest. This problem can be viewed in the context of both feature selection and sparse approximation. We analyze the performance of widely used greedy heuristics, using insights from the maximization of submodular functions and spectral analysis. We introduce the submodularity ratio as a key quantity to help understand why greedy algorithms perform well even when the variables are highly correlated. Using our techniques, we obtain the strongest known approximation guarantees for this problem, both in terms of the submodularity ratio and the smallest $k$-sparse eigenvalue of the covariance matrix. We also analyze greedy algorithms for the dictionary selection problem, and significantly improve the previously known guarantees. Our theoretical analysis is complemented by experiments on real-world and synthetic data sets; the experiments show that the submodular ratio is a stronger predictor of the performance of greedy algorithms than other spectral parameters.
[download] [bibTeX] [discuss]
A Spectral Algorithm for Latent Tree Graphical Models
Ankur Parikh, Le Song, Eric Xing
Abstract:Latent variable models are powerful tools for probabilistic modeling, and have been successfully applied to various domains, such as speech analysis and bioinformatics. However, parameter learning algorithms for latent variable models have predominantly relied on local search heuristics such as expectation maximization (EM). We propose a fast, local-minimum-free spectral algorithm for learning latent variable models with arbitrary tree topologies, and show that the joint distribution of the observed variables can be reconstructed from the marginals of triples of observed variables irrespective of the maximum degree of the tree. We demonstrate the performance of our spectral algorithm on synthetic and real datasets; for large training sizes, our algorithm performs comparable to or better than EM while being orders of magnitude faster.
[download] [bibTeX] [discuss]
A Unified Probabilistic Model for Global and Local Unsupervised Feature Selection
Yue Guan, Jennifer Dy, Michael Jordan
Abstract:Existing algorithms for joint clustering and feature selection can be categorized as either global or local approaches. Global methods select a single cluster-independent subset of features, whereas local methods select cluster-specific subsets of features. In this paper, we present a unified probabilistic model that can perform both global and local feature selection for clustering. Our approach is based on a hierarchical beta-Bernoulli prior combined with a Dirichlet process mixture model. We obtain global or local feature selection by adjusting the variance of the beta prior. We provide a variational inference algorithm for our model. In addition to simultaneously learning the clusters and features, this Bayesian formulation allows us to learn both the number of clusters and the number of features to retain. Experiments on synthetic and real data show that our unified model can find global and local features and cluster data as well as competing methods of each type.
[download] [bibTeX] [discuss]
Towards Making Unlabeled Data Never Hurt
Yu-Feng Li, Zhi-Hua Zhou
Abstract:It is usually expected that, when labeled data are limited, the learning performance can be improved by exploiting unlabeled data. In many cases, however, the performances of current semi-supervised learning approaches may be even worse than purely using the limited labeled data.It is desired to have \textit{safe} semi-supervised learning approaches which never degenerate learning performance by using unlabeled data. In this paper, we focus on semi-supervised support vector machines (S3VMs) and propose S4VMs, i.e., safe S3VMs. Unlike S3VMs which typically aim at approaching an optimal low-density separator, S4VMs try to exploit the candidate low-density separators simultaneously to reduce the risk of identifying a poor separator with unlabeled data. We describe two implementations of S4VMs, and our comprehensive experiments show that the overall performance of S4VMs are highly competitive to S3VMs, while in contrast to S3VMs which degenerate performance in many cases, S4VMs are never significantly inferior to inductive SVMs.
[download] [bibTeX] [discuss]
On Random Weights and Unsupervised Feature Learning
Andrew Saxe, pang Wei Koh, Zhenghao Chen, Maneesh Bhand, Bipin Suresh, Andrew Ng
Abstract:Recently two anomalous results in the literature have shown that certain feature learning architectures can yield useful features for object recognition tasks even with untrained, random weights. In this paper we pose the question: why do random weights sometimes do so well? Our answer is that certain convolutional pooling architectures can be inherently frequency selective and translation invariant, even with random weights. Based on this we demonstrate the viability of extremely fast architecture search by using random weights to evaluate candidate architectures, thereby sidestepping the time-consuming learning process. We then show that a surprising fraction of the performance of certain state-of-the-art methods can be attributed to the architecture alone.
[download] [bibTeX] [discuss]
Doubly Robust Policy Evaluation and Learning
Miroslav Dudik, John Langford, Lihong Li
Abstract: We study decision making in environments where the reward is only partially observed, but can be modeled as a function of an action and an observed context. This setting, known as contextual bandits, encompasses a wide variety of applications including health-care policy and Internet advertising. A central task is evaluation of a new policy given historic data consisting of contexts, actions and received rewards. The key challenge is that the past data typically does not faithfully represent proportions of actions taken by a new policy. Previous approaches rely either on models of rewards or models of the past policy. The former are plagued by a large bias whereas the latter have a large variance. In this work, we leverage the strength and overcome the weaknesses of the two approaches by applying the \emph{doubly robust} technique to the problems of policy evaluation and optimization. We prove that this approach yields accurate value estimates when we have \emph{either} a good (but not necessarily consistent) model of rewards \emph{or} a good (but not necessarily consistent) model of past policy. Extensive empirical comparison demonstrates that the doubly robust approach uniformly improves over existing techniques, achieving both lower variance in value estimation and better policies. As such, we expect the doubly robust approach to become common practice.
[download] [bibTeX] [discuss]
Learning Deep Energy Models
Jiquan Ngiam, Zhenghao Chen, pang Wei Koh, Andrew Ng
Abstract:Deep generative models with multiple hidden layers have been shown to be able to learn meaningful and compact representations of data. In this work we propose deep energy models, a class of models that use a deep feedforward neural network to model the energy landscape that defines a probabilistic model. We are able to efficiently train all layers of our model at the same time, allowing the lower layers of the model to adapt to the training of the higher layers, producing better generative models. We evaluate the generative performance of our models on natural images and demonstrate that joint training of multiple layers yields qualitative and quantitative improvements over greedy layerwise training. We further generalize our models beyond the commonly used sigmoidal neural networks and show how a deep extension of the product of Student-t distributions model achieves good generative performance. Finally, we introduce a discriminative extension of our model and demonstrate that it outperforms other fully-connected models on object recognition on the NORB dataset.
[download] [bibTeX] [discuss]
Bipartite Ranking through Minimization of Univariate Loss
Wojciech Kotlowski, Krzysztof Dembczynski, Eyke Huellermeier
Abstract:Minimization of the rank loss or, equivalently, maximization of the AUC in bipartite ranking calls for minimizing the number of disagreements between pairs of instances. Since the complexity of this problem is inherently quadratic in the number of training examples, it is tempting to ask how much is actually lost by minimizing a simple univariate loss function, as done by standard classification methods, as a surrogate. In this paper, we first note that minimization of 0/1 loss is not an option, as it may yield an arbitrarily high rank loss. We show, however, that better results can be achieved by means of a weighted (cost-sensitive) version of 0/1 loss. Yet, the real gain is obtained through margin-based loss functions, for which we are able to derive proper bounds, not only for rank risk but, more importantly, also for rank regret. The paper is completed with an experimental study in which we address specific questions raised by our theoretical analysis.
[download] [bibTeX] [discuss]
Manifold Identification of Dual Averaging Methods for Regularized Stochastic Online Learning
Sangkyun Lee, Stephen Wright
Abstract:Iterative methods that take steps in approximate subgradient directions have proved to be useful for stochastic learning problems over large or streaming data sets. When the objective consists of a loss function plus a nonsmooth regularization term, whose purpose is to induce structure (for example, sparsity) in the solution, the solution often lies on a low-dimensional manifold along which the regularizer is smooth. This paper shows that a regularized dual averaging algorithm can identify this manifold with high probability. This observation motivates an algorithmic strategy in which, once a near-optimal manifold is identified, we switch to an algorithm that searches only in this manifold, which typically has much lower intrinsic dimension than the full space, thus converging quickly to a near-optimal point with the desired structure. Computational results are presented to illustrate these claims.
[download] [bibTeX] [discuss]
Noisy matrix decomposition via convex relaxation: Optimal rates in high dimensions
Alekh Agarwal, Sahand Negahban, Martin Wainwright
Abstract:We analyze a class of estimators based on a convex relaxation for solving high-dimensional matrix decomposition problems. The observations are the noisy realizations of the sum of an (approximately) low rank matrix $\Theta^\star$ with a second matrix $\Gamma^\star$ endowed with a complementary form of low-dimensional structure. We derive a general theorem that gives upper bounds on the Frobenius norm error for an estimate of the pair $(\Theta^\star, \Gamma^\star)$ obtained by solving a convex optimization problem. We then specialize our general result to two cases that have been studied in the context of robust PCA: low rank plus sparse structure, and low rank plus a column sparse structure. Our theory yields Frobenius norm error bounds for both deterministic and stochastic noise matrices, and in the latter case, they are minimax optimal. The sharpness of our theoretical predictions is also confirmed in numerical simulations.
[download] [bibTeX] [discuss]
Unsupervised Models of Images by Spike-and-Slab RBMs
Aarron Courville, James Bergstra, Yoshua Bengio
Abstract:The spike and slab Restricted Boltzmann Machine (RBM) is defined by having both a real valued ``slab'' variable and a binary ``spike'' variable associated with each unit in the hidden layer. In this paper we generalize and extend the spike and slab RBM to include non-zero means of the conditional distribution over the observed variables conditional on the binary spike variables. We also introduce a term, quadratic in the observed data that we exploit to guarantee the all conditionals associated with the model are well defined -- a guarantee that was absent in the original spike and slab RBM. The inclusion of these generalizations improves the performance of the spike and slab RBM as a feature learner and achieves competitive performance on the CIFAR-10 image classification task. The spike and slab model, when trained in a convolutional configuration, can generate sensible samples that demonstrate that the model has capture the broad statistical structure of natural images.
[download] [bibTeX] [discuss]
Approximating Correlated Equilibria using Relaxations on the Marginal Polytope
Hetunandan Kamisetty, Eric Xing, Christopher Langmead
Abstract:In game theory, a Correlated Equilibrium (CE) is an equilibrium concept that generalizes the more well-known Nash Equilibrium. If the game is represented as a graphical game, the computational complexity of computing an optimum CE is exponential in the tree-width of the graph. In settings where this exact computation is not feasible, it is desirable to approximate the properties of the CE, such as its expected social utility and marginal probabilities. We study outer relaxations of this problem that yield approximate marginal strategies for the players under a variety of utility functions. Results on simulated games and in a real problem involving drug design indicate that our approximations can be highly accurate and can be successfully used when exact computation of CE is infeasible.
[download] [bibTeX] [discuss]
Active Learning from Crowds
Yan Yan, Romer Rosales, Glenn Fung, Jennifer Dy
Abstract:Obtaining labels is expensive or time-consuming, but unlabeled data is often abundant and easy to obtain. Many learning task can profit from intelligently choosing unlabeled instances to be labeled by an oracle also known as active learning, instead of simply labeling all the data or randomly selecting data to be labeled. Supervised learning traditionally relies on an oracle playing the role of a teacher. In the multiple annotator paradigm, an oracle, who knows the ground truth, no longer exists; instead, multiple labelers, with varying expertise, are available for querying. This paradigm posits new challenges to the active learning scenario. We can ask which data sample should be labeled next and which annotator should we query to benefit our learning model the most. In this paper, we develop a probabilistic model for learning from multiple annotators that can also learn the annotator expertise even when their expertise may not be consistently accurate (or inaccurate) across the task domain. In addition, we provide an optimization formulation that allows us to simultaneously learn the most uncertain sample and the annotator/s to query the labels from for active learning. Our active learning approach combines both intelligently selecting samples to label and learning from expertise among multiple labelers to improve learning performance.
[download] [bibTeX] [discuss]
Computational Rationalization: The Inverse Equilibrium Problem
Kevin Waugh, Brian Ziebart, Drew Bagnell
Abstract:Modeling the purposeful behavior of imperfect agents from a small number of observations is a challenging task. When restricted to the single-agent decision-theoretic setting, inverse optimal control techniques assume that observed behavior is an approximately optimal solution to an unknown decision problem. These techniques learn a utility function that explains the example behavior and can then be used to accurately predict or imitate future behavior in similar observed or unobserved situations. In this work, we consider similar tasks in competitive and cooperative multi-agent domains. Here, unlike single-agent settings, a player cannot myopically maximize its reward --- it must speculate on how the other agents may act to influence the game's outcome. Employing the game-theoretic notion of regret and the principle of maximum entropy, we introduce a technique for predicting and generalizing behavior, as well as recovering a reward function in these domains.
[download] [bibTeX] [discuss]
Finite-Sample Analysis of Lasso-TD
Mohammad Ghavamzadeh, Alessandro Lazaric, Remi Munos, Matthew Hoffman
Abstract:In this paper, we analyze the performance of Lasso-TD, a modification of LSTD in which the projection operator is defined as a Lasso problem. We first show that Lasso-TD is guaranteed to have a unique fixed point and its algorithmic implementation coincides with the recently presented LARS-TD and LC-TD methods. We then derive two bounds on the prediction error of Lasso-TD in the Markov design setting, i.e., when the performance is evaluated on the same states used by the method. The first bound makes no assumption, but has a slow rate w.r.t. the number of samples. The second bound is under an assumption on the empirical Gram matrix, called the compatibility condition, but has an improved rate and directly relates the prediction error to the sparsity of the value function in the feature space at hand.
[download] [bibTeX] [discuss]
Generalized Value Functions for Large Action Sets
Jason Pazis, Ron Parr
Abstract:The majority of value function approximation based reinforcement learning algorithms available today, focus on approximating the state (V) or state-action (Q) value function and efficient action selection comes as an afterthought. On the other hand, real-world problems tend to have large action spaces, where evaluating every possible action becomes impractical. This mismatch presents a major obstacle in successfully applying reinforcement learning to real-world problems. In this paper we present a unified view of V and Q functions and arrive at a new space-efficient representation, where action selection can be done exponentially faster, without the use of a model. We then describe how to calculate this new value function efficiently via approximate linear programming and provide experimental results that demonstrate the effectiveness of the proposed approach.
[download] [bibTeX] [discuss]
k-DPPs: Fixed-Size Determinantal Point Processes
Alex Kulesza, Ben Taskar
Abstract:Determinantal point processes (DPPs) have recently been proposed as models for set selection problems where diversity is preferred. For example, they can be used to select diverse sets of sentences to form document summaries, or to find multiple non-overlapping human poses in an image. However, DPPs conflate the modeling of two distinct characteristics: the size of the set, and its content. For many realistic tasks, the size of the desired set is known up front; e.g., in search we may want to show the user exactly ten results. In these situations the effort spent by DPPs modeling set size is not only wasteful, but actually introduces unwanted bias into the modeling of content. Instead, we propose the k-DPP, a conditional DPP that models only sets of cardinality k. In exchange for this restriction, k-DPPs offer greater expressiveness and control over content, and simplified integration into applications like search. We derive algorithms for efficiently normalizing, sampling, and marginalizing k-DPPs, and propose an experts-style algorithm for learning combinations of k-DPPs. We demonstrate the usefulness of the model on an image search task, where k-DPPs significantly outperform MMR as judged by human annotators.
[download] [bibTeX] [discuss]
On Autoencoders and Score Matching for Energy Based Models
Kevin Swersky, Marc'Aurelio Ranzato, David Buchman, Benjamin Marlin, Nando Freitas
Abstract:We consider estimation methods for the class of continuous-data energy based models (EBMs). Our main result shows that estimating the parameters of an EBM using score matching when the conditional distribution over the visible units is Gaussian corresponds to training a particular form of regularized autoencoder. We show how different Gaussian EBMs lead to different autoencoder architectures, providing deep links between these two families of models. We compare the score matching estimator for the mPoT model, a particular Gaussian EBM, to several other training methods on a variety of tasks including image denoising and unsupervised feature extraction. We show that the regularization function induced by score matching leads to superior classification performance relative to a standard autoencoder. We also show that score matching yields classification results that are indistinguishable from better-known stochastic approximation maximum likelihood estimators.
[download] [bibTeX] [discuss]
Generalized Boosting Algorithms for Convex Optimization
Alexander Grubb, Drew Bagnell
Abstract:Boosting is a popular way to derive powerful learners from simpler hypothesis classes. Following previous work (Mason et al., 1999; Friedman, 2000) on general boosting frameworks, we analyze gradient-based descent algorithms for boosting with respect to any convex objective and introduce a new measure of weak learner performance into this setting which generalizes existing work. We present the first weak to strong learning guarantees for the existing gradient boosting work for smooth convex objectives, and also demonstrate that this work fails for non-smooth objectives. To address this issue, we present new algorithms which extend this boosting approach to arbitrary convex loss functions and give corresponding weak to strong convergence results. In addition, we demonstrate experimental results that support our analysis and demonstrate the need for the new algorithms we present.
[download] [bibTeX] [discuss]