## ICML 2008 abstracts

paper ID: 111

### Preconditioned Temporal Difference Learning

* Hengshuai Yao and Zhi-Qiang Liu*

This paper extends many of the recent popular reinforcement learning (RL) algorithms to a generalized framework that includes least-squares temporal difference (LSTD) learning, least-squares policy evaluation (LSPE) and a variant of incremental LSTD (iLSTD). The basis of this extension is a preconditioning technique that tries to solve a stochastic model equation. This paper also studies three signicant issues of the new framework: it presents a new rule of step-size that can be computed online, provides an iterative way to apply preconditioning, and reduces the complexity of related algorithms to near that of temporal difference (TD) learning.

[Full paper] [Discussion]

paper ID: 113

### The GroupLASSO for Generalized Linear Models: Uniqueness of Solutions and Efficient Algorithms

* Volker Roth and Bernd Fischer*

The GroupLASSO method for finding important explanatory factors suffers from the potential non-uniqueness of solutions and also from high computational costs. We formulate conditions for the uniqueness of GroupLASSO solutions which lead to an easily implementable test procedure. In addition to merely detecting ambiguities in solutions, this testing procedure identifies all potentially active groups. These results are used to derive an efficient algorithm that can deal with input dimensions in the millions and can approximate the solution path efficiently. The derived methods are applied to large-scale learning problems where they exhibit excellent performance. We show that the proposed testing procedure helps to avoid misinterpretations of GroupLASSO solutions.

[Full paper] [Discussion]

paper ID: 121

### Autonomous Geometric Precision Error Estimation in Low-level Computer Vision Tasks

* Andrés Corrada-Emmanuel and Howard Schultz*

Errors in map-making tasks using computer vision are sparse. We demonstrate this by considering the construction of digital elevation models that employ stereo matching algorithms to triangulate real-world points. This sparsity, coupled with a geometric theory of errors recently developed by the authors, allows for autonomous agents to calculate their own precision independently of ground truth. We connect these developments with recent advances in the mathematics of sparse signal reconstruction or compressed sensing. The theory presented here extends the autonomy of 3-D model reconstructions discovered in the 1990s to their errors.

[Full paper] [Discussion]

paper ID: 125

### A Worst-Case Comparison Between Temporal Difference and Residual Gradient with Linear Function Approximation

* Lihong Li*

Residual gradient (RG) was proposed as an alternative to TD(0) for policy evaluation when function approximation is used, but there exists little formal analysis comparing them except in very limited cases. This paper employs techniques from online learning of linear functions and provides a worst-case analysis to compare these two types of algorithms when linear function approximation is used. No statistical assumptions are made on the sequence of observations. In particular, our results suggest that RG may result in smaller temporal differences, while TD(0) is more likely to yield smaller prediction errors. These phenomena can be observed even in two simple Markov chain examples.

[Full paper] [Discussion]

paper ID: 129

### Dirichlet Component Analysis: Feature Extraction for Compositional Data

* Hua-Yan Wang, Qiang Yang, Hong Qin, and Hongbin Zha*

We consider feature extraction (dimensionality reduction) for compositional data, where the data vectors are constrained to be positive and constant-sum. In real-world problems, the data components (variables) usually have complicated "correlations" while their total number is huge. Such scenario demands feature extraction. That is, we shall de-correlate the components and reduce their dimensionality. Traditional techniques such as the Principle Component Analysis (PCA) are not suitable for these problems due to unique statistical properties and the need to satisfy the constraints in compositional data. This paper presents a novel approach to feature extraction for compositional data. Our method first identifies a family of dimensionality reduction projections that preserve all relevant constraints, and then finds the optimal projection that maximizes the estimated Dirichlet precision on projected data. It reduces the compositional data to a given lower dimensionality while the components in the lower-dimensional space are de-correlated as much as possible. We develop theoretical foundation of our approach, and validate its effectiveness on some synthetic and real-world datasets.

[Full paper] [Discussion]

paper ID: 130

### Adaptive p-Posterior Mixture-Model Kernels for Multiple Instance Learning

* Hua-Yan Wang, Qiang Yang, and Hongbin Zha*

In multiple instance learning (MIL), how the instances determine the bag-labels is an essential issue, both algorithmically and intrinsically. In this paper, we show that the mechanism of how the instances determine the bag-labels is different for different application domains, and does not necessarily obey the traditional assumptions of MIL. We therefore propose an adaptive framework for MIL that adapts to different application domains by learning the domain-specific mechanisms merely from labeled bags. Our approach is especially attractive when we are encountered with novel application domains, for which the mechanisms may be different and unknown. Specifically, we exploit mixture models to represent the composition of each bag and an adaptable kernel function to represent the relationship between the bags. We validate on synthetic MIL datasets that the kernel function automatically adapts to different mechanisms of how the instances determine the bag-labels. We also compare our approach with state-of-the-art MIL techniques on real-world benchmark datasets.

[Full paper] [Discussion]

paper ID: 145

### Pairwise Constraint Propagation by Semidefinite Programming for Semi-Supervised Classification

* Zhenguo Li, Jianzhuang Liu, and Xiaoou Tang*

We consider the general problem of learning from pairwise constraints and unlabeled data. The pairwise constraints specify whether two objects belong to the same class or not, known as the must-link constraints and the cannot-link constraints. We propose to learn a mapping that is smooth over the data graph and maps the data onto a unit hypersphere, where two must-link objects are mapped to the same point while two cannot-link objects are mapped to be orthogonal. We show that such a mapping can be achieved by formulating a semidefinite programming problem, which is convex and can be solved globally. Our approach can effectively propagate pairwise constraints to the whole data set. It can be directly applied to multi-class classification and can handle data labels, pairwise constraints, or a mixture of them in a unified framework. Promising experimental results are presented for classification tasks on a variety of synthetic and real data sets.

[Full paper] [Discussion]

paper ID: 150

### Cost-Sensitive Multi-class Classification from Probability Estimates

* Deirdre O'Brien, Maya Gupta, and Robert Gray*

For two-class classification, it is common to classify by setting a threshold on class probability estimates, where the threshold is determined by {ROC} curve analysis. An analog for multi-class classification is learning a new class partitioning of the multiclass probability simplex to minimize empirical misclassification costs. We analyze the interplay between systematic errors in the class probability estimates and cost matrices for multi-class classification. We explore the effect on the class partitioning of five different transformations of the cost matrix. Experiments on benchmark datasets with naive Bayes and quadratic discriminant analysis show the effectiveness of learning a new partition matrix compared to previously proposed methods.

[Full paper] [Discussion]

paper ID: 151

### Fast Gaussian Process Methods for Point Process Intensity Estimation

* John Cunningham, Krishna Shenoy, and Maneesh Sahani*

Point processes are difficult to analyze because they provide only a sparse and noisy observation of the intensity function driving the process. Gaussian Processes offer an attractive theoretical framework by which to infer optimal estimates of these underlying intensity functions. The result of this inference is a continuous function defined across time that is typically more amenable to analytical efforts. However, a naive implementation of this intensity estimation will become computationally infeasible in any problem of reasonable size, both in memory and run-time requirements. We demonstrate problem specific methods for a class of renewal processes that eliminate the memory burden and reduce the solve time by orders of magnitude.

[Full paper] [Discussion]

paper ID: 158

### Localized Multiple Kernel Learning

* Mehmet Gonen and Ethem Alpaydin*

Recently, instead of selecting a single kernel, multiple kernel learning (MKL) has been proposed which uses a convex combination of kernels, where the weight of each kernel is optimized during training. However, MKL assigns the same weight to a kernel over the whole input space. In this paper, we develop a localized multiple kernel learning (LMKL) algorithm using a gating model for selecting the appropriate kernel function locally. The localizing gating model and the kernel-based classifier are coupled and their optimization is done in a joint manner. Empirical results on ten benchmark and two bioinformatics data sets validate the applicability of our approach. LMKL achieves statistically similar accuracy results compared with MKL by storing fewer support vectors. LMKL can also combine multiple copies of the same kernel function localized in different parts. For example, LMKL with multiple linear kernels gives better accuracy results than using a single linear kernel on bioinformatics data sets.

[Full paper] [Discussion]

paper ID: 160

### Causal Modelling Combining Instantaneous and Lagged Effects: an Identifiable Model Based on Non-Gaussianity

* Aapo Hyvarinen, Shohei Shimizu, and Patrik Hoyer*

Causal analysis of continuous-valued variables typically uses either autoregressive models or linear Gaussian Bayesian networks with instantaneous effects. Estimation of Gaussian Bayesian networks poses serious identifiability problems, which is why it was recently proposed to use non-Gaussian models. Here, we show how to combine the non-Gaussian instantaneous model with autoregressive models. We show that such a non-Gaussian model is identifiable without prior knowledge of network structure, and we propose an estimation method shown to be consistent. This approach also points out how neglecting instantaneous effects can lead to completely wrong estimates of the autoregressive coefficients.

[Full paper] [Discussion]

paper ID: 163

### Uncorrelated Multilinear Principal Component Analysis through Successive Variance Maximization

* Haiping Lu, Konstantinos Plataniotis, and Anastasios Venetsanopoulos*

Tensorial data are frequently encountered in various machine learning tasks today and dimensionality reduction is one of their most important applications. This paper extends the classical principal component analysis (PCA) to its multilinear version by proposing a novel dimensionality reduction algorithm for tensorial data, named as uncorrelated multilinear PCA (UMPCA). UMPCA seeks a tensor-to-vector projection that captures most of the variation in the original tensorial input while producing uncorrelated features through successive variance maximization. We evaluate the proposed algorithm on a second-order tensorial problem, face recognition, and the experimental results show its superiority, especially in low-dimensional spaces, through the comparison with three other PCA-based algorithms.

[Full paper] [Discussion]

paper ID: 166

### A Dual Coordinate Descent Method for Large-scale Linear SVM

* Cho-Jui Hsieh, Kai-Wei Chang, Chih-Jen Lin, S. Sathiya Keerthi, and S. Sundararajan*

In many applications, data appear with a huge number of instances as well as features. Linear Support Vector Machines (SVM) is one of the most popular tools to deal with such large-scale sparse data. This paper presents a novel dual coordinate descent method for linear SVM with L1- and L2-loss functions. The proposed method is simple and reaches an epsilon-accurate solution in O(log (1/epsilon)) iterations. Experiments indicate that our method is much faster than state of the art solvers such as Pegasos, Tron, svmperf, and a recent primal coordinate descent implementation.

[Full paper] [Discussion]

paper ID: 167

### Listwise Approach to Learning to Rank - Theory and Algorithm

* Fen Xia, Tie-Yan Liu, Jue Wang, Wensheng Zhang, and Hang Li*

This paper aims to conduct a comprehensive study on the listwise approach to learning to rank. The listwise approach learns a ranking function by taking individual lists as instances and minimizing a loss function defined on two lists (one is predicted result and the other ground truth). Existing work on the approach mainly focused on the development of new algorithms; methods such as RankCosine and ListNet have been proposed and better performances by them have also been observed. Unfortunately, the underlying theory was not sufficiently studied as far. To amend the problem, this paper proposes conducting theoretical analysis of learning to rank algorithms through investigation on the properties of the loss functions, including consistency, soundness, continuity, differentiability, convexity, and efficiency. A sufficient condition on consistency for ranking is given, which seems to be the first such result obtained in related research. The paper then conducts analysis on three loss functions: likelihood loss, cosine loss, and cross entropy loss. The latter two were used in RankCosine and ListNet respectively. The use of likelihood loss leads to the development of a new listwise method called ListMLE, whose loss function offers better properties. Experimental results have also verified the correctness of the theoretical results obtained in the paper.

[Full paper] [Discussion]

paper ID: 168

### Efficient MultiClass Maximum Margin Clustering

* Bin Zhao, Fei Wang, and Changshui Zhang*

This paper presents a cutting plane algorithm for multiclass maximum margin clustering (MMC). The proposed algorithm constructs a nested sequence of successively tighter relaxations of the original MMC problem, and each optimization problem in this sequence could be efficiently solved using the constrained concave-convex procedure (CCCP). Experimental evaluations on several real world datasets show that our algorithm converges much faster than existing MMC methods with guaranteed accuracy, and can thus handle much larger datasets efficiently.

[Full paper] [Discussion]

paper ID: 172

### Spectral Clustering with Inconsistent Advice

* Tom Coleman, James Saunderson, and Anthony Wirth*

Clustering with advice (often known as constrained clustering) has been a recent focus of the data mining community. Success has been achieved incorporating advice into the k-means framework, as well as spectral clustering. Although the theory community has explored inconsistent advice, it has not yet been incorporated into spectral clustering. Extending work of De Bie and Cristianini, we set out a framework for finding minimum normalized cuts, subject to inconsistent advice. Our results suggest that the framework will be successful in many situations.

[Full paper] [Discussion]

paper ID: 178

### Nearest Hyperdisk Methods for High-Dimensional Classification

* Hakan Cevikalp, Bill Triggs, and Robi Polikar*

In high-dimensional classification problems it is infeasible to include enough training samples to cover the class regions densely. Irregularities in the resulting sparse sample distributions cause local classifiers such as Nearest Neighbors (NN) and kernel methods to have irregular decision boundaries. One solution is to "fill in the holes" by building a convex model of the region spanned by the training samples of each class and classifying examples based on their distances to these approximate models. Methods of this kind based on affine and convex hulls and bounding hyperspheres have already been studied. Here we propose a method based on the bounding hyperdisk of each class -- the intersection of the affine hull and the smallest bounding hypersphere of its training samples. We argue that in many cases hyperdisks are preferable to affine and convex hulls and hyperspheres: they bound the classes more tightly than affine hulls or hyperspheres while avoiding much of the sample overfitting and computational complexity that is inherent in high-dimensional convex hulls. We show that the hyperdisk method can be kernelized to provide nonlinear classifiers based on non-Euclidean distance metrics. Experiments on several classification problems show promising results.

[Full paper] [Discussion]

paper ID: 179

### Query-Level Stability and Generalization in Learning to Rank

* Yanyan Lan, Tie-Yan Liu, Tao Qin, Zhiming Ma, and Hang Li*

This paper is concerned with the generalization ability of learning to rank algorithms for information retrieval (IR). We point out that the key for addressing the learning problem is to look at it from the viewpoint of query, and we give a formulation of learning to rank for IR based on the consideration. We define a number of new concepts within the framework, including query-level loss, query-level risk, and query-level stability. We then analyze the generalization ability of learning to rank algorithms by giving query-level generalization bounds to them using query-level stability as a tool. Such an analysis is very helpful for us to derive more advanced algorithms for IR. We apply the proposed theory to the existing algorithms of Ranking SVM and IRSVM. Experimental results on the two algorithms verify the correctness of the theoretical analysis.

[Full paper] [Discussion]

paper ID: 180

### Local Likelihood Modeling of Temporal Text Streams

* Guy Lebanon and Yang Zhao*

Temporal text data is often generated by a time-changing process or distribution. Such a drift in the underlying distribution cannot be captured by stationary likelihood techniques. We consider the application of local likelihood methods to generative and conditional modeling of temporal document sequences. We examine the asymptotic bias and variance and present an experimental study using the RCV1 dataset containing a temporal sequence of Reuters news stories.

[Full paper] [Discussion]

paper ID: 182

### Inverting the Viterbi Algorithm: an Abstract Framework for Structure Design

* Michael Schnall-Levin, Leonid Chindelevitch, and Bonnie Berger*

Probabilistic grammatical formalisms such as hidden Markov models (HMMs) and stochastic context-free grammars (SCFGs) have been extensively studied and widely applied in a number of fields. Here, we introduce a new algorithmic problem on HMMs and SCFGs that arises naturally from protein and RNA design, and which has not been previously studied. The problem can be viewed as an inverse to the one solved by the Viterbi algorithm on HMMs or by the CKY algorithm on SCFGs. We study this problem theoretically and obtain the first algorithmic results. We prove that the problem is NP-complete, even for a 3-letter emission alphabet, via a reduction from 3-SAT, a result that has implications for the hardness of RNA secondary structure design. We then develop a number of approaches for making the problem tractable. In particular, for HMMs we develop a branch-and-bound algorithm, which can be shown to have fixed-parameter tractable worst-case running time, exponential in the number of states of the HMM but linear in the length of the structure. We also show how to cast the problem as a Mixed Integer Linear Program.

[Full paper] [Discussion]

paper ID: 196

### Estimating Local Optimums in EM Algorithm over Gaussian Mixture Model

* Zhenjie Zhang, Bing Tian Dai, and Anthony K.H. Tung*

EM algorithm is a very popular method to estimate the parameters of Gaussian Mixture Model from a large observation set. However, in most cases, EM algorithm is not guaranteed to converge to the global optimum. Instead, it stops at some local optimums, which can be much worse than the global optimum. Therefore, it is usually required to run multiple procedures of EM algorithm with different initial configurations and return the best solution. To improve the efficiency of this scheme, we propose a new method which can estimate an upper bound on the logarithm likelihood of the local optimum, based on the current configuration after the latest EM iteration. This is accomplished by first deriving some region bounding the possible locations of local optimum, followed by some upper bound estimation on the maximum likelihood. With this estimation, we can terminate an EM algorithm procedure if the estimated local optimum is definitely worse than the best solution seen so far. Extensive experiments show that our method can effectively and efficiently accelerate conventional EM algorithm.

[Full paper] [Discussion]

paper ID: 197

### Efficiently Learning Linear-Linear Exponential Family Predictive Representations of State

* David Wingate and Satinder Singh*

Exponential Family PSR (EFPSR) models capture stochastic dynamical systems by representing state as the parameters of an exponential family distribution over a short-term window of future observations. They are appealing from a learning perspective because they are fully observed (meaning expressions for maximum likelihood do not involve hidden quantities), but are still expressive enough to both capture existing models (such as POMDPs and linear dynamical systems) and predict new models. While learning algorithms based on maximizing exact likelihood exist, they are not computationally feasible. We present a new, computationally efficient, learning algorithm based on an approximate likelihood function. The algorithm can be interpreted as attempting to induce stationary distributions of observations, features and states which match their empirically observed counterparts. The approximate likelihood, and the idea of matching stationary distributions, may have application in other models.

[Full paper] [Discussion]

paper ID: 202

### Learning to Classify with Missing and Corrupted Features

* Ofer Dekel and Ohad Shamir*

After a classifier is trained using a machine learning algorithm and put to use in a real world system, it often faces noise which did not appear in the training data. Particularly, some subset of features may be missing or may become corrupted. We present two novel machine learning techniques that are robust to this type of classification-time noise. First, we solve an approximation to the learning problem using linear programming. We analyze the tightness of our approximation and prove statistical risk bounds for this approach. Second, we define the online-learning variant of our problem, address this variant using a modified Perceptron, and obtain a statistical learning algorithm using an online-to-batch technique. We conclude with a set of experiments that demonstrate the effectiveness of our algorithms.

[Full paper] [Discussion]

paper ID: 209

### Multi-Task Compressive Sensing with Dirichlet Process Priors

* Yuting Qi, Dehong Liu, David Dunson, and Lawrence Carin*

Compressive sensing (CS) is an emerging field that, under appropriate conditions, can significantly reduce the number of measurements required for a given signal. In many applications, one is interested in multiple signals that may be measured in multiple CS-type measurements, where here each signal corresponds to a sensing "task". In this paper we propose a novel multi-task compressive sensing framework based on a Bayesian formalism, where a Dirichlet process (DP) prior is employed, yielding a principled means of simultaneously inferring the appropriate sharing mechanisms as well as CS inversion for each task. A variational Bayesian (VB) inference algorithm is employed to estimate the full posterior on the model parameters.

[Full paper] [Discussion]

paper ID: 215

### Fast Solvers and Efficient Implementations for Distance Metric Learning

* Kilian Weinberger and Lawrence Saul*

In this paper we study how to improve nearest neighbor classification by learning a Mahalanobis distance metric. We build on a recently proposed framework for distance metric learning known as large margin nearest neighbor (LMNN) classification. Within this framework, we focus specifically on the challenges in scalability and adaptability posed by large data sets. Our paper makes three contributions. First, we describe a highly efficient solver for the particular instance of semidefinite programming that arises in LMNN classification; our solver can handle problems with billions of large margin constraints in a few hours. Second, we show how to reduce both training and testing times using metric ball trees; the speedups from ball trees are further magnified by learning low dimensional representations of the input space. Third, we show how to learn different Mahalanobis distance metrics in different parts of the input space. For large data sets, these mixtures of locally adaptive metrics lead to even lower error rates.

[Full paper] [Discussion]

paper ID: 216

### Nu-Support Vector Machine as Conditional Value-at-Risk Minimization

* Akiko Takeda and Masashi Sugiyama*

The nu-support vector classification (nu-SVC) algorithm was shown to work well and provide intuitive interpretations, e.g., the parameter nu roughly specifies the fraction of support vectors. Although nu corresponds to a fraction, it cannot take the entire range between 0 and 1 in its original form. This problem was settled by a non-convex extension of nu-SVC and the extended method was experimentally shown to generalize better than original nu-SVC. However, its good generalization performance and convergence properties of the optimization algorithm have not been studied yet. In this paper, we provide new theoretical insights into these issues and propose a novel nu-SVC algorithm that has guaranteed generalization performance and convergence properties.

[Full paper] [Discussion]

paper ID: 229

### Manifold Alignment using Procrustes Analysis

* Chang Wang and Sridhar Mahadevan*

In this paper we introduce a novel approach to manifold alignment, based on Procrustes analysis. Our approach differs from "semi-supervised alignment" in that it results in a mapping that is defined everywhere - when used with a suitable dimensionality reduction method - rather than just on the training data points. We describe and evaluate our approach both theoretically and experimentally, providing results showing useful knowledge transfer from one domain to another. Novel applications of our method including cross-lingual information retrieval and transfer learning in Markov decision processes are presented.

[Full paper] [Discussion]

paper ID: 236

### A Decoupled Approach to Exemplar-based Unsupervised Learning.

* Sebastian Nowozin and Gökhan Bakir*

A recent trend in exemplar based unsupervised learning is to formulate the learning problem as a convex optimization problem. Convexity is achieved by restricting the set of possible prototypes to training exemplars. In particular, this has been done for clustering, vector quantization and mixture model density estimation. In this paper we propose a novel algorithm that is theoretically and practically superior to these convex formulations. This is possible by posing the unsupervised learning problem as a single convex "master problem" with non-convex subproblems. We show that for the above learning tasks the subproblems are extremely well-behaved and can be solved efficiently.

[Full paper] [Discussion]

paper ID: 237

### Laplace Maximum Margin Markov Networks

* Jun Zhu, Eric Xing, and Bo Zhang*

Learning sparse Markov networks based on the maximum margin
principle remains an open problem in structured prediction. In
this paper, we proposed the Laplace max-margin Markov network
(LapM^{3}N), and a general class of Bayesian M^{3}N
(BM^{3}N) of which the LapM^{3}N is a special case and
enjoys a sparse representation. The BM^{3}N is built on a novel
*Structured Maximum Entropy Discrimination* (SMED)
formalism, which offers a general framework for combining Bayesian
learning and max-margin learning of log-linear models for
structured prediction, and it subsumes the unsparsified M^{3}N as
a special case. We present an efficient iterative learning
algorithm based on variational approximation and existing convex
optimization methods employed in M^{3}N. We show that our method
outperforms competing ones on both synthetic and real OCR data.

[Full paper] [Discussion]

paper ID: 241

### Gaussian Process Product Models for Nonparametric Nonstationarity

* Ryan Adams and Oliver Stegle*

Stationarity is often an unrealistic prior assumption for Gaussian process regression. One solution is to predefine an explicit nonstationary covariance function, but such covariance functions can be difficult to specify and require detailed prior knowledge of the nonstationarity. We propose the Gaussian process product model (GPPM) which models data as the pointwise product of two latent Gaussian processes to nonparametrically infer nonstationary variations of amplitude. This approach differs from other nonparametric approaches to covariance function inference in that it operates on the outputs rather than the inputs, resulting in a significant reduction in computational cost and required data for inference, while improving scalability to high-dimensional input spaces. We present an approximate inference scheme using Expectation Propagation. This variational approximation yields convenient GP hyperparameter selection and compact approximate predictive distributions.

[Full paper] [Discussion]

paper ID: 242

### Prediction with Expert Advice for the Brier Game

* Vladimir Vovk and Fedor Zhdanov*

We show that the Brier game of prediction is mixable and find the optimal learning rate and substitution function for it. The resulting prediction algorithm is applied to predict results of football and tennis matches. The theoretical performance guarantee turns out to be rather tight on these data sets, especially in the case of the more extensive tennis data.

[Full paper] [Discussion]

paper ID: 254

### Stability of Transductive Regression Algorithms

* Corinna Cortes, Mehryar Mohri, Dmitry Pechyony, and Ashish Rastogi*

This paper uses the notion of algorithmic stability to derive novel generalization bounds for several families of transductive regression algorithms, both by using convexity and closed-form solutions. Our analysis helps compare the stability of these algorithms. It suggests that several existing algorithms might not be stable but prescribes a technique to make them stable. It also reports the results of experiments with local transductive regression demonstrating the benefit of our stability bounds for model selection, in particular for determining the radius of the local neighborhood used by the algorithm.

[Full paper] [Discussion]

paper ID: 257

### Learning All Optimal Policies with Multiple Criteria

* Leon Barrett and Srinivas Narayanan*

We describe an algorithm for learning in the presence of multiple criteria. Our technique generalizes previous approaches in that it can learn optimal policies for any linear preference assignment over the multiple reward criteria. The algorithm can be viewed as an extension to standard reinforcement learning for MDPs where instead of repeatedly backing up maximal expected rewards, we back up the set of expected rewards that are maximal for some set of linear preferences (given by a weight vector, w). We present the algorithm, along with a proof of correctness showing that our solution gives the optimal policy for any linear preference function. The solution reduces to the standard value iteration algorithm for a specific weight vector.

[Full paper] [Discussion]

paper ID: 258

### Random Classification Noise Defeats All Convex Potential Boosters

* Philip M. Long and Rocco A. Servedio*

A broad class of boosting algorithms can be interpreted as performing coordinate-wise gradient descent to minimize some potential function of the margins of a data set. This class includes AdaBoost, LogitBoost, and other widely used and well-studied boosters. In this paper we show that for a broad class of convex potential functions, any such boosting algorithm is highly susceptible to random classification noise. We do this by showing that for any such booster and any nonzero random classification noise rate R, there is a simple data set of examples which is efficiently learnable by such a booster if there is no noise, but which cannot be learned to accuracy better than 1/2 if there is random classification noise at rate R. This negative result is in contrast with known branching program based boosters which do not fall into the convex potential function framework and which can provably learn to high accuracy in the presence of random classification noise.

[Full paper] [Discussion]

paper ID: 259

### Non-Parametric Policy Gradients: A Unified Treatment of Propositional and Relational Domains

* Kristian Kersting and Kurt Driessens*

Policy gradient approaches are a powerful instrument for learning how to interact with the environment.Existing approaches have focused on propositional and continuous domains only. Without extensive feature engineering, it is difficult -- if not impossible -- to apply them within structured domains, in which e.g. there is a varying number of objects and relations among them. In this paper, we describe a non-parametric policy gradient approach -- called NPPG -- that overcomes this limitation. The key idea is to apply Friedmann's gradient boosting: policies are represented as a weighted sum of regression models grown in an stage-wise optimization. Employing off-the-shelf regression learners, NPPG can deal with propositional, continuous, and relational domains in a unified way. Our experimental results show that it can even improve on established results.

[Full paper] [Discussion]

paper ID: 260

### On Partial Optimality in Multi-label MRFs

* Pushmeet Kohli, Alexander Shekhovtsov, Carsten Rother, Vladimir Kolmogorov, and Philip Torr*

We consider the problem of optimizing multi-label MRFs, which is in general NP-hard and ubiquitous in low-level computer vision. One approach for its solution is to formulate it as an integer programming problem and relax the integrality constraints. The approach we consider in this paper is to first convert the multi-label MRF into an equivalent binary-label MRF and then to relax it. Our key contribution is a theoretical study of this new relaxation. We also show how this approach can be used in combination with recently developed optimization techniques based on roof-duality which have the desired property that a partial (or sometimes the complete) optimal solution of the binary MRF can be found. This property enables us to localize (restrict) the range of labels where the optimal label for any random variable of the multi-label MRF lies. In many cases these localizations lead to a partially optimal solution of the multi-label MRF. Further, running standard MRF solvers, e.g. TRW-S, on this restricted energy is much faster than running them on the original unrestricted energy. We demonstrate the use of our methods on challenging computer vision problems. Our experimental results show that methods derived from our study outperform competing methods for minimizing multi-label MRFs.

[Full paper] [Discussion]

paper ID: 264

### Learning Diverse Rankings with Multi-Armed Bandits

* Filip Radlinski, Robert Kleinberg, and Thorsten Joachims*

Algorithms for learning to rank Web documents usually assume a document's relevance is independent of other documents. This leads to learned ranking functions that produce rankings with redundant results. In contrast, user studies have shown that diversity at high ranks is often preferred. We present two new learning algorithms that directly learn a diverse ranking of documents based on users' clicking behavior. We show that these algorithms minimize abandonment, or alternatively, maximize the probability that a relevant document is found in the top k positions of a ranking. We show that one of our algorithms asymptotically achieves the best possible payoff obtainable in polynomial time even as user's interests change. The other performs better empirically when user interests are static, and is still theoretically near-optimal in that case.

[Full paper] [Discussion]

paper ID: 266

### SVM Optimization: Inverse Dependence on Training Set Size

* Shai Shalev-Shwartz and Nathan Srebro*

We discuss how the runtime of SVM optimization should decrease as the size of the training data increases. We present theoretical and empirical results demonstrating how a simple subgradient descent approach indeed displays such behavior, at least for linear kernels.

[Full paper] [Discussion]

paper ID: 270

### A Least Squares Formulation for Canonical Correlation Analysis

* Liang Sun, Shuiwang Ji, and Jieping Ye*

Canonical Correlation Analysis (CCA) is a well-known technique for finding the correlations between two sets of multi-dimensional variables. It projects both sets of variables into a lower-dimensional space in which they are maximally correlated. CCA is commonly applied for supervised dimensionality reduction, in which one of the multi-dimensional variables is derived from the class label. It has been shown that CCA can be formulated as a least squares problem in the binary-class case. However, their relationship in the more general setting remains unclear. In this paper, we show that, under a mild condition which tends to hold for high-dimensional data, CCA in multi-label classifications can be formulated as a least squares problem. Based on this equivalence relationship, we propose several CCA extensions including sparse CCA using 1-norm regularization. Experiments on multi-label data sets confirm the established equivalence relationship. Results also demonstrate the effectiveness of the proposed CCA extensions.

[Full paper] [Discussion]

paper ID: 272

### Learning from Incomplete Data with Infinite Imputations

* Uwe Dick, Peter Haider, and Tobias Scheffer*

We address the problem of learning decision functions from training data in which some attribute values are unobserved. This problem can arise for instance, when training data is aggregated from multiple sources, and some sources record only a subset of attributes. We derive a joint optimization problem for the final classifier in which the distribution governing the missing values is a free parameter. We show that the optimal solution concentrates the density mass on finitely many atoms, and provide a corresponding algorithm for learning from incomplete data. We report on empirical results on benchmark data, and on the email spam application that motivates the problem setting.

[Full paper] [Discussion]

paper ID: 277

### Nonextensive Entropic Kernels

* Andre F. T. Martins, Mario A. T. Figueiredo, Pedro M. Q. Aguiar, Noah A. Smith, and Eric P. Xing*

Positive definite kernels on probability measures have been recently applied in structured data classification problems. Some of these kernels are related to classic information theoretic quantities, such as mutual information and the Jensen-Shannon divergence. Meanwhile, driven by recent advances in Tsallis statistics, nonextensive generalizations of Shannon’s information theory have been proposed. This paper bridges these two trends. We introduce the Jensen-Tsallis q-difference, a generalization of the Jensen-Shannon divergence. We then define a new family of nonextensive mutual information kernels, which allow weights to be assigned to their arguments, and which includes the Boolean, Jensen-Shannon, and linear kernels as particular cases. We illustrate the performance of these kernels on text categorization tasks.

[Full paper] [Discussion]

paper ID: 278

### A Distance Model for Rhythms

* Jean-Francois Paiement, Yves Grandvalet, Samy Bengio, and Douglas Eck*

Modeling long-term dependencies in time series has proved very difficult to achieve with traditional machine learning methods. This problem occurs when considering music data. In this paper, we introduce a model for rhythms based on the distributions of distances between subsequences. A specific implementation of the model when considering Hamming distances over a simple rhythm representation is described. The proposed model consistently outperforms a standard Hidden Markov Model in terms of conditional prediction accuracy on two different music databases.

[Full paper] [Discussion]

paper ID: 279

### Training Structural SVMs when Exact Inference is Intractable

* Thomas Finley and Thorsten Joachims*

While discriminative training (e.g., CRF, structural SVM) holds much promise for machine translation, image segmentation, and clustering, the complex inference these applications require make exact training intractable. This leads to a need for approximate training methods. Unfortunately, knowledge about how to perform efficient and effective approximate training is limited. Focusing on structural SVMs, we provide and explore algorithms for two different classes of approximate training algorithms, which we call undergenerating (e.g., greedy) and overgenerating (e.g., relaxations) algorithms. We provide a theoretical and empirical analysis of both types of approximate trained structural SVMs, focusing on fully connected pairwise Markov random fields. We find that models trained with overgenerating methods have theoretic advantages over undergenerating methods, are empirically robust relative to their undergenerating brethren, and relaxed trained models favor non-fractional predictions from relaxed predictors.

[Full paper] [Discussion]

paper ID: 290

### Active Reinforcement Learning

* Arkady Epshteyn, Adam Vogel, and Gerald DeJong*

When the transition probabilities and rewards of a Markov Decision Process (MDP) are known, the agent can obtain the optimal policy without any interaction with the environment. However, exact transition probabilities are difficult for experts to specify. One option left to an agent is a long and potentially costly exploration of the environment. In this paper, we propose another alternative: given initial (possibly inaccurate) specification of the MDP, the agent determines the sensitivity of the optimal policy to changes in transitions and rewards. It then focuses its exploration on the regions of space to which the optimal policy is most sensitive. We show that the proposed exploration strategy performs well on several control and planning problems.

[Full paper] [Discussion]

paper ID: 296

### Graph Transduction via Alternating Minimization

* Jun Wang, Tony Jebara, and Shih-Fu Chang*

Graph transduction methods label input data by learning a classification function that is regularized to exhibit smoothness along a graph over labeled and unlabeled samples. In practice, these algorithms are sensitive to the initial set of labels provided by the user. For instance, classification accuracy drops if the training set contains weak labels, if imbalances exist across label classes or if the labeled portion of the data is not chosen at random. This paper introduces a propagation algorithm that more reliably minimizes a cost function over both a function on the graph and a binary label matrix. The cost function generalizes prior work in graph transduction and also introduces node normalization terms for resilience to label imbalances. We demonstrate that global minimization of the function is intractable but instead provide an alternating minimization scheme that incrementally adjusts the function and the labels towards a reliable local minimum. Unlike prior methods, the resulting propagation of labels does not prematurely commit to an erroneous labeling and obtains more consistent labels. Experiments are shown for synthetic and real classification tasks including digit and text recognition. A substantial improvement in accuracy compared to state of the art semi-supervised methods is achieved. The advantage are even more dramatic when labeled instances are limited.

[Full paper] [Discussion]

paper ID: 304

### Learning to Sportscast: A Test of Grounded Language Acquisition

* David Chen and Raymond Mooney*

We present a novel commentator system that learns language from sportscasts of simulated soccer games. The system learns to parse and generate commentaries without any engineered knowledge about the English language. Training is done using only ambiguous supervision in the form of textual human commentaries and simulation states of the soccer games. The system simultaneously tries to establish correspondences between the commentaries and the simulation states as well as build a translation model. We also present a novel algorithm, Iterative Generation Strategy Learning (IGSL), for deciding which events to comment on. Human evaluations of the generated commentaries indicate they are of reasonable quality compared to human commentaries.

[Full paper] [Discussion]

paper ID: 305

### An HDP-HMM for Systems with State Persistence

* Emily Fox, Erik Sudderth, Michael Jordan, and Alan Willsky*

The hierarchical Dirichlet process hidden Markov model (HDP-HMM) is a flexible, nonparametric model which allows state spaces of unknown size to be learned from data. We demonstrate some limitations of the original HDP-HMM formulation, and propose a sticky extension which allows more robust learning of smoothly varying dynamics. Using DP mixtures, this formulation also allows learning of more complex, multimodal emission distributions. We further develop a sampling algorithm that employs a truncated approximation of the DP to jointly resample the full state sequence, greatly improving mixing rates. Via extensive experiments with synthetic data and the NIST speaker diarization database, we demonstrate the advantages of our sticky extension, and the utility of the HDP-HMM in real-world applications.

[Full paper] [Discussion]

paper ID: 311

### Fully Distributed EM for Very Large Datasets

* Jason Wolfe, Aria Haghighi, and Dan Klein*

In EM and related algorithms, E-step computations distribute easily, because data items are independent given parameters. For very large data sets, however, even storing all of the parameters in a single node for the M-step can be impractical. We present a framework which fully distributes the entire EM procedure. Each node interacts with only parameters relevant to its data, sending messages to other nodes along a junction-tree topology. We demonstrate improvements over a MapReduce approach, on two tasks: word alignment and topic modeling.

[Full paper] [Discussion]

paper ID: 312

### Grassmann Discriminant Analysis: a Unifying View on Subspace-Based Learning

* Jihun Hamm and Daniel Lee*

In this paper we propose a discriminant learning framework for problems in which data consist of linear subspaces instead of vectors. By treating subspaces as basic elements, we can make learning algorithms adapt naturally to the problems with linear invariant structures. We propose a unifying view on the subspace-based learning method by formulating the problems on the Grassmann manifold, which is the set of fixed-dimensional subspaces of a Euclidean space. Previous methods on the problem typically adopt an inconsistent strategy: feature extraction is performed in the Euclidean space while non-Euclidean dissimilarity measures are used. In our approach, we treat each subspace as a point in the Grassmann space, and perform feature extraction and classification in the same space. We show feasibility of the approach by using the Grassmann kernel functions such as the Projection kernel and the Binet-Cauchy kernel. Experiments with real image databases show that the proposed method performs well compared with state-of-the-art algorithms.

[Full paper] [Discussion]

paper ID: 317

### On-line Discovery of Temporal-Difference Networks

* Takaki Makino and Toshihisa Takagi*

We present an algorithm for on-line, incremental discovery of temporal-difference (TD) networks. The key contribution is the establishment of three criteria to expand a node in TD network: a node is expanded when the node is well-known, independent, and has a prediction error that requires further explanation. Since none of these criteria requires centralized calculation operations, they are easily computed in a parallel and distributed manner, and scalable for bigger problems compared to other discovery methods of predictive state representations. Through computer experiments, we demonstrate the empirical effectiveness of our algorithm.

[Full paper] [Discussion]

paper ID: 318

### A Reproducing Kernel Hilbert Space Framework for Pairwise Time Series Distances

* Zhengdong Lu, Todd K. Leen, Yonghong Huang, and Deniz Erdogmus*

A good distance measure for time series needs to properly incorporate the temporal structure, and should be applicable to sequences with unequal lengths. In this paper, we propose a distance measure as a principled solution to the two requirements. Unlike the unconventional feature vector representation, our approach represents each time series with a summarizing smooth curve in a reproducing kernel Hilbert space (RKHS), and therefore translate the distance between time series into distances between curves. Moreover we propose to learn the kernel of this RKHS from a population of time series with discrete observations using Gaussian process-based non-parametric mixed-effect models. Experiments on two vastly different real-world problems show that the proposed distance measure leads to improved classification accuracy over the conventional distance measures.

[Full paper] [Discussion]

paper ID: 322

### Confidence-Weighted Linear Classification

* Mark Dredze, Koby Crammer, and Fernando Pereira*

We introduce confidence-weighted linear classifiers, a new class of algorithms that maintain confidence information about classifier parameters. Learning in this framework updates parameters by estimating weights and increasing model confidence. We investigate a new online algorithm that maintains a Gaussian distribution over weight vectors, updating the mean and variance of the model with each instance. Empirical evaluation on a range of NLP tasks show that our algorithm improves over other state of the art online and batch methods, learns faster in the online setting, and lends itself to better classifier combination after parallel training.

[Full paper] [Discussion]

paper ID: 323

### On the Chance Accuracies of Large Collections of Classifiers

* Mark Palatucci and Andrew Carlson*

We provide a theoretical analysis of the chance accuracies of large collections of classifiers. We show that on problems with small numbers of examples, some classifier can perform well by random chance, and we derive a theorem to explicitly calculate this accuracy. We use this theorem to provide a principled feature selection criteria for sparse, high-dimensional problems. We evaluate this method on both microarray and fMRI datasets and show that it performs very close to the optimal accuracy obtained from an oracle. We also show that on the fMRI dataset this technique chooses relevant features successfully while another state-of-the-art method, the False Discovery Rate (FDR), completely fails at standard significance levels.

[Full paper] [Discussion]

paper ID: 324

### Hierarchical sampling for active learning

* Sanjoy Dasgupta and Daniel Hsu*

We present an active learning scheme that exploits cluster structure in data.

[Full paper] [Discussion]

paper ID: 327

### Efficiently Solving Convex Relaxations for MAP Estimation

* Pawan Kumar Mudigonda and Philip Torr*

The problem of obtaining the maximum a posteriori (MAP) estimate of a discrete random field is of fundamental importance in many areas of Computer Science. In this work, we build on the tree reweighted message passing (TRW) framework of Kolmogorov and Wainwright et al. TRW iteratively optimizes the Lagrangian dual of a linear programming relaxation for MAP estimation. We show how the dual formulation of TRW can be extended to include linear cycle inequalities. We then consider the inclusion of some recently proposed second order cone (SOC) constraints in the dual. We propose efficient iterative algorithms for solving the resulting duals. Similar to the method described by Kolmogorov, these methods are guaranteed to converge. We test our algorithms on a large set of synthetic data, as well as real data. Our experiments show that the additional constraints (i.e. cycle inequalities and SOC constraints) provide better results in cases where the TRW framework fails (namely MAP estimation for non-submodular energy functions).

[Full paper] [Discussion]

paper ID: 331

### Boosting with Incomplete Information

* Gholamreza Haffari, Yang Wang, Shaojun Wang, Greg Mori, and Feng Jiao*

In real-world machine learning problems, it is very common that part of the input feature vector is incomplete: either not available, missing, or corrupted. In this paper, we present a boosting approach that integrates features with incomplete information and those with complete information to form a strong classifier. By introducing hidden variables to model missing information, we form loss functions that combine fully labeled data with partially labeled data to effectively learn normalized and unnormalized models. The primal problems of the proposed optimization problems with these loss functions are provided to show their close relationships and the motivations behind them. We use auxiliary functions to bound the change of the loss functions and derive explicit parameter update rules for the learning algorithms. We demonstrate encouraging results on two real-world problems - visual object recognition in computer vision and named entity recognition in natural language processing - to show the effectiveness of the proposed boosting approach.

[Full paper] [Discussion]

paper ID: 335

### Privacy-Preserving Reinforcement Learning

* Jun Sakuma, Shigenobu Kobayashi, and Rebecca Wright*

Distributed reinforcement learning (DRL) has been studied as an approach to learn control policies thorough interactions between distributed agents and environments. The main emphasis of DRL has been put on the way to learn sub-optimal policies with the least or limited sharing of agents' perceptions. In this study, we introduce a new concept, privacy-preservation, into DRL. In our setting, agents' perceptions, such as states, rewards, and actions, are not only distributed but also are desired to be kept private. This can occur when agents' perceptions include private or confidential information. Conventional DRL algorithms could be applied to such problems, but do not theoretically guarantee privacy preservation. We design solutions that achieve optimal policies in standard reinforcement leering settings without requiring the agents to share their private information by means of well-known cryptographic primitive, secure function evaluation.

[Full paper] [Discussion]

paper ID: 337

### Estimating Labels from Label Proportions

* Novi Quadrianto, Alex Smola, Tiberio Caetano, and Quoc Viet Le*

Consider the following problem: given sets of unlabeled observations, each set with known label proportions, predict the labels of another set of observations, also with known label proportions. This problem appears in areas like e-commerce, spam filtering and improper content detection. We present consistent estimators which can reconstruct the correct labels with high probability in a uniform convergence sense. Experiments show that our method works well in practice.

[Full paper] [Discussion]

paper ID: 340

### Deep Learning via Semi-Supervised Embedding

* Jason Weston, Frederic Ratle, and Ronan Collobert*

We show how nonlinear embedding algorithms popular for use with shallow
semi-supervised learning techniques such as kernel methods can be applied to
deep multi-layer architectures, either as a regularizer at the output layer, or
on each layer of the architecture. This provides a simple alternative to
existing approaches to *deep* learning whilst yielding competitive error
rates compared to those methods, and existing *shallow* semi-supervised
techniques.

[Full paper] [Discussion]

paper ID: 341

### Online Kernel Selection for Bayesian Reinforcement Learning

* Joseph Reisinger, Peter Stone, and Risto Miikkulainen*

Kernel-based Bayesian methods for Reinforcement Learning (RL) such as Gaussian Process Temporal Difference (GPTD) are particularly promising because they rigorously treat uncertainty in the value function and make it easy to specify prior knowledge. However, the choice of prior distribution significantly affects the empirical performance of the learning agent, and little work has been done extending existing methods for prior model selection to the online setting. This paper develops Replacing-Kernel RL, an online model selection method for GPTD using population-based search. Replacing-Kernel RL is compared to standard GPTD and tile-coding on several RL domains, and is shown to yield significantly better asymptotic performance for many different kernel families. Furthermore, the resulting kernels capture an intuitively useful notion of prior state covariance that may nevertheless be difficult to capture manually.

[Full paper] [Discussion]

paper ID: 343

### Unsupervised Rank Aggregation with Distance-Based Models

* Alexandre Klementiev, Dan Roth, and Kevin Small*

The need to meaningfully combine sets of rankings often comes up when one deals with ranked data. Although a number of heuristic and supervised learning approaches to rank aggregation exist, they require domain knowledge or supervised ranked data, both of which are expensive to acquire. In order to address these limitations, we propose a mathematical and algorithmic framework for learning to aggregate (partial) rankings without supervision. We instantiate the framework for the cases of combining permutations and combining top-k lists, and propose a novel metric for the latter. Experiments in both scenarios demonstrate the effectiveness of the proposed formalism.

[Full paper] [Discussion]

paper ID: 355

### The Projectron: a Bounded Kernel-Based Perceptron

* Francesco Orabona, Joseph Keshet, and Barbara Caputo*

We present a discriminative online algorithm with a bounded memory growth, which is based on the kernel-based Perceptron. Generally, the required memory of the kernel-based Perceptron for storing the online hypothesis is not bounded. Previous work has been focused on discarding part of the instances in order to keep the memory bounded. In the proposed algorithm the instances are not discarded, but projected onto the space spanned by the previous online hypothesis. We derive a relative mistake bound and compare our algorithm both analytically and empirically to the state-of-the-art Forgetron algorithm (Dekel et al, 2007). The first variant of our algorithm, called Projectron, outperforms the Forgetron. The second variant, called Projectron++, outperforms even the Perceptron.

[Full paper] [Discussion]

paper ID: 361

### Efficient Projections onto the L1-Ball for Learning in High Dimensions

* John Duchi, Shai Shalev-Shwartz, Yoram Singer, and Tushar Chandra*

We describe efficient algorithms for projecting a vector onto the L1-ball. We present two methods for projection. The first performs exact projection in O(n) time, where n is the dimension of the space. The second works on vectors k of whose elements are perturbed outside the L1-ball, projecting in O(k log(n)) time. This setting is especially useful for online learning in sparse feature spaces such as text categorization applications. We demonstrate the merits and effectiveness of our algorithms in numerous batch and online learning tasks. We show that variants of stochastic gradient projection methods augmented with our efficient projection procedures outperform state-of-the-art optimization techniques such as interior point methods. We also show that in online settings gradient updates with L1 projections outperform the EG algorithm while obtaining models with high degrees of sparsity.

[Full paper] [Discussion]

paper ID: 362

### Maximum Likelihood Rule Ensembles

* Wojciech Kotlowski, Krzysztof Dembczynski, and Roman Slowinski*

We propose a new rule induction algorithm for solving classification problems via probability estimation. The main advantage of decision rules is their simplicity and good interpretability. While the early approaches to rule induction were based on sequential covering, we follow an approach in which a single decision rule is treated as a base classifier in an ensemble. The ensemble is built by greedily minimizing the negative loglikelihood which results in estimating the class conditional probability distribution. The introduced approach is compared with other decision rule induction algorithms such as SLIPPER, LRI and RuleFit.

[Full paper] [Discussion]

paper ID: 367

### Rank Minimization via Online Learning

* Raghu Meka, Prateek Jain, Constantine Caramanis, and Inderjit Dhillon*

Minimum rank problems arise frequently in machine learning applications and are notoriously difficult to solve due to the non-convex nature of the rank objective. In this paper, we present the first online learning approach for the problem of rank minimization of matrices over polyhedral sets. In particular, we present two online learning algorithms for rank minimization - our first algorithm is a multiplicative update method based on a generalized experts framework, while our second algorithm is a novel application of the online convex programming framework [Zinkevich, 2003]. In the latter, we flip the role of the decision maker by making the decision maker search over the constraint space instead of feasible points, as is usually the case in online convex programming. A salient feature of our online learning approach is that it allows us to give the first provable approximation guarantees for the rank minimization problem over polyhedral sets. We demonstrate the effectiveness of our methods on synthetic examples, and on the real-life application of low-rank kernel learning.

[Full paper] [Discussion]

paper ID: 371

### Topologically-Constrained Latent Variable Models

* Raquel Urtasun, David Fleet, Andreas Geiger, Jovan Popovic, Trevor Darrell, and Neil Lawrence*

In dimensionality reduction approaches, the data are typically embedded in a Euclidean latent space. However for some data sets this is inappropriate. For example, in human motion data we expect latent spaces that are cylindrical or a toroidal, that are poorly captured with a Euclidean space. In this paper, we present a range of approaches for embedding data in a non-Euclidean latent space. Our focus is the Gaussian Process latent variable model. In the context of human motion modeling this allows us to (a) learn models with interpretable latent directions enabling, for example, style/content separation, and (b) generalize beyond the data set enabling us to learn transitions between motion styles even though such transitions are not present in the data.

[Full paper] [Discussion]

paper ID: 377

### Tailoring Density Estimation via Reproducing Kernel Moment Matching

* Le Song, Xinhua Zhang, Alex Smola, Arthur Gretton, and Bernhard Schoelkopf*

Moment matching is a popular means of parametric density estimation. We extend this technique to nonparametric estimation of mixture models. Our approach works by embedding distributions into a reproducing kernel Hilbert space, and performing moment matching in that space. This allows us to tailor density estimators to a function class of interest (i.e., for which we would like to compute expectations). We show our density estimation approach is useful in applications such as message compression in graphical models, and image classification and retrieval.

[Full paper] [Discussion]

paper ID: 379

### Graph Kernels Between Point Clouds

* Francis Bach*

Point clouds are sets of points in two or three dimensions. Most kernel methods for learning on sets of points have not yet dealt with the specific geometrical invariances and practical constraints associated with point clouds in computer vision and graphics. In this paper, we present extensions of graph kernels for point clouds, which allow to use kernel methods for such objects as shapes, line drawings, or any three-dimensional point clouds. In order to design rich and numerically efficient kernels with as few free parameters as possible, we use kernels between covariance matrices and their factorizations on graphical models. We derive polynomial time dynamic programming recursions and present applications to recognition of handwritten digits and Chinese characters from few training examples.

[Full paper] [Discussion]

paper ID: 382

### Large Scale Manifold Transduction

* Michael Karlen, Jason Weston, Ayse Erkan, and Ronan Collobert*

We show how the regularizer of Transductive Support Vector Machines (TSVM) can be trained by stochastic gradient descent for linear models and multi-layer architectures. The resulting methods can be trained online, have vastly superior training and testing speed to existing TSVM algorithms, can encode prior knowledge in the network architecture, and obtain competitive error rates. We then go on to propose a natural generalization of the TSVM loss function that takes into account neighborhood and manifold information directly, unifying the two-stage Low Density Separation method into a single criterion, and leading to state-of-the-art results.

[Full paper] [Discussion]

paper ID: 383

### On Multi-View Active Learning and the Combination with Semi-Supervised Learning

* Wei Wang and Zhi-Hua Zhou*

*Multi-view learning* has become a hot topic during the past few years.
In this paper, we first characterize the sample complexity of multi-view
*active learning*. Under the *α-expansion* assumption, we
get an exponential improvement in the sample complexity from usual
Õ(1/ε) to Õ(log 1/ε), requiring neither strong assumption on data distribution
such as the data is distributed uniformly over the unit sphere in
ℜ^{d} nor strong assumption on hypothesis class such as linear
separators through the origin. We also give an upper bound of the error rate
when the *α-expansion* assumption does not hold. Then, we analyze
the combination of multi-view active learning and semi-supervised learning and
get a further improvement in the sample complexity. Finally, we study the
empirical behavior of the two paradigms, which verifies that the combination of
multi-view active learning and semi-supervised learning is efficient.

[Full paper] [Discussion]

paper ID: 390

### Bolasso: Model Consistent Lasso Estimation through the Bootstrap

* Francis Bach*

We consider the least-square linear regression problem with regularization by
the *l*_{1}-norm, a problem usually referred to as the Lasso.
In this paper, we present a detailed asymptotic analysis of model consistency
of the Lasso. For various decays of the regularization parameter, we compute
asymptotic equivalents of the probability of correct model selection (i.e.,
variable selection). For a specific rate decay, we show that the Lasso selects
all the variables that should enter the model with probability tending to one
exponentially fast, while it selects all other variables with strictly positive
probability. We show that this property implies that if we run the Lasso for
several bootstrapped replications of a given sample, then intersecting the
supports of the Lasso bootstrap estimates leads to consistent model selection.
This novel variable selection algorithm, referred to as the Bolasso, is
compared favorably to other linear regression methods on synthetic data and
datasets from the UCI machine learning repository.

[Full paper] [Discussion]

paper ID: 391

### A Unified Architecture for Natural Language Processing: Deep Neural Networks with Multitask Learning

* Ronan Collobert and Jason Weston*

We describe a single convolutional neural network architecture that given a sentence, outputs a host of language processing predictions: part-of-speech tags, chunks, named entity tags, semantic roles, semantically similar words and the likelihood that the sentence makes sense (grammatically and semantically) using a language model. The entire network is trained jointly on all these tasks using weight-sharing, an instance of multitask learning. All the tasks use labeled data except the language model which is learnt from unlabeled text and represents a novel way of performing semi-supervised learning for the shared tasks. We show how both multitask learning and semi-supervised learning improve the generalization of the shared tasks, resulting in a learnt model with state-of-the-art performance.

[Full paper] [Discussion]

paper ID: 392

### Learning Dissimilarities by Ranking: From SDP to QP

* Hua Ouyang and Alexander Gray*

We consider the problem of learning dissimilarities between points via formulations which preserve a specified ordering between points rather than the numerical values of the dissimilarities. Dissimilarity ranking (d-ranking) learns from instances like "A is more similar to B than C is to D" or "The distance between E and F is larger than that between G and H". Three formulations of d-ranking problems are presented and new algorithms are presented for two of them, one by semidefinite programming (SDP) and one by quadratic programming (QP). Among the novel capabilities of these approaches are out-of-sample prediction and scalability to large problems.

[Full paper] [Discussion]

paper ID: 396

### The Skew Spectrum of Graphs

* Risi Kondor and Karsten Borgwardt*

The central issue in representing graph-structured data instances in learning
algorithms is designing features which are invariant to permuting the
numbering of the vertices. We present a new system of invariant graph features
which we call the skew spectrum of graphs. The skew spectrum is based on
mapping the adjacency matrix to a function on the symmetric group and computing
bispectral invariants. The reduced form of the skew spectrum is computable in
O(n^{3}) time, and experiments show that on several benchmark datasets it can
outperform state of the art graph kernels.

[Full paper] [Discussion]

paper ID: 398

### Modified MMI/MPE: a Direct Evaluation of the Margin in Speech Recognition

* Georg Heigold, Thomas Deselaers, Ralf Schlueter, and Hermann Ney*

In this paper we show how common speech recognition training criteria such as the Minimum Phone Error criterion or the Maximum Mutual Information criterion can be extended to incorporate a margin term. Different margin-based training algorithms have been proposed to refine existing training algorithms for general machine learning problems. However, for speech recognition, some special problems have to be addressed and all approaches proposed either lack practical applicability or the inclusion of a margin-term enforces significant changes to the underlying model, e.g. the optimization algorithm, the loss functions, or the parameterization of the model. In our approach, the conventional training criteria are modified to incorporate a margin term. This allows us to do large-margin training in speech recognition using the same efficient algorithms for accumulation and optimization and to use the same software as for conventional discriminative training. We show that the proposed criteria are equivalent to Support Vector Machines with suitable smooth loss functions, approximating the non-smooth hinge loss function or the hard error (e.g. phone error). Experimental results are given for two different tasks: the rather simple digit string recognition task Sietill which severely suffers from overfitting and the large vocabulary European Parliament Plenary Sessions English task which is supposed to be dominated by the risk and the generalization does not seem to be such an issue.

[Full paper] [Discussion]

paper ID: 399

### Bi-Level Path Following for Cross Validated Solution of Kernel Quantile Regression

* Saharon Rosset*

Modeling of conditional quantiles requires specification of the quantile being estimated and can thus be viewed as a parameterized predictive modeling problem. Quantile loss is typically used, and it is indeed parameterized by a quantile parameter. In this paper we show how to follow the path of cross validated solutions to regularized kernel quantile regression. Even though the bi-level optimization problem we encounter for every quantile is non-convex, the manner in which the optimal cross-validated solution evolves with the parameter of the loss function allows tracking of this solution. We prove this property, construct the resulting algorithm, and demonstrate it on data. This algorithm allows us to efficiently solve the whole family of bi-level problems.

[Full paper] [Discussion]

paper ID: 400

### Fast Nearest Neighbor Retrieval for Bregman Divergences

* Lawrence Cayton*

We present a data structure enabling efficient NN retrieval for bregman divergences. The family of bregman divergences includes many popular dissimilarity measures including KL-divergence (relative entropy), Mahalanobis distance, and Itakura-Saito divergence. These divergences present a challenge for efficient NN retrieval because they are not, in general, metrics, for which most NN data structures are designed. The data structure introduced in this work shares the same basic structure as the popular metric ball tree, but employs convexity properties of bregman divergences in place of the triangle inequality. Experiments demonstrate speedups over brute-force search of up to several orders of magnitude.

[Full paper] [Discussion]

paper ID: 402

### Accurate Max-margin Training for Structured Output Spaces

* Sunita Sarawagi and Rahul Gupta*

Tsochantaridis et al 2005 proposed two formulations for maximum margin training of structured spaces: margin scaling and slack scaling. While margin scaling has been extensively used since it requires the same kind of MAP inference as normal structured prediction, slack scaling is believed to be more accurate and better-behaved. We present an efficient variational approximation to the slack scaling method that solves its inference bottleneck while retaining its accuracy advantage over margin scaling. We further argue that existing scaling approaches do not separate the true labeling comprehensively while generating violating constraints. We propose a new max-margin trainer PosLearn that generates violators to ensure separation at each position of a decomposable loss function. Empirical results on real datasets illustrate that PosLearn can reduce test error by up to 25%. Further, PosLearn violators can be generated more efficiently than slack violators; for many structured tasks the time required is just twice that of MAP inference.

[Full paper] [Discussion]

paper ID: 411

### Optimized Cutting Plane Algorithm for Support Vector Machines

* Vojtech Franc and Soeren Sonnenburg*

We have developed a new Linear Support Vector Machine (SVM) training algorithm called OCAS. Its computational effort scales linearly with the sample size. In an extensive empirical evaluation OCAS significantly outperforms current state of the art SVM solvers, like SVMLight, SVMPerf and BMRM, achieving speedups of over 1,000 on some datasets over SVMLight and 20 over SVMPerf, while obtaining the same precise Support Vector solution. OCAS even in the early optimization steps shows often faster convergence than the so far in this domain prevailing approximative methods SGD and Pegasos. Effectively parallelizing OCAS we were able to train on a dataset of size 15 million examples (itself about 32GB in size) in just 671 seconds --- a competing string kernel SVM required 97,484 seconds to train on 10 million examples sub-sampled from this dataset.

[Full paper] [Discussion]

paper ID: 412

### Learning to Learn Implicit Queries from Gaze Patterns

* Kai Puolamäki, Antti Ajanki, and Samuel Kaski*

In the absence of explicit queries, an alternative is to try to infer users' interests from implicit feedback signals, such as clickstreams or eye tracking. The interests, formulated as an implicit query, can then be used in further searches. We formulate this task as a probabilistic model, which can be interpreted as a kind of transfer learning and meta-learning. The probabilistic model is demonstrated to outperform an earlier kernel-based method in a small-scale information retrieval task.

[Full paper] [Discussion]

paper ID: 413

### Modeling Interleaved Hidden Processes

* Niels Landwehr*

Hidden Markov models assume that observations in time series data stem from some hidden process that can be compactly represented as a Markov chain. We generalize this model by assuming that the observed data stems from multiple hidden processes, whose outputs interleave to form the sequence of observations. Exact inference in this model is NP-hard. However, a tractable and effective inference algorithm is obtained by extending structured approximate inference methods used in factorial hidden Markov models. The proposed model is evaluated in an activity recognition domain, where multiple activities interleave and together generate a stream of sensor observations. It is shown to be more accurate than a standard hidden Markov model in this domain.

[Full paper] [Discussion]

paper ID: 415

### Discriminative Parameter Learning for Bayesian Networks

* Jiang Su, Harry Zhang, Charles X. Ling, and Stan Matwin*

Bayesian network classifiers have been widely used for
classification problems. Given a fixed Bayesian network structure,
parameter learning can take two different approaches: generative
and discriminative learning. While generative parameter learning is
more efficient, discriminative parameter learning is more effective.
In this paper, we propose a simple, efficient, and effective
discriminative parameter learning method, called
*Discriminative Frequency Estimate* (DFE), which learns
parameters by discriminatively computing frequencies from data.
Empirical studies show that the DFE algorithm integrates the
advantages of both generative and discriminative learning: it
performs as well as the state-of-the-art discriminative parameter
learning method ELR in accuracy, but is significantly more efficient.

[Full paper] [Discussion]

paper ID: 419

### Memory Bounded Inference in Topic Models

* Ryan Gomes, Max Welling, and Pietro Perona*

What type of algorithms and statistical techniques support learning from very large datasets over long stretches of time? We address this question through a memory bounded version of a variational EM algorithm that approximates inference of a topic model. The algorithm alternates two phases: "model building" and "model compression" in order to always satisfy a given memory constraint. The model building phase grows its internal representation (the number of topics) as more data arrives through Bayesian model selection. Compression is achieved by merging data-items in clumps and only caching their sufficient statistics. Empirically, the resulting algorithm is able to handle datasets that are orders of magnitude larger than the standard batch version.

[Full paper] [Discussion]

paper ID: 429

### A Semi-parametric Statistical Approach to Model-free Policy Evaluation

* Tsuyoshi Ueno, Motoaki Kawanabe, Takeshi Mori, Shin-Ichi Maeda, and Shin Ishii*

Reinforcement learning (RL) methods based on least-squares temporal difference (LSTD) have been developed recently and have shown good practical performance. However, the quality of their estimation has not been well elucidated. In this article, we discuss LSTD based policy evaluation from the new viewpoint of semiparametric statistical inference. In fact, the estimator can be obtained from a particular estimating function which guarantees its convergence to the true value asymptotically, without specifying a model of the environment. Based on these observations, we 1) analyze the asymptotic variance of an LSTD-based estimator, 2) derive the optimal estimating function with the minimum asymptotic estimation variance, and 3) derive a suboptimal estimator to reduce the computational burden in obtaining the optimal estimating function.

[Full paper] [Discussion]

paper ID: 432

### Self-taught Clustering

* Wenyuan Dai, Qiang Yang, Gui-Rong Xue, and Yong Yu*

This paper focuses on a new clustering task, called
*self-taught clustering*. Self-taught clustering is an instance
of *unsupervised transfer learning*, which aims at clustering a
small collection of target unlabeled data with the help of a large
amount of *auxiliary* unlabeled data. The target and auxiliary
data can be different in topic distribution. We show that even when
the target data are not sufficient to allow effective learning of a
high quality feature representation, it is possible to learn the
useful features with the help of the auxiliary data on which the
target data can be clustered effectively. We propose a co-clustering
based self-taught clustering algorithm to tackle this problem, by
clustering the target and auxiliary data simultaneously to allow the
feature representation from the auxiliary data to influence the
target data through a common set of features. Under the new
data representation, clustering on the target data can be improved.
Our experiments on image clustering show that our algorithm can
greatly outperform several state-of-the-art clustering methods when
utilizing irrelevant unlabeled auxiliary data.

[Full paper] [Discussion]

paper ID: 437

### Active Kernel Learning

* Steven C.H. Hoi and Rong Jin*

Identifying the appropriate kernel function/matrix for a given dataset is essential to all kernel-based learning techniques. In the past, a number of kernel learning algorithms have been proposed to learn kernel functions or matrices from side information, in the form of labeled examples or pairwise constraints. However, most previous studies are limited to the "passive" kernel learning in which the side information is provided beforehand. In this paper we present a framework of "Active Kernel Learning" (AKL) that is able to actively identify the most informative pairwise constraints for kernel learning. The key challenge of active kernel learning is how to measure the informativeness of each example pair given its class label is unknown. To this end, we propose a min-max approach for active kernel learning that selects the example pairs that will lead to the largest classification margin even when the class assignments to the selected pairs are incorrect. We furthermore approximate the related optimization problem into a convex programming problem. We evaluate the effectiveness of the proposed active kernel learning algorithm by comparing it with two other implementations of active kernel learning. Empirical study with nine datasets on data clustering shows that the proposed algorithm is considerably more effective than its competitors.

[Full paper] [Discussion]

paper ID: 440

### Sequence Kernels for Predicting Protein Essentiality

* Cyril Allauzen, Mehryar Mohri, and Ameet Talwalkar*

The problem of identifying the minimal gene set required to sustain life is of crucial importance in understanding cellular mechanisms and designing therapeutic drugs. This work describes several kernel-based solutions for predicting essential genes that outperform existing models while using less training data. Our first solution is based on a semi-manually designed kernel derived from the Pfam database, which includes several Pfam domains. We then present novel and general domain-based sequence kernels that capture sequence similarity with respect to several domains made of large sets of protein sequences. We show how to deal with the large size of the problem – several thousands of domains with individual domains sometimes containing thousands of sequences – by representing and efficiently computing these kernels using automata. We report results of extensive experiments demonstrating that they compare favorably with the Pfam kernel in predicting protein essentiality, while requiring no manual tuning.

[Full paper] [Discussion]

paper ID: 448

### Optimizing Estimated Loss Reduction for Active Sampling in Rank Learning

* Pinar Donmez and Jaime Carbonell*

Learning to rank is becoming an increasingly popular research area in machine learning. The ranking problem aims to induce an ordering or preference relations among a set of instances in the input space. However, collecting labeled data is growing into a burden in many rank applications since labeling requires eliciting the relative ordering over the set of alternatives. In this paper, we propose a novel active learning framework for SVM-based and boosting-based rank learning. Our approach suggests sampling based on maximizing the estimated loss differential over unlabeled data. Experimental results on two benchmark corpora show that the proposed model substantially reduces the labeling effort, and achieves superior performance rapidly with as much as 30% relative improvement over the margin-based sampling baseline.

[Full paper] [Discussion]

paper ID: 449

### Robust Matching and Recognition using Context-Dependent Kernels

* Hichem Sahbi, Jean-Yves Audibert, Jaonary Rabarisoa, and Renaud Keriven*

The success of kernel methods including support vector machines (SVMs) strongly depends on the design of appropriate kernels. While initially kernels were designed in order to handle fixed-length data, their extension to unordered, variable-length data became more than necessary for real pattern recognition problems such as object recognition and bioinformatics. We focus in this paper on object recognition using a new type of kernel referred to as "context-dependent". Objects, seen as constellations of local features (interest points,regions, etc.), are matched by minimizing an energy function mixing (1) a fidelity term which measures the quality of feature matching, (2) a neighborhood criteria which captures the object geometry and (3) a regularization term. We will show that the fixed-point of this energy is a "context-dependent" kernel ("CDK") which also satisfies the Mercer condition. Experiments conducted on object recognition show that when plugging our kernel in SVMs, we clearly outperform SVMs with "context-free" kernels.

[Full paper] [Discussion]

paper ID: 452

### Learning for Control from Multiple Demonstrations

* Adam Coates, Pieter Abbeel, and Andrew Ng*

We consider the problem of learning to follow a desired trajectory when given a small number of demonstrations from a sub-optimal expert. We present an algorithm that (i) extracts the---initially unknown---desired trajectory from the sub-optimal expert's demonstrations and (ii) learns a local model suitable for control along the learned trajectory. We apply our algorithm to the problem of autonomous helicopter flight. In all cases, the autonomous helicopter's performance exceeds that of our expert helicopter pilot's demonstrations. Even stronger, our results significantly extend the state-of-the-art in autonomous helicopter aerobatics. In particular, our results include the first autonomous tic-tocs, loops and hurricane, vastly superior performance on previously performed aerobatic maneuvers (such as in-place flips and rolls), and a complete airshow, which requires autonomous transitions between these and various other maneuvers.

[Full paper] [Discussion]

paper ID: 455

### Bayes Optimal Classification for Decision Trees

* Siegfried Nijssen*

We present the first algorithm for exact Bayes optimal classification from the hypothesis space of decision trees satisfying leaf constraints. Our contribution is that we reduce this problem to the problem of finding a rule-based classifier with appropriate weights. We show that these rules and weights can be computed in linear time from the output of a modified frequent itemset mining algorithm, which means that we can compute the classifier in practice, despite the exponential worst-case complexity. We perform experiments in which we compare the Bayes optimal predictions with those of the maximum a posteriori hypothesis.

[Full paper] [Discussion]

paper ID: 458

### Automatic Discovery and Transfer of MAXQ Hierarchies

* Neville Mehta, Soumya Ray, Prasad Tadepalli, and Thomas Dietterich*

We present an algorithm, HI-MAT (Hierarchy Induction via Models And Trajectories), that discovers MAXQ task hierarchies by applying dynamic Bayesian network models to a successful trajectory from a source reinforcement learning task. HI-MAT discovers subtasks by analyzing the causal and temporal relationships among the actions in the trajectory. Under appropriate assumptions, HI-MAT induces hierarchies that are consistent with the observed trajectory and have compact value-function tables employing safe state abstractions. We demonstrate empirically that HI-MAT constructs compact hierarchies that are comparable to manually-engineered hierarchies and facilitate significant speedup in learning when transferred to a target task.

[Full paper] [Discussion]

paper ID: 459

### Compressed Sensing and Bayesian Experimental Design

* Matthias Seeger and Hannes Nickisch*

We relate compressed sensing (CS) with Bayesian experimental design and provide a novel efficient approximate method for the latter, based on expectation propagation. In a large comparative study about linearly measuring natural images, we show that the simple standard heuristic of measuring Wavelet coefficients top-down systematically outperforms CS methods using random measurements; the sequential projection optimisation approach of [Ji & Carin 2007] performs even worse. We also show that our own approximate Bayesian method is able to learn measurement filters on full images efficiently which outperform the Wavelet heuristic. To our knowledge, ours is the first successful attempt at {}"learning compressed sensing" for images of realistic size. In contrast to common CS methods, our framework is not restricted to sparse signals, but can readily be applied to other notions of signal complexity or noise models. We give concrete ideas how our method can be scaled up to large signal representations.

[Full paper] [Discussion]

paper ID: 460

### Statistical Models for Partial Membership

* Katherine Heller, Sinead Williamson, and Zoubin Ghahramani*

We present a principled Bayesian framework for modeling partial memberships of data points to clusters. Unlike a standard mixture model which assumes that each data point belongs to one and only one mixture component, or cluster, a partial membership model allows data points to have fractional membership in multiple clusters. Algorithms which assign data points partial memberships to clusters can be useful for tasks such as clustering genes based on microarray data and global positioning and orbit determination. Our Bayesian Partial Membership Model (BPM) uses exponential family distributions to model each cluster, and a product of these distibtutions, with weighted parameters, to model each datapoint. Here the weights correspond to the degree to which the datapoint belongs to each cluster. All parameters in the BPM are continuous, so we can use Hybrid Monte Carlo to perform inference and learning. We discuss relationships between the BPM and Latent Dirichlet Allocation, Mixed Membership models, Exponential Family PCA, and fuzzy clustering. Lastly, we show some experimental results and discuss nonparametric extensions to our model.

[Full paper] [Discussion]

paper ID: 461

### A Quasi-Newton Approach to Nonsmooth Convex Optimization

* Jin Yu, S.V.N. Vishwanathan, Simon Guenter, and Nicol Schraudolph*

We extend the well-known BFGS quasi-Newton method and its
limited-memory variant (LBFGS) to the optimization of nonsmooth
convex objectives. This is done in a rigorous fashion by
generalizing three components of BFGS to subdifferentials: The
local quadratic model, the identification of a descent
direction, and the Wolfe line search conditions.
We apply the resulting sub(L)BFGS algorithm to
*L _{2}*-regularized risk minimization with binary
hinge loss, and its direction-finding component to

*L*-regularized risk minimization with logistic loss. In both settings our generic algorithms perform comparable to or better than their counterparts in specialized state-of-the-art solvers.

_{1}[Full paper] [Discussion]

paper ID: 470

### Predicting Diverse Subsets Using Structural SVMs

* Yisong Yue and Thorsten Joachims*

In many retrieval tasks, one important goal involves retrieving a diverse set of results (e.g., documents covering a wide range of topics for a search query). First of all, this reduces redundancy, effectively presenting more information with the presented results. Secondly, search queries are often ambiguous at some level. For example, the query “Jaguar” can refer to many different topics (such as the car or the feline). A set of documents with high topic diversity ensures that fewer users abandon the query because none of the results are relevant to them. Unlike existing approaches to learning retrieval functions, we present a method that explicitly trains to diversify results. In particular, we formulate the learning problem of predicting a diverse subset and derive a training algorithm based on structural SVMs.

[Full paper] [Discussion]

paper ID: 476

### Improved Nystrom Low-Rank Approximation and Error Analysis

* Kai Zhang, Ivor Tsang, and James Kwok*

Low-rank matrix approximation is an effective tool in alleviating the memory
and computational burdens of kernel methods and sampling, as the mainstream of
such algorithms, has drawn considerable attention in both theory and practice.
This paper presents detailed studies on the Nystrom sampling scheme and in
particular, an error analysis that directly relates the Nystrom approximation
quality with the encoding powers of the landmark points in summarizing the
data. The resultant error bound suggests a simple and efficient sampling
scheme, the *k*-means clustering algorithm, for Nystrom low-rank approximation.
We compare it with state-of-the-art approaches that range from greedy schemes
to probabilistic sampling. Our algorithm achieves significant performance gains
in a number of supervised/unsupervised learning tasks including kernel PCA and
least squares SVM.

[Full paper] [Discussion]

paper ID: 479

### Transfer of Samples in Batch Reinforcement Learning

* Alessandro Lazaric, Marcello Restelli, and Andrea Bonarini*

The main objective of transfer learning is to reduce the complexity of learning
the solution of a target task by effectively reusing the knowledge retained
from solving one or more source tasks. In this paper, we introduce a novel
algorithm that transfers samples (i.e., experience tuples ~~) from
source to target tasks. Under the assumption that tasks defined on the same
environment often have similar transition models and reward functions, we
propose a method to select samples from the source tasks that are mostly
similar to the target task, and, then, to use them as input for batch
reinforcement learning algorithms. As a result, the number of samples that the
agent needs to collect from the target task to learn its solution is reduced.
We empirically show that, following the proposed approach, the transfer of
samples is effective in reducing the learning complexity, even when the source
tasks are significantly different from the target task.~~

[Full paper] [Discussion]

paper ID: 484

### Expectation-Maximization for Sparse and Non-Negative PCA

* Christian David Sigg and Joachim M. Buhmann*

We study the problem of finding the dominant eigenvector of the sample covariance matrix, under additional constraints on its elements: a cardinality constraint limits the number of non-zero elements, and non-negativity forces the elements to have equal sign. This problem is known as sparse and non-negative principal component analysis (PCA), and has many applications including dimensionality reduction and feature selection. Based on expectation-maximization for probabilistic PCA, we present an algorithm for any combination of these constraints. Its complexity is at most quadratic in the number of dimensions of the data. We demonstrate significant improvements in performance and computational efficiency compared to the state-of-the-art, using large data sets from biology and computer vision.

[Full paper] [Discussion]

paper ID: 487

### Reinforcement Learning with Limited Reinforcement: Using Bayes Risk for Active Learning in POMDPs

* Finale Doshi, Joelle Pineau, and Nicholas Roy*

Partially Observable Markov Decision Processes (POMDPs) have succeeded in planning domains because they optimally trade between actions that increase an agent's knowledge and actions that increase an agent's reward. Unfortunately, most POMDPs are defined with a large number of parameters which are difficult to specify only from domain knowledge. In this paper, we treat the POMDP model parameters as additional hidden state in a "model-uncertainty" POMDP and develop an approximate algorithm for planning in the this larger POMDP. The approximation, coupled with model-directed queries, allows the planner to actively learn good policies. We demonstrate our approach on several standard POMDP problems.

[Full paper] [Discussion]

paper ID: 488

### Space-indexed Dynamic Programming: Learning to Follow Trajectories

* J. Zico Kolter, Adam Coates, Andrew Ng, Yi Gu, and Charles DuHadway*

We consider the task of learning to accurately follow a trajectory in a vehicle such as a car or helicopter. A number of dynamic programming algorithms such as Differential Dynamic Programming (DDP) and Policy Search by Dynamic Programming (PSDP), can efficiently compute non-stationary policies for these tasks --- such policies in general are well-suited to trajectory following since they can easily generate different control actions at different times in order to follow the trajectory. However, a weakness of these algorithms is that their policies are time-indexed, in that they apply different policies depending on the current time. This is problematic since 1) the current time may not correspond well to where we are along the trajectory and 2) the uncertainty over future states can prevent these algorithms from finding any good policies at all. In this paper we propose a method for space-indexed dynamic programming that overcomes both these difficulties. We begin by showing how a dynamical system can be rewritten in terms of a spatial index variable (i.e., how far along the trajectory we are) rather than as a function of time. We then use these space-indexed dynamical systems to derive space-indexed version of the DDP and PSDP algorithms. Finally, we show that these algorithms perform well on a variety of control tasks, both in simulation and on real systems.

[Full paper] [Discussion]

paper ID: 489

### Democratic Approximation of Lexicographic Preference Models

* Fusun Yaman, Thomas Walsh, Michael Littman, and Marie desJardins*

Previous algorithms for learning lexicographic preference models (LPMs) produce a "best guess" LPM that is consistent with the observations. Our approach is more democratic: we do not commit to a single LPM. Instead, we approximate the target using the votes of a collection of consistent LPMs. We present two variations of this method -- "variable voting" and "model voting" -- and empirically show that these democratic algorithms outperform the existing methods. We also introduce an intuitive yet powerful learning bias to prune some of the possible LPMs. We demonstrate how this learning bias can be used with variable and model voting and show that the learning bias improves the learning curve significantly, especially when the number of observations is small.

[Full paper] [Discussion]

paper ID: 490

### The Many Faces of Optimism: a Unifying Approach

* Istvan Szita and Andras Lorincz*

The exploration-exploitation dilemma has been an intriguing and unsolved problem within the framework of reinforcement learning. Optimism in the face of uncertainty and model building play central roles in advanced exploration methods. Here, we integrate several concepts and obtain a fast and simple algorithm. We show that the proposed algorithm finds a near-optimal policy in polynomial time, and give experimental evidence that it is robust and efficient compared to its ascendants.

[Full paper] [Discussion]

paper ID: 491

### Fast Support Vector Machine Training and Classification on Graphics Processors

* Bryan Catanzaro, Narayanan Sundaram, and Kurt Keutzer*

Recent developments in programmable, highly parallel Graphics Processing Units (GPUs) have enabled high performance implementations of machine learning algorithms. We describe a solver for Support Vector Machine training, using Platt's Sequential Minimal Optimization algorithm and an adaptive first and second order working set selection heuristic, which achieves speedups of 9-35x over LIBSVM running on a traditional processor. We also present a GPU-based system for SVM classification which achieves speedups of 81-138x over LibSVM (5-24x over our own CPU-based SVM classifier).

[Full paper] [Discussion]

paper ID: 497

### Stopping Conditions for Exact Computation of Leave-One-Out Error in Support Vector Machines

* Vojtech Franc, Pavel Laskov, and Klaus-R. Mueller*

We propose a new stopping condition for a Support Vector Machine (SVM) solver which precisely reflects the objective of the Leave-One-Out error computation. The stopping condition guarantees that the output on an intermediate SVM solution is identical to the output of the optimal SVM solution with one data point excluded from the training set. A simple augmentation of a general SVM training algorithm allows one to use a stopping criterion equivalent to the proposed sufficient condition. A comprehensive experimental evaluation of our method shows consistent speedup of the exact LOO computation by our method, up to the factor of 13 for the linear kernel. The new algorithm can be seen as an example of constructive guidance of an optimization algorithm towards achieving the best attainable expected risk at optimal computational cost.

[Full paper] [Discussion]

paper ID: 502

### Data Spectroscopy: Learning Mixture Models using Eigenspaces of Convolution Operators

* Tao Shi, Mikhail Belkin, and Bin Yu*

In this paper we develop a spectral framework for estimating mixture
distributions, specifically
Gaussian mixture models. In physics, spectroscopy is often used for the
identification of
substances through their spectrum. Treating a kernel function *K(x,y)* as
"light" and the sampled data as "substance", the spectrum of their
interaction (eigenvalues and eigenvectors of
the kernel matrix *K*) unveils certain aspects of the underlying parametric
distribution *p*, such as the
parameters of a
Gaussian mixture. Our approach extends the intuitions and analyses underlying
the existing spectral techniques, such as spectral clustering and Kernel
Principal Components Analysis (KPCA).
We construct algorithms to estimate parameters of Gaussian mixture models,
including the number of mixture components,
their means and covariance matrices, which are important in many practical
applications. We provide a theoretical
framework and show encouraging experimental results.

[Full paper] [Discussion]

paper ID: 503

### Fast Estimation of Relational Pattern Coverage through Randomization and Maximum Likelihood

* Ondrej Kuzelka and Filip Zelezny*

In inductive logic programming, theta-subsumption is a widely used coverage test. Unfortunately, testing theta-subsumption is NP-complete, which represents a crucial efficiency bottleneck for many relational learners. In this paper, we present a probabilistic estimator of clause coverage, based on a randomized restarted search strategy. Under a distribution assumption, our algorithm can estimate clause coverage without having to decide subsumption for all examples. We implement this algorithm in program ReCovEr. On generated graph data and real-world datasets, we show that ReCovEr provides reasonably accurate estimates while achieving dramatic runtimes improvements compared to a state-of-the-art algorithm.

[Full paper] [Discussion]

paper ID: 511

### Efficient Bandit Algorithms for Online Multiclass Prediction

* Sham M. Kakade, Shai Shalev-Shwartz, and Ambuj Tewari*

This paper introduces the Banditron, a variant of the Perceptron, for the multiclass bandit setting. The multiclass bandit setting models a wide range of practical supervised learning applications where the learner only receives partial feedback (referred to as "bandit" feedback, in the spirit of multi-armed bandit models) with respect to the true label (e.g. in many web applications users often only provide positive "click" feedback which does not necessarily fully disclose a true label). The Banditron has the ability to learn in a multiclass classification setting with the "bandit" feedback which only reveals whether or not the prediction made by the algorithm was correct or not (but does not necessarily reveal the true label). We provide (relative) mistake bounds which show how the Banditron enjoys favorable performance, and our experiments demonstrate the practicality of the algorithm. Furthermore, this paper pays close attention to the important special case when the data is linearly separable --- a problem which has been exhaustively studied in the full information setting yet is novel in the bandit setting.

[Full paper] [Discussion]

paper ID: 513

### Polyhedral Classifier for Target Detection A Case Study: Colorectal Cancer

* Murat Dundar, Matthias Wolf, Sarang Lakare, Marcos Salganicoff, and Vikas C. Raykar*

In this study we introduce a novel algorithm for learning a polyhedron to describe the target class. The proposed approach takes advantage of the limited subclass information made available for the negative samples and jointly optimizes multiple hyperplane classifiers each of which is designed to classify positive samples from a subclass of the negative samples. The flat faces of the polyhedron provides robustness whereas multiple faces contributes to the flexibility required to deal with complex datasets. Apart from improving the prediction accuracy of the system, the proposed polyhedral classifier also provides run-time speedups as a by-product when executed in a cascaded framework in real-time. We introduce the Computer Aided Detection for Colon Cancer as a case study and evaluate the performance of the proposed technique on a real-world Colon dataset both in terms of prediction accuracy and online execution speed. We also compare the proposed technique against some benchmark classifiers.

[Full paper] [Discussion]

paper ID: 519

### Exploration Scavenging

* John Langford, Alexander Strehl, and Jennifer Wortman*

We examine the problem of evaluating a policy in the contextual bandit setting using only observations collected during the execution of another policy. We show that policy evaluation can be impossible if the exploration policy chooses actions based on the side information provided at each time step. We then propose and prove the correctness of a principled method for policy evaluation which works when this is not the case, even when the exploration policy is deterministic, as long as each action is explored sufficiently often. We apply this general technique to the problem of offline evaluation of internet advertising policies. Although our theoretical results hold only when the exploration policy chooses ads independent of side information, an assumption that is typically violated by commercial systems, we show how clever uses of the theory provide non-trivial and realistic applications. We also provide an empirical demonstration of the effectiveness of our techniques on real ad placement data.

[Full paper] [Discussion]

paper ID: 520

### Multi-Task Learning for HIV Therapy Screening

* Steffen Bickel, Jasmina Bogojeska, Thomas Lengauer, and Tobias Scheffer*

We address the problem of learning classifiers for a large number of tasks. We derive a solution that produces resampling weights which match the pool of all examples to the target distribution of any given task. Our work is motivated by the problem of predicting the outcome of a therapy attempt for a patient who carries an HIV virus with a set of observed genetic properties. Such predictions need to be made for hundreds of possible combinations of drugs, some of which use similar biochemical mechanisms. Multi-task learning enables us to make predictions even for drug combinations with few or no training examples and substantially improves the overall prediction accuracy.

[Full paper] [Discussion]

paper ID: 523

### Empirical Bernstein Stopping

* Volodymyr Mnih, Csaba Szepesvari, and Jean-Yves Audibert*

Sampling is a popular way of scaling up machine learning algorithms to large datasets. The question often is how many samples are needed. Adaptive stopping algorithms monitor the performance in an online fashion and make it possible to stop early, sparing valuable computation time. We concentrate on the setting where probabilistic guarantees are desired and demonstrate how recently-introduced empirical Bernstein bounds can be used to design stopping rules that are efficient. We provide upper bounds on the sample complexity of the new rules as well as empirical results on model selection and boosting in the filtering setting.

[Full paper] [Discussion]

paper ID: 528

### The Asymptotics of Semi-Supervised Learning in Discriminative Probabilistic Models

* Nataliya Sokolovska, Olivier Cappé, and François Yvon*

Semi-supervised learning aims at taking advantage of unlabeled data to improve the efficiency of supervised learning procedures. For discriminative models however, this is a challenging task. In this contribution, we introduce an original methodology for using unlabeled data through the design of a simple semi-supervised objective function. We prove that the corresponding semi-supervised estimator is asymptotically optimal. The practical consequences of this result are discussed for the case of the logistic regression model.

[Full paper] [Discussion]

paper ID: 530

### Discriminative Structure and Parameter Learning for Markov Logic Networks

* Tuyen Huynh and Raymond Mooney*

Markov logic networks (MLNs) are an expressive representation for statistical relational learning that generalizes both first-order logic and graphical models. Existing methods for learning the logical structure of an MLN are not discriminative; however, many relational learning problems involve specific target predicates that must be inferred from given background information. We found that existing MLN methods perform very poorly on several such ILP benchmark problems, and we present improved discriminative methods for learning MLN clauses and weights that outperform existing MLN and traditional ILP methods.

[Full paper] [Discussion]

paper ID: 531

### Training SVM with Indefinite Kernels

* Jianhui Chen and Jieping Ye*

Similarity matrices generated from many applications may not be positive semidefinite, and hence can't fit into the kernel machine framework. In this paper, we study the problem of training support vector machines with an indefinite kernel. We consider a regularized SVM formulation, in which the indefinite kernel matrix is treated as a noisy observation of some unknown positive semidefinite one (proxy kernel) and the support vectors and the proxy kernel can be computed simultaneously. We propose a semi-infinite quadratically constrained linear program formulation for the optimization, which can be solved iteratively to find a global optimum solution. We further propose to employ an additional pruning strategy, which significantly improves the efficiency of the algorithm, while retaining the convergence property of the algorithm. In addition, we show the close relationship between the proposed formulation and multiple kernel learning. Experiments on a collection of benchmark data sets demonstrate the efficiency and effectiveness of the proposed algorithm.

[Full paper] [Discussion]

paper ID: 536

### Multi-Classification by Categorical Features via Clustering

* Yevgeny Seldin and Naftali Tishby*

We derive a generalization bound for multi-classification schemes based on grid clustering in categorical parameter product spaces. Grid clustering partitions the parameter space in the form of a Cartesian product of partitions for each of the parameters. The derived bound provides a means to evaluate clustering solutions in terms of the generalization power of a built-on classifier. For classification based on a single feature the bound serves to find a globally optimal classification rule. Comparison of the generalization power of individual features can then be used for feature ranking. Our experiments show that in this role the bound is much more precise than mutual information or normalized correlation indices.

[Full paper] [Discussion]

paper ID: 538

### The Dynamic Hierarchical Dirichlet Process

* Lu Ren, David B. Dunson, and Lawrence Carin*

The dynamic hierarchical Dirichlet process (dHDP) is developed to model the time-evolving statistical properties of sequential data sets. The data collected at any time point are represented via a mixture representation associated with an appropriate underlying model, in the framework of HDP. The statistical properties of data collected at consecutive time points are linked via a random parameter that controls their probabilistic similarity. The sharing mechanisms of the time-evolving data are derived, and a relatively simple Markov Chain Monte Carlo sampler is developed. Experimental results are presented to demonstrate the model.

[Full paper] [Discussion]

paper ID: 542

### No-Regret Learning in Convex Games

* Geoffrey J. Gordon, Amy Greenwald, and Casey Marks*

Quite a bit is known about minimizing different kinds of regret in experts problems, and how these regret types relate to types of equilibria in the multiagent setting of repeated matrix games. Much less is known about the possible kinds of regret in online convex programming problems (OCPs), or about equilibria in the analogous multiagent setting of repeated convex games. This gap is unfortunate, since convex games are much more expressive than matrix games, and since many important machine learning problems can be expressed as OCPs. In this paper, we work to close this gap: we analyze a spectrum of regret types which lie between external and swap regret, along with their corresponding equilibria, which lie between coarse correlated and correlated equilibrium. We also analyze algorithms for minimizing these regret types. As examples of our framework, we derive algorithms for learning correlated equilibria in polyhedral convex games and extensive-form correlated equilibria in extensive-form games. The former is exponentially more efficient than previous algorithms, and the latter is the first of its type.

[Full paper] [Discussion]

paper ID: 544

### Hierarchical Model-Based Reinforcement Learning: R-max + MAXQ

* Nicholas Jong and Peter Stone*

Hierarchical decomposition promises to help scale reinforcement learning algorithms naturally to real-world problems by exploiting their underlying structure. Model-based algorithms, which provided the first finite-time convergence guarantees for reinforcement learning, may also play an important role in coping with the relative scarcity of data in large environments. In this paper, we introduce an algorithm that fully integrates modern hierarchical and model-learning methods in the standard reinforcement learning setting. Our algorithm, R-maxq, inherits the efficient model-based exploration of the R-max algorithm and the opportunities for abstraction provided by the MAXQ framework. We analyze the sample complexity of our algorithm, and our experiments in a standard simulation environment illustrate the advantages of combining hierarchies and models.

[Full paper] [Discussion]

paper ID: 551

### ICA and ISA Using Schweizer-Wolff Measure of Dependence

* Sergey Kirshner and Barnabás Póczos*

We propose a new algorithm for independent component and independent subspace analysis problems. This algorithm uses a contrast based on the Schweizer-Wolff measure of pairwise dependence, a non-parametric measure based on pairwise ranks of the variables. Our algorithm frequently outperforms state of the art ICA methods in the normal setting, is significantly more robust to outliers in the mixed signals, and performs well even in the presence of noise. Since pairwise dependence is evaluated explicitly, using Cardoso's conjecture, our method can be applied to solve independence subspace analysis (ISA) problems by grouping signals recovered by ICA methods. We provide an extensive empirical evaluation using simulated, sound, and image data.

[Full paper] [Discussion]

paper ID: 552

### Multiple Instance Ranking

* Charles Bergeron, Jed Zaretzki, Curt Breneman, and Kristin Bennett*

This paper introduces a novel machine learning model called multiple instance ranking (MIRank) that enables ranking to be performed in a multiple instance learning setting. The motivation for MIRank stems from the hydrogen abstraction problem in computational chemistry, that of predicting the grouping of hydrogen atoms from which a hydrogen is abstracted (removed) during metabolism. The model predicts the preferred hydrogen grouping within a molecule by ranking the groups, with the ambiguity of not knowing which hydrogen within the preferred grouping is actually abstracted. This paper formulates MIRank in its general context and proposes an algorithm for solving MIRank problems using successive linear programming. The method outperforms multiple instance classification models on several real and synthetic datasets.

[Full paper] [Discussion]

paper ID: 554

### Hierarchical Kernel Stick-Breaking Process for Multi-Task Image Analysis

* Qi An, Chunping Wang, Ivo Shterev, Eric Wang, Lawrence Carin, and David B. Dunson*

The kernel stick-breaking process (KSBP) is employed to segment general imagery, imposing the condition that patches (small blocks of pixels) that are spatially proximate are more likely to be associated with the same cluster (segment). The number of clusters is not set a priori and is inferred from the hierarchical Bayesian model. Further, KSBP is integrated with a shared Dirichlet process prior to simultaneously model multiple images, inferring their inter-relationships. This latter application may be useful for sorting and learning relationships between multiple images. The Bayesian inference algorithm is based on a hybrid of variational Bayesian analysis and local sampling. In addition to providing details on the model and associated inference framework, example results are presented for several image-analysis problems.

[Full paper] [Discussion]

paper ID: 562

### mStruct: A New Admixture Model for Inference of Population Structure in Light of Both Genetic Admixing and Allele Mutations

* Suyash Shringarpure and Eric Xing*

Traditional methods for analyzing population structure, such as the Structure program, ignore the influence of mutational effects. We propose mStruct, an admixture of population-specific mixtures of inheritance models, that addresses the task of structure inference and mutation estimation jointly through a hierarchical Bayesian framework, and a variational algorithm for inference. We validated our method on synthetic data, and used it to analyze the HGDP-CEPH cell line panel of microsatellites used in (Rosenberg et al., 2002) and the HGDP SNP data used in (Conrad et al., 2006). A comparison of the structural maps of world populations estimated by mStruct and Structure is presented, and we also report potentially interesting mutation patterns in world populations estimated by mStruct, which is not possible by Structure.

[Full paper] [Discussion]

paper ID: 564

### Sample-Based Learning and Search with Permanent and Transient Memories

* David Silver, Richard Sutton, and Martin Mueller*

We present a reinforcement learning architecture, Dyna-2, that encompasses both sample-based learning and sample-based search, and that generalises across states during both learning and search. We apply Dyna-2 to high performance Computer Go. In this domain the most successful planning methods are based on sample-based search algorithms, such as UCT, in which states are treated individually, and the most successful learning methods are based on temporal-difference learning algorithms, such as Sarsa, in which linear function approximation is used. In both cases, an estimate of the value function is formed, but in the first case it is transient, computed and then discarded after each move, whereas in the second case it is more permanent, slowly accumulating over many moves and games. The idea of Dyna-2 is for the transient planning memory and the permanent learning memory to remain separate, but for both to be based on linear function approximation and both to be updated by Sarsa. To apply Dyna-2 to 9x9 Computer Go, we use a million binary features in the function approximator, based on templates matching small fragments of the board. Using only the transient memory, Dyna-2 performed at least as well as UCT. Using both memories combined, it significantly outperformed UCT. Our program based on Dyna-2 achieved a higher rating on the Computer Go Online Server than any handcrafted or traditional search based program.

[Full paper] [Discussion]

paper ID: 565

### Fast Incremental Proximity Search in Large Graphs

* Purnamrita Sarkar, Andrew Moore, and Amit Prakash*

In this paper we investigate two aspects of ranking problems on large graphs. First, we augment the deterministic pruning algorithm in Sarkar and Moore (2007) with sampling techniques to compute approximately correct rankings with high probability under random walk based proximity measures at query time. Second, we prove some surprising locality properties of these proximity measures by examining the short term behavior of random walks. The proposed algorithm can answer queries on the fly without caching any information about the entire graph. We present empirical results on a 600,000 node author-word-citation graph from the Citeseer domain on a single CPU machine where the average query processing time is around 4 seconds. We present quantifiable link prediction tasks. On most of them our techniques outperform Personalized Pagerank, a well-known diffusion based proximity measure.

[Full paper] [Discussion]

paper ID: 571

### An Object-Oriented Representation for Efficient Reinforcement Learning

* Carlos Diuk, Andre Cohen, and Michael Littman*

Rich representations in reinforcement learning have been studied for the purpose of enabling generalization and making learning feasible in large state spaces. We introduce Object-Oriented MDPs (OO-MDPs), a representation based on objects and their interactions, which is a natural way of modeling environments and offers important generalization opportunities. We introduce a learning algorithm for deterministic OO-MDPs, and prove a polynomial bound in its sample complexity. We illustrate the performance gains of our representation and algorithm in the well-known Taxi domain, plus a real-life videogame.

[Full paper] [Discussion]

paper ID: 573

### On the Quantitative Analysis of Deep Belief Networks

* Ruslan Salakhutdinov and Iain Murray*

Deep Belief Networks (DBN's) are generative models that contain many layers of hidden variables. Efficient greedy algorithms for learning and approximate inference have allowed these models to be applied successfully in many application domains. The main building block of a DBN is a bipartite undirected graphical model called a restricted Boltzmann machine (RBM). Due to the presence of the partition function, model selection, complexity control, and exact maximum likelihood learning in RBM's are intractable. Annealed Importance Sampling (AIS), can be used to efficiently estimate the partition function of an RBM. We present a novel AIS scheme for comparing RBM's with different architectures. We further show how an AIS estimator, along with approximate inference, can be used to estimate a lower bound on the log-probability that a DBN model with multiple hidden layers assigns to the test data. This is, to our knowledge, the first step towards obtaining quantitative results that would allow us to directly assess the performance of Deep Belief Networks as generative models of data.

[Full paper] [Discussion]

paper ID: 574

### Sparse Bayesian Nonparametric Regression

* Francois Caron and Arnaud Doucet*

One of the most common problems in machine learning and statistics consists of estimating the mean response X.beta from a vector of observations y assuming y=X.beta+epsilon where X is known, beta is a vector of parameters of interest and epsilon a vector of stochastic errors. We are particularly interested here in the case where the dimension K of beta is much higher than the dimension of y. We propose some flexible Bayesian models which can yield sparse estimates of beta. We show that as K tends to infinity, these models are closely related to a class of Levy processes. Simulations demonstrate that our models outperform significantly a range of popular alternatives.

[Full paper] [Discussion]

paper ID: 580

### Reinforcement Learning in the Presence of Rare Events

* Jordan Frank, Shie Mannor, and Doina Precup*

We consider the task of reinforcement learning in an environment in which rare significant events occur independently of the actions selected by the controlling agent. If these events are sampled according to their natural probability of occurring, convergence of standard reinforcement learning algorithms is likely to be very slow, and the learning algorithms may exhibit high variance. In this work, we assume that we have access to a simulator, in which the rare event probabilities can be artificially altered. Then, importance sampling can be used to learn with this simulation data. We introduce algorithms for policy evaluation, both using tabular and function approximation representation of the value function. We prove that in both cases, the reinforcement learning algorithms converge. In the tabular case, we also analyze the bias and variance of our approach compared to TD-learning. We evaluate empirically the performance of the algorithm on random Markov Decision Processes, as well as on a large network planning task.

[Full paper] [Discussion]

paper ID: 581

### An Analysis of Linear Models, Linear Value-Function Approximation, and Feature Selection for Reinforcement Learning

* Ronald Parr, Lihong Li, Gavin Taylor, Christopher Painter-Wakefield, and Michael Littman*

We show that linear value function approximation is equivalent to a form of linear model approximation. We derive a relationship between the model approximation error and the Bellman error, and show how this relationship can guide feature selection for model improvement and/or value function improvement. We also show how these results give insight into the behavior of existing feature-selection algorithms.

[Full paper] [Discussion]

paper ID: 582

### Metric Embedding for Kernel Classification Rules

* Bharath Sriperumbudur, Omer Lang, and Gert Lanckriet*

In this paper, we consider a smoothing kernel-based classification rule and propose an algorithm for optimizing the performance of the rule by learning the bandwidth of the smoothing kernel along with a data-dependent distance metric. The data-dependent distance metric is obtained by learning a function that embeds an arbitrary metric space into a Euclidean space while minimizing an upper bound on the resubstitution estimate of the error probability of the kernel classification rule. By restricting this embedding function to a reproducing kernel Hilbert space, we reduce the problem to solving a semidefinite program and show the resulting kernel classification rule to be a variation of the k-nearest neighbor rule. We compare the performance of the kernel rule (using the learned data-dependent distance metric) to state-of-the-art distance metric learning algorithms (designed for k-nearest neighbor classification) on some benchmark datasets. The results show that the proposed rule has either better or as good classification accuracy as the other metric learning algorithms.

[Full paper] [Discussion]

paper ID: 587

### Bayesian Multiple Instance Learning: Automatic Feature Selection and Inductive Transfer

* Vikas Raykar, Balaji Krishnapuram, Jinbo Bi, Murat Dundar, and R. Bharat Rao*

We propose a novel Bayesian multiple instance learning algorithm. This algorithm automatically identifies the relevant feature subset, and utilizes inductive transfer when learning multiple (conceptually related) classifiers. Experimental results indicate that the proposed baseline MIL method is more accurate than previous MIL algorithms and selects a much smaller set of useful features. Inductive transfer further improves the accuracy of the classifier as compared to learning each task individually.

[Full paper] [Discussion]

paper ID: 588

### An Asymptotic Analysis of Generative, Discriminative, and Pseudolikelihood Estimators

* Percy Liang and Michael Jordan*

Statistical and computational concerns have motivated parameter estimators based on various forms of likelihood, e.g., joint, conditional, and pseudolikelihood. In this paper, we present a unified framework for studying these estimators, which allows us to compare their relative (statistical) efficiencies. Our asymptotic analysis suggests that modeling more of the data tends to reduce variance, but at the cost of being more sensitive to model misspecification. We present experiments validating our analysis.

[Full paper] [Discussion]

paper ID: 592

### Extracting and Composing Robust Features with Denoising Autoencoders

* Pascal Vincent, Hugo Larochelle, Yoshua Bengio, and Pierre-Antoine Manzagol*

Previous work has shown that the difficulties in learning deep generative or discriminative models can be overcome by an initial unsupervised learning step that maps inputs to useful itermediate representations. We introduce and motivate a new training principle for unsupervised learning of a representation based on the idea of making the learned representations robust to partial corruption of the input pattern. This approach can be used to train autoencoders, and these denoising autoencoders can be stacked to initialize deep architectures. The algorithm can be motivated from a manifold learning and information theoretic perspective or from a generative model perspective. Comparative experiments clearly show the surprising advantage of corrupting the input of autoencoders on a pattern classification benchmark suite.

[Full paper] [Discussion]

paper ID: 599

### Sparse Multiscale Gaussian Process Regression

* Christian Walder, Kwang In Kim, and Bernhard Schoelkopf*

Most existing sparse Gaussian process (g.p.) models seek computational advantages by basing their computations on a set of m basis functions that are the covariance function of the g.p. with one of its two inputs fixed. We generalise this for the case of Gaussian covariance function, by basing our computations on m Gaussian basis functions with arbitrary diagonal covariance matrices (or length scales). For a fixed number of basis functions and any given criteria, this additional flexibility permits approximations no worse and typically better than was previously possible. We perform gradient based optimisation of the marginal likelihood, which costs O(m2n) time where n is the number of data points, and compare the method to various other sparse g.p. methods. Although we focus on g.p. regression, the central idea is applicable to all kernel based algorithms, and we also provide some results for the support vector machine (s.v.m.) and kernel ridge regression (k.r.r.). Our approach outperforms the other methods, particularly for the case of very few basis functions, i.e. a very high sparsity ratio.

[Full paper] [Discussion]

paper ID: 600

### Bayesian Probabilistic Matrix Factorization using Markov Chain Monte Carlo

* Ruslan Salakhutdinov and Andriy Mnih*

Low-rank matrix approximation methods provide one of the simplest and most effective approaches to collaborative filtering. Such models are usually fitted to data by finding a MAP estimate of the model parameters, a procedure that can be efficiently performed even on very large datasets. However, unless the regularization parameters are tuned carefully, this approach is prone to overfitting because it finds a single point estimate of the parameters. In this paper we present a fully Bayesian treatment of the Probabilistic Matrix Factorization (PMF) model in which model capacity is controlled automatically by integrating over all model parameters and hyperparameters. We show that Bayesian PMF models can be efficiently trained using Markov chain Monte Carlo methods by applying them to the Netflix dataset, which consists of over 100 million user/movie ratings. The resulting models achieve significantly higher prediction accuracy than PMF models trained using MAP estimation.

[Full paper] [Discussion]

paper ID: 601

### Classification using Discriminative Restricted Boltzmann Machines

* Hugo Larochelle and Yoshua Bengio*

Recently, many applications for Restricted Boltzmann Machines (RBMs) have been developed for a large variety of learning problems. However, RBMs are usually used as feature extractors for another learning algorithm or to provide a good initialization for deep feed-forward neural network classifiers, and are not considered as a stand-alone solution to classification problems. In this paper, we argue that RBMs provide a self-contained framework for deriving competitive non-linear classifiers. We present an evaluation of different learning algorithms for RBMs which aim at introducing a discriminative component to RBM training and improve their performance as classifiers. This approach is simple in that RBMs are used directly to build a classifier, rather than as a stepping stone. Finally, we demonstrate how discriminative RBMs can also be successfully employed in a semi-supervised setting.

[Full paper] [Discussion]

paper ID: 611

### Semi-supervised Learning of Compact Document Representations with Deep Networks

* Marc'Aurelio Ranzato and Martin Szummer*

Finding a good representation of text documents is crucial in document retrieval and classification systems. Nowadays, the most popular representation is simply based on a vector of counts storing the number of occurrences of each word in the document. This representation falls short in describing the dependence existing between similar words, and it cannot disambiguate phenomena like synonymy and polysemy of words. In this paper, we propose an algorithm to learn text document representations based on the recent advances in training deep networks. This technique can efficiently produce a very compact and informative representation of a document. Our experiments compare favorably this algorithm against similar algorithms but producing sparse and binary representations. Unlike other models, this method is trained by taking into account both an unsupervised and a supervised objective. We show that it is very advantageous to exploit even a few labeled samples during training, and that we can learn extremely compact representations by using deep and non-linear models.

[Full paper] [Discussion]

paper ID: 614

### Pointwise Exact Bootstrap Distributions of Cost Curves

* Charles Dugas and David Gadoury*

Cost curves have recently been introduced as an alternative or complement to ROC curves in order to visualize binary classifiers performance. Of importance to both cost and ROC curves is the computation of confidence intervals along with the curves themselves so that the reliability of a classifier's performance can be assessed. Computing confidence intervals for the difference in performance between two classifiers allows to determine whether one classifier performs significantly better than another. A simple procedure to obtain confidence intervals for costs or the difference between two costs, under various operating conditions, is to perform bootstrap resampling of the testset. In this paper, we derive exact bootstrap distributions of these values and use these distributions to obtain confidence intervals, under various operating conditions. Performances of these confidence intervals are measured in terms of coverage accuracies. Simulations show excellent results.

[Full paper] [Discussion]

paper ID: 627

### Knows What It Knows: A Framework For Self-Aware Learning

* Lihong Li, Michael Littman, and Thomas Walsh*

We introduce a learning framework that combines elements of the well-known PAC and mistake-bound models. The KWIK (knows what it knows) framework was designed particularly for its utility in learning settings where active exploration can impact the training examples the learner is exposed to, as is true in reinforcement-learning and active-learning problems. We catalog several KWIK-learnable classes and list some open problems in this area.

[Full paper] [Discussion]

paper ID: 628

### A Rate-Distortion One-Class Model and its Applications to Clustering

* Koby Crammer, Partha Pratim Talukdar, and Fernando Pereira*

We study the problem of one-class classification, in which we seek a rule to separate a coherent subset of instances similar to a few positive examples from a large pool of instances. We find that the problem can be formulated naturally in terms of a rate-distortion tradeoff, which can be analyzed precisely and leads to an efficient algorithm that competes well with two previous one-class methods. We also show that our model can be extended naturally to clustering problems in which it is important to remove background clutter to improve cluster purity.

[Full paper] [Discussion]

paper ID: 630

### Detecting Statistical Interactions with Additive Groves of Trees

* Daria Sorokina, Rich Caruana, Mirek Riedewald, and Daniel Fink*

Discovering additive structure is an important step towards understanding a complex multi-dimensional function because it allows the function to be expressed as the sum of lower-dimensional components. When variables interact, however, their effects are not additive and must be modeled and interpreted simultaneously. We present a new approach for the problem of interaction detection. Our method is based on comparing the performance of unrestricted and restricted prediction models, where restricted models are prevented from modeling an interaction in question. We show that an additive model-based regression ensemble, Additive Groves, can be restricted appropriately for use with this framework, and thus has the right properties for accurately detecting variable interactions.

[Full paper] [Discussion]

paper ID: 632

### An Empirical Evaluation of Supervised Learning in High Dimensions

* Rich Caruana, Nikos Karampatziakis, and Ainur Yessenalina*

In this paper we perform an empirical evaluation of supervised learning methods on high dimensional data. We evaluate learning performance on three metrics: accuracy, AUC, and squared loss. We also study the effect of increasing dimensionality on the relative performance of the learning algorithms. Our findings are consistent with previous studies for problems of relatively low dimension, but suggest that as dimensionality increases the relative performance of the various learning algorithms changes. To our surprise, the methods that seem best able to learn from high dimensional data are random forests and neural nets.

[Full paper] [Discussion]

paper ID: 638

### Training Restricted Boltzmann Machines using Approximations to the Likelihood Gradient

* Tijmen Tieleman*

A new algorithm for training Restricted Boltzmann Machines is introduced. The algorithm, named Persistent Contrastive Divergence, is different from the standard Contrastive Divergence algorithms in that it aims to draw samples from almost exactly the model distribution. It is compared to some standard Contrastive Divergence algorithms on the tasks of modeling handwritten digits and classifying digit images by learning a model of the joint distribution of images and labels. The Persistent Contrastive Divergence algorithm outperforms other Contrastive Divergence algorithms, and is equally fast and simple.

[Full paper] [Discussion]

paper ID: 641

### An RKHS for Multi-View Learning and Manifold Co-Regularization

* Vikas Sindhwani and David Rosenberg*

Inspired by co-training, many multi-view semi-supervised kernel methods implement the following idea: find a function in each of multiple Reproducing Kernel Hilbert Spaces (RKHSs) such that (a) the chosen functions make similar predictions on unlabeled examples, and (b) the average prediction given by the chosen functions performs well on labeled examples. In this paper, we construct a single RKHS with a data-dependent “co-regularization” norm that reduces these approaches to standard supervised learning. The reproducing kernel for this RKHS can be explicitly derived and plugged into any kernel method, greatly extending the theoretical and algorithmic scope of co-regularization. In particular, with this development, the Rademacher complexity bound for co-regularization given in (Rosenberg & Bartlett, 2007) follows easily from well-known results. Furthermore, more refined bounds given by localized Rademacher complexity can also be easily applied. We propose a co-regularization based algorithmic alternative to manifold regularization (Belkin et al., 2006; Sindhwani et al., 2005a) that leads to major empirical improvements on semi-supervised tasks. Unlike the recently proposed transductive approach of (Yu et al., 2008), our RKHS formulation is truly semi-supervised and naturally extends to unseen test data.

[Full paper] [Discussion]

paper ID: 643

### A Generalization of Haussler's Convolution Kernel - Mapping Kernel

* Kilho Shin and Tetsuji Kuboyama*

Haussler's convolution kernel provides
a successful framework for engineering
new positive semidefinite kernels, and
has been applied to a wide range of data types and applications.
In the framework,
each data object represents a finite set of finer grained components.
Then,
Haussler's convolution kernel takes a pair of data objects as input,
and returns the sum of the return values of
the predetermined primitive kernel
calculated for all the possible pairs of the components
of the input data objects.
Due to the definition,
Haussler's convolution kernel is also known as the cross product kernel,
and is positive semidefinite, if so is the primitive kernel.
On the other hand,
the *mapping kernel* that we introduce in this paper is
a natural generalization of Haussler's convolution kernel,
in that the input to the primitive kernel moves over
a predetermined subset rather than the entire cross product.
Although we have plural instances of the mapping kernel in the literature,
their positive semidefiniteness was investigated
in case-by-case manners, and worse yet,
was sometimes incorrectly concluded.
In fact,
there exists a simple and easily checkable necessary and sufficient condition,
which is generic in the sense that it enables us to investigate
the positive semidefiniteness of an arbitrary instance of the mapping kernel.
This is the first paper that presents and proves
the validity of the condition.
In addition, we introduce two important instances of the mapping kernel,
which we refer to as the *size-of-index-structure-distribution* kernel
and the *edit-cost-distribution* kernel.
Both of them are naturally derived from
well known (dis)similarity measurements in the literature
(e.g. the maximum agreement tree, the edit distance),
and
are reasonably expected to improve the performance of
the existing measures
by evaluating their distributional features
rather than their peak (maximum/minimum) features.

[Full paper] [Discussion]

paper ID: 645

### Apprenticeship Learning Using Linear Programming

* Umar Syed, Michael Bowling, and Robert Schapire*

In apprenticeship learning, the goal is to learn a policy in a Markov decision process that is at least as good as a policy demonstrated by an expert. The difficulty arises in that the MDP's true reward function is assumed to be unknown. We show how to frame apprenticeship learning as a linear programming problem, and show that using an off-the-shelf LP solver to solve this problem results in a substantial improvement in running time over existing methods --- up to two orders of magnitude faster in our experiments. Additionally, our approach produces stationary policies, while all existing methods for apprenticeship learning output policies that are "mixed", i.e. randomized combinations of stationary policies. The technique used is general enough to convert any mixed policy to a stationary policy.

[Full paper] [Discussion]

paper ID: 652

### An Analysis of Reinforcement Learning with Function Approximation

* Francisco Melo, Sean Meyn, and Isabel Ribeiro*

We address the problem of computing the optimal Q-function in Markov decision problems with infinite state-space. We analyze the convergence properties of several variations of Q-learning when combined with function approximation, extending the analysis of TD-learning in (Tsitsilis and Van Roy, 1996) to stochastic control settings. We identify conditions under which such approximate methods converge with probability 1. We conclude with a brief discussion on the general applicability of our results and compare them with several related works.

[Full paper] [Discussion]

paper ID: 655

### Strategy Evaluation in Extensive Games with Importance Sampling

* Michael Bowling, Michael Johanson, Neil Burch, and Duane Szafron*

Typically agent evaluation is done through Monte Carlo estimation. However, stochastic agent decisions and stochastic outcomes can make this approach inefficient, requiring many samples for an accurate estimate. We present a new technique that can be used to simultaneously evaluate many strategies while playing a single strategy in the context of an extensive game. This technique is based on importance sampling, but utilizes two new mechanisms for significantly reducing variance in the estimates. We demonstrate its effectiveness in the domain of poker, where stochasticity makes traditional evaluation problematic.

[Full paper] [Discussion]

paper ID: 665

### Composite Kernel Learning

* Marie Szafranski, Yves Grandvalet, and Alain Rakotomamonjy*

The Support Vector Machine (SVM) is an acknowledged powerful tool for building classifiers, but it lacks flexibility, in the sense that the kernel is chosen prior to learning. Multiple Kernel Learning (MKL) enables to learn the kernel, from an ensemble of basis kernels, whose combination is optimized in the learning process. Here, we propose Composite Kernel Learning to address the situation where distinct components give rise to a group structure among kernels. Our formulation of the learning problem encompasses several setups, putting more or less emphasis on the group structure. We characterize the convexity of the learning problem, and provide a general wrapper algorithm for computing solutions. Finally, we illustrate the behavior of our method on multi-channel data where groups correspond to channels.

[Full paper] [Discussion]

paper ID: 667

### Nonnegative Matrix Factorization via Rank-One Downdate

* Michael Biggs, Ali Ghodsi, and Stephen Vavasis*

Nonnegative matrix factorization (NMF) was popularized as a tool for data mining by Lee and Seung in 1999. NMF attempts to approximate a matrix with nonnegative entries by a product of two low-rank matrices, also with nonnegative entries. We propose an algorithm called rank-one downdate (R1D) for computing a NMF that is partly motivated by singular value decomposition. This algorithm computes the dominant singular values and vectors of adaptively determined submatrices of a matrix. On each iteration, R1D extracts a rank-one submatrix from the dataset according to an objective function. We establish a theoretical result that maximizing this objective function corresponds to correctly classifying articles in a nearly separable corpus. We also provide computational experiments showing the success of this method in identifying features in realistic datasets. The method is much faster than either LSI or other NMF routines.

[Full paper] [Discussion]

paper ID: 668

### Closed-form Supervised Dimensionality Reduction with Generalized Linear Models

* Irina Rish, Genady Grabarnilk, Guillermo Cecchi, Francisco Pereira, and Geoffrey J. Gordon*

We propose a family of supervised dimensionality reduction (SDR) algorithms that combine feature extraction (dimensionality reduction) with learning a predictive model in a unified optimization framework, using data- and class-appropriate generalized linear models (GLMs), and handling both classification and regression problems. Our approach uses simple closed-form update rules and is provably convergent. Promising empirical results are demonstrated on a variety of high-dimensional datasets.

[Full paper] [Discussion]

paper ID: 673

### Structure Compilation: Trading Structure for Features

* Percy Liang, Hal Daume, and Dan Klein*

Structured models often achieve excellent performance but can be slow at test time. We investigate structure compilation, where we replace structure with features, which are often computationally simpler but unfortunately statistically more complex. We analyze this tradeoff theoretically and empirically on three natural language processing tasks. We also introduce a simple method to transfer predictive power from structure to features via unlabeled data, while incurring a minimal statistical penalty.

[Full paper] [Discussion]

paper ID: 676

### ManifoldBoost: Stagewise Function Approximation for Fully-, Semi- and Un-supervised Learning

* Nicolas Loeff, David Forsyth, and Deepak Ramachandran*

We describe a manifold learning framework that naturally accommodates supervised learning manifold learning, partially supervised learning and unsupervised clustering as particular cases. Our method chooses a function by minimizing loss subject to a manifold regularization penalty. This augmented cost is minimized using a greedy stagewise functional minimization procedure, as in Gradientboost. Each stage of boosting is fast and efficient. We demonstrate our approach using both radial basis function approximations and classification trees. The performance of our method is at the state of the art on standard problems.

[Full paper] [Discussion]

paper ID: 679

### Beam Sampling for the Infinite Hidden Markov Model

* Jurgen Van Gael, Yunus Saatci, Yee Whye Teh, and Zoubin Ghahramani*

The infinite hidden Markov model is a nonparametric extension of the widely used hidden Markov model. Our paper introduces a new inference algorithm for the infinite hidden Markov model called beam sampling. Beam sampling combines slice sampling, which limits the number of states considered at each time step to a finite number, with dynamic programming, which samples whole state trajectories efficiently. Our algorithm typically outperforms the Gibbs sampler and is more robust. We present applications of iHMM inference using the beam sampler on changepoint detection and text prediction problems.

[Full paper] [Discussion]

paper ID: 681

### Message-passing for Graph-structured Linear Programs: Proximal Projections, Convergence and Rounding Schemes

* Pradeep Ravikumar, Alekh Agarwal, and Martin J. Wainwright*

Linear programming relaxations are one promising approach to solving the MAP estimation problem in Markov random fields; in particular, a body of past work has focused on the first-order tree-based LP relaxation for the MAP problem. Although a variety of algorithms with interesting connections to this LP have been proposed, to date none is guaranteed to always solve the LP for any problem. In this paper, we develop a family of provably convergent LP solvers based on proximal minimization schemes using Bregman divergences that exploit the underlying graphical structure, and so scale well to large problems. All of our algorithms have a double-loop character, with the outer loop corresponding to the proximal sequence, and an inner loop of cyclic Bregman divergences used to compute each proximal update. The inner loop updates are distributed and respect the graph structure, and thus can be cast as message-passing algorithms. We establish various convergence guarantees for our algorithms, illustrate their performance on medium to large-scale problems, and also present a tree-based rounding scheme with provable optimality guarantees.

[Full paper] [Discussion]

paper ID: 682

### On the Hardness of Finding Symmetries in Markov Decision Processes

* Shravan Narayanamurthy and Balaraman Ravindran*

In this work we address the question of finding symmetries
of a given MDP. We show that the problem is *Isomorphism Complete*, that
is, the problem is polynomially equivalent to verifying whether two graphs are
isomorphic. Apart from the theoretical importance of this result it has an
important practical application. The reduction presented can be used together
with any off-the-shelf Graph Isomorphism solver, which performs well in the
average case, to find symmetries of an MDP. In fact, we present results of
using NAutY (the best Graph Isomorphism solver currently available), to find
symmetries of MDPs.

[Full paper] [Discussion]

paper ID: 687

### Actively Learning Level-Sets of Composite Functions

* Brent Bryan and Jeff Schneider*

Scientists frequently have multiple types of experiments and data sets on which they can test the validity of their parametrized models and locate plausible regions for the model parameters. By examining multiple data sets, these scientists can obtain inferences for their problems which typically are much more informative than the deductions derived from each of the data sources independently. Several standard data combination techniques result in a target function which is a weighted sum of the observed data sources. Computing constraints on the plausible regions of the model parameter space can be formulated as that of finding a specified level set of the target function. We propose an active learning algorithm for this problem which at each step selects both a parameter setting (from the parameter space) and an experiment type upon which to compute the next sample. Empirical tests on synthetic functions and on real data for a eight parameter cosmological model show that our algorithm significantly reduces the number of samples required to identify desired regions.

[Full paper] [Discussion]