**Schedule**

Subject to change. Complete list of papers with supplementary material is on JMLR proceedings.

- Sunday 19th: Tutorials and welcome reception.
- Monday 20th – Wednesday 22th: Main Conference. The event will be live streamed by techtalks.tv. The ICML’16 awards are highlighted here.
- Thursday 23th – Friday 24th: Workshops

Posters sessions will be in parallel from the main conference:

- Monday afternoon 3pm – 7pm
- Tuesday morning 10am – 1pm
- Tuesday afternoon 3pm – 7pm
- Wednesday morning 10am – 1pm

And are denoted on the tables with a |

With papers allocated as described in this schedule.

#### Birds of a Feather Unworkshop

This s a room where you can write a note with a subject, time, and place to meet other people for a discussion of your choosing. These meetings have no format and are not pre-planned, so have fun.

**Room:**Marriott: Duffy/Columbia 7th floor**Time:**Thursday 8:30-5 Friday 8:30-4

**Note: ** You can also find all the informations in the Booklet of the conference.

## Main Conference

## Detailed Schedule

### Monday – Neural Networks and Deep Learning

#### Session chair: Hugo Larochelle

**Ballroom 1+2+Juliard**

We develop machine learning systems with this important capacity by developing new deep generative models, models that combine the representational power of deep learning with the inferential power of Bayesian reasoning. We develop a class of sequential generative models that are built on the principles of feedback and attention. These two characteristics lead to generative models that are among the state-of-the art in density estimation and image generation. We demonstrate the one-shot generalization ability of our models using three tasks: unconditional sampling, generating new exemplars of a given concept, and generating new exemplars of a family of concepts. In all cases our models are able to generate compelling and diverse samples—having seen new examples just once—providing an important class of general-purpose models for one-shot machine learning.

#### Session chair: Honglak Lee

**Ballroom 1+2+Juliard**

**Best paper award**– 03:44 – Pixel Recurrent Neural Networks Paper | Reviews | Rebuttal | Poster session on tuesday morning 10am-1pm | Abstract

This task requires an image model that is at once expressive, tractable and scalable. We present a deep neural network that sequentially predicts the pixels in an image along the two spatial dimensions. Our method models the discrete probability of the raw pixel values and encodes the complete set of dependencies in the image. Architectural novelties include fast two-dimensional recurrent layers and an effective use of residual connections in deep recurrent networks. We achieve log-likelihood scores on natural images that are considerably better than the previous state of the art. Our main results also provide benchmarks on the diverse ImageNet dataset. Samples generated from the model appear crisp, varied and globally coherent.

### Monday – Reinforcement Learning

#### Session chair: Joelle Pineau

**Ballroom 3+4**

secondary agents with competing goals also adapt their strategies,

yet it remains challenging because of strategies’ complex interaction

and the non-stationary nature. Most previous work focuses on

developing probabilistic models or parameterized strategies for

specific applications. Inspired by the recent success of deep

reinforcement learning, we present neural-based models that jointly

learn a policy and the behavior of opponents. Instead of explicitly

predicting the opponent’s action, we encode observation of the

opponents into a deep Q-Network (DQN), while retaining

explicit modeling under multitasking. By using a Mixture-of-Experts

architecture, our model automatically discovers different strategy

patterns of opponents even without extra supervision. We

evaluate our models on a simulated soccer game and a popular trivia

game, showing superior performance over DQN and its variants.

**Best paper award**– 12:04 – Dueling Network Architectures for Deep Reinforcement Learning Paper | Reviews | Rebuttal | Poster session on monday afternoon 3:00pm-7:00pm | Abstract

### Monday – Optimization (Continuous)

#### Session chair: Sebastien Bubeck

**Marquis**

analyze stochastic variance reduced gradient

(SVRG) methods for them. SVRG and related

methods have recently surged into prominence

for convex optimization given their edge over

stochastic gradient descent (SGD); but their theoretical

analysis almost exclusively assumes convexity.

In contrast, we prove non-asymptotic

rates of convergence (to stationary points) of

SVRG for nonconvex optimization, and show that

it is provably faster than SGD and gradient descent.

We also analyze a subclass of nonconvex

problems on which SVRG attains linear convergence

to the global optimum. We extend

our analysis to mini-batch variants of SVRG,

showing (theoretical) linear speedup due to minibatching

in parallel settings.

We obtain new primal-dual convergence rates, e.g., for the Lasso as well as many L1, Elastic Net, group Lasso and TV-regularized problems. The theory applies to any norm-regularized generalized linear model. Our approach provides efficiently computable duality gaps which are globally defined, without modifying the original problems in the region of interest.

### Monday – Online Learning

#### Session chair: Alina Beygelzimer

**Lyceum**

We show a large gap between the adversarial and the stochastic cases.

In the adversarial case, we prove that even for dense feedback graphs, the learner cannot improve upon a trivial regret bound obtained by ignoring any additional feedback besides her own loss.

In contrast, in the stochastic case we give an algorithm that achieves $\widetilde{\Theta}(\sqrt{\alpha T})$ regret over $T$ rounds, provided that the independence numbers of the hidden feedback graphs are at most $\alpha$.

We also extend our results to a more general feedback model, in which the learner does not necessarily observe her own loss, and show that, even in simple cases, concealing the feedback graphs might render the problem unlearnable.

#### Session chair: Csaba Szepesvari

**Liberty**

### Monday – Clustering

#### Session chair: Alexandru Niculescu-Mizil

**Empire**

for hard clustering algorithms. In this paper, our first

contribution is a two-way generalisation of this seeding, $k$-variates++, that includes the sampling of general densities

rather than just a discrete set of Dirac densities anchored at the point

locations, \textit{and} a generalisation of the

well known Arthur-Vassilvitskii (AV) approximation

guarantee, in the form of a \textit{bias+variance} approximation bound of

the \textit{global} optimum. This approximation exhibits a reduced dependency on the “noise”

component with respect to the optimal potential — actually approaching the

statistical lower bound. We show that $k$-variates++ \textit{reduces} to efficient (biased seeding) clustering algorithms tailored to specific frameworks; these include distributed, streaming and on-line clustering, with \textit{direct} approximation results for these algorithms.

Finally, we present a novel

application of $k$-variates++ to differential privacy.

For either the specific frameworks considered here, or for the

differential privacy setting, there is little to no prior results on the

direct application of $k$-means++ and its approximation bounds —

state of the art contenders appear to be significantly more complex

and / or display less favorable (approximation) properties.

We stress that our algorithms can still be run in

cases where there is \textit{no} closed form solution for the population minimizer.

We demonstrate the applicability of our analysis via experimental

evaluation on several domains and settings, displaying competitive performances vs state of the art.

### Monday – Bayesian Nonparametric Methods

#### Session chair: Hal Daume III

**Soho**

We use completely random measures~(CRM) and hierarchical CRM to define a prior for Poisson processes. We derive the marginal distribution of the resultant point process, when the underlying CRM is marginalized out. Using well known properties unique to Poisson processes, we were able to derive an exact approach for instantiating a Poisson process with a hierarchical CRM prior. Furthermore, we derive Gibbs sampling strategies for hierarchical CRM models based on Chinese restaurant franchise sampling scheme. As an example, we present the sum of generalized gamma process (SGGP), and show its application in topic-modelling. We show that one can determine the power-law behaviour of the topics and words in a Bayesian fashion, by defining a prior on the parameters of SGGP.

### Monday – Matrix Factorization / Neuroscience Applications

#### Session chair: Jennifer Dy

**Liberty**

sparse Canonical Correlation Analysis (CCA)

seeks linear combinations of a small number of variables in each set,

such that the induced \emph{canonical} variables are maximally correlated.

Sparse CCA is NP-hard.

We propose a novel combinatorial algorithm for sparse diagonal CCA,

\textit{i.e.}, sparse CCA under the additional assumption that variables within each set are standardized and uncorrelated.

Our algorithm operates on a low rank approximation of the input data and its computational complexity scales linearly with the number of input variables.

It is simple to implement, and parallelizable.

In contrast to most existing approaches,

our algorithm administers precise control on the sparsity of the extracted canonical vectors,

and comes with theoretical data-dependent global approximation guarantees, that hinge on the spectrum of the input data.

Finally, it can be straightforwardly adapted to other constrained variants of CCA enforcing structure beyond sparsity.

We empirically evaluate the proposed scheme

and apply it on a real neuroimaging dataset to investigate associations between brain activity and behavior measurements.

### Monday – Optimization / Online Learning

#### Session chair: Satyen Kale

**Ballroom 3+4**

approximate maximal variance, based on a stream of i.i.d. data points in $\reals^d$. A simple and computationally cheap algorithm for this is

stochastic gradient descent (SGD), which incrementally updates its estimate based on each new data point. However, due to the non-convex nature of the problem, analyzing its performance has been a challenge. In particular, existing guarantees rely on a non-trivial eigengap assumption on the covariance matrix, which is intuitively unnecessary. In this paper, we provide (to the best of our knowledge) the first eigengap-free convergence guarantees for SGD in the context of PCA. This also partially resolves an open problem posed in \cite{hardt2014noisy}. Moreover, under an eigengap assumption, we show that the same techniques lead to new SGD convergence guarantees with better dependence on the eigengap.

Here $nnz(A)$ is the number of nonzeros in $A$, $sr(A)$ is the stable rank, and $gap$ is the relative eigengap.

We also consider an online setting in which, given a stream of i.i.d. samples from a distribution $D$ with covariance matrix $\Sigma$ and a vector $x_0$ which is an $O(gap)$ approximate top eigenvector for $\Sigma$, we show how to refine $x_0$ to an $\epsilon$ approximation using $ O ( \frac{var(D)}{gap * \epsilon} )$ samples from $D$. Here $var(D)$ is a natural notion of variance. Combining our algorithm with previous work to initialize $x_0$, we obtain improved sample complexities and runtimes under a variety of assumptions on $D$.

We achieve our results via a robust analysis of the classic shift-and-invert preconditioning method. This technique lets us reduce eigenvector computation to *approximately* solving a series of linear systems with fast stochastic gradient methods.

### Monday – Machine Learning Applications

#### Session chair: Balazs Kegl

**Marquis**

dealbreaker model, in which a student’s success probability is determined by their weakest concept mastery. We develop efficient parameter inference algorithms for this model using novel methods for nonconvex optimization. We show that the dealbreaker model achieves comparable or better prediction performance as compared to affine models with real-world educational datasets. We further demonstrate that the parameters learned by the

dealbreaker model are interpretable—they provide

key insights into which concepts are critical (i.e., the “dealbreaker”) to answering a question correctly. We conclude by reporting preliminary results for a movie-rating dataset, which illustrate the broader applicability of the dealbreaker model.

higher accuracy.

Bayesian optimization, the task of finding the global maximizer of an unknown, expensive function through sequential evaluation using Bayesian decision theory.

However, many interesting problems involve optimizing multiple, expensive to evaluate objectives simultaneously, and relatively little research has addressed this setting from a Bayesian

theoretic standpoint. A prevailing choice when tackling this problem, is to model

the multiple objectives as being independent, typically for ease of computation.

In practice, objectives are correlated to some extent. In this work, we incorporate

the modelling of inter-task correlations, developing an approximation to overcome

intractable integrals. We illustrate the power of modelling dependencies between objectives on a range of synthetic and real world multi-objective optimization problems.

### Monday – Matrix Factorization and Related Topics

#### Session chair: Alex Kulesza

**Lyceum**

Existing polynomial time algorithms with proven error guarantees require EACH column N_.j to have l1 norm much smaller than ||(BC)_.j ||_1, which could be very restrictive. In important applications of NMF such as Topic Modeling as well as theoretical noise models (e.g. Gaussian with high sigma), almost EVERY column of N_.j violates this condition. We introduce the heavy noise model which only requires the average noise over large subsets of columns to be small. We initiate

a study of Noisy NMF under the heavy noise model. We show that our noise model subsumes noise models of theoretical and practical interest (for e.g. Gaussian noise of maximum possible sigma). We then devise an algorithm TSVDNMF which

under certain assumptions on B,C, solves the problem under heavy noise. Our error guarantees match those of previous algorithms. Our running time of O(k.(d+n)^2) is substantially better than the O(d.n^3) for the previous best. Our assumption on B is weaker than the “Separability” assumption made by all previous results. We provide empirical justification for our assumptions on C. We also provide the first proof of identifiability (uniqueness of B) for noisy NMF which is not based on separability and does not use hard to check geometric conditions. Our algorithm outperforms earlier polynomial time algorithms both in time and error, particularly in the presence

of high noise.

### Monday – Bandit Problems

#### Session chair: Shie Mannor

**Empire**

At each time step the agent chooses an arm, and observes the reward of the obtained sample.

Each sample is considered here as a separate item with the reward designating its value, and the goal is to find an item with the highest possible value. Our basic assumption is a known lower bound on the {\em tail function} of the reward distributions. Under the PAC framework, we provide a lower bound on the sample complexity of any

$(\epsilon,\delta)$-correct algorithm, and propose an algorithm that attains this bound

up to logarithmic factors.

We provide an analysis of the robustness of the proposed algorithm to the model assumptions, and further compare its performance to the simple non-adaptive variant, in which the arms are chosen randomly at each stage.

noise. Most of of the work on linear bandits has been based on the

assumption of bounded or sub-Gaussian noise. However, this assumption

is often violated in common scenarios such as financial markets. We

present two algorithms to tackle this problem: one based on dynamic truncation

and one based on a median of means estimator. We show that,

when the noise admits admits only a $1 + \eps$ moment, these

algorithms are still able to achieve regret in $\wt O(T^{\frac{2 + \eps}{2(1 + \eps)}})$

and $\wt O(T^{\frac{1+ 2\eps}{1 + 3 \eps}})$ respectively. In particular,

they guarantee sublinear regret as long as the noise has finite variance.

We also present empirical results

showing that our algorithms achieve a better performance than the

current state of the art for bounded noise when the $L_\infty$ bound

on the noise is large yet the $1 + \eps$ moment of the noise is small.

### Monday – Graphical Models

#### Session chair: Stefano Ermon

**Soho**

### Monday – Transfer Learning / Learning Theory

#### Session chair: Alina Beygelzimer

**Liberty**

the empirical risk of most well-known loss functions factors into a linear term aggregating all labels with a term that is label free, and can further be expressed by sums of the same loss. This holds true even for non-smooth, non-convex losses and in any

RKHS. The first term is a (kernel) mean operator — the focal quantity of this work — which we characterize as the sufficient statistic for the labels. The result tightens known generalization bounds and sheds new light on their interpretation.

Factorization has a direct application on weakly supervised learning. In particular, we demonstrate that algorithms like SGD and proximal methods can be adapted with minimal effort to handle weak supervision, once the mean operator has been estimated. We apply this idea to learning with asymmetric noisy labels, connecting and extending prior work. Furthermore, we show that most losses enjoy a data-dependent (by the mean operator) form of noise robustness, in contrast with known negative results.

### Monday – Neural Networks and Deep Learning I

#### Session chair: Kyunghyun Cho

**Ballroom 1+2+Juliard**

function, we allow the optimization procedure to explore the boundary between the degenerate saturating) and the well-behaved parts of the activation function. We also establish connections to simulated annealing, when the amount of noise is annealed down, making it easier to optimize hard objective functions. We find experimentally that replacing such saturating activation functions by noisy variants helps optimization in many contexts, yielding state-of-the-art or competitive results on different datasets and task, especially when training seems to be the most difficult, e.g., when curriculum learning is necessary to obtain good results.

ful tools for processing sequential data, finding

architectures or optimization strategies that al-

low them to model very long term dependencies

is still an active area of research. In this work,

we carefully analyze two synthetic datasets orig-

inally outlined in (Hochreiter & Schmidhuber,

1997) which are used to evaluate the ability of

RNNs to store information over many time steps.

We explicitly construct RNN solutions to these

problems, and using these constructions, illumi-

nate both the problems themselves and the way in

which RNNs store different types of information

in their hidden states. These constructions fur-

thermore explain the success of recent methods

that specify unitary initializations or constraints

on the transition matrices.

### Monday – Neural Networks and Deep Learning II (Computer Vision)

#### Session chair: Laurens van der Maaten

**Ballroom 3+4**

### Monday – Approximate Inference

#### Session chair: Jun Zhu

**Marquis**

probability distributions in a compact form is one

of the key insights in the field of graphical models.

Factored representations are ubiquitous in machine learning

and lead to major computational advantages.

We explore a different type of compact representation

based on discrete Fourier representations, complementing

the classical approach based on conditional independencies.

We show that a large class of probabilistic graphical models

have a compact Fourier representation. This theoretical

result opens up an entirely new way of approximating a

probability distribution. We demonstrate the significance

of this approach by applying it to the variable elimination

algorithm. Compared with the traditional bucket

representation and other approximate inference algorithms,

we obtain significant improvements.

These inferences are typically intractable, and are a major bottleneck for learning models with large output spaces. In this paper, we provide a new approach for amortizing the cost of a sequence of related inference queries, such as the ones arising during learning. Our technique relies on a surprising connection with algorithms developed in the past two decades for similarity search in large data bases. Our approach achieves improved running times with provable approximation guarantees. We show that it performs well both on synthetic data and neural language models with large output spaces.

### Monday – Metric and Manifold Learning / Kernel Methods

#### Session chair: Le Song

**Lyceum**

### Monday – Statistical Learning Theory

#### Session chair: Ohad Shamir

**Empire**

for a nonlinear tensor estimation problem.

Low-rank tensor estimation has been used as a method to learn higher order relations among several data sources in a wide range of applications,

such as multi-task learning, recommendation systems, and spatiotemporal analysis.

We consider a general setting where a common linear tensor learning is extended to a nonlinear learning problem in

reproducing kernel Hilbert space and propose a nonparametric Bayesian method based on the Gaussian process method.

We prove its statistical convergence rate without assuming any strong convexity, such as restricted strong convexity.

Remarkably, it is shown that our convergence rate achieves the minimax optimal rate.

We apply our proposed method to multi-task learning and

show that our method significantly outperforms existing methods through numerical experiments on real-world data sets.

information gain with respect to the global maximum, MRS aims at minimizing the expected simple regret of its ultimate recommendation for the optimum. While empirically ES and MRS perform similar in most of the cases, MRS produces fewer outliers with high simple regret than ES. We provide empirical results both for a synthetic single-task optimization problem as well as for a simulated multi-task robotic control problem.

### Monday – Structured Prediction / Monte Carlo Methods

#### Session chair: Martin Jaggi

**Soho**

The core idea is to reformulate the estimation of a score – whether a loss or a prediction estimate – as an empirical expectation, and to use such a tree whose leaves carry the samples to focus efforts over the problematic “heavy weight” ones.

We illustrate the potential of this approach on three problems: to improve Adaboost and a multi-layer perceptron on 2D synthetic tasks with several million points, to train a large-scale convolution network on several millions deformations of the CIFAR data-set, and to compute the response of a SVM with several hundreds of thousands of support vectors. In each case, we show how it either cuts down computation by more than one order of magnitude and/or allows to get better loss estimates.

### Tuesday – Neural Networks and Deep Learning

#### Session chair: Hal Daume III

**Ballroom 1+2+Juliard**

We introduce the dynamic memory network (DMN), a neural network architecture which processes input sequences and questions, forms episodic memories, and generates relevant answers. Questions trigger an iterative attention process which allows the model to condition its attention on the inputs and the result of previous iterations. These results are then reasoned over in a hierarchical recurrent sequence model to generate answers.

The DMN can be trained end-to-end and obtains state-of-the-art results on several types of tasks and datasets:

question answering (Facebook’s bAbI dataset),

text classification for sentiment analysis (Stanford Sentiment Treebank) and

sequence modeling for part-of-speech tagging (WSJ-PTB).

The training for these different tasks relies exclusively on trained word vector representations and input-question-answer triplets.

We trained a PHOG model on a large JavaScript code corpus and show that it is more precise than existing models, while similarly fast. As a result, PHOG can immediately benefit existing programming tools based on probabilistic models of code.

### Tuesday – Reinforcement Learning

#### Session chair: Tom Erez

**Ballroom 3+4**

We explain why versions of least-squares value iteration that use Boltzmann or epsilon-greedy exploration can be highly inefficient, and we present computational results that demonstrate dramatic efficiency gains enjoyed by RLSVI.

Further, we establish an upper bound on the expected regret of RLSVI that demonstrates near-optimality in a tabula rasa learning context.

More broadly, our results suggest that randomized value functions offer a promising approach to tackling a critical challenge in reinforcement learning: synthesizing efficient exploration and effective generalization.

policy update. However, linearly approximating the dynamics in order to derive the new policy can bias the update and prevent convergence to the optimal policy. In this article, we propose a new model-free algorithm that

backpropagates a local quadratic time-dependent Q-Function, allowing the derivation of the policy update in closed form. Our policy update

ensures exact KL-constraint satisfaction without simplifying assumptions on the system dynamics demonstrating improved performance in comparison to related Trajectory Optimization algorithms linearizing the dynamics.

#### Session chair: Lihong Li

**Marquis**

#### Session chair: Tom Erez

**Marquis**

The performance of several learning algorithms (e.g., Q-learning) is affected by the accuracy of such estimation.

Unfortunately, no unbiased estimator exists.

The usual approach of taking the maximum of the sample means leads to large overestimates that may significantly harm the performance of the learning algorithm.

Recent works have shown that the cross validation estimator—which is negatively biased—outperforms the maximum estimator in many sequential decision-making scenarios.

On the other hand, the relative performance of the two estimators is highly problem-dependent.

In this paper, we propose a new estimator for the maximum expected value, based on a weighted average of the sample means, where the weights are computed using Gaussian approximations for the distributions of the sample means.

We compare the proposed estimator with the other state-of-the-art methods both theoretically, by deriving upper bounds to the bias and the variance of the estimator, and empirically, by testing the performance on different sequential learning problems.

We provide theoretical convergence guarantees for all the proposed algorithms and also empirically demonstrate the usefulness of our algorithms.

### Tuesday – Optimization (Combinatorial)

#### Session chair: Andreas Krause

**Marquis**

### Tuesday – Unsupervised Learning / Representation Learning

#### Session chair: Jennifer Dy

**Lyceum**

Intuitively, data is passed through a series of progressively fine-grained sieves.

Each layer of the sieve recovers a single latent factor that is maximally informative about multivariate dependence in the data.

The data is transformed after each pass so that the remaining unexplained information trickles down to the next layer.

Ultimately, we are left with a set of latent factors explaining all the dependence in the original data and remainder information consisting of independent noise.

We present a practical implementation of this framework for discrete variables and apply it to a variety of fundamental tasks in unsupervised learning including independent component analysis, lossy and lossless compression, and predicting missing values in data.

We propose a new algorithmic framework for counterfactual inference which brings together ideas from domain adaptation and representation learning. In addition to a theoretical justification, we perform an empirical comparison with previous approaches to causal inference from observational data. Our deep learning algorithm significantly outperforms the previous state-of-the-art.

learning algorithms, because not only it is an efficient mode of data

representation, but also — more importantly — it captures the generation

process of most real world data. While a number of regularized auto-encoders

(AE) enforce sparsity explicitly in their learned representation and others don’t, there has been little formal analysis on what encourages sparsity in these

models in general. Our objective is to formally study this general problem for

regularized auto-encoders. We provide sufficient conditions on both regularization

and activation functions that encourage sparsity. We show that multiple popular

models (de-noising and contractive auto encoders, e.g.) and activations

(rectified linear and sigmoid, e.g.) satisfy these conditions; thus, our

conditions help explain sparsity in their learned representation. Thus our

theoretical and empirical analysis together shed light on the properties of

regularization/activation that are conductive to sparsity and unify

a number of existing auto-encoder models and activation functions under the same

analytical framework.

### Tuesday – Sampling / Kernel Methods

#### Session chair: Marius Kloft

**Empire**

known lower bounds depending exponentially in dimension.

A popular strategy to alleviate this curse of dimensionality

has been to use additive models of \emph{first order}, which model the regression

function as a sum of independent functions on each dimension.

Though useful in controlling the variance of the estimate, such models are

often too restrictive in practical settings.

Between non-additive models which often have large variance and first order

additive models which have large bias, there has been little work to

exploit the trade-off in the middle via additive models of intermediate order.

In this work, we propose \salsa, which bridges this gap by allowing

interactions between variables, but controls model capacity by limiting the order of

interactions.

\salsas minimises the residual sum of squares with squared RKHS norm penalties.

Algorithmically, it can be viewed as Kernel Ridge Regression with an additive kernel.

When the regression function is additive, the excess risk is only polynomial

in dimension.

Using the Girard-Newton formulae, we efficiently sum over a combinatorial number

of terms in the additive expansion.

Via a comparison on $15$ real datasets, we show that our method is competitive

against $21$ other alternatives.

### Tuesday – Sparsity and Compressed Sensing

#### Session chair: Gal Chechik

**Soho**

A proximal gradient homotopy algorithm is proposed to solve the proposed optimization problem.

Theoretically, we first prove that the proposed estimator attains a faster statistical rate than the traditional low-rank matrix estimator with nuclear norm penalty.

Moreover, we rigorously show that under a certain condition on the magnitude of the nonzero singular values, the proposed estimator enjoys oracle property (i.e., exactly recovers the true rank of the matrix), besides attaining a faster rate.

Extensive numerical experiments on both synthetic and real world datasets corroborate our theoretical findings.

### Tuesday – Approximate Inference

#### Session chair: Jon Mcauliffe

**Liberty**

We extend the multi-sample approach to discrete latent variables and analyze the difficulty encountered when estimating the gradients involved. We then develop the first unbiased gradient estimator designed for importance-sampled objectives and evaluate it at training generative and structured output prediction models. The resulting estimator, which is based on low-variance per-sample learning signals, is both simpler and more effective than the NVIL estimator proposed for the single-sample variational objective, and is competitive with the currently used biased estimators.

### Tuesday – Neural Networks and Deep Learning I

#### Session chair: Nicolas Le Roux

**Ballroom 1+2+Juliard**

#### Session chair: Yoshua Bengio

**Ballroom 1+2+Juliard**

### Tuesday – Neural Networks and Deep Learning II

#### Session chair: Alexandru Niculescu-Mizil

**Ballroom 3+4**

Unfortunately, they usually do not take into account the often unknown but nevertheless rich relationships between labels. In this paper, we propose to make use of this underlying structure by learning to partition the labels into a Markov Blanket Chain and then applying a novel deep architecture that exploits the partition.

Experiments on several popular and large multi-label datasets demonstrate that our approach not only yields significant improvements, but also helps to overcome trade-offs specific to the multi-label classification setting.

Relatively little work has focused on learning representations for clustering.

In this paper, we propose Deep Embedded Clustering (DEC), a method that simultaneously learns feature representations and cluster assignments using deep neural networks.

DEC learns a mapping from the data space to a lower-dimensional feature space in which it iteratively optimizes a clustering objective.

Our experimental evaluations on image and text corpora show significant improvement over state-of-the-art methods.

We propose a framework for learning convolutional neural networks for arbitrary graphs. These graphs may be undirected, directed, and with both discrete and continuous node and edge attributes. Analogous to image-based convolutional networks that operate on locally connected regions of the input, we present a general approach to extracting locally connected regions from graphs. Using established benchmark data sets, we demonstrate that the learned feature representations are competitive with state of the art graph kernels and that their computation is highly efficient.

#### Session chair: David Sontag

**Ballroom 3+4**

The system has an associative memory based on complex-valued vectors and is closely related to Holographic Reduced Representations and Long Short-Term Memory networks. Holographic Reduced Representations have limited capacity: as they store more information, each retrieval becomes noisier due to interference.

Our system in contrast creates redundant copies of stored information, which enables retrieval with reduced noise.

Experiments demonstrate faster learning on multiple memorization tasks.

### Tuesday – Optimization (Continuous)

#### Session chair: Le Song

**Lyceum**

Second, we get an efficient randomized interior point method with an efficiently computable universal barrier for any convex set described by a membership oracle. Previously, efficiently computable barriers were known only for particular convex sets.

#### Session chair: Tong Zhang

**Lyceum**

We provide simple iterative algorithms, with improved runtimes, for solving these problems that are globally linearly convergent with moderate dependencies on the condition numbers and eigenvalue gaps of the matrices involved.

We obtain our results by reducing CCA to the top-$k$ generalized eigenvector problem. We solve this problem through a general framework that simply requires black box access to an approximate linear system solver. Instantiating this framework with accelerated gradient descent we obtain a running time of $\order{\frac{z k \sqrt{\kappa}}{\rho} \log(1/\epsilon) \log \left(k\kappa/\rho\right)}$ where $z$ is the total number of nonzero entries, $\kappa$ is the condition number and $\rho$ is the relative eigenvalue gap of the appropriate matrices.

Our algorithm is linear in the input size and the number of components $k$ up to a $\log(k)$ factor. This is essential for handling large-scale matrices that appear in practice. To the best of our knowledge this is the first such algorithm with global linear convergence. We hope that our results prompt further research and ultimately improve the practical running time for performing these important data analysis procedures on large data sets.

### Tuesday – Matrix Factorization and Related Topics

#### Session chair: Roger Grosse

**Empire**

By avoiding explicit principal component analysis (PCA), our algorithm is the first with no runtime dependence on the number of top principal components. We show that it can be used to give a fast iterative method for the popular principal component regression problem, giving the first major runtime improvement over the naive method of combining PCA with regression.

To achieve our results, we first observe that ridge regression can be used to obtain a “smooth projection” onto the top principal components. We then sharpen this approximation to true projection using a low-degree polynomial approximation to the matrix step function. Step function approximation is a topic of long-term interest in scientific computing. We extend prior theory by constructing polynomials with simple iterative structure and rigorously analyzing their behavior under limited precision.

In this paper, we provide provable recovery guarantee of weighted low-rank via a simple alternating minimization algorithm. In particular, for a natural class of matrices and weights and without any assumption on the noise, we bound the spectral norm of the difference between the recovered matrix and the ground truth, by the spectral norm of the weighted noise plus an additive error term that decreases exponentially with the number of rounds of alternating minimization, from either initialization by SVD or, more importantly, random initialization.

These provide the first theoretical results for weighted low-rank approximation via alternating minimization with non-binary deterministic weights, significantly generalizing those for matrix completion, the special case with binary weights, since our assumptions are similar or weaker than those made in existing works. Furthermore, this is achieved by a very simple algorithm that improves the vanilla alternating minimization with a simple clipping step.

### Tuesday – Unsupervised Learning / Applications

#### Session chair: Jeff Bilmes

**Soho**

Our model is formulated as a hierarchical Bayesian mixture model with cell-specific scalings that aid the iterative normalization and clustering of cells, teasing apart technical variation from biological signals. We demonstrate that this approach is superior to global normalization followed by clustering. We show identifiability and weak convergence guarantees of our method and present a scalable Gibbs inference algorithm. This method improves cluster inference in both synthetic and real single-cell data compared with previous methods, and allows easy interpretation and recovery of the underlying structure and cell types.

### Tuesday – Learning Theory

#### Session chair: Jon Mcauliffe

**Liberty**

Previous rigorous approaches for this problem rely on dynamic programming (DP) and, while sample efficient, have running time quadratic in the sample size. As our main contribution, we provide new sample near-linear time algorithms for the problem that – while not being minimax optimal – achieve a significantly better sample-time tradeoff on large datasets compared to the DP approach. Our experimental evaluation shows that, compared with the DP approach, our algorithms provide a convergence rate that is only off by a factor of 2 to 4, while achieving speedups of three orders of magnitude.

Designing provable algorithms for inference has proved more difficult. Here we take a first step towards provable inference in topic models. We leverage a property of topic models that enables us to construct simple linear estimators for the unknown topic proportions that have small variance, and consequently can work with short documents. Our estimators also correspond to finding an estimate around which the posterior is well-concentrated. We show lower bounds that for shorter documents it can be information theoretically impossible to find the hidden topics. Finally, we give empirical results that demonstrate that our algorithm works on realistic topic models. It yields good solutions on synthetic data and runs in time comparable to a single iteration of Gibbs sampling.

### Tuesday – Large Scale Learning and Big Data

#### Session chair: Andreas Krause

**Empire**

### Tuesday – Graphical Models

#### Session chair: Jun Zhu

**Soho**

**Best paper award**– 05:44 – Ensuring Rapid Mixing and Low Bias for Asynchronous Gibbs Sampling Paper | Reviews | Rebuttal | Poster session on wednesday morning 10am-1pm | Abstract

output, maximum marginal likelihood faces two computational obstacles:

non-convexity of the objective and intractability of even a single gradient

computation. In this paper, we bypass both obstacles for a class of what we

call linear indirectly-supervised problems. Our approach is simple: we

solve a linear system to estimate sufficient statistics of the model,

which we then use to estimate parameters via convex optimization. We

analyze the statistical properties of our approach and show empirically that

it is effective in two settings: learning with local privacy constraints

and learning from low-cost count-based annotations.

### Tuesday – Supervised Learning

#### Session chair: Nicolas Le Roux

**Liberty**

### Wednesday – Neural Networks and Deep Learning I

#### Session chair: Samy Bengio

**Ballroom 1+2+Juliard**

First, we prove that the popular model of Dawid and Skene, which assumes that all classifiers are

conditionally independent, is {\em equivalent} to a Restricted Boltzmann Machine (RBM) with a single hidden node.

Hence, under this model, the posterior probabilities of the true labels can be instead estimated via a trained RBM.

Next, to address the more general case, where classifiers may strongly violate the conditional independence assumption,

we propose to apply RBM-based Deep Neural Net (DNN).

Experimental results on various simulated and real-world datasets demonstrate that our proposed DNN approach

outperforms other state-of-the-art methods, in particular when the data violates the conditional independence assumption.

### Wednesday – Optimization (Continuous)

#### Session chair: Shie Mannor

**Ballroom 3+4**

#### Session chair: Satyen Kale

**Ballroom 3+4**

With the knowledge of these two parameters, we show that any such algorithm attains an iteration complexity lower bound of $\Omega(\sqrt{L/\epsilon})$ for $L$-smooth convex functions, and $\tilde{\Omega}(\sqrt{L/\mu}\ln(1/\epsilon))$ for $L$-smooth $\mu$-strongly convex functions. These lower bounds are stronger than those in the traditional oracle model, as they hold independently of the dimension. To attain these, we abandon the oracle model in favor of a structure-based approach which builds upon a framework recently proposed in \cite{arjevani2015lower}. We further show that without knowing the strong convexity parameter, it is impossible to attain an iteration complexity better than $\tilde{\Omega}\circpar{(L/\mu)\ln(1/\epsilon)}$. This result is then used to formalize an observation regarding $L$-smooth convex functions, namely, that the iteration complexity of algorithms employing time-invariant step sizes must be at least $\Omega(L/\epsilon)$.

In this paper we describe a new first-order algorithm based on graduated optimization and analyze its performance. We characterize a family of non-convex functions for which this algorithm provably converges to a global optimum. In particular, we prove that the algorithm converges to an $\eps$-approximate solution within $O(1 / \eps^2)$ gradient-based steps.

We extend our algorithm and analysis to the setting of stochastic non-convex optimization with noisy gradient feedback, attaining the same convergence rate.

Additionally, we discuss the setting of “zero-order optimization”, and devise a variant of our algorithm which converges at rate of $O(d^2/ \eps^4)$.

### Wednesday – Multi-label, multi-task, and neural networks

#### Session chair: Joelle Pineau

**Marquis**

### Wednesday – Gaussian Processes

#### Session chair: Roger Grosse

**Lyceum**

### Wednesday – Feature Selection and Dimensionality Reduction

#### Session chair: Laurens van der Maaten

**Empire**

not applicable for problems with dimensionality larger than the sample

size. For these problems, we advocate the use of a generalized version of OLS

motivated by ridge regression, and propose two novel three-step algorithms involving least squares fitting and hard thresholding. The algorithms are methodologically simple to

understand intuitively, computationally easy to implement efficiently, and

theoretically appealing for choosing models consistently. Numerical exercises comparing

our methods with penalization-based approaches in simulations and data

analyses illustrate the great potential of the proposed algorithms.

on the left by an $m \times n$ matrix $\tilde G$ of i.i.d. Gaussian random variables,

but could not afford to do it because it was too slow? In this work

we propose a new randomized $m \times n$ matrix $T$, for which one can

compute $T \cdot X$ in only $O(\nnz(X)) + \tilde O(m^{1.5} \cdot d^{3})$ time, for which

the total variation distance between the distributions $T \cdot X$ and $\tilde G \cdot X$

is as small as desired, i.e., less than any positive constant. Here

$\nnz(X)$ denotes the number of non-zero entries of $X$. Assuming

$\nnz(X) \gg m^{1.5} \cdot d^{3}$,

this is a significant savings over the na\”ive $O(\nnz(X) m)$ time to compute $\tilde G \cdot X$.

Moreover, since the total variation distance is small, we can provably use $T \cdot X$

in place of $\tilde G \cdot X$ in any application and have the same guarantees as if we were using

$\tilde G \cdot X$, up to a small

positive constant in error probability.

We apply this transform to nonnegative matrix

factorization (NMF) and support vector machines (SVM).

### Wednesday – Graph Analysis/ Spectral Methods

#### Session chair: Alex Kulesza

**Soho**

### Wednesday – Ranking and Preference Learning

#### Session chair: Tong Zhang

**Liberty**

### Wednesday – Neural Networks and Deep Learning

#### Session chair: Yoshua Bengio

**Ballroom 1+2+Juliard**

By combining a variational autoencoder (VAE) with a generative adversarial network (GAN) we can use learned feature representations in the GAN discriminator as basis for the VAE reconstruction objective.

Thereby, we replace element-wise errors with feature-wise errors to better capture the data distribution while offering invariance towards e.g. translation.

We apply our method to images of faces and show that it outperforms VAEs with element-wise similarity measures in terms of visual fidelity.

Moreover, we show that the method learns an embedding in which high-level abstract visual features (e.g. wearing glasses) can be modified using simple arithmetic.

that can adaptively assign its capacity across different portions of

the input data. This is achieved by combining modules of two types:

low-capacity sub-networks and high-capacity sub-networks. The

low-capacity sub-networks are applied across most of the input, but

also provide a guide to select a few portions of the input on which

to apply the high-capacity sub-networks. The selection is made using

a novel gradient-based attention mechanism, that efficiently

identifies input regions for which the DCN’s output is most

sensitive and to which we should devote more capacity. We focus our

empirical evaluation on the Cluttered MNIST and SVHN image datasets.

Our findings indicate that DCNs are able to drastically reduce the

number of computations, compared to traditional convolutional neural

networks, while maintaining similar or even better performance.

### Wednesday – Applications and Time-Series Analysis

#### Session chair: Jon Mcauliffe

**Marquis**

state-of-the-art alternatives.

### Wednesday – Dimensionality Reduction / Private Learning

#### Session chair: Jeff Bilmes

**Lyceum**

In this paper, we answer this question in affirmative, by showing that for the class of generalized linear functions, we can obtain excess risk bounds of O(w(Theta)^{2/3}/n^{1/3}) under eps-differential privacy, and O((w(Theta)/n)^{1/2}) under (eps,delta)-differential privacy, given only the projected data and the projection matrix. Here n is the sample size and w(Theta) is the Gaussian width of the parameter space that we optimize over. Our strategy is based on adding noise for privacy in the projected subspace and then lifting the solution to original space by using high-dimensional estimation techniques. A simple consequence of these results is that, for a large class of ERM problems, in the traditional setting (i.e., with access to the original data), under eps-differential privacy, we improve the worst-case risk bounds of Bassily et al. (FOCS 2014).

binary embeddings, that involves

pseudo-random projections followed by nonlinear

(sign function) mappings. The pseudorandom

projection is described by a matrix,

where not all entries are independent random

variables but instead a fixed “budget of randomness”

is distributed across the matrix. Such matrices

can be efficiently stored in sub-quadratic or

even linear space, provide reduction in randomness

usage (i.e. number of required random values),

and very often lead to computational speed

ups. We prove several theoretical results showing

that projections via various structured matrices

followed by nonlinear mappings accurately preserve

the angular distance between input high-dimensional

vectors. To the best of our knowledge,

these results are the first that give theoretical

ground for the use of general structured matrices

in the nonlinear setting. In particular, they

generalize previous extensions of the Johnson-

Lindenstrauss lemma and prove the plausibility

of the approach that was so far only heuristically

confirmed for some special structured matrices.

Consequently, we show that many structured

matrices can be used as an efficient information

compression mechanism. Our findings

build a better understanding of certain deep

architectures, which contain randomly weighted

and untrained layers, and yet achieve high performance

on different learning tasks. We empirically

verify our theoretical findings and show

the dependence of learning via structured hashed

projections on the performance of neural network

as well as nearest neighbor classifier.

### Wednesday – Monte Carlo Methods

#### Session chair: Stefano Ermon

**Empire**

We show strong performance of our algorithms on synthetic datasets and high-dimensional Poisson factor analysis-based topic modeling scenarios.

by learning heuristic approximations to stochastic inverses,

designed specifically for use as proposal distributions in sequential Monte Carlo methods.

We describe a procedure for constructing and learning a structured neural network

which represents an inverse factorization of the graphical model,

resulting in a conditional density estimator that takes as input

particular values of the observed random variables, and returns an

approximation to the distribution of the latent variables.

This recognition model can be learned offline, independent from

any particular dataset, prior to performing inference.

The output of these networks can be used as

automatically-learned high-quality proposal distributions to

accelerate sequential Monte Carlo

across a diverse range of problem settings.

the multinomial probability law of the inverse temperatures, and provides estimates of the partition function in terms of a simple quotient of Rao-Blackwellized marginal inverse temperature probability estimates, which are updated while sampling. We show that the method has interesting connections with several alternative popular methods, and offers some significant advantages. In particular, we empirically find that the new method provides more accurate estimates than Annealed Importance Sampling when calculating partition functions of large Restricted Boltzmann Machines (RBM);

moreover, the method is sufficiently accurate to track training and validation log-likelihoods during learning of RBMs, at minimal computational cost.

### Wednesday – Crowdsourcing and Interactive Learning

#### Session chair: Yisong Yue

**Soho**

### Wednesday – Learning Theory

#### Session chair: Alina Beygelzimer

**Liberty**

More precisely, we provide new analysis to improve the state-of-the-art running times in both settings by either applying SVRG or its novel variant. Since non-strongly convex objectives include important examples such as Lasso or logistic regression, and sum-of-non-convex objectives include famous examples such as stochastic PCA and is even believed to be related to training deep neural nets, our results also imply better performances in these applications.

We provide the first improvement in this line of research. Our result is based on the variance reduction trick recently introduced to convex optimization, as well as a brand new analysis of variance reduction that is suitable for non-convex optimization. For objectives that are sum of smooth functions, our first-order minibatch stochastic method converges with an $O(1/\varepsilon)$ rate, and is faster than full gradient descent by $\Omega(n^{1/3})$.

We demonstrate the effectiveness of our methods on empirical risk minimizations with non-convex loss functions and training neural nets.

In this paper, we improve the best known running time of accelerated coordinate descent by a factor up to $\sqrt{n}$. Our improvement is based on a clean, novel non-uniform sampling that selects each coordinate with a probability proportional to the square root of its smoothness parameter. Our proof technique also deviates from the classical estimation sequence technique used in prior work. Our speed-up applies to important problems such as empirical risk minimization and solving linear systems, both in theory and in practice.

with few iterations have vanishing generalization error. We prove our results

by arguing that SGM is algorithmically stable in the sense of Bousquet and

Elisseeff. Our analysis only employs elementary tools from convex and

continuous optimization. We derive stability bounds for both convex and

non-convex optimization under standard Lipschitz and smoothness assumptions.

Applying our results to the convex case, we provide new insights for why

multiple epochs of stochastic gradient methods generalize well in practice.

In the non-convex case, we give a new interpretation of common practices in

neural networks, and formally show that popular techniques for training large

deep models are indeed stability-promoting. Our findings conceptually

underscore the importance of reducing training time beyond its obvious

benefit.

### Wednesday – Supervised Learning

#### Session chair: Csaba Szepesvari

**Ballroom 3+4**

Finally, we validate the empirical performance of this method on the estimation of regularization constants of L2-regularized logistic regression and kernel Ridge regression. Empirical benchmarks indicate that our approach is highly competitive with respect to state of the art methods.

### Wednesday – Kernel Methods

#### Session chair: Marius Kloft

**Marquis**

to incorporate kernel-based regression from conditional distributions, thus appropriately taking into account the specific structure of the posited generative model. We show that our approach can be implemented in a computationally and statistically efficient way using the random Fourier features framework for large-scale kernel learning. In addition to that, our framework outperforms related methods by a large margin on toy and real-world data, including hierarchical and time series models.

We propose, structure2vec, an effective and scalable approach for structured data representation based on the idea of embedding latent variable models into feature spaces, and learning such feature spaces using discriminative information. Interestingly, structure2vec extracts features by performing a sequence of function mappings in a way similar to graphical model inference procedures, such as mean field and belief propagation. In applications involving millions of data points, we showed that structure2vec runs 2 times faster, produces models which are 10,000 times smaller, while at the same time achieving the state-of-the-art predictive performance.

### Wednesday – Matrix Factorization and Related Topics

#### Session chair: Stefanie Jegelka

**Lyceum**

vectors into structured matrices to ap-

proximate various kernel functions in sublin-

ear time via random embeddings. Our frame-

work includes the Fastfood construction of Le

et al. (2013) as a special case, but also ex-

tends to Circulant, Toeplitz and Hankel matri-

ces, and the broader family of structured matri-

ces that are characterized by the concept of low-

displacement rank. We introduce notions of co-

herence and graph-theoretic structural constants

that control the approximation quality, and prove

unbiasedness and low-variance properties of ran-

dom feature maps that arise within our frame-

work. For the case of low-displacement matri-

ces, we show how the degree of structure and

randomness can be controlled to reduce statis-

tical variance at the cost of increased computa-

tion and storage requirements. Empirical results

strongly support our theory and justify the use of

a broader family of structured matrices for scal-

ing up kernel methods using random features.

Our key result is that for most multivariate losses, the expected loss is provably optimized by sorting the items by the conditional probability of label being positive and then selecting top $k$ items. Such a result was previously known only for the $F$-measure. Leveraging on the optimality characterization, we give an algorithm for estimating optimal predictions in practice with runtime quadratic in size of item sets for many losses. We provide empirical results on benchmark datasets, comparing the proposed algorithm to state-of-the-art methods for optimizing multivariate losses.

### Wednesday – Privacy, Anonymity, and Security

#### Session chair: Gal Chechik

**Empire**

### Wednesday – Causal Inference

#### Session chair: David Sontag

**Soho**

multivariate autoregressive moving average (VARMA) model

satisfies the same model assumption

in the reversed time direction, too, if

all innovations are normally distributed. This reversibility breaks down

if the innovations are non-Gaussian.

This means that under the assumption of a VARMA process with non-Gaussian noise, the arrow of time becomes detectable.

Our work thereby provides a theoretic justification of an algorithm that has been used for inferring the direction of video snippets.

We present a slightly modified practical algorithm that estimates the time direction for a given sample and prove its consistency. We further investigate how the performance of the algorithm depends on sample size, number of dimensions of the time series and the order of the process.

An application to real world data from economics shows that considering multivariate processes instead of univariate processes can be beneficial for estimating the time direction.

Our result extends earlier work on univariate time series. It relates to the concept of causal inference, where recent methods exploit non-Gaussianity of the error terms for causal structure learning.

and propose a novel fixed-k nearest neighbor estimator, and demonstrate its consistency. Finally, we demonstrate an application to single-cell flow-cytometry, where the proposed estimators significantly reduce sample complexity.

We propose an effective method learning Granger causality for a special but significant type of point processes — Hawkes processes.

Focusing on Hawkes processes, we reveal the relationship between Hawkes process’s impact functions and its Granger causality graph.

Specifically, our model represents impact functions using a series of basis functions and recovers the Granger causality graph via group sparsity of the impact functions’ coefficients.

We propose an effective learning algorithm combining a maximum likelihood estimator (MLE) with a sparse-group-lasso (SGL) regularizer.

Additionally, the pairwise similarity between the dimensions of the process is considered when their clustering structure is available.

We analyze our learning method and discuss the selection of the basis functions.

Experiments on synthetic data and real-world data show that our method can learn the Granger causality graph and the triggering patterns of Hawkes processes simultaneously.

### Wednesday – Optimization

#### Session chair: Lihong Li

**Liberty**

In this paper, we address the problem of decentralized minimization of pairwise functions of the data points, where these points are distributed over the nodes of a graph defining the communication topology of the network.

This general problem finds applications in ranking, distance metric learning and graph inference, among others.

We propose new gossip algorithms based on dual averaging which aims at solving such problems both in synchronous and asynchronous settings. The proposed framework is flexible enough to deal with constrained and regularized variants of the optimization problem. Our theoretical analysis reveals that the proposed algorithms preserve the convergence rate of centralized dual averaging up to an additive bias term. We present numerical simulations on Area Under the ROC Curve (AUC) maximization and metric learning problems which illustrate the practical interest of our approach.

More specifically, we maintain a distribution over classes (instead of individual instances) that is adaptively estimated during the course of optimization to give the maximum reduction in the variance of the gradient. Intuitively, we sample more from those regions in space that have a \textit{larger} gradient contribution.

Our experiments on highly multiclass datasets show that our proposal converge significantly faster than existing techniques.