Tutorials | ICML New York City

Tutorials

ICML 2016 tutorials took place on June 19, 2016. (Due to capacity constraints, one tutorial track took place at the Crown Plaza hotel, a short walk from Marriott.)

Schedule:

Room	Crown Plaza Broadway + Breakout room	Marriott Astor + Empire (simulcast)	Marriott Soho + Duffy (simulcast)	Marriott Cantor	Marriott Times Square
Session 1 8:30-10:30am	Causal inference	Deep Residual Networks	Convex optimization	Deep Residual Networks	Convex optimization
10:30-11am Coffee break
Session 2 11am-1pm	Memory Networks	Stochastic Gradient	Rigorous Data Dredging	Stochastic Gradient	Rigorous Data Dredging
1-2:30pm Lunch break (on your own)
Session 3 2:30-4:30pm	Non-convex optimization	Deep RL	Deep RL	Graph sketching and streaming	Graph sketching and streaming

All tutorials have been recorded and will be available after the conference.

All tutorials at a glance

Click on the title of any tutorial to get to the tutorial’s page/slides.

Deep Reinforcement Learning

David Silver (Google DeepMind)

A major goal of artificial intelligence is to create general-purpose agents that can perform effectively in a wide range of challenging tasks. To achieve this goal, it is necessary to combine reinforcement learning (RL) agents with powerful and flexible representations. The key idea of deep RL is to use neural networks to provide this representational power. In this tutorial we will present a family of algorithms in which deep neural networks are used for value functions, policies, or environment models. State-of-the-art results will be presented in a variety of domains, including Atari games, 3D navigation tasks, continuous control domains and the game of Go.
[slides]
[AlphaGo slides]

Memory Networks for Language Understanding

Jason Weston (Facebook)

There has been a recent resurgence in interest in the use of the combination of reasoning, attention and memory for solving tasks, particularly in the field of language understanding. I will review some of these recent efforts, as well as focusing on one of my own group’s contributions, memory networks, an architecture that we have applied to question answering, language modeling and general dialog. As we try to move towards the goal of true language understanding, I will also discuss recent datasets and tests that have been built to assess these models abilities to see how far we have come.

Deep Residual Networks: Deep Learning Gets Way Deeper

Kaiming He (Facebook, starting July, 2016)

Deeper neural networks are more difficult to train. Beyond a certain depth, traditional deeper networks start to show severe underfitting caused by optimization difficulties. This tutorial will describe the recently developed residual learning framework, which eases the training of networks that are substantially deeper than those used previously. These residual networks are easier to converge, and can gain accuracy from considerably increased depth. On the ImageNet dataset we evaluate residual nets with depth of up to 152 layers—8x deeper than VGG nets but still having lower complexity. These deep residual networks are the foundations of our 1st-place winning entries in all five main tracks in ImageNet and COCO 2015 competitions, which cover image classification, object detection, and semantic segmentation.

In this tutorial we will further look into the propagation formulations of residual networks. Our latest work reveals that when the residual networks have identity mappings as skip connections and inter-block activations, the forward and backward signals can be directly propagated from one block to any other block. This leads us to promising results of 1001-layer residual networks. Our work suggests that there is much room to exploit the dimension of network depth, a key to the success of modern deep learning.
[slides]

Recent Advances in Non-Convex Optimization

Anima Anandkumar (University of California Irvine)

Most machine learning tasks require solving non-convex optimization. The number of critical points in a non-convex problem grows exponentially with the data dimension. Local search methods such as gradient descent can get stuck in one of these critical points, and therefore, finding the globally optimal solution is computationally hard. Despite this hardness barrier, we have seen many advances in guaranteed non-convex optimization. The focus has shifted to characterizing transparent conditions under which the global solution can be found efficiently. In many instances, these conditions turn out to be mild and natural for machine learning applications. This tutorial will provide an overview of the recent theoretical success stories in non-convex optimization. This includes learning latent variable models, dictionary learning, robust principal component analysis, and so on. Simple iterative methods such as spectral methods, alternating projections, and so on, are proven to learn consistent models with polynomial sample and computational complexity. This tutorial will present main ingredients towards establishing these results. The tutorial with conclude with open challenges and possible path towards tackling them.

Stochastic Gradient Methods for Large-Scale Machine Learning

Leon Bottou (Facebook AI Research), Frank E. Curtis (Lehigh University), and Jorge Nocedal (Northwestern University)

This tutorial provides an accessible introduction to the mathematical properties of stochastic gradient methods and their consequences for large scale machine learning. After reviewing the computational needs for solving optimization problems in two typical examples of large scale machine learning, namely, the training of sparse linear classifiers and deep neural networks, we present the theory of the simple, yet versatile stochastic gradient algorithm, explain its theoretical and practical behavior, and expose the opportunities available for designing improved algorithms. We then provide specific examples of advanced algorithms to illustrate the two essential directions for improving stochastic gradient methods, namely, managing the noise and making use of second order information.
[slides1] [slides2] [slides3]

The convex optimization, game-theoretic approach to learning

Elad Hazan (Princeton University) and Satyen Kale (Yahoo Research)

In recent years convex optimization and the notion of regret minimization in games have been combined and applied to machine learning in a general framework called online convex optimization. We will survey the basics of this framework, its applications, main algorithmic techniques and future research directions.

Rigorous Data Dredging: Theory and Tools for Adaptive Data Analysis

Moritz Hardt (Google) and Aaron Roth (University of Pennsylvania)

Reliable tools for inference and model selection are necessary in all applications of machine learning and statistics. Much of the existing theory breaks down in the now common situation where the data analyst works interactively with the data, adaptively choosing which methods to use by probing the same data many times. We illustrate the problem through the lens of machine learning benchmarks, which currently all rely on the standard holdout method. After understanding why and when the standard holdout method fails, we will see practical alternatives to the holdout method that can be used many times without losing the guarantees of fresh data. We then transition into the emerging theory on this topic touching on deep connections to differential privacy, compression schemes, and hypothesis testing (although no prior knowledge will be assumed).

Graph Sketching, Streaming, and Space-Efficient Optimization

Sudipto Guha (University of Pennsylvania) and Andrew McGregor (University of Massachusetts Amherst)

Graphs ae one of the most commonly used data representation tools but existing algorithmicapproaches are typically not appropriate when the graphs of interest are dynamic, stochastic, ordo not ﬁt into the memory of a single machine. Such graphs are often encountered as machinelearning techniques are increasingly deployed to manage graph data and large-scale graph opti-mization problems. Graph sketching is a form of dimensionality reduction for graph data that isbased on using random linear projections and exploiting connections between linear algebra andcombinatorial structure. The technique has been studied extensively over the last ﬁve years andcan be applied in many computational settings. It enables small-space online and data streamcomputation where we are permitted only a few passes (ideally only one) over an input sequence ofupdates to a large underlying graph. The technique parallelizes easily and can naturally be appliedin various distributed settings. It can also be used in the context of convex programming to enablemore eﬃcient algorithms for combinatorial optimization problems such as correlation clustering. One of the main goals of the research on graph sketching is understanding and characterizing thetypes of graph structure and features that can be inferred from compressed representations of the relevant graphs.
[slides1] [slides2]

Causal inference for observational studies

David Sontag and Uri Shalit (New York University)

In many fields such as healthcare, education, and economics, policy makers have increasing amounts of data at their disposal. Making policy decisions based on this data often involves causal questions: Does medication X lead to lower blood sugar, compared with medication Y? Does longer maternity leave lead to better child social and cognitive skills? These questions have to be addressed in practice, every day, by scientists working across many different disciplines.

The goal of this tutorial is to bring machine learning practitioners closer to the vast field of causal inference as practiced by statisticians, epidemiologists and economists. We believe that machine learning has much to contribute in helping answer such questions, especially given the massive growth in the available data and its complexity. We also believe the machine learning community could and should be highly interested in engaging with such problems, considering the great impact they have on society in general.

We hope that participants in the tutorial will: a) learn the basic language of causal inference as exemplified by the two most dominant paradigms today: the potential outcomes framework, and causal graphs; b) understand the similarities and the differences between problems machine learning practitioners usually face and problems of causal inference; c) become familiar with the basic tools employed by practicing scientists performing causal inference, and d) be informed about the latest research efforts in bringing machine learning techniques to address problems of causal inference.