Test of Time Award


Outstanding Paper

[ Pacific Ballroom ]
The key idea behind the unsupervised learning of disentangled representations is that realworld data is generated by a few explanatory factors of variation which can be recovered by unsupervised learning algorithms.
In this paper, we provide a sober look at recent progress in the field and challenge some common assumptions.
We first theoretically show that the unsupervised learning of disentangled representations is fundamentally impossible without inductive biases on both the models and the data.
Then, we train more than $12000$ models covering most prominent methods and evaluation metrics in a reproducible largescale experimental study on seven different data sets.
We observe that while the different methods successfully enforce properties ``encouraged'' by the corresponding losses, welldisentangled models seemingly cannot be identified without supervision.
Furthermore, increased disentanglement does not seem to lead to a decreased sample complexity of learning for downstream tasks.
Our results suggest that future work on disentanglement learning should be explicit about the role of inductive biases and (implicit) supervision, investigate concrete benefits of enforcing disentanglement of the learned representations, and consider a reproducible experimental setup covering several data sets.

Outstanding Paper

[ Hall A ]
Excellent variational approximations to Gaussian process posteriors have been developed which avoid the $\mathcal{O}\left(N^3\right)$ scaling with dataset size $N$. They reduce the computational cost to $\mathcal{O}\left(NM^2\right)$, with $M\ll N$ the number of \emph{inducing variables}, which summarise the process. While the computational cost seems to be linear in $N$, the true complexity of the algorithm depends on how $M$ must increase to ensure a certain quality of approximation. We show that with high probability the KL divergence can be made arbitrarily small by growing $M$ more slowly than $N$. A particular case is that for regression with normally distributed inputs in Ddimensions with the Squared Exponential kernel, $M=\mathcal{O}(\log^D N)$ suffices. Our results show that as datasets grow, Gaussian process posteriors can be approximated cheaply, and provide a concrete rule for how to increase $M$ in continual learning scenarios.

Outstanding Paper

[ Pacific Ballroom ]
Excellent variational approximations to Gaussian process posteriors have been developed which avoid the $\mathcal{O}\left(N^3\right)$ scaling with dataset size $N$. They reduce the computational cost to $\mathcal{O}\left(NM^2\right)$, with $M\ll N$ the number of \emph{inducing variables}, which summarise the process. While the computational cost seems to be linear in $N$, the true complexity of the algorithm depends on how $M$ must increase to ensure a certain quality of approximation. We show that with high probability the KL divergence can be made arbitrarily small by growing $M$ more slowly than $N$. A particular case is that for regression with normally distributed inputs in Ddimensions with the Squared Exponential kernel, $M=\mathcal{O}(\log^D N)$ suffices. Our results show that as datasets grow, Gaussian process posteriors can be approximated cheaply, and provide a concrete rule for how to increase $M$ in continual learning scenarios.
