Timezone: »
Research at the intersection of vision and language has been attracting a lot of attention in recent years. Topics include the study of multi-modal representations, translation between modalities, bootstrapping of labels from one modality into another, visually-grounded question answering, segmentation and storytelling, and grounding the meaning of language in visual data. An ever-increasing number of tasks and datasets are appearing around this recently-established field.
At NeurIPS 2018, we released the How2 data-set, containing more than 85,000 (2000h) videos, with audio, transcriptions, translations, and textual summaries. We believe it presents an ideal resource to bring together researchers working on the previously mentioned separate tasks around a single, large dataset. This rich dataset will facilitate the comparison of tools and algorithms, and hopefully foster the creation of additional annotations and tasks. We want to foster discussion about useful tasks, metrics, and labeling techniques, in order to develop a better understanding of the role and value of multi-modality in vision and language. We seek to create a venue to encourage collaboration between different sub-fields, and help establish new research directions and collaborations that we believe will sustain machine learning research for years to come.
Sat 8:45 a.m. - 9:00 a.m.
|
Welcome
|
🔗 |
Sat 9:00 a.m. - 10:15 a.m.
|
The How2 Database and Challenge ( Presentation ) link » | Lucia Specia · Ramon Sanabria 🔗 |
Sat 10:15 a.m. - 10:30 a.m.
|
Coffee Break
|
🔗 |
Sat 10:30 a.m. - 11:00 a.m.
|
Forcing Vision + Language Models To Actually See, Not Just Talk ( Invited Talk 1 ) link » | Devi Parikh 🔗 |
Sat 11:00 a.m. - 11:30 a.m.
|
Topics in Vision and Language: Grounding, Segmentation and Author Anonymity
(
Invited Talk 2 (Bernt Schiele)
)
|
🔗 |
Sat 11:30 a.m. - 12:00 p.m.
|
Learning to Reason: Modular and Relational Representations for Visual Questions and Referring Expressions
(
Invited Talk 3
)
|
Kate Saenko 🔗 |
Sat 1:30 p.m. - 2:00 p.m.
|
Multi-agent communication from raw perceptual input: what works, what doesn't and what's next
(
Invited Talk 4
)
Multi-agent communication has been traditionally used as a computational tool to study language evolution. Recently, it has attracted attention also as a means to achieve better coordination among multiple interacting agents in complex environments. However, is it easy to scale previous research in the new deep learning era? In this talk, I will first give a brief overview of some of the previous approaches that study emergent communication in cases where agents are given as input symbolic data. I will then move on to presenting some of the challenges that agents face when are placed in grounded environments where they receive raw perceptual information and how environmental or pre-linguistic conditions affect the nature of the communication protocols that they learn. Finally, I will discuss some potential remedies that are inspired from human language and communication. |
Angeliki Lazaridou 🔗 |
Sat 2:00 p.m. - 2:30 p.m.
|
Overcoming Bias in Captioning Models
(
Invited Talk 5
)
Most machine learning models are known to capture and exploit bias. While this can be beneficial for many classification tasks (e.g., it might be easier to recognize a computer mouse given the context of a computer and a desk), exploiting bias can also lead to incorrect predictions. In this talk, I will first consider how over-reliance on bias might lead to incorrect predictions in a scenario where is inappropriate to rely on bias: gender prediction in image captioning. I will present the Equalizer model which more accurately describes people and their gender by considering appropriate gender evidence. Next, I will consider how bias is related to hallucination, an interesting error mode in image captioning. I will present a metric designed to measure hallucination and consider questions like what causes hallucination, which models are prone to hallucination, and do current metrics accurately capture hallucination? |
Lisa Anne Hendricks 🔗 |
Sat 2:30 p.m. - 3:00 p.m.
|
Embodied language grounding
(
Invited Talk 6 (Katerina Fragkiadaki)
)
|
Katerina Fragkiadaki 🔗 |
Sat 3:00 p.m. - 4:30 p.m.
|
Poster Session and Coffee
(
Poster Session and Break
)
|
Ramon Sanabria · Tejas Srinivasan · Vikas Raunak · Luowei Zhou · Gourab Kundu · Roma Patel · Lucia Specia · Sang Keun Choe · Anna Belova 🔗 |
Sat 4:30 p.m. - 5:00 p.m.
|
Unsupervised Bilingual Lexicon Induction from mono-lingual multimodal data ( Invited Talk 7 (remote) ) link » | Qin Jin 🔗 |
Sat 5:00 p.m. - 6:00 p.m.
|
New Directions for Vision & Language
(
Discussion Panel
)
|
Florian Metze · Shruti Palaskar 🔗 |
Author Information
Florian Metze (Carnegie Mellon University)
Lucia Specia (Imperial College)
Desmond Elliott (Kopenhagen University)
Loic Barrault (University Le Mans)
Ramon Sanabria (Carnegie Mellon University)
Shruti Palaskar (Carnegie Mellon University)
More from the Same Authors
-
2022 Poster: IGLUE: A Benchmark for Transfer Learning across Modalities, Tasks, and Languages »
Emanuele Bugliarello · Fangyu Liu · Jonas Pfeiffer · Siva Reddy · Desmond Elliott · Edoardo Maria Ponti · Ivan Vulić -
2022 Spotlight: IGLUE: A Benchmark for Transfer Learning across Modalities, Tasks, and Languages »
Emanuele Bugliarello · Fangyu Liu · Jonas Pfeiffer · Siva Reddy · Desmond Elliott · Edoardo Maria Ponti · Ivan Vulić -
2019 : New Directions for Vision & Language »
Florian Metze · Shruti Palaskar -
2019 : Poster Session and Coffee »
Ramon Sanabria · Tejas Srinivasan · Vikas Raunak · Luowei Zhou · Gourab Kundu · Roma Patel · Lucia Specia · Sang Keun Choe · Anna Belova -
2019 : The How2 Database and Challenge »
Lucia Specia · Ramon Sanabria