The ICML Expressive Vocalizations (ExVo) Workshop and Competition 2022 introduces, for the first time in a competition setting, the machine learning problem of understanding and generating vocal bursts – a wide range of emotional non-linguistic utterances. Participants of ExVo are presented with three tasks that utilize a single dataset. The dataset and three tasks draw attention to new innovations in emotion science and capture 10 dimensions of emotion reliably perceived in distinct vocal bursts: Awe, Excitement, Amusement, Awkwardness, Fear, Horror, Distress, Triumph, Sadness and Surprise. Of particular interest to the ICML community, these tasks highlight the need for advanced machine learning techniques for multi-task learning, audio generation, and personalized few-shot learning of nonverbal expressive style.
With studies of vocal emotional expression often relying on significantly smaller datasets insufficient to apply the latest machine learning innovations, the ExVo competition and workshop provides an unprecedented platform for the development and discussion of novel strategies for understanding vocal bursts and will enable unique forms of collaborations by leading researchers from diverse disciplines. Organized by leading researchers in emotion science and machine learning, the following three tasks are proposed: the Multi-task High-Dimensional Emotion, Age & Country Task (ExVo Multi-Task); the Generative Emotional Vocal Burst Task (ExVo Generate); and the Few-Shot Emotion Recognition task (ExVo Few-Shot).
Important dates (AoE)
- Challenge Opening (data available): April 1, 2022.
- Baselines and paper released: April 8, 2022.
- ExVo MultiTask submission deadline: May 12, 2022.
- ExVo Few-Shot (test-labels): May 13, 2022.
- Workshop paper submission: ~~May 20, 2022~~ Extended June 6 2022.
For those interested in submitting research to the ExVo workshop outside of the competition, we encourage contributions covering the following topics:
- Detecting and Understanding Vocal Emotional Behavior
- Multi-Task Learning in Affective Computing
- Generating Nonverbal Vocalizations or Speech Prosody
- Personalized Machine Learning for Affective Computing
- Other topics related to Affective Verbal and Nonverbal Vocalization
Sat 6:00 a.m. - 6:15 a.m.
|
ExVo Welcome
(
Opening Remarks
)
SlidesLive Video » |
Alice Baird 🔗 |
Sat 6:15 a.m. - 6:25 a.m.
|
The ICML 2022 Expressive Vocalizations Workshop and Competition: Recognizing, Generating, and Personalizing Vocal Bursts
(
Spotlight
)
SlidesLive Video » The ICML Expressive Vocalization (ExVo) Competition is focused on understanding and generating vocal bursts: laughs, gasps, cries, and other non-verbal vocalizations that are central to emotional expression and communication. ExVo 2022, includes three competition tracks using a large-scale dataset of 59,201 vocalizations from 1,702 speakers. The first, ExVo-MultiTask, requires participants to train a multi-task model to recognize expressed emotions and demographic traits from vocal bursts. The second, ExVo-Generate, requires participants to train a generative model that produces vocal bursts conveying ten different emotions. The third, ExVo-FewShot, requires participants to leverage few-shot learning incorporating speaker identity to train a model for the recognition of 10 emotions conveyed by vocal bursts. This paper describes the three tracks and provides performance measures for baseline models using state-of-the-art machine learning strategies. The baseline for each track is as follows, for ExVo-MultiTask, a combined score, computing the harmonic mean of Concordance Correlation Coefficient (CCC), Unweighted Average Recall (UAR), and inverted Mean Absolute Error (MAE) (SMTL) is at best, 0.335 SMTL; for ExVo-Generate, we report Fr\'{e}chet inception distance (FID) scores ranging from 4.81 to 8.27 (depending on the emotion) between the training set and generated samples. We then combine the inverted FID with perceptual ratings of the generated samples (SGen) and obtain 0.174 SGen; and for ExVo-FewShot, a mean CCC of 0.444 is obtained. |
Alice Baird · Panagiotis Tzirakis · Alan Cowen · Gauthier Gidel · Marco Jiralerspong · Eilif Muller · Kory Mathewson · Bjoern Schuller · Erik Cambria · Dacher Keltner 🔗 |
Sat 6:25 a.m. - 6:30 a.m.
|
Questions
|
🔗 |
Sat 6:30 a.m. - 6:40 a.m.
|
Exploring speaker enrolment for few-shot personalisation in emotional vocalisation prediction
(
Spotlight
)
link »
SlidesLive Video » In this work, we explore a novel few-shot personalisation architecture for emotional vocalisation prediction. The core contribution is an "enrolment" encoder which utilises two unlabelled samples of the target speaker to adjust the output of the emotion encoder; the adjustment is based on dot-product attention, thus effectively functioning as a form of "soft" feature selection. The emotion and enrolment encoders are based on two standard audio architectures: CNN14 and CNN10. The two encoders are further guided to forget or learn auxiliary emotion and/or speaker information. Our best approach achieves a CCC of .650 on the ExVo Few-Shot dev set, a 2.5% increase over our baseline CNN14 CCC of .634. |
Andreas Triantafyllopoulos · Meishu Song · Zijiang Yang · Xin Jing · Björn Schuller 🔗 |
Sat 6:40 a.m. - 6:45 a.m.
|
Questions
|
🔗 |
Sat 6:45 a.m. - 6:55 a.m.
|
Redundancy Reduction Twins Network: A Training framework for Multi-output Emotion Regression
(
Spotlight
)
link »
SlidesLive Video » In this paper, we propose the Redundancy Reduction Twins Network (RRTN), a redundancy reduction training framework that minimizes redundancy by measuring the cross-correlation matrix between the outputs of the same network fed with distorted versions of a sample and bringing it as close to the identity matrix as possible. RRTN also applies a new loss function, the Barlow Twins loss function, to help maximize the similarity of representations obtained from different distorted versions of a sample. However, as the distribution of losses can cause performance fluctuations in the network, we also propose the use of a Restrained Uncertainty Weight Loss (RUWL) or joint training to identify the best weights for the loss function. Our best approach on CNN14 with proposed methodology obtains a CCC over emotion regression of .678 on the ExVo Multi-task dev set, a 4.8% increase over vanilla CNN 14 CCC of .647, which achieves a significant difference at 95% confidence interval (2-tailed). |
Xin Jing · Andreas Triantafyllopoulos · Zijiang Yang · Björn Schuller · Meishu Song 🔗 |
Sat 6:55 a.m. - 7:00 a.m.
|
Questions
|
🔗 |
Sat 7:00 a.m. - 7:30 a.m.
|
Tea/Coffee Break
|
🔗 |
Sat 7:30 a.m. - 8:15 a.m.
|
"Using WaveNet to reunite speech-impaired users with their original voices" (invited talk)
(
Keynote
)
SlidesLive Video » Dr Yutian Chen is a staff research scientist at DeepMind. He obtained his PhD in machine learning at the University of California, Irvine, and later worked at the University of Cambridge as a research associate (Postdoc) before joining DeepMind. Yutian took part in the AlphaGo and AlphaGo Zero project, developed Game Go AI programs that defeated the world champions. The AlphaGo project was ranked in the top 10 discoveries of the decade 2010s by the New Scientist magazine. Yutian has conducted research in multiple machine learning areas including Bayesian methods, deep learning, reinforcement learning, generative models and meta-learning with applications in gaming AI, computer vision and text-to-speech. Yutian also serves as reviewers and area chairs for multiple academic conferences and journals. |
Yutian Chen 🔗 |
Sat 8:15 a.m. - 8:30 a.m.
|
Questions
|
🔗 |
Sat 8:30 a.m. - 8:40 a.m.
|
Synthesizing Personalized Non-speech Vocalization from Discrete Speech Representations
(
Spotlight
)
link »
SlidesLive Video » We formulated non-speech vocalization (NSV) modeling as a text-to-speech (TTS) task and verified its viability. Specifically, we evaluated the phonetic expressivity of Hubert speech units on NSVs and verified our model’s ability to generalize to few-shot speakers. In addition, we explicated one of the major challenges in the ExVo dataset by visualizing the speaker space our model learned and discussed possible improvements for future research. Audio samples of synthesized NSVs can be found on our anonymized demo page. |
Chin-Cheng Hsu 🔗 |
Sat 8:40 a.m. - 8:45 a.m.
|
Questions
|
🔗 |
Sat 8:45 a.m. - 8:55 a.m.
|
Generating Diverse Vocal Bursts with StyleGAN2 and MEL-Spectrograms
(
Spotlight
)
link »
SlidesLive Video » We describe our approach for the generative emotional vocal burst task (ExVo Generate) of the ICML Expressive Vocalizations Competition (Baird et al., 2022). We train a conditional StyleGAN2 (Karras et al., 2019) architecture on mel-spectrograms of preprocessed versions of the audio samples. The mel-spectrograms generated by the model are then inverted back to the audio domain using Griffin-Lim. As a result, our generated samples significantly improve upon the baseline provided by (Baird et al., 2022) from a qualitative and quantitative perspective. More precisely, on all emotions, we improve the FAD of the baseline by a significant factor ranging from1.97 (Awe) to 3.9 (Sadness). |
Marco Jiralerspong · Gauthier Gidel 🔗 |
Sat 8:55 a.m. - 9:00 a.m.
|
Questions
|
🔗 |
Sat 9:00 a.m. - 10:30 a.m.
|
Lunch
|
🔗 |
Sat 10:30 a.m. - 11:00 a.m.
|
"Fundamental advances in understanding nonverbal behavior" (invited talk)
(
Keynote
)
SlidesLive Video » Dr. Alan Cowen is an applied mathematician and computational emotion scientist developing new data-driven methods to study human experience and expression. He was previously a researcher at the University of California and visiting scientist at Google, where he helped establish affective computing research efforts. His discoveries have been featured in leading journals such as Nature, PNAS, Science Advances, and Nature Human Behavior and covered in press outlets ranging from CNN to Scientific American. His research applies new computational tools to address how emotional behaviors can be evoked, conceptualized, predicted, and annotated, how they influence our social interactions, and how they bring meaning to our everyday lives. |
Alan Cowen 🔗 |
Sat 11:00 a.m. - 11:15 a.m.
|
Questions
|
🔗 |
Sat 11:15 a.m. - 11:25 a.m.
|
Dynamic Restrained Uncertainty Weighting Loss for Multitask Learning of Vocal Expression
(
Spotlight
)
link »
SlidesLive Video » We propose a novel Dynamic Restrained Uncertainty Weighting Loss to experimentally handle the problem of balancing the contributions of multiple tasks on the ICML ExVo 2022 Challenge.The multitask aims to recognize expressed emotions and demographic traits from vocal bursts jointly. Our strategy combines the advantages of Uncertainty Weight and Dynamic Weight Average, by extending weights with a restraint term to make the learning process more explainable. We use a lightweight multi-exit CNN architecture to implement our proposed loss approach. The experimental H-Mean score (0.394) shows a substantial improvement over the baseline H-Mean score (0.335). |
Meishu Song · Zijiang Yang · Andreas Triantafyllopoulos · Xin Jing · Vincent Karas · Jiangjian Xie · Zixing Zhang · Yamamoto Yoshiharu · Björn Schuller 🔗 |
Sat 11:25 a.m. - 11:30 a.m.
|
Questions
|
🔗 |
Sat 11:30 a.m. - 11:40 a.m.
|
Multitask vocal burst modeling with ResNets and pre-trained paralinguistic Conformers
(
Spotlight
)
link »
SlidesLive Video » This technical report presents the modeling approaches used in our submission to the ICML Expressive Vocalizations Workshop \& Competition multitask track (\textsc{ExVo-MultiTask}). We first applied image classification models of various sizes on mel-spectrogram representations of the vocal bursts, as is standard in sound event detection literature. Results from these models show an increase of 21.24\% over the baseline system with respect to the harmonic mean of the task metrics, and comprise our team's main submission to the \textsc{MultiTask} track. We then sought to characterize the headroom in the \textsc{MultiTask} track by applying a large pre-trained Conformer model that previously achieved state-of-the-art results on paralinguistic tasks like speech emotion recognition and mask detection. We additionally investigated the relationship between the sub-tasks of emotional expression, country of origin, and age prediction, and discovered that the best performing models are trained as single-task models, questioning whether the problem truly benefits from a multitask setting. |
Josh Belanich · Krishna Somandepalli · Brian Eoff · Brendan Jou 🔗 |
Sat 11:40 a.m. - 11:45 a.m.
|
Questions
|
🔗 |
Sat 11:45 a.m. - 11:55 a.m.
|
Exploring the Effectiveness of Self-supervised Learning and Classifier Chains in Emotion Recognition of Nonverbal Vocalizations
(
Spotlight
)
link »
SlidesLive Video »
We present an emotion recognition system for nonverbal vocalizations (NVs) submitted to the ExVo Few-Shot track of the ICML Expressive Vocalizations Competition 2022.The proposed method uses self-supervised learning (SSL) models to extract features from NVs and uses a classifier chain to model the label dependency between emotions.Experimental results demonstrate that the proposed method can significantly improve the performance of this task compared to several baseline methods.Our proposed method obtained a mean concordance correlation coefficient (CCC) of $0.725$ in the validation set and $0.739$ in the test set, while the best baseline method only obtained $0.554$ in the validation set.We publicate our code at \url{https://github.com/Aria-K-Alethia/ExVo} to help others to reproduce our experimental results.
|
Detai Xin · Shinnosuke Takamichi · Hiroshi Saruwatari 🔗 |
Sat 11:55 a.m. - 12:00 p.m.
|
Questions
|
🔗 |
Sat 12:00 p.m. - 12:30 p.m.
|
Tea/Coffee Break
|
🔗 |
Sat 12:30 p.m. - 12:50 p.m.
|
"Neurosymbolic AI for Sentiment Analysis" (invited talk)
(
Keynote
)
SlidesLive Video » Dr. Erik Cambria is the Founder of SenticNet, a Singapore-based company offering B2B sentiment analysis services, and an Associate Professor at NTU, where he also holds the appointment of Provost Chair in Computer Science and Engineering. Prior to joining NTU, he worked at Microsoft Research Asia (Beijing) and HP Labs India (Bangalore) and earned his PhD through a joint programme between the University of Stirling and MIT Media Lab. His research focuses on neurosymbolic AI for explainable natural language processing in domains like sentiment analysis, dialogue systems, and financial forecasting. He is recipient of several awards, e.g., IEEE Outstanding Career Award, was listed among the AI's 10 to Watch, and was featured in Forbes as one of the 5 People Building Our AI Future. He is an IEEE Fellow, Associate Editor of many top-tier AI journals, e.g., INFFUS and IEEE TAFFC, and is involved in various international conferences as program chair and SPC member. |
Erik Cambria 🔗 |
Sat 12:50 p.m. - 1:00 p.m.
|
Self-supervision and Learnable STRFs for Age, Emotion and Country Prediction
(
Spotlight
)
link »
SlidesLive Video » This work presents a multitask approach to the simultaneous estimation of age, country of origin, and emotion given vocal burst audio for the 2022 ICML Expressive Vocalizations Challenge. The method of choice utilized a combination of spectro-temporal modulation and self-supervised features, followed by an encoder-decoder network organized in a multitask paradigm. We evaluate the complementarity between the tasks posed by examining independent task-specific and joint models, and explore the relative strengths of different feature sets. We also introduce a simple score fusion mechanism to leverage the complementarity of different feature sets for this task. We find that robust data pre-processing in conjunction with score fusion over spectro-temporal receptive field and HUBERT models achieved our best test score of 41.2. |
Roshan Sharma · Tyler Vuong · Mark Lindsey · Hira Dhamyal · Bhiksha Raj · Rita Singh 🔗 |
Sat 1:00 p.m. - 1:05 p.m.
|
Questions
|
🔗 |
Sat 1:05 p.m. - 1:15 p.m.
|
Comparing supervised and self-supervised embedding for ExVo Multi-Task learning track
(
Spotlight
)
link »
SlidesLive Video » The ICML Expressive Vocalizations (ExVo) Multi-task challenge 2022, focuses on understanding the emotional facets of the non-linguistic vocalizations (vocal bursts (VB)). The objective of this challenge is to predict emotional intensities for VB, being a multi-task challenge it also requires to predict speakers' age and native-country. For this challenge we study and compare two distinct embedding spaces namely, self-supervised learning (SSL) based embeddings and task-specific supervise learning based embeddings. Towards that, we investigate feature representations obtained from several pre-trained SSL neural networks and task-specific supervised classification neural networks.Our studies show that best performance is obtained with an hybrid approach, where predictions derived via both SSL and task-specific supervised learning are used. Our best system on test-set surpass the ComParE baseline results by a relative 13% margin. |
Tilak Purohit · Imen Ben Mahmoud · Bogdan Vlasenko · Mathew Magimai.-Doss 🔗 |
Sat 1:15 p.m. - 1:20 p.m.
|
Questions
|
🔗 |
Sat 1:20 p.m. - 1:30 p.m.
|
Burst2Vec: An Adversarial Multi-Task Approach for Predicting Emotion, Age, and Origin from Vocal Bursts
(
Spotlight
)
link »
SlidesLive Video » We present Burst2Vec, our multi-task learning approach to predict emotion, age, and origin (i.e., native country/language) from vocal bursts. Burst2Vec utilises pre-trained speech representations to capture acoustic information from raw waveforms and incorporates the concept of model debiasing via adversarial training. Our models achieve a relative 30 % performance gain over baselines using pre-extracted features and score the highest amongst all participants in the ICML ExVo 2022 Multi-Task Challenge. |
Atijit Anuchitanukul · Lucia Specia 🔗 |
Sat 1:30 p.m. - 1:35 p.m.
|
Questions
|
🔗 |
Sat 1:35 p.m. - 2:00 p.m.
|
Winner Announcements
(
Closing Remarks
)
SlidesLive Video » |
Alice Baird 🔗 |