Oral Presentation
in
Workshop: Synthetic Realities: Deep Learning for Detecting AudioVisual Fakes

Contributed Talk: Tampered Speaker Inconsistency Detection with Phonetically Aware Audio-visual Features

Pavel Korshunov

2019 Oral Presentation
in
Workshop: Synthetic Realities: Deep Learning for Detecting AudioVisual Fakes

Abstract

The recent increase in social media based pro- paganda, i.e., ‘fake news’, calls for automated methods to detect tampered content. In this paper, we focus on detecting tampering in a video with a person speaking to a camera. This form of ma- nipulation is easy to perform, since one can just replace a part of the audio, dramatically chang- ing the meaning of the video. We consider sev- eral detection approaches based on phonetic fea- tures and recurrent networks. We demonstrate that by replacing standard MFCC features with embeddings from a DNN trained for automatic speech recognition, combined with mouth land- marks (visual features), we can achieve a signif- icant performance improvement on several chal- lenging publicly available databases of speakers (VidTIMIT, AMI, and GRID), for which we gen- erated sets of tampered data. The evaluations demonstrate a relative equal error rate reduction of 55% (to 4.5% from 10.0%) on the large GRID corpus based dataset and a satisfying generaliza- tion of the model on other datasets.

Chat is not available.