Skip to yearly menu bar Skip to main content


Contributed Talk
in
Workshop: Self-supervision in Audio and Speech

Self-supervised Pitch Detection by Inverse Audio Synthesis

Jesse Engel


Abstract:

Audio scene understanding, parsing sound into a hierarchy of meaningful parts, is an open problem in representation learning. Sound is a particularly challenging domain due to its high dimensionality, sequential dependencies and hierarchical structure. Differentiable Digital Signal Processing (DDSP) greatly simplifies the forward problem of generating audio by introducing differentiable synthesizer and effects modules that combine strong signal priors with end-to-end learning. Here, we focus on the inverse problem, inferring synthesis parameters to approximate an audio scene. We demonstrate that DDSP modules can enable a new approach to self-supervision, generating synthetic audio with differentiable synthesizers and training feature extractor networks to infer the synthesis parameters. By building a hierarchy from sinusoidal to harmonic representations, we show that it possible to use such an inverse modeling approach to disentangle pitch from timbre, an important task in audio scene understanding.

Link to the video: https://slideslive.com/38930745/selfsupervised-pitch-detection-by-inverse-audio-synthesis

Chat is not available.