Expo Demonstration
Excitement-Driven AI Sports Commentary Generation
Yang Zhang
West Exhibition Hall A-B1
Speech language models refer to language models with speech processing and understanding capabilities. One key desirable capability for speech language models is the ability to capture the intricate interdependency between content and prosody, which many existing works fail to accomplish satisfactorily. We propose a speech language model that explicitly represents the prosody information and its relationship with text and thus is surprisingly capable of generating expressive speech appropriate to the context. 
In this demo, we combine our speech modeling technology with multi-modal language models into an expressive AI sports commentary generation system. The system analyzes tennis game videos and generates expressive play-by-play speech commentary. Notably, the system can detect the excitement level of the play from crowd and player reactions and adjust the excitement level of the generated speech accordingly.