MFCL Audio: An Audio Function Calling Evaluation for Large Language Models
Huanzhi Mao ⋅ Aditya Ghai ⋅ Imra Dawoodani ⋅ Tony Ginart ⋅ Shishir G. Patil ⋅ John Emmons ⋅ Joseph E Gonzalez
Abstract
Audio agents are increasingly deployed to execute tools from spoken requests, yet audio tool use poses challenges beyond text-only function calling: perception errors (e.g., homophones, noise, disfluencies) can corrupt entities and arguments, and natural interactions often require clarification that changes the tool-calling protocol. We introduce MFCL-Audio, a large-scale benchmark for audio function calling with 6.2K expert-verified tasks across two suites that mirror common deployments: MFCL Text Audio \(pipelined ASR$\rightarrow$LLM$\rightarrow$tools via transcripts) and MFCL True Audio \(end-to-end audio-in$\rightarrow$tool calls). MFCL-Audio includes controlled speech and acoustic perturbations (accent and speaking-rate variation, content disfluencies, and background noise) generated through a controllable audio synthesis/augmentation pipeline. We provide automatic grading for both function names and argument values using AST-based matching for single-turn calls and response/state-based metrics for multi-turn interactions, enabling scalable evaluation without LLM judges. Across a broad set of models, we propose a failure-mode taxonomy and analyze which speech and noise factors most strongly impact tool-calling accuracy. We release the benchmark, evaluation harness, and audio pipeline to support research on reliable speech-based agents.
Successful Page Load