Teaching Physical Awareness to LLMs through Sounds
Abstract
Large Language Models (LLMs) have shown remarkable capabilities in text and multimodal processing, yet they fundamentally lack physical awareness--understanding of real-world physical phenomena.In this work, we present ACORN, a framework that teaches LLMs physical awareness through sound, focusing on fundamental physical phenomena like the Doppler effect, multipath effect, and spatial relationships. To overcome data scarcity, ACORN introduce a physics-based simulator combining real-world sound sources with controlled physical channels to generate diverse training data. Using this simulator, we build AQA-PHY, a comprehensive Audio Question-Answer dataset, and propose an audio encoder that processes both magnitude and phase information. By connecting our audio encoder to state-of-the-art LLMs, we demonstrate reasonable results in both simulated and real-world tasks, such as line-of-sight detection, Doppler effect estimation, and Direction-of-Arrival estimation, paving the way for enabling LLMs to understand physical world.
Lay Summary
Large Language Models (LLMs) have shown remarkable capabilities in text and multimodal processing, but they struggle with a fundamental challenge: understanding the physical world. In contrast, humans rely heavily on sound to perceive their surroundings—we can hear if someone is nearby, whether a car is moving toward us, or if a voice is coming from behind a wall.We present ACORN, a framework that teaches LLMs to understand the physical world through sound and develop human-like hearing. Instead of collecting expensive real-world recordings, we simulate how sound travels through space—capturing physical effects like motion, echo, and direction—and generate a large dataset of sound-based questions and answers. We also design a new audio encoder that mimics how humans process both the intensity and timing of sound. Our results show that LLMs can learn to detect whether a voice is blocked by a wall, estimate motion via Doppler shifts, and locate where sounds are coming from. This opens the door to safer voice-controlled systems, smarter robots, and AI that responds more naturally to the real world—making them not just smart, but also physically aware.