Models like BERT or GPT-2 can do amazing things with language, and this raises the interesting question of whether such text-based models could ever really "understand" it. One clear difference between BERT-understanding and human understanding is that BERT doesn't learn to connect language to its actions or its perception of the world it inhabits. I'll discuss an alternative approach to language understanding in which a neural-network-based agent is trained to associate words and phrases with things that it learns to see and do. First, I'll provide some evidence for the promise of this approach by showing that the interactive, first-person perspective of an agent affords it with a particular inductive bias that helps it to extend its training experience to generalize to out-of-distribution settings in ways that seem natural or 'systematic'. Second, I'll show the amount of 'propositional' (i.e. linguistic) knowledge that emerges in the internal states of the agent as it interacts with the world can be increased significantly by it learning to make predictions about observations multiple timesteps into the future. This underlines some important common ground between the agent-based and BERT-style approaches: both attest to the power of prediction and the importance of context in acquiring semantic representations. Finally, I'll connect BERT and agent-based learning in a more literal way, by showing how an agent endowed with BERT representations can achieve substantial (zero-shot) transfer from template-based language to noisy natural instructions given by humans with access to the agent's world.