I will focus on the problem of executing natural language instructions in a collaborative environment. I will propose the task of learning to follow sequences of instructions in a collaborative scenario, where two agents, a leader and a follower, execute actions in the environment and the leader controls the follower using natural language. To study this problem, we build CerealBar, a multi-player 3D game where a leader instructs a follower, and both act in the environment together to accomplish complex goals. I will focus on learning an autonomous follower that executes the instructions of a human leader. I will briefly describe a model to address this problem, and a learning method that relies on static recorded human-human interactions, while still learning to recover from cascading errors between instructions.