Morning Poster
in
Workshop: Artificial Intelligence & Human Computer Interaction
Demystifying the Role of Feedback in GPT Self-Repair for Code Generation
Theo X. Olausson · Jeevana Priya Inala · Chenglong Wang · Jianfeng Gao · Armando Solar-Lezama
Large Language Models (LLMs) have shown remarkable aptitude in generating code from natural language specifications, but still struggle on challenging programming tasks. Self-repair---in which the user provides executable unit tests and the model uses these to debug and fix mistakes in its own code---may improve performance in these settings without significantly altering the way in which programmers interface with the system. However, existing studies on how and when self-repair works effectively have been limited in scope, and one might wonder how self-repair compares to keeping a software engineer in the loop to give feedback on the code model's outputs. In this paper, we analyze GPT-3.5 and GPT-4's ability to perform self-repair on APPS, a challenging dataset consisting of diverse coding challenges. We find that when the cost of generating both feedback and repaired code is taken into account, performance gains from self-repair are marginal and can only be seen with GPT-4. In contrast, when human programmers are used to provide feedback, the success rate of repair increases by as much as 57%. These findings suggest that self-repair still trails far behind what can be achieved with a feedback-giving human kept closely in the loop.