Poster
in
Workshop: The 3rd DL4C Workshop: Emergent Possibilities and Challenges in Deep Learning for Code

Generate-Feedback-Refine: How Much Does Model Quality in Each Role Matter?

Xiang Pan · Jason Phang · Guy Davidson · Ethan Perez

Project Page [ OpenReview]

Abstract

From early in grade school, people learn from explicit feedback provided in response to assignments or other interactions. In this work, we explore how effectively language models incorporate textual feedback, focusing on exploring the utility of having weaker models feedback stronger ones, a potential pathway to scalable oversight. Using code generation as a test domain, we experimentally investigate a generate-feedback-refine process, varying model strengths for generation, feedback, and refinement across the MBPP, APPS, and DS-1000 datasets. We find that weaker models can provide feedback as effectively as stronger models in some cases. Feedback-and-refinement consistently improves performance on APPS and DS-1000, while on MBPP, feedback mainly benefits weaker generation models, underscoring differences across tasks.

Video

Chat is not available.