Skip to yearly menu bar Skip to main content


Poster
in
Workshop: The 3rd DL4C Workshop: Emergent Possibilities and Challenges in Deep Learning for Code

Generate-Feedback-Refine: How Much Does Model Quality in Each Role Matter?

Xiang Pan · Jason Phang · Guy Davidson · Ethan Perez


Abstract:

From early in grade school, people learn from explicit feedback provided in response to assignments or other interactions. In this work, we explore how effectively language models incorporate textual feedback, focusing on exploring the utility of having weaker models feedback stronger ones, a potential pathway to scalable oversight. Using code generation as a test domain, we experimentally investigate a generate-feedback-refine process, varying model strengths for generation, feedback, and refinement across the MBPP, APPS, and DS-1000 datasets. We find that weaker models can provide feedback as effectively as stronger models in some cases. Feedback-and-refinement consistently improves performance on APPS and DS-1000, while on MBPP, feedback mainly benefits weaker generation models, underscoring differences across tasks.

Chat is not available.