ReCAPA: Hierarchical Predictive Correction to Mitigate Cascading Failures
Abstract
Vision–Language–Action (VLA) agents follow instructions to perform multi-step tasks in multimodal environments. To support planning and execution in such settings, many approaches typically adopt structured post-hoc or rely on fixed decomposition and rigid alignment to improve success rate. However, once an intermediate subgoal or action is mis-specified and without a flexible correction mechanism, local errors propagate through subsequent steps and eventually accumulate into cascading failures in long-horizon reasoning. To mitigate this compounding effect, we propose Reflective Contrastive Alignment and Planning Architecture (ReCAPA), a framework that uses predictive correction to anticipate deviations and adjust representations across three levels: actions, subgoals, and trajectories. Semantic alignment is enforced at all levels using a Sinkhorn-based module and a Score-field module. The corrective signals, derived from predictive correction and alignment mechanisms, jointly update the execution network during training, enabling it to flexibly adjust fine-grained steps to remain aligned with the overall intent. We further introduce two new metrics to quantify error propagation and recovery processes in tasks. Experiments show that ReCAPA achieves competitive results on embodied agent benchmarks such as VisualAgentBench, MineDojo, and MAP-THOR, outperforming strong proprietary and open-source Large Language Model (LLM) baselines.