Belief-Based Offline Reinforcement Learning for Delay-Robust Policy Optimization
Simon Zhan · Qingyuan Wu · Zhaofeng Wang · Frank Yang · Xiangyu Shi · Chao Huang · Qi Zhu
Abstract
Offline–to–online deployment of reinforcement learning (RL) agents often stumbles over two fundamental gaps: (1) the sim-to-real gap, where real-world systems exhibit latency and other physical imperfections not captured in simulation; and (2) the interaction gap, where policies trained purely offline face out-of-distribution (OOD) issues during online execution, as collecting new interaction data is costly or risky. As a result, agents must generalize from static, delay-free datasets to dynamic, delay-prone environments. In this work, we propose $\textbf{DT-CORL}$ ($\textbf{D}$elay-$\textbf{T}$ransformer belief policy $\textbf{C}$onstrained $\textbf{O}$ffline $\textbf{RL}$), a novel framework for learning delay-resilient policies solely from static, delay-free offline data. DT-CORL introduces a transformer-based belief model to infer latent states from delayed observations and jointly trains this belief with a constrained policy objective, ensuring that value estimation and belief representation remain aligned throughout learning. Crucially, our method does not require access to delayed transitions during training and outperforms naive history-augmented baselines, SOTA delayed RL methods, and existing belief-based approaches. Empirically, we demonstrate that DT-CORL achieves strong delay-robust generalization across both locomotion and goal-conditioned tasks in the D4RL benchmark under varying delay regimes. Our results highlight that joint belief-policy optimization is essential for bridging the sim-to-real latency gap and achieving stable performance in delayed environments.
Successful Page Load