GAS: Enhancing Reward-Cost Balance of Generative Model-assisted Offline Safe RL
Abstract
Offline Safe Reinforcement Learning (OSRL) aims to learn a policy that achieves high performance in sequential decision-making while satisfying safety constraints, using only pre-collected datasets. Recent works, inspired by the strong capabilities of Generative Models (GMs), reformulate decision-making in OSRL as a conditional generative process, where GMs generate desirable actions conditioned on predefined reward and cost return-to-go values. However, GM-assisted methods face two major challenges in constrained settings: (1) they lack the ability to ``stitch'' optimal transitions from suboptimal trajectories within the dataset, and (2) they struggle to balance reward maximization with constraint satisfaction, particularly when tested with imbalanced human-specified reward-cost conditions. To address these issues, we propose Goal-Assisted Stitching (GAS), a novel algorithm designed to enhance stitching capabilities while effectively balancing reward maximization and constraint satisfaction. To enhance the stitching ability, GAS first augments and relabels the dataset at the transition level, enabling the construction of high-quality trajectories from suboptimal ones. GAS also introduces novel goal functions, which estimate the optimal achievable reward and cost goals from the dataset. These goal functions, trained using expectile regression on the relabeled and augmented dataset, allow GAS to accommodate a broader range of reward-cost return pairs and achieve a better tradeoff between reward maximization and constraint satisfaction compared to human-specified values. The estimated goals then guide policy training, ensuring robust performance under constrained settings. Furthermore, to improve training stability and efficiency, we reshape the dataset to achieve a more uniform reward-cost return distribution. Empirical results validate the effectiveness of GAS, demonstrating superior performance in balancing reward maximization and constraint satisfaction compared to existing methods.