Poster
in
Workshop: Mathematical and Empirical Understanding of Foundation Models (ME-FoMo)
Principled Reinforcement Learning with Human Feedback from Pairwise or $K$-wise Comparisons
Banghua Zhu · Jiantao Jiao · Michael Jordan
Keywords: [ offline reinforcement learning ] [ maximum likelihood estimator ] [ Bradley-Terry- Luce model ] [ Plackett-Luce model ] [ reinforcement learning with human feedback (RLHF) ] [ Pessimism ]