Skip to yearly menu bar Skip to main content


Poster
in
Workshop: Self-Improving Foundation Models Without Human Supervision

Scalable Thompson Sampling via Ensemble++

Yingru Li · Jiawei Xu · Baoxiang Wang · Zhi-Quan Luo

Keywords: [ Online Learning ] [ Foundation Model ] [ Sequential Decision Making ] [ Self-improvement ]


Abstract: Thompson Sampling is a principled uncertainty-driven method for active exploration, but its real-world adoption is impeded by the high computational overhead of posterior maintenance in large-scale or non-conjugate settings. Ensemble-based approaches offer partial remedies, but often require a large ensemble size. This paper proposes the Ensemble++, a scalable agent that sidesteps these limitations by a shared-factor ensemble update architecture and a random linear combination scheme. We theoretically justify that in linear bandits, Ensemble++ agent only needs an ensemble size of $\Theta(d \log T)$ to achieve regret guarantees comparable to exact Thompson Sampling. Further, to handle nonlinear rewards and complex environments. we introduce a neural extension that replaces fixed features with a learnable representation, preserving the same underlying objective via gradient-based updates. Empirical results confirm that Ensemble++ agent excel in both sample efficiency and computational scalability across linear and nonlinear environments, including GPT-based contextual bandits for automated content moderation -- a safety-critical foundation model online decision-making task.

Chat is not available.