ARINBEV: Bird's-Eye View Layout Estimation with Conditional Autoregressive Model
Jiyong Kwag · Charles Toth · Alper Yilmaz
Abstract
Recent advances in Bird’s Eye View (BEV) layout estimation have advanced through refinements in architectural and geometric design. However, existing methods often overlook the structured relationships among traffic elements. Components such as drivable areas, lane dividers, and pedestrian crossings constitute an interdependent system governed by civil engineering standards. For instance, stop lines precede crosswalks, which align with sidewalks, while lane dividers follow road curvature. To capture these interdependencies, we propose \textbf{ARINBEV}, an autoregressive model for BEV map estimation. Unlike prior generative approaches that rely on complex multiphase training or encoder-decoder architectures, ARINBEV employs a single-stage, decoder-only autoregressive design. This architecture enables semantically consistent BEV map estimation. On nuScenes and Argoverse2, ARINBEV attains 64.3 and 65.6 mIoU, respectively, while using $1.7\times$ fewer parameters and training $1.8\times$ faster than state-of-the-art models.
Successful Page Load