Poster
in
Workshop: 3rd ICLR Workshop on Machine Learning for Remote Sensing
An Analysis of Multimodal Large Language Models for Object Localization in Earth Observation Imagery
Darryl Hannan · John Cooper · Dylan White · Henry Kvinge · Timothy Doster · Yijing Watkins
in
Workshop: 3rd ICLR Workshop on Machine Learning for Remote Sensing
Multimodal large language models (MLLMs) have altered the landscape of computer vision, obtaining impressive results across a wide range of tasks, especially in zero-shot settings. Unfortunately, their strong performance does not always transfer to out-of-distribution domains, such as earth observation (EO) imagery. Prior work has demonstrated that MLLMs excel at some earth observation tasks, such as image captioning and scene understanding, while failing at tasks that require more fine-grained spatial reasoning, such as object localization. However, MLLMs are advancing rapidly and insights quickly become out-dated. In this work, we analyze more recent MLLMs that have been explicitly trained to include fine-grained spatial reasoning capabilities, benchmarking them on EO object localization tasks. We demonstrate that these models are performant in certain settings, making them well suited for zero-shot or limited data scenarios. We then directly compare their performance to a few-shot Faster RCNN, quantifying the amount of data that is needed to surpass MLLM performance in various settings. We hope that this work will prove valuable as others evaluate whether or not it is worth employing an MLLM for a given EO task and that it will encourage further research in utilizing these models in the overhead domain.