Poster
in
Workshop: ICLR 2025 Workshop on Human-AI Coevolution
R-LLAVA: IMPROVING MED-VQA UNDERSTANDING THROUGH VISUAL REGION OF INTEREST
Xupeng Chen · Zhixin Lai · Kangrui Ruan · Shichu Chen · Jiaxiang Liu · Zuozhu Liu
in
Workshop: ICLR 2025 Workshop on Human-AI Coevolution
Artificial intelligence has made significant stridesin medical visual question answering (Med-VQA), yet prevalentstudies often interpret images holistically, overlooking the visualregions of interest that may contain crucial information, potentiallyaligning with a doctor’s prior knowledge that can be incorporatedwith minimal annotations (e.g., bounding boxes). To addressthis gap, this paper introduces R-LLaVA, designed to enhancebiomedical VQA understanding by integrating simple medicalannotations as prior knowledge directly into the image spacethrough CLIP. These annotated visual regions of interest arethen fed into the LLaVA model during training, aiming to enrichthe model’s understanding of biomedical queries. Experimentalevaluation on four standard Med-VQA datasets demonstrates R-LLaVA’s superiority over existing state-of-the-art (SoTA) methods.Additionally, to verify the model’s capability in visual comprehension, a novel multiple-choice medical visual understanding datasetis introduced, confirming the positive impact of focusing on visualregions of interest in advancing biomedical VQA understanding.