Skip to yearly menu bar Skip to main content


Poster
in
Workshop: ICLR 2025 Workshop on Human-AI Coevolution

R-LLAVA: IMPROVING MED-VQA UNDERSTANDING THROUGH VISUAL REGION OF INTEREST

Xupeng Chen · Zhixin Lai · Kangrui Ruan · Shichu Chen · Jiaxiang Liu · Zuozhu Liu


Abstract:

Artificial intelligence has made significant stridesin medical visual question answering (Med-VQA), yet prevalentstudies often interpret images holistically, overlooking the visualregions of interest that may contain crucial information, potentiallyaligning with a doctor’s prior knowledge that can be incorporatedwith minimal annotations (e.g., bounding boxes). To addressthis gap, this paper introduces R-LLaVA, designed to enhancebiomedical VQA understanding by integrating simple medicalannotations as prior knowledge directly into the image spacethrough CLIP. These annotated visual regions of interest arethen fed into the LLaVA model during training, aiming to enrichthe model’s understanding of biomedical queries. Experimentalevaluation on four standard Med-VQA datasets demonstrates R-LLaVA’s superiority over existing state-of-the-art (SoTA) methods.Additionally, to verify the model’s capability in visual comprehension, a novel multiple-choice medical visual understanding datasetis introduced, confirming the positive impact of focusing on visualregions of interest in advancing biomedical VQA understanding.

Chat is not available.