KGLMQA: Enhancing Medical Visual Question Answering with Knowledge Graphs and LLMs
Abstract
Medical Visual Question Answering (MedVQA) integrates computer vision and natural language processing technologies to assist clinical decision-making and reduce diagnostic errors. However, existing MedVQA models often suffer from limitations in multimodal feature interaction, insufficient integration of medical knowledge, and a lack of diagnostic logic in their answers. To address these challenges, we propose KGLMQA, a novel framework that integrates knowledge graphs with Large Language Models (LLMs) to enhance the performance of MedVQA. KGLMQA consists of three modules: a MedVQA classification model that employs a gating mechanism and multi-stage feature fusion, a Knowledge Graph Retrieval Augmented Generation (KGRAG) module for dynamic retrieval and refinement of medical knowledge, and a large language model used to generate professional and semantically aware answers. The experimental results based on the P-VQA, VQA-RAD, and SLAKE datasets show that KGLMQA achieves state-of-the-art performance in terms of Accuracy and Precision, and it particularly excels in handling open-ended questions. The case analysis further demonstrates that, compared with baseline large language models such as ChatGPT and DeepSeek-V3, KGLMQA has significant advantages in diagnostic logic and completeness of answers. The results indicate that integrating visual diagnosis, structured medical knowledge, and large language models can effectively promote the development of MedVQA systems towards real clinical applications.