FCCR: Fine-grained cultural image captioning via knowledge-driven self-refinement
Abstract
Despite the remarkable progress of Vision-Language Models (VLMs), existing image captioning systems still struggle to accurately recognize and describe fine-grained cultural elements. These limitations primarily stem from the lack of culturally diverse training data and the absence of structured cultural knowledge. To address these challenges, we propose FCCR (Fine-Grained Cultural Image Captioning via Knowledge-Driven Self-Refinement), a novel framework that iteratively refines captions using structured cultural knowledge to generate culturally precise and expressive descriptions. Central to FCCR is a self-refinement mechanism guided by a scoring function that extends beyond conventional natural language feedback. By leveraging structured and fine-grained cultural data, the model produces semantically rich feedback and performs multi-step refinement. To facilitate this process, we introduce FACA-1K (Fine-grained Attributes for Cultural Architecture), a curated dataset consisting of 1,000 high-resolution images annotated with expert-level architectural attributes. Furthermore, we develop a custom scoring function to quantitatively assess the cultural appropriateness of generated captions, along with a new metric—Keyword Awareness Rate (KAR)—which measures the incorporation of culturally specific terminology. Experimental results demonstrate that FCCR significantly outperforms baseline models in both Cultural Awareness Score (CAS) and KAR, generating captions that are richer and more culturally accurate. This study lays a solid foundation for culturally aware captioning systems that can capture and preserve fine-grained cultural contexts.