VistoGen: A compact multi-task vision-language model for unified histopathology report generation, survival prediction, and tumor grading
Abstract
Deep learning has significantly advanced histopathology image analysis, aiding in cancer diagnosis, prognosis, and precision medicine, particularly through the emergence of Vision-Language Models (VLMs). However, the inherent complexity of analyzing pathological images, coupled with the need for domain-specific expertise and the computational demands of large, fragmented models, presents a significant challenge for automating tasks like report generation, tumor grading, and patient survival prediction. This study proposes VistoGen, a novel multi-task VLM designed to explicitly overcome these limitations. To address high computational demands, VistoGen is a compact and lightweight (256M parameters) dual-encoder framework. It tackles the inherent complexity of pathological images using a shape-optimized vision encoder pre-trained on a substantial corpus of histopathology images. To embed domain-specific expertise, the key component is our two-stage adaptation strategy: first, fine-tuning on whole-slide image (WSI)-caption pairs to learn domain-specific visual-linguistic mappings, and second, applying robust multi-task optimization with specialized heads for report generation, survival prediction, and tumor grading. Our methodology is extensively evaluated on WSI patch datasets, including PatchGastricADC22 for report generation, and TCGA-GBMLGG and TCGA-KIRC for cancer grade classification and patient survival prediction. Experimental results demonstrate that VistoGen achieves state-of-the-art performance across all tasks. It surpasses previous methodologies with a C-Index of 0.782 for survival analysis on TCGA-KIRC and a notable C-Index of 0.8322 for survival analysis on TCGA-GBMLGG. Furthermore, for histopathology captioning, VistoGen establishes a new state-of-the-art with a ROUGE-L score of 0.665 on the PatchGastricADC22 dataset.