MDPENet: Multimodal-driven prototype evolving network for few-shot semantic segmentation
Abstract
Few-shot Semantic Segmentation (FSS) aims to predict the mask of unknown targets with only a few labeled samples. Prototype learning is a commonly used method in FSS, which transfers prototype vectors from known categories (support images) to novel categories (query images) to predict the mask of unseen objects. Although such methods have achieved success, FSS methods based on prototype learning still suffer from the problems of prototype bias and insufficient utilization of limited information. In this work, we propose a Multimodal-Driven Prototype Evolving Network (MDPENet) to alleviate these problems. Our method mainly includes the Support Feature Enhancement Module (SFEM), the Query Feature Disentanglement Module (QFDM), and the Prototype Evolution Module (PEM). Concretely, the SFEM is first utilized to establish multimodal feature interaction between the text label features encoded by CLIP and the separated support foreground features, improving the reliability of the support foreground features. Then, the QFDM combines the text label features encoded by CLIP and the support foreground features to disentangle the whole query features, which helps reduce the mutual interference between different semantics within the query features. Finally, the PEM generates the prototype set on the enhanced support foreground features and the disentangled query foreground features at a fine-grained level. Extensive experiments on the benchmark datasets PASCAL-5 i and COCO-20 i demonstrate the superiority of our MDPENet compared to classical FSS methods.