RDTextSpotter: Enhancing text spotting with resampling-driven query update in transformer
Abstract
Recently, end-to-end scene text spotting methods based on the DETR framework have achieved significant success, showcasing their potential in designing efficient text spotters by initializing and updating queries. However, these methods rely solely on the features sampled during the query initialization phase, failing to fully exploit the synergy between detection and recognition tasks. This results in performance degradation and poor generalization. To address the aforementioned challenges, we propose RDTextSpotter, which employs a resampling-driven query update technique to optimize the synergy between text detection and recognition. Specifically, RDTextSpotter is built on the following core concepts: First, enhanced query initialization allows us to select bounding boxes that account for both text category and positional information, improving the quality of initial queries. Second, after the initial modeling of text queries, we introduce a resampling-driven query update module. This module resamples and updates text queries in subsequent decoder layers based on iterative detection box information, optimizing the synergy between the two tasks. Finally, during the prediction phase, we implement a weighted query fusion module to improve the stability of Hungarian matching. Extensive experiments show that RDTextSpotter outperforms state-of-the-art methods in both quantitative and qualitative metrics.