ConBGAT: a novel model combining convolutional neural networks, transformer and graph attention network for information extraction from scanned image

View article
PeerJ Computer Science

Main article text

 

Introduction

  • We propose a new model ConBGAT for the extracting scanned image information. The ConBGAT model has features extracted from combining advanced models CNN in image feature extraction and Transformer-DistilBERT in text feature extraction.

  • The proposed model utilizes graph modeling techniques to represent the interrelations among regions within text images, enabling the model to effectively capture the complex relationships between components.

  • Performing comprehensive experiments and multidimensional evaluations on real-world datasets, juxtaposing the performance of our proposed model, ConBGAT, against other existing methods.

Background theory

Convolutional neural networks

The transformer

Graph neural network

Proposed methods

The proposed ConBGAT model

Extraction and graph neural networks model

Word recognition

Feature extraction

where a(k)1:i=[a(k)1,a(k)2,...,a(k)i] is the input sequence padding with a(k)1=[CLS]. The [CLS] token, used for capturing full sequence context, was first introduced by Devlin et al. (2018). TEmb(k)1:i=[TEmb(k)1,TEmb(k)2,...,TEmb(k)i]iϵidmodel represents the output sequence embeddings, dmodelis the dimension of the model. TEmb(k)irepresents the ith result of the pre-train model DistilBERT for ith document. ΘDistilBERTrepresents the parameters of the pre-train model DistilBERT. Each sentence is encoded independently, we get the text embeddings of document β with η sentences or text paragraphs. We define it according to Eq. (2).

where b(k)1:i=[b(k)0,b(k)1,...,b(k)i]denotes the input image segments appending with b(k)0=fullimg. We utilize b(k)0 to represent the overall structural characteristics of the document image. b(k)iϵHW3 represents ith image segment of kth document, H and W are high and width of image respectively. IEmb(k)0:i=[IEmb(k)0,IEmb(k)1,,IEmb(k)n]ϵndmodel represents the output image embeddings, dmodel is the dimension of the model. We use a variant of CNN for this image embeddings viz Resnet50 (He et al., 2016) and a fully connect class that resizes the output according to the size of the dmodel. IEmb(k)i represents the ith output of CNN model for the kth document. ΘCNNis parameters of CNN model. By independent encoding, we can get the image embedding of the document β. We define it according to Eq. (4).

Graph modeling

where DL, DT, DR, DB corresponding to the relative distances left, above, right, and below the word Boxroot (root box) to neighboring boxes.

Graph attention network models

where, a2F, is a learned weight vector W is a learned weight matrix and || is the operator that joins two vectors. Soften the attention weights to sum to 1.

where Nu represents the set of adjacent vertices of vertex u.

where δ is a sigmod function.

Loss function and optimization

Implementation and experimental result

Dataset

Data processing

Implementation details

Evaluation metrics

where TP, FP, FN represent for True Positive, False Positive, False Negative.

Experimental results and discussion

Detailed of results

Conclusion

Supplemental Information

Additional Information and Declarations

Competing Interests

The authors declare that they have no competing interests.

Author Contributions

Duy Ho Vo Hoang performed the experiments, performed the computation work, prepared figures and/or tables, and approved the final draft.

Huy Vo Quoc analyzed the data, prepared figures and/or tables, and approved the final draft.

Bui Thanh Hung conceived and designed the experiments, performed the experiments, performed the computation work, prepared figures and/or tables, authored or reviewed drafts of the article, and approved the final draft.

Data Availability

The following information was supplied regarding data availability:

The CORD raw data are available in the Supplemental File. They came from CORD Dataset: Park, S., Shin, S., Lee, B., Lee, J., Surh, J., Seo, M., & Lee, H. 2019. CORD: a consolidated receipt dataset for post-OCR parsing. In Workshop on Document Intelligence at NeurIPS 2019.

The FUNSD Dataset is available at: Jaume, G., Ekenel, H. K., & Thiran, J. P. (2019, September). Funsd: A dataset for form understanding in noisy scanned documents. In 2019 International Conference on Document Analysis and Recognition Workshops (ICDARW) (Vol. 2, pp. 1-6). IEEE. https://doi.org/10.48550/arXiv.1905.13538.

The SROIE Dataset is available at: Huang, Z., Chen, K., He, J., Bai, X., Karatzas, D., Lu, S., & Jawahar, C. V. (2019, September). Icdar2019 competition on scanned receipt ocr and information extraction. In 2019 International Conference on Document Analysis and Recognition (ICDAR) (pp. 1516-1520). IEEE. https://doi.org/10.1109/ICDAR.2019.00244.

Funding

The authors received no funding for this work.

1 Citation 978 Views 23 Downloads

Your institution may have Open Access funds available for qualifying authors. See if you qualify

Publish for free

Comment on Articles or Preprints and we'll waive your author fee
Learn more

Five new journals in Chemistry

Free to publish • Peer-reviewed • From PeerJ
Find out more