bioPDFX: preparing PDF scientific articles for biomedical text mining

Department of Computer Science and Engineering, Jacobs School of Engineering, University of California, San Diego, La Jolla, California, United States
Health System Department of Biomedical Informatics, School of Medicine, University of California, San Diego, La Jolla, California, United States
DOI
10.7287/peerj.preprints.2993v1
Subject Areas
Bioinformatics, Data Science, Databases, Digital Libraries, World Wide Web and Web Science
Keywords
bioPDFX, PDF conversion, PDF transcribing, biomedical text mining, natural language processing
Copyright
© 2017 Bhargava et al.
Licence
This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, reproduction and adaptation in any medium and for any purpose provided that it is properly attributed. For attribution, the original author(s), title, publication source (PeerJ Preprints) and either DOI or URL of the article must be cited.
Cite this article
Bhargava S, Kuo T, Goyal A, Kuri V, Lin G, Hsu C. 2017. bioPDFX: preparing PDF scientific articles for biomedical text mining. PeerJ Preprints 5:e2993v1

Abstract

Background. There is huge amount of full-text biomedical literatures available in public repositories like PubMed Central (PMC). However, a substantial number of the papers are in Portable Document Format (PDF) and do not provide plain text format ready for text mining and natural language processing (NLP). Although there exist many PDF-to-text converters, they still suffer from several challenges while processing biomedical PDFs, such as the correct transcription of titles/abstracts, segmenting references/acknowledgements, special characters, jumbling errors (the wrong order of the text), and word boundaries.

Methods. In this paper, we present bioPDFX, a novel tool which complements weaknesses with strengths of multiple state-of-the-art methods and then applies machine learning methods to address all issues above

Results. The experiment results on publications of Genome Wide Association Studies (GWAS) demonstrated that bioPDFX significantly improved the quality of XML comparing to state-of-the-art PDF-to-XML converter, leading to a biomedical database more suitable for text mining.

Discussion. Overall, the whole pipeline developed in this paper makes the published literature in form of PDF files much better suited for text mining tasks, while slightly improving the overall text quality as well. The service is open to access freely at URL: http://textmining.ucsd.edu:9000 . A list of PubMed Central IDs of the 941 articles (see Supplemental File 1) used in this study is available for download at the same URL. The instructions of how to run the service with a PubMed ID are described in Supplemental File 2.

Author Comment

This is a submission to PeerJ Computer Science for review.

Supplemental Information

Supplemental File 1

The raw data of a list of PubMed Central IDs of the 941 articles.

DOI: 10.7287/peerj.preprints.2993v1/supp-1

Supplemental File 2

The instructions of how to run the bioPDFX service with a PubMed ID.

DOI: 10.7287/peerj.preprints.2993v1/supp-2