Data-driven classification of the certainty of scholarly assertions

Mario Prieto; Helena Deus; Anita De Waard; Erik Schultes; Beatriz García-Jiménez; Mark D Wilkinson

doi:10.7287/peerj.preprints.27829v2

Data-driven classification of the certainty of scholarly assertions

Mario Prieto¹, Helena Deus², Anita De Waard³, Erik Schultes⁴, Beatriz García-Jiménez⁵, Mark D Wilkinson ⁶

1 Departamento de Biotecnología-Biología Vegetal, Escuela Técnica Superior de Ingeniería Agronómica, Alimentaria y de Biosistemas, Centro de Biotecnología y Genómica de Plantas, Universidad Politécnica de Madrid (UPM) - Instituto Nacional de Investigación y Tecnología Agraria y Alimentaria (INIA), Madrid, Madrid, Spain

2 Elsevier Inc., Cambridge, MA, United States

3 Elsevier Research Collaborations Unit, Jericho, VT, United States

4 GO FAIR International Support and Coordination Office, Leiden, The Netherlands

5 Centro de Biotecnología y Genómica de Plantas, Universidad Politécnica de Madrid (UPM) - Instituto Nacional de Investigación y Tecnología Agraria y Alimentaria (INIA);, Madrid, Madrid, Spain

6 Departamento de Biotecnología-Biología Vegetal, Escuela Técnica Superior de Ingeniería Agronómica, Alimentaria y de Biosistemas, Centro de Biotecnología y Genómica de Plantas, Universidad Politécnica de Madrid (UPM) - Instituto Nacional de Investigación y Tecnología Agraria y Alimentaria (INIA);, Madrid, Madrid, Spain

DOI: 10.7287/peerj.preprints.27829v2

Published: 2019-12-20
Accepted: 2019-12-20

Subject Areas: Bioinformatics, Computational Science, Data Mining and Machine Learning
Keywords: text mining, scholarly communication, certainty, FAIR Data, machine learning

Copyright: © 2019 Prieto et al.
Licence: This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, reproduction and adaptation in any medium and for any purpose provided that it is properly attributed. For attribution, the original author(s), title, publication source (PeerJ Preprints) and either DOI or URL of the article must be cited.

Cite this article: Prieto M, Deus H, De Waard A, Schultes E, García-Jiménez B, Wilkinson MD. 2019. Data-driven classification of the certainty of scholarly assertions. PeerJ Preprints 7:e27829v2 https://doi.org/10.7287/peerj.preprints.27829v2

Abstract

The grammatical structures scholars use to express their assertions are intended to convey various degrees of certainty or speculation. Prior studies have suggested a variety of categorization systems for scholarly certainty; however, these have not been objectively tested for their validity, particularly with respect to representing the interpretation by the reader, rather than the intention of the author. In this study, we use a series of questionnaires to determine how researchers classify various scholarly assertions, using three distinct certainty classification systems. We find that there are three distinct categories of certainty along a spectrum from high to low. We show that these categories can be detected in an automated manner, using a machine learning model, with a cross-validation accuracy of 89.2% relative to an author-annotated corpus, and 82.2% accuracy against a publicly-annotated corpus. This finding provides an opportunity for contextual metadata related to certainty to be captured as a part of text-mining pipelines, which currently miss these subtle linguistic cues. We provide an exemplar machine-accessible representation - a Nanopublication - where certainty category is embedded as metadata in a formal, ontology-based manner within text-mined scholarly assertions.

Author Comment

There has been an important change in the statistics used to generate the results, in that the data is compositional, and therefore required an additional pre-processing step. The results are unaffected. We added the machine-learning component, and output from the application of the machine learning model to novel statements, showing that it is now possible to automatically detect hedging-erosion in citation chains. Finally, we updated the structure of the nanopublication.

Supplemental Information

Supplemental Information

Horn’s parallel analysis result for S1

Horn’s parallel analysis result for S2

Horn’s parallel analysis result for S3