The role of software in science: a knowledge graph-based analysis of software mentions in PubMed Central

David Schindler; Felix Bensmann; Stefan Dietze; Frank Krüger

doi:10.7717/peerj-cs.835

The role of software in science: a knowledge graph-based analysis of software mentions in PubMed Central

David Schindler ¹, Felix Bensmann², Stefan Dietze^2,3, Frank Krüger ^1,4

1Institute of Communications Engineering, University of Rostock, Rostock, Germany

2GESIS - Leibniz Institute for the Social Sciences, Cologne, Germany

3Heinrich-Heine-University, Düsseldorf, Germany

4Department Knowledge, Culture & Transformation, University of Rostock, Rostock, Germany

DOI: 10.7717/peerj-cs.835

Published: 2022-01-14
Accepted: 2021-12-07
Received: 2021-10-08

Academic Editor: Sedat Akleylek

Subject Areas: Data Mining and Machine Learning, Data Science, Digital Libraries, Natural Language and Speech, World Wide Web and Web Science
Keywords: Knowledge graph, Software mention, Named entity recognition, Software citation

Copyright: © 2022 Schindler et al.
Licence: This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, reproduction and adaptation in any medium and for any purpose provided that it is properly attributed. For attribution, the original author(s), title, publication source (PeerJ Computer Science) and either DOI or URL of the article must be cited.

Cite this article: Schindler D, Bensmann F, Dietze S, Krüger F. 2022. The role of software in science: a knowledge graph-based analysis of software mentions in PubMed Central. PeerJ Computer Science 8:e835 https://doi.org/10.7717/peerj-cs.835

The authors have chosen to make the review history of this article public.

Abstract

Science across all disciplines has become increasingly data-driven, leading to additional needs with respect to software for collecting, processing and analysing data. Thus, transparency about software used as part of the scientific process is crucial to understand provenance of individual research data and insights, is a prerequisite for reproducibility and can enable macro-analysis of the evolution of scientific methods over time. However, missing rigor in software citation practices renders the automated detection and disambiguation of software mentions a challenging problem. In this work, we provide a large-scale analysis of software usage and citation practices facilitated through an unprecedented knowledge graph of software mentions and affiliated metadata generated through supervised information extraction models trained on a unique gold standard corpus and applied to more than 3 million scientific articles. Our information extraction approach distinguishes different types of software and mentions, disambiguates mentions and outperforms the state-of-the-art significantly, leading to the most comprehensive corpus of 11.8 M software mentions that are described through a knowledge graph consisting of more than 300 M triples. Our analysis provides insights into the evolution of software usage and citation patterns across various fields, ranks of journals, and impact of publications. Whereas, to the best of our knowledge, this is the most comprehensive analysis of software use and citation at the time, all data and models are shared publicly to facilitate further research into scientific use and citation of software.

Introduction

Science across all disciplines has become increasingly data-driven, leading to additional needs with respect to software for collecting, processing and analyzing data. Hence, transparency about software used as part of the scientific process is crucial to ensure reproducibility and to understand provenance of individual research data and insights. Knowledge about the particular version or software development state is a prerequisite for reproducibility of scientific results as even minor changes to the software might impact them significantly.

Furthermore, from a macro-perspective, understanding software usage, varying citation habits and their evolution over time within and across distinct disciplines can shape the understanding of the evolution of scientific disciplines, the varying influence of software on scientific impact and the emerging needs for computational support within particular disciplines and fields. Initial efforts are made to provide publicly accessible datasets that link open access articles to respective software that is used and cited, for instance, the OpenAIRE Knowledge Graph (Manghi et al., 2019) or SoftwareKG (Schindler, Zapilko & Krüger, 2020). Given the scale and heterogeneity of software citations, robust automated methods are required, able to detect and disambiguate mentions of software and related metadata.

Despite the existence of software citation principles (Smith, Katz & Niemeyer, 2016; Katz et al., 2021), software mentions in scientific articles are usually informal and often incomplete—information about the developer or the version are often missing entirely, see Fig. 1. Spelling variations and mistakes for software names, even common ones (Schindler, Zapilko & Krüger, 2020), increase the complexity of automatic detection and disambiguation. Training and evaluation of information extraction approaches requires reliable ground truth data of sufficient size, raising the need for manually annotated gold standard corpora of software mentions.

Figure 1: Annotated sentences from SOMESCI missing information required by software citation standards.

Download full-size image

DOI: 10.7717/peerj-cs.835/fig-1

Most works concerned with recognition of software mentions in scientific articles apply manual analysis on small corpora in order to answer specific questions (Howison & Bullard, 2016; Nangia & Katz, 2017) or are limited to specific software (Li, Lin & Greenberg, 2016; Li, Yan & Feng, 2017). Automatic methods, enabling large scale analysis, have been implemented by iterative bootstrapping (Pan et al., 2015) as well as machine learning on manually engineered rules (Duck et al., 2016). However, both achieve only moderate performance. Extraction through deep learning with a Bi-LSTM-CRF (Schindler, Zapilko & Krüger, 2020) shows promise, but requires sufficient and reliable ground truth data which only recently became available.

Available corpora (Duck et al., 2016; Schindler, Zapilko & Krüger, 2020; Du et al., 2021) do not cover all available metadata features, cater for disambiguation of different spelling variations of the same software or distinguish between the purpose of the mention such as creation or usage. In SOMESCI (Schindler et al., 2021b), we have introduced a gold standard knowledge graph of software mentions in scientific articles. To the best of our knowledge, SOMESCI is the most comprehensive gold standard corpus of software mentions in scientific articles, created by manually annotating 3,756 software mentions with additional information about types of software, mentions and related features, resulting in 7,237 labeled entities in 47,524 sentences from 1,367 PMC articles.

In this work, we provide a large-scale analysis of software usage and citation practices facilitated through an unprecedented knowledge graph of software mentions and affiliated metadata generated through a supervised information extraction model trained on SOMESCI and applied to more than 3 million scientific articles. In summary, our contributions include:

A large-scale analysis of software usage across 3,215,386 scholarly publications covering a range of diverse fields and providing unprecedented insights into the evolution of software usage and citation patterns across various domains, distinguishing between different types of software, mentions as well as rank of journals and impact of publications. Results indicate strongly discipline-specific usage of software and an overall increase in software adoption. To the best of our knowledge, this is the most comprehensive analysis of software use and citation at the time.
A comprehensive knowledge graph of software citations in scholarly publications comprising of 301,825,757 triples describing 11.8 M software mentions together with types and additional metadata. The knowledge graph is represented using established vocabularies capturing the relations between citation contexts, disambiguated software mentions and related metadata and provides a unique resource for further research into software use and citation pattern.
Robust supervised information extraction models for disambiguating software mentions and related knowledge in scholarly publications. As part of our experimental evaluation, our model based on SciBERT and trained on SOMESCI Schindler et al. (2021b) for NER and classification outperforms state-of-the-art methods for software extraction by 5 pp on average. Software mentions are disambiguated and different variations interlinked, e.g., abbreviations and name- and spelling-alternatives, of the same software.

Through these contributions, we advance the understanding of software use and citation practices across various fields and provide a significant foundation for further large-scale analysis through an unprecedented dataset as well as robust information extraction models.

The remaining paper is organized as follows. Related work is discussed in the following section, whereas the Methods and Materials introduces developed information extraction methods together with datasets used for training and testing. Results: Information Extraction Performance describes the performance results obtained on the various information extraction tasks, while the Results: Analysis of Software Mentions introduces an in-depth analysis of the extracted data. Key findings are discussed in the Discussion, followed by a brief conclusion and introduction of future work.

Related Work

Requirements for large scale software citation analyses

Software mentions in scientific articles have been analyzed for several reasons including mapping the landscape of available scientific software, analyses of software citation practices and measuring the impact of software in science (Krüger & Schindler, 2020). This includes manual analyses based on high quality data, such as Howison & Bullard (2016), Du et al. (2021), Nangia & Katz (2017) and Schindler et al. (2021b) but also automatic analyses such as Pan et al. (2015), Duck et al. (2016) and Schindler, Zapilko & Krüger (2020). While manual analyses provide highly reliable data, results often only provide a small excerpt and do not generalize due to small sample size. Analyses based on automatic data processing, in contrast, allow to make more general statements, for instance, regarding trends over time or across disciplines, but require high quality information extraction methods which themselves rely on reliable ground truth labels for supervised training. Table 1 compares manual and automatic approaches with respect to sample size and quality indicators such as IRR or FScore. Manual approaches provide substantial to almost perfect IRR, but are restricted to less than 5,000 articles at most. Howison & Bullard (2016), for instance, analyzed software mentions in science by content analysis in 90 articles. The main objective of Du et al. (2021) and Schindler et al. (2021b) was to create annotated corpora of high quality for supervised learning of software mentions in scientific articles. Du et al. (2021) provide labels for software, version, developer, and URL for articles from PMC, which is multidisciplinary but strongly skewed towards Medicine (see Table A11) and Economics. Schindler et al. (2021b) exclusively used articles from PMC, and provide labels for software, a broad range of associated information, software type, mention type, and for disambiguation of software names.

Table 1:

Summary of investigations concerning software in science together with source of the articles, number of articles and software, and a quality indicator.

Level of extracted details varies between listed approaches. Note that PLoS is a subset of PMC. M, manual; A, automatic; k, Cohen’s; F, FScore; O, Percentage Overlap.

	Approach	Quality	Source	Articles	Software
M	Howison & Bullard (2016)	O = 0.68–0.83	Biology	90	286
	Nangia & Katz (2017)	–	Nature (Journal)	40	211
	Du et al. (2021)	O = 0.76	PMC, Economics	4,971	4,093
	Schindler et al. (2021b)	κ = 0.82, F = 0.93	PMC	1,367	3,756
A	Pan et al. (2015)	F = 0.58	PLoS ONE	10 K	26 K
	Duck et al. (2016)	F = 0.67	PMC	714 K	3.9 M
	Schindler, Zapilko & Krüger (2020)	F = 0.82	PLoS (Social Science)	51 K	133 K

DOI: 10.7717/peerj-cs.835/table-1

Early automatic approaches, such as Pan et al. (2015) and Duck et al. (2016) achieve only moderate recognition performance of 0.58 and 0.67 FScore, but perform analyses on up to 714 K articles raising doubts about the reliability and generalizability of the described results. Pan et al. (2015) used iterative bootstrapping—a rule-based method that learns context rules—as well as a dictionary of software names based on an initial set of seed names. Duck et al. (2016) employ machine learning classifiers on top of manually engineered rules. With the availability of large language models and deep learning methods for sequence labeling, Schindler, Zapilko & Krüger (2020) employed a Bi-LSTM-CRF and achieved an FScore of 0.82 for the recognition of software mentions in scientific articles. Most recently, Lopez et al. (2021) compare Bi-LSTM-CRF and SciBERT-CRF models on Softcite (Du et al., 2021) software entity recognition at paragraph level. They achieve 0.66 and 0.71 FScore, respectively, and further improved performance to 0.74 FScore by linking entities to Wikidata during postprocessing.

Beside high recognition rates, and thus the basis for reliable statements, Schindler, Zapilko & Krüger (2020) demonstrate the capabilities of semantic web technologies for information structuring and data integration with respect to analyzing software usage. They provide a KG—SoftwareKG—representing a source for structured data access for analyses. Moreover, the performed disambiguation of software mentions allows to draw conclusion on the level of software rather than software mentions, even with spelling variations. Finally, the linked nature of KG allows the integration of external data sources enabling further analyses. Following the direction of Schindler, Zapilko & Krüger (2020), large scale analyses of software mentions in scholarly articles requires (1) robust information extraction and disambiguation techniques that achieve results on the level of manual approaches, and (2) the provision of all data in a standardized way that allows the reuse and the integration of external knowledge.

Previous analyses of software in scholarly publication

As described above, previous studies on software mentions in scholarly publication were based on high quality manual analyses with small sample sizes or automatic analyses with large sample size but moderate quality. Most studies report basic descriptive statistics such as the number of overall mentions given in Table 1 or the distribution of software mentions over different software. Howison & Bullard (2016) report an average of 3.2 software mentions per article in Biology while Duck et al. (2016) report 12.9. In PMC, Duck et al. (2016) report an average of 5.5 mentions while Du et al. (2021) report 1.4 and Schindler et al. (2021b) 2.6. Similarly, Pan et al. (2015) and Schindler, Zapilko & Krüger (2020) report values of 2.7 and 2.6 for sub-selections of PLoS. Interestingly, Du et al. (2021) report a low value of 0.2 for Economics and Duck et al. (2016) a high value of 30.8 for Bioinformatics. Some of those results clearly show disciplinary differences, while others such as the PMC discrepancies might be attributed to methodical differences, for instance, publication time of articles in the investigated sets. Articles within Du et al. (2021) are significantly older than articles in Schindler et al. (2021b) which could result in a lower average software usage. This is also supported by the finding of Duck et al. (2016) who analyze software mentions up to 2013 and report a rapid increase in software usage between 2000 and 2006.

Other findings regard the distribution with respect to unique software names. Pan et al. (2015) report that 20% of software names account for 80% of mentions. Duck et al. (2016) report that 5% of software names account for 47% of mentions, and, similarly, 6.6% of entities are responsible for 50% of mentions in Schindler et al. (2021b). Therefore, all prior studies agree that the distribution of software within articles is highly skewed, pointing towards the fact that there are few pieces of general purpose software such as SPSS or R that support the scientific infrastructure. On the other hand, there is a high number of rarely mentioned software that is likely to be highly specialized towards problems and domains. Duck et al. (2016) perform an analysis of domain specific software to investigate disciplinary differences in software usage. They were able to confirm the existence of domain specific software and showed, for instance, that 65% of software used in medicine was not used in other analyzed domains. They also analyze journal specific software and applied a clustering analysis with respect to journal and software names.

Completeness of software mentions and citations is of high importance since employed software can only be clearly identified with sufficient information. Providing information such as the specific version or developer of software is, therefore, essential for provenance of study results or to provide credit for the creation of scientific software. For this purpose, guidelines for proper software citation have been established (Smith, Katz & Niemeyer, 2016; Katz et al., 2021) that recommend the following information to be included: name, author, version/release/date, location, venue, and unique ID, e.g., DOI. Howison & Bullard (2016) analyze the completeness of software mentions with respect to formal citation 44%, version 28%, developer 18% and URL 5%. Based on the given information they were able to locate 86% of software online, but only 5% with the specific version. Completeness analyses by Du et al. (2021) showed that a total of only 44% of software mentions include further information with version being included in 27%, publisher in 31%, and URL in 17%. An analysis by Schindler et al. (2021b) showed that 39% mentions included a version, 23% a developer, 4% a URL and 16% a formal citation. Overall, the studies show that software mentions are still often informal and incomplete, but exhibit some notable differences between reported values. The problem of formal and informal software citation was also included in the automatic analysis of Pan et al. (2015) who identified formal citations for recognized software by automatic string pattern matching. They report a correlation between the number of mentions of a software and its formal citation frequency.

Availability of used software is crucial as studies conducted with commercial software might not be reproducible by other research teams. Furthermore, implementation details for non open source software cannot be reviewed by the scientific community and can potentially bias study results. Therefore, different studies included analyses regarding commercial, free and open source software usage. Pan et al. (2015) found that of the most frequent software mentions, which were labeled for availability manually, 64% are free for academic use. Moreover, they found that free software received more formal citations than commercial software. Howison & Bullard (2016) include an analysis for accessibility, license and source code availability and report that commercial software is more likely to be mentioned similar to scientific instruments (including details on developer and its location) while open access software is more often attributed with formal citations. However, they note that there is no overall preferred style for any group of software. Schindler, Zapilko & Krüger (2020) show a comparison of software mention numbers for free, open source and commercial software over time that showed no clear trend towards a specific group.

Beside analyses about software in scholarly publications in general, several studies focus on particular aspects such as specific software or the relation of software usage to bibliometric measures. Li, Lin & Greenberg (2016), analyze mentions of the specific engineering software (LAMMPS) and found that the given information is often not complete enough to determine how it was applied with respect to version, but also regarding software specific settings. Li, Yan & Feng (2017) analyze software citation for R and R packages. They report inconsistency resulting from a variety in citation standards, which are also not followed well by authors. Overall, they show a trend towards more package mentions, and find a comparably high number of formal citations for R packages (72%). Mayernik et al. (2017) discuss data and software citation and conclude that there is no impact measure for software available. Allen, Teuben & Ryan (2018) analyse the availability of source code in astrophysics and report that it could only be located for 58% of all mentions. Pan et al. (2018) analyze the completeness for usage statements of three specific bibliometric mapping tools and find provided versions in 30% of cases, URLs in 24%, and formal citations in 76%. They argue that the high formal citation might be due to good author citation instruction given by the tools. Howison & Bullard (2016) report that articles published in high impact journals mention more software. The platform swMATH (Greuel & Sperber, 2014) aims to establish a mapping of software used in mathematical literature by manually labeling software present in zbMATH articles pre-filtered through an automatic, heuristic search.

Most studies agree that software citation is important but often incomplete and report similar trends about the frequency of software mentions. They deviate, however, when it comes to particular numbers such as software mentions per article. This could be the result of (1) discipline specific citation habits, (2) small sample sizes in analysis studies, and (3) insufficient quality of automatic information extraction. A large scale study based on reliable automatic information extraction is required to draw conclusions across different disciplines.

Methods and Materials

Information extraction

Training dataset

We apply automatic information extraction based on supervised machine learning for recognizing software in science and use SOMESCI—Software Mentions in Science—a corpus of annotated software mentions in scientific articles (Schindler et al., 2021b). It contains 3,756 software annotations in 1,367 PubMed Central (PMC) articles as well as annotations for different software types such as Programming Environment or Plug-In, mention types such as Usage or Creation, and additional information such as Version or Developer. Moreover, it provides unique entity identities for all software annotations, which allows to not only develop a system for software name recognition but also for disambiguating names, an essential inference step in building a software Knowledge Graph. This level of detail is not represented in other available software datasets such as BioNerDs (Duck et al., 2016) or Softcite (Du et al., 2021). SOMESCI does also contain recent articles and is, therefore, suited to represent the recent shift in awareness and recommendations for software citation. Quality of SOMESCI annotations was assessed through IRR and is reported to be high with a value of κ = 0.82. SOMESCI is available from Zenodo (https://doi.org/10.5281/zenodo.4968738) and an annotated example with markup from the web-based annotation tool BRAT (Stenetorp et al., 2012) is given in Fig. 2. For all reported information extraction problems described below we use the same 60:20:20 division in train, development, and test set as the SOMESCI baseline.

Figure 2: Sentence from SOMESCI annotated with respect to software, additional information, mention type, and software type as well as corresponding relations.

Download full-size image

DOI: 10.7717/peerj-cs.835/fig-2

An overview of the different annotations along with the overall statistics of the SOMESCI dataset is given in Table 2. SOMESCI distinguishes each mention of a software by two types: mention and software. Mention type can take the values of Usage if the software was actively used and is contributing to the articles results, Creation if it was created in the scope of the article, Deposition if it was created and additionally published, and Allusion if its name was merely stated, e.g., in an comparison with another software. Similarly, software type is distinguished between, Application if the software can be run as a stand-alone software, PlugIn if it is an extension to an existing host software, Operating System and Programming Environment if it is a framework for writing and executing program code. More details on the different types and relations are provided in the Taxonomy for Software and Related Information.

Table 2:

Overview of the SOMESCI corpus.

Further details can be found in Schindler et al. (2021b).

SOMESCI statistics
# Articles	1,367
# Sentences w/ Software	2,728
# Sentences w/o Software	44,796
# Annotations	7,237
# Software	3,756
# unique Software	883
# Relations	3,776
Software Type	Application, PlugIn, Operating System (OS), Programming Environment (PE)
Mention Type	Allusion, Usage, Creation, Deposition
Additional Information	Developer, Version, URL, Citation, Extension, Release, License, Abbreviation, Alternative Name

DOI: 10.7717/peerj-cs.835/table-2

Inference dataset

The inference dataset includes 3,215,386 articles indexed in PMC acquired via bulk download (https://www.ncbi.nlm.nih.gov/pmc/tools/ftp/). on January 22, 2021. Construction of SoftwareKG requires metadata and plain text of each article. To acquire the information, JATS was used instead of the also available Portable Document Format (PDF). PDF is the standard form in which humans consume scientific articles, however, there are drawbacks for machines due to formatting artifacts caused by elements such as headers, footers, page numbering, or multi-column formats. While some tools, such as GROBID (2021), perform well on pdf to text conversion, using JATS prevents errors resulting from text formatting. JATS on the other hand is an XML-based format, and while specific tagging conventions vary between different journals indexed in PMC, they all follow a common scheme, making it a suited source for both metadata and plain text. Both were extracted using a custom implementation available in the associated source code (https://github.com/dave-s477/SoftwareKG).

Entity recognition and classification

The objective of this information extraction step is to recognize software mentions and associated additional information, and to classify software according to its Software Type and Mention Type. The target labels are summarized in Table 2. The task is modelled as an NER sequence tagging problem where each sentence is considered as a sequence of tokens each of which has to be assigned a correct output tag.

Different suited state-of-the-art machine learning models are considered for the task. We compare the given baseline results on SOMESCI Schindler et al. (2021b), which were established by an un-optimized Bi-LSTM-CRF model, to other machine learning models suited for scientific literature, for instance, SciBERT (Beltagy, Lo & Cohan, 2019). To establish a consistent naming scheme we label all implemented and tested models by type, classification target and optimization state: M_{type,target,optimization}. Results for NER are reported by mean and standard deviation for repeated training runs because performance can vary between runs due to randomization in initialization and training. Results of at least 4 different training runs are provided for hyper-parameter optimization and 16 for final performance estimation. The best model is selected on the problem of identifying software mentions (M_−,sw,−) as we consider it the most important quality measure and the main problem all other tasks relate to.

Bi-LSTM-CRFs (M_L,sw,−) were selected as they are well established for NER and have been reported to achieve state-of-the-art results (Ma & Hovy, 2016). Further, they have previously been applied to the problem of recognizing software in scientific literature (Schindler, Zapilko & Krüger, 2020; Schindler et al., 2021b; Lopez et al., 2021). More details on the model can be found in Ma & Hovy (2016), Schindler, Zapilko & Krüger (2020) as well as in the implementation details in our published code.

BERT (Devlin et al., 2019) is a transformer-based model that is pre-trained on a masked language prediction task and has proven to achieve state-of-the-art performance across a wide range of NLP problems after fine-tuning. Different adaptions of the BERT pre-training procedure exist for scientific literature resulting in the two well established models BioBERT (Lee et al., 2019) (M_BB,sw,−) and SciBERT (Beltagy, Lo & Cohan, 2019) (M_SB,sw,−). While BioBERT is pre-trained on PubMed abstracts as well as PMC full-texts SciBERT is pre-trained on full-text articles from semantic scholar with 18% of articles coming from the domain of Computer Science and the remaining 82% from Biomedicine. To reduce run-time requirements, hyper-parameter optimization was only performed for the best performing BERT model that was chosen by comparing both models after fine-tuning with the default configuration summarized in Table 3. The parameter Sampling reduces the size of the training set by randomly suppressing sentences from the training corpus that do not contain software.

Table 3:

Hyper-parameters considered for BERT models including their default setting.

Parameter	Default
Learning Rate (LR)	1e−5
Sampling	all data
Dropout	0.1
Gradient Clipping	1.0

DOI: 10.7717/peerj-cs.835/table-3

The overall, best model based on the development set is selected and extended to solve the 3 main objectives (M_{−,sw+info,−}) of the initial information extraction step: (1) recognize software mentions and corresponding additional information, (2) classify software type, (3) classify mention type of extracted software mentions. The combined problem is modeled as hierarchical multi-task sequence labeling and illustrated in Fig. 3. Multi-task learning can improve recognition performance and help to learn better representations if the given tasks are related as it implicitly increases the sample size (Ruder, 2017). Therefore, the main layers of the model share their weights across all sub-tasks and are updated with loss signals from all individual tasks. The output of each sub-task is calculated by a separate fully connected layer with softmax activation. For backpropagation we chose the simple approach of summing over the three cross-entropy losses, however, this could be further explored in the future, for instance, as described by Kendall, Gal & Cipolla (2018).

Figure 3: Illustration of the employed multi-task, hierarchical, sequence labeling model.
Features are generated based on shared layers. The features are passed to 3 separate tasks and loss signals are summed to update shared weights. Outputs of classification layers are passed back to the network as input features to other classification layers, depicted from left to right in the image. Teacher forcing—replacing lower level classification outputs with gold label data—is used during training to stop potentially wrong classification outputs from being passed to other classification layers. Colors represent similar types of information.

Download full-size image

DOI: 10.7717/peerj-cs.835/fig-3

The hierarchical component is added by passing the classification result of lower hierarchy sub-tasks as input to higher sub-tasks. The classification layer for mention type receives the output of software recognition and the software type layer the output of both software recognition and mention type. There is no gradient passed backward through the hierarchy so the weight updates in each classification layer are only based on the individual task loss. Teacher forcing—passing the correct prediction regardless of the actual prediction—is performed during training with respect to the output of lower layers in the hierarchy. As a result, we expect better update steps and faster learning convergence by providing more gold label information to higher classification layers. Additionally, teacher forcing should motivate the constraint that a software type or mention type should only be classified if a software was classified before. Note, that hyper-parameters for the M_{−,sw+info,opt} are based on the best set of parameters identified for software recognition M_−,sw,opt.

As labels for multiple tasks have to be combined with potential tagging inconsistencies for each task we experimented with adding a CRF layer on top of BERT to improve performance by learning inter-dependencies and constraints between labels. We found no improvement in performance but additional time complexity and did not further pursue the model. Instead, we enforce tagging consistencies by applying a simple set of rules: (1) all I-tags without leading B-tags are transformed to B-tags—including I-tags that do not match their leading B-tags; (2) entity boundaries for higher hierarchy tasks are adjusted to the base task entity boundaries; (3) when there are multiple conflicting labels in higher hierarchy steps for one identified software entity, the label for the first token is chosen. An example is given in Table 4.

Table 4:

Example for enforcing tagging consistency.

Inconsistencies are underlined.

Sentence	We	Used	SPSS	Statistics	16	.
Entities	O	O	B-App	I-App	I-Ver	O
Types	O	O	B-Use	I-Mention	O	O
Fixed	O	O	B-App-Use	B-App-Use	B-Ver	O

DOI: 10.7717/peerj-cs.835/table-4

The performance of M_{−,sw+info,opt} is evaluated against the SOMESCI baseline (Schindler et al., 2021b) described above. In contrast to our implementation, information is not shared between tasks in the baseline model. Instead, all classifications are performed hierarchically and individually. Therefore, the reported results for the baseline are subject to error propagation as recognition of additional information, software type classification and mention type classification all assume an underlying perfect software recognition. As our implementation does take error propagation into account the SOMESCI baseline overestimates performance in a direct comparison.

Relation extraction

For Relation Extraction (RE), the task of classifying if and which relationships exist between entities, we considered all relations available from the training dataset. All additional information can be related to software, versions and developers to licenses, and URLs to licenses or developers. Software mentions can be related to each other by the plugin-of relation, representing one mention as the host software and the other as the PlugIn, or by the specification-of relation if both mentions refer to the same real world entity. Some possible relations are also depicted in Fig. 2. Its important to note that RE is the second information extraction step and, therefore, directly dependent on entity extraction. For developing and testing RE we rely on gold level entities, but in practice RE performance is expected to be lower due to false negatives and false positives resulting from entity extraction errors.

SOMESCI (Schindler et al., 2021b) provides a baseline model for classifying relations between software associated entities based on manually engineered features and an optimized Random Forest classifier. All features are implemented to yield Integer or Boolean results and take into account (1) entity order, (2) entity types, (3) entity length, (4) entity distance, (5) number of software entities, (6) sub-string relations, and (7) automatically generated acronyms.

We chose to adapt and enhance the SOMESCI baseline model instead of using more complex deep learning models because the baseline achieved good results. Moreover, RE for software associated entities is less challenging as general RE problems as we impose a large number of constraints on how entities can be related. To improve the given rule set we individually fine-tuned the implementation of each rule. Moreover, we experimented with multi-layer perceptrons and SVMs as alternative to the Random Forest classifier. In initial tests, they did not achieve better performance and we chose to retain the Random Forest classifier as it has the benefit of offering better explainability. The Random Forest was trained with 100 trees, unlimited maximum depth, and no restrictions to splitting samples.

Software disambiguation

Software is referred to by different names due to abbreviations, geographical differences, or time. Schindler, Zapilko & Krüger (2020), for instance, report up to 179 different spelling variations for the commonly used software SPSS. This raises the need for software name disambiguation as a core requirement for constructing SoftwareKG. SOMESCI provides a gold standard for this problem through manually assigned unique identifiers in form of links to external knowledge bases. However, existing knowledge bases, such as Wikidata (Vrandečić, 2012) or DBpedia (Auer et al., 2007), are sparse when it comes to scientific software which is illustrated by an analysis of the SOMESCI disambiguation ground truth: only 205 of 883 (23%) unique and 2,228 of 3,717 (60%) software mentions are represented in Wikidata. Therefore, we adapt and develop an entity disambiguation method able to handle previously unknown software names such as those from creation statements without the need to link to external knowledge bases. In consequence, we contribute to establish a more complete KG of software.

The objective of software entity disambiguation is as follows: Given a pair (E₁, E₂) of software entities the goal is to determine whether they refer to the same real world entity. For that purpose we employ agglomerative clustering following the procedure illustrated in Fig. 4. First, manually engineered features are calculated for each pair, resulting in a feature vector v_E1,E2. Features take into account: (1) string similarity, (2) similarity of extracted context information, (3) automatically generated abbreviations, and (4) software related information queried from DBpedia.

Overview of the software name disambiguation. For all pairs of extracted software entities (E1, E2), features are extracted (feature extraction) and used to determine a probability of linking (perceptron). — Figure 4: Overview of the software name disambiguation. For all pairs of extracted software entities (E₁, E₂), features are extracted (feature extraction) and used to determine a probability of linking (perceptron).
Finally, agglomerative clustering is performed to cluster similar software names.

Download full-size image

DOI: 10.7717/peerj-cs.835/fig-4

For each pair (E₁,E₂), vector v_E1,E2 is mapped to a probability estimate p_link for if they should be linked by a 4-layer perceptron (15 × 10 × 5 × 1) with low complexity p_link = f_perceptron (v_E1,E2). The model is trained supervised to predict if a link exist l = {0,1} based on splitting all possible combinations from the ground truth set in train, development and test set in a 60:20:20 ratio. Since the class was trained as a binary classification the output of the perceptron is the result of a sigmoid layer d ∈ [0,1] and is used in combination with a threshold in the following steps. We considered applying dropouts but found a decreasing performance in initial tests. We also did not find any increase in performance for increasing model complexity.

For disambiguation we have to consider the influence of the sample size on the density of samples in the resulting features space. For n extracted mentions of software the number of entity pairs that need to be disambiguated accumulates to n² − n. In the small training set data points are less dense than in the large inference set. Moreover, the inference set does contain false positive mentions with strong resemblance to software resulting from prediction errors in the entity extraction step. This makes it difficult to find reliable decision boundaries on the training set alone. During testing it became apparent that due to the described effect the perceptron trained only on gold standard labels could not learn suited decision boundaries to disambiguate entities pairs in the inference set. To counteract this problem, data augmentation was applied to add further entities resembling false positive extracted software names, which should not be linked to any other mentions. To simulate closeness to existing software names the new samples were generated by recombining sub-strings of existing samples, for instance, ImageJ and SPSS Statistics could be combined to form Image Statistics. During creation we made sure to not re-create given software names as well as duplicates. In total, 2n augmented samples were created once for the n original software mentions and included at each training epoch. They were also included in the test set with the same factor in order to estimate performance under the chance of false positive samples. As we only add negative samples to the test set there is no risk to overestimate the performance with the employed metrics of Precision and Recall.

Based on the predicted probabilities p_link for entity pairs a agglomerative clustering is performed. In each step, the two clusters with the largest probability are combined. As stopping criterion the threshold t is introduced and defined as the minimal probability for which pairs are linked. It is optimized based on the available gold standard labels. Here, the creation of reliable decision boundaries within the densely populated feature space is also an issue. To counteract it the threshold is optimized taking into account all available data points from gold standard and inference set by combining both sets. This approach allows to evaluate how well the gold labeled mentions are clustered within the densely populated feature space. The performance is estimated in terms of Precision, Recall and FScore at t.

We considered single and average linkage for clustering and found almost identical performance for varying thresholds based on gold standard mentions only. Given the similar performance during the initial tests, single linkage was preferred as it offers benefits in run-time and space complexity because it allows to re-use the initially calculated similarities. Average linking, in contrast, would require additional computation for the per-cluster-pair average similarity. Due to the run-time issues described below an evaluation of average linkage would not have been feasible with the evaluation method described above. Single linkage was then applied and evaluated as described.

A major issue we faced for disambiguation was run-time complexity as the number of pairs accumulates to n² − n with n > 11M software mentions. Therefore, we had to optimize for run-time complexity. Our initial optimization step was to assume symmetric feature vectors between entities E₁ and E₂ v_E1,E2 = v_E2,E1 reducing the number of required compares to $\frac{n (n - 1)}{2}$ , even so they are not strictly symmetric because string length of entities are included as normalization factors. Further, we made the assumption that all software with the same exact string refers to the same real-world software entity and only included a limited number of n_unique = 6 samples of each name. The work of Schindler et al. (2021b) showed that this can in rare cases lead to false positive clustering, but in this case the benefit outweighs this risk because otherwise the computation would not have been feasible. Disambiguation on the remaining set of ~1.4 M mentions took approximately 6 days, with feature calculations parallelized over 6 Intel® Xeon® Gold 6248 CPUs (2.50 GHz, 40 Threads).

Schindler et al. (2021b) provide a baseline implementation for entity disambiguation on SOMESCI which uses manually engineered rules and external knowledge from DBpedia to disambiguate software names. For completeness we provide baseline results, however, as explained above, the density of the features space increases strongly by including additional data samples and our evaluation specifically includes augmented negative samples. Thus, the baseline cannot directly be compared to the implemented method in terms of disambiguation quality, but serves as an indicator.

SoftwareKG: knowledge graph of software mentions

Taxonomy for software and related information

We define software and its related information following the taxonomy presented by Schindler et al. (2021b) that describes the intricacies of in-text software mentions in scientific publications. The taxonomy distinguishes Type of Software describing which artifacts are considered as software, Type of Mention describing the context in which software was applied, and Additional Information that is provided to closer describe a software entity.

Type of software

Based on the distinction between end-user application (software) and package introduced by Li, Yan & Feng (2017), Schindler et al. (2021b) distinguish the following categories of software:

Applications are standalone programs, designed for end-users, that usually result in associated data or project files, e.g., Excel sheets. This definition includes software applications that are only hosted and available through web-based services, but excludes web-based collections of data. The definition also excludes databases that are used to store collections of scientific data. To be considered as an application a web-service has to provide functionality beyond filtered access to data.

Programming Environments (PE) are environments for implementing and executing computer programs or scripts. They are built around programming languages such as Python but also integrate compilers or interpreters in order to create executables from developed code. PEs play an important role in many scientific investigations and are particularly important for computationally heavy scientific disciplines such as computer science.

PlugIns are extensions specifically developed to be used with existing applications or PEs and cannot be used individually. As such, in the context of PEs, the category PlugIn could also be called package or library. Often, the original application can be concluded from the PlugIn, e.g., scikit-learn is a frequently used Python package for machine learning. The usage of Plugins is well established in the scientific community as it allows to extend the function range of well established software libraries. This allows to implement custom software without the need to establish more complex stand-alone application.

Operating Systems (OS) build the basis for running software on a computer by managing its hardware components and the execution of all other software. OS are necessary when running a software application and they are, overall, less mentioned than other software. In many cases authors still choose to attribute common operating systems such as Windows, OS X, or Android as well as lesser used ones such as Ubuntu or Raspbian.

Type of mention

The definition of Schindler et al. (2021b) introduces a hierarchy of reasons why software is mentioned within scholarly articles based on the basic distinction between mention and usage introduced by Howison & Bullard (2016):

Allusion of software describes each mention of a software name within a scholarly article. Aside appearance of the software name there are no further requirement for an allusion. It should especially be noted that no indication of actual usage is required, for instance, a fact about the software can be stated or multiple software can be compared. In the context of software mentions, allusions are comparable with scholarly citations used to refer to related work.

Usage (sub-type of Allusion) defines that a software made a contribution to a study and was actively used during the investigation, which makes the software a part of the research’s provenance. Therefore, usage statements are required to allow conclusions regarding provenance. This is in line with the definition of software usage by Lopez et al. (2021).

Creation (sub-type of Allusion) indicates that software was developed and implemented as part of a scientific investigation and is itself a research contribution. Knowledge of creation statements allows to track research software to its developers in order to provide credit to them as well as to discover and map newly published scientific software.

Deposition (sub-type of Creation) indicates that a software was published in the scope of a scientific investigation on top of being developed. In difference to creation statements, depositions require that authors provide either a URL to access the software or the corresponding publication license. Deposition statements, therefore, allow to provide additional information about discovered scientific software.

Both Type of Software and Type of Mention are required to fully describe a software mention in a scientific publication.

Additional Information and Declarations

Software is constantly updated and changing. Moreover, software names are ambiguous (Schindler, Zapilko & Krüger, 2020). Therefore, software citation principles (Smith, Katz & Niemeyer, 2016; Katz et al., 2021) have been established to precisely identify software in publications. They require that software mentions in scholarly articles are accompanied by additional information allowing the unique identification of the actually used software, information that is often missing in practice (Howison & Bullard, 2016; Du et al., 2021; Schindler et al., 2021b). Here we employ the following definitions for additional information about software as defined by Schindler et al. (2021b). Developer describes the person or organization that developed a software while Version indicates a defined state in the software life-cycle, typically identified by a version number, Release indicates a defined state in the software life-cycle by using a date based identifier, and Extension indicates different function ranges for the same base software such as professional and basic versions. URL gives a location for further information and download, Citation provides a formal, bibliographic citation, and License covers the permission and terms of usage. Lastly, Abbreviation gives a shortened name for a software while Alternative Name provides a longer name. All additional information is related to the specific entity it describes. In most cases this is a software, but licenses can also be specified by versions, URLs and abbreviations, while developers can be closer described by URLs and abbreviations.

Data Model and RDF/S lifting

In order to ensure interpretability and reusability, extracted data is lifted into a structured KG based on established vocabularies. KGs represent a meaningful way to semantically structure information in an unambiguous way and provide a reasonable approach to make data accessible for later reuse. In particular, KGs enable the FAIR publication of research data.

The data model of the KG is depicted in Fig. 5. It can be subdivided into different areas that represent different types of information. Bibliographic information about articles, journal and authors (depicted in violet color) is represented by employing terms from the Bibliographic Ontology (BIBO) (D’Arcus & Giasson, 2009), Dublin Core Metadata Initiative Terms (DCT) (DCMI Usage Board, 2020), Simple Knowledge Organization System (SKOS) (Miles et al., 2005), and schema.org (Guha, Brickley & Macbeth, 2016). The representation of entity mentions that were automatically extracted from the texts (orange color), is mainly built upon the NLP Interchange Format (NIF) (Hellmann et al., 2013) and Datacite (Peroni et al., 2016). Disambiguated software are represented by Software from Informatics Research Artifacts Ontology (IRAO) (Bach, 2021). For the metadata of software we examined dedicated vocabularies and ontologies including DOAP (Description of a Project) (Wilder-James, 2018), SDO (Software Description Ontology) (Garijo et al., 2019), SWO (Software Ontology) (Malone et al., 2014), OS (OntoSoft) (Gil, Ratnakar & Garijo, 2015), and Codemeta (Jones et al., 2017) (including their crosswalk), but did not use those terms as they do not represent the textual information but the real entities. For clear separation of fact and prediction we opted to not create entities from our mentions but model the mentions as they are and provide information inferred on top of them in the form of reification statements (green color). Whenever we were not able to identify existing vocabularies that allow the representation, we introduced new terms under the prefix skg (http://data.gesis.org/softwarekg/vocab/). This was necessary for modelling the information, mention and software types.

Figure 5: Data model of the Knowledge Graph representing extracted software mentions and their related information.
For reasons of conciseness some details are left out.

Download full-size image

DOI: 10.7717/peerj-cs.835/fig-5

Articles and mentions are central entities of the KG. Mentions of all pieces of information extracted from an article (schema:ScholarlyArticle) such as software, version or developer are represented by nif:String. Software mentions are assigned a software type (skg:softwareType) and a mention type (skg:mentionType, yellow). For all other mentions the type is noted using the skg:informationType property (yellow). To represent relations at the textual level, we introduced predicates for each possible relation. The mention of a software, for instance, refers to the corresponding version via skg:referredToByVersion.

In order to indicate different degrees of probabilities for information aggregated over disambiguated software entities we use reification statements (rdf:Statement) instead of domain entities. Confidence values based on the frequency within and across articles are used to provide a measure of certainty. Formally, let I_r,x be the set of all forms of a piece of information for a given relation r and software x. Further, let D be the set of all articles and m_r,a,x the mapping of a piece of information a ∈ I_r,x to x under the relation r, we then define the confidence score c_m_a,x as given in:

$c_{m_{r, a, x}} = \frac{1}{| {d \in D | m_{r, b, x} \in d, b \in I_{r, x}} |} \cdot \sum_{d \in D} \frac{| {m_{r, a, x} \in d} |}{\sum_{b \in I_{r, x}} | {m_{r, b, x} \in d} |}, a \equiv b .$ a ≡ b signals that both, a and b represent the same type of information, e.g., name or developer. This way we achieve a ratio based fair weighting on mention level and on document level. All values range from 0 to 1 and also add up to 1.

Additional information sources

SoftwareKG was build upon data from PMC making use of the PMC OA JATS XML data set as structured information source for article metadata. Data from PubMedKG (Xu et al., 2020) was integrated to allow bibliometric and domain specific analyses. In particular, we used PKG2020S4 (1781-Dec. 2020), Version 4 available from http://er.tacc.utexas.edu/datasets/ped. It includes Scimago data on journal H-index, journal rank, best quartiles as well as their domains and publishers. Moreover, it includes citation information for articles in PubMed from PubMed itself and Web of Science. For integration of PubMedKG we matched PMC identifiers to PubMed identifiers based on PMC’s mapping service, available at https://www.ncbi.nlm.nih.gov/pmc/pmctopmid/. Specifically, we used their CSV table to match the PMC-ID in PubMed Central with PM-ID from PubMedKG.

Journal specific information vary over time so we modelled them in an skg:JournalInfo-entity that encapsulates information per year. Citation data are integrated in two ways: (1) all citations between PMC Open Access articles are inserted as schema:citation in the KG and (2) the overall number of citations an article received is included as a citation count. This allows analyses based on citation counts, but also provides a basis to identify particular citations paths.