PeerJ Computer Science Preprints: Digital Librarieshttps://peerj.com/preprints/index.atom?journal=cs&subject=9800Digital Libraries articles published in PeerJ Computer Science PreprintsCitation.js: a format-independent, modular bibliography tool for the browser and command linehttps://peerj.com/preprints/274662019-07-112019-07-11Lars G Willighagen
Background. Given the vast number of standards and formats for bibliographical data, any program working with bibliographies and citations has to be able to interpret such data. This paper describes the development of Citation.js (https://citation.js.org/), a tool to parse and format according to those standards. The program follows modern guidelines for software in general and JavaScript in specific, such as version control, source code analysis, integration testing and semantic versioning.
Results. The result is an extensible tool that has already seen adaption in a variety of sources and use cases: as part of a server-side page generator of a publishing platform, as part of a local extensible document generator, and as part of an in-browser converter of extracted references. Use cases range from transforming a list of DOIs or Wikidata identifiers into a BibTeX file on the command line, to displaying RIS references on a webpage with added Altmetric badges to generating "How to cite this" sections on a blog. The accuracy of conversions is currently 27 % for properties and 60 % for types on average and a typical initialization takes 120 ms in browsers and 1 s with Node.js on the command line.
Conclusions. Citation.js is a library supporting various formats of bibliographic information in a broad selection of use cases and environments. Given the support for plugins, more formats can be added with relative ease.
Background. Given the vast number of standards and formats for bibliographical data, any program working with bibliographies and citations has to be able to interpret such data. This paper describes the development of Citation.js (https://citation.js.org/), a tool to parse and format according to those standards. The program follows modern guidelines for software in general and JavaScript in specific, such as version control, source code analysis, integration testing and semantic versioning.Results. The result is an extensible tool that has already seen adaption in a variety of sources and use cases: as part of a server-side page generator of a publishing platform, as part of a local extensible document generator, and as part of an in-browser converter of extracted references. Use cases range from transforming a list of DOIs or Wikidata identifiers into a BibTeX file on the command line, to displaying RIS references on a webpage with added Altmetric badges to generating "How to cite this" sections on a blog. The accuracy of conversions is currently 27 % for properties and 60 % for types on average and a typical initialization takes 120 ms in browsers and 1 s with Node.js on the command line.Conclusions. Citation.js is a library supporting various formats of bibliographic information in a broad selection of use cases and environments. Given the support for plugins, more formats can be added with relative ease.Predicting the results of evaluation procedures of academicshttps://peerj.com/preprints/275822019-03-122019-03-12Francesco PoggiPaolo CiancariniAldo GangemiAndrea Giovanni NuzzoleseSilvio PeroniValentina Presutti
Background. The 2010 reform of the Italian university system introduced the National Scientific Habilitation (ASN) as a requirement for applying to permanent professor positions. Since the CVs of the 59149 candidates and the results of their assessments have been made publicly available, the ASN constitutes an opportunity to perform analyses about a nation-wide evaluation process.
Objective. The main goals of this paper are: (i) predicting the ASN results using the information contained in the candidates’ CVs; (ii) identifying a small set of quantitative indicators that can be used to perform accurate predictions.
Approach. Semantic technologies are used to extract, systematize and enrich the information contained in the applicants’ CVs, and machine learning methods are used to predict the ASN results and to identify a subset of relevant predictors.
Results. For predicting the success in the role of associate professor, our best models using all and the top 15 predictors make accurate predictions (F-measure values higher than 0.6) in 88% and 88.6% of the cases, respectively. Similar results have been achieved for the role of full professor.
Evaluation. The proposed approach outperforms the other models developed to predict the results of researchers’ evaluation procedures.
Conclusions. Such results allow the development of an automated system for supporting both candidates and committees in the future ASN sessions and other scholars’ evaluation procedures.
Background. The 2010 reform of the Italian university system introduced the National Scientific Habilitation (ASN) as a requirement for applying to permanent professor positions. Since the CVs of the 59149 candidates and the results of their assessments have been made publicly available, the ASN constitutes an opportunity to perform analyses about a nation-wide evaluation process.Objective. The main goals of this paper are: (i) predicting the ASN results using the information contained in the candidates’ CVs; (ii) identifying a small set of quantitative indicators that can be used to perform accurate predictions.Approach. Semantic technologies are used to extract, systematize and enrich the information contained in the applicants’ CVs, and machine learning methods are used to predict the ASN results and to identify a subset of relevant predictors.Results. For predicting the success in the role of associate professor, our best models using all and the top 15 predictors make accurate predictions (F-measure values higher than 0.6) in 88% and 88.6% of the cases, respectively. Similar results have been achieved for the role of full professor.Evaluation. The proposed approach outperforms the other models developed to predict the results of researchers’ evaluation procedures.Conclusions. Such results allow the development of an automated system for supporting both candidates and committees in the future ASN sessions and other scholars’ evaluation procedures.Evaluating social network extraction for classic and modern fiction literaturehttps://peerj.com/preprints/272632018-10-082018-10-08Niels DekkerTobias KuhnMarieke van Erp
The analysis of literary works has experienced a surge in computer-assisted processing. To obtain insights into the community structures and social interactions portrayed in novels the creation of social networks from novels has gained popularity. Many methods rely on identifying named entities and relations for the construction of these networks, but many of these tools are not specifically created for the literary domain. Furthermore, many of the studies on information extraction from literature typically focus on 19th century source material. Because of this, it is unclear if these techniques are as suitable to modern-day science fiction and fantasy literature as they are to those 19th century classics. We present a study to compare classic literature to modern literature in terms of performance of natural language processing tools for the automatic extraction of social networks as well as their network structure. We find that there are no significant differences between the two sets of novels but that both are subject to a high amount of variance. Furthermore, we identify several issues that complicate named entity recognition in modern novels and we present methods to remedy these.
The analysis of literary works has experienced a surge in computer-assisted processing. To obtain insights into the community structures and social interactions portrayed in novels the creation of social networks from novels has gained popularity. Many methods rely on identifying named entities and relations for the construction of these networks, but many of these tools are not specifically created for the literary domain. Furthermore, many of the studies on information extraction from literature typically focus on 19th century source material. Because of this, it is unclear if these techniques are as suitable to modern-day science fiction and fantasy literature as they are to those 19th century classics. We present a study to compare classic literature to modern literature in terms of performance of natural language processing tools for the automatic extraction of social networks as well as their network structure. We find that there are no significant differences between the two sets of novels but that both are subject to a high amount of variance. Furthermore, we identify several issues that complicate named entity recognition in modern novels and we present methods to remedy these.Fostering Open Science at WSL with the EnviDat Environmental Data Portalhttps://peerj.com/preprints/272112018-09-152018-09-15Ionut Iosifescu EnescuMarielle FraefelGian-Kasper PlattnerLucia Espona-PernasDominik Haas-ArthoMichael LehningKonrad Steffen
EnviDat is the institutional research data portal of the Swiss Federal Institute for Forest, Snow and Landscape WSL. The portal is designed to provide solutions for efficient, unified and managed access to the WSL’s comprehensive reservoir of monitoring and research data, in accordance with the WSL data policy. Through EnviDat, WSL is fostering open science, making curated, quality-controlled, publication-ready research data accessible. Data producers can document author contributions for a particular data set through the EnviDat-DataCRediT taxonomy. The publication of research data sets can be complemented with additional digital resources, such as, e.g., supplementary documentation, processing software or detailed descriptions of code (i.e. as Jupyter Notebooks). The EnviDat Team is working towards generic solutions for enhancing open science, in line with WSL’s commitment to accessible research data.
EnviDat is the institutional research data portal of the Swiss Federal Institute for Forest, Snow and Landscape WSL. The portal is designed to provide solutions for efficient, unified and managed access to the WSL’s comprehensive reservoir of monitoring and research data, in accordance with the WSL data policy. Through EnviDat, WSL is fostering open science, making curated, quality-controlled, publication-ready research data accessible. Data producers can document author contributions for a particular data set through the EnviDat-DataCRediT taxonomy. The publication of research data sets can be complemented with additional digital resources, such as, e.g., supplementary documentation, processing software or detailed descriptions of code (i.e. as Jupyter Notebooks). The EnviDat Team is working towards generic solutions for enhancing open science, in line with WSL’s commitment to accessible research data.g.citation: Scientific citation for individual GRASS GIS software moduleshttps://peerj.com/preprints/272062018-09-142018-09-14Peter LöweVaclav PetrasMarkus NetelerHelena Mitasova
The authors introduce the GRASS GIS add-on module g.citation. The module extends the existing citation capabilities of GRASS GIS, which until now only provide for automated citation of the software project as a whole, authored by the GRASS Development Team, without reference to individual persons. The functionalities of the new module enable individual code citation for each of the over 500 implemented functionalities, including add-on modules. Three different classes of citation output are provided in a variety human- and machine-readable formats. The implications of this reference implementation of scientific software citation for both for the GRASS GIS project and the OSGeo foundation are outlined.
The authors introduce the GRASS GIS add-on module g.citation. The module extends the existing citation capabilities of GRASS GIS, which until now only provide for automated citation of the software project as a whole, authored by the GRASS Development Team, without reference to individual persons. The functionalities of the new module enable individual code citation for each of the over 500 implemented functionalities, including add-on modules. Three different classes of citation output are provided in a variety human- and machine-readable formats. The implications of this reference implementation of scientific software citation for both for the GRASS GIS project and the OSGeo foundation are outlined.Mapping ISO metadata standards to codemetahttps://peerj.com/preprints/271532018-08-282018-08-28Ted Habermann
The codemeta project recently proposed a vocabulary for software metadata. ISO Technical Committee 211 has published a set of metadata standards for geographic data and many kinds of related resources, including software. In order for ISO metadata creators and users to take advantage of the codemeta recommendations, a mapping from ISO elements to the codemeta vocabulary must exist. This mapping is complicated by differences in the approaches used by ISO and codemeta, primarily a difference between hard and soft typing of metadata elements. These differences are described in detail and a mapping is proposed that includes sixty-four of the sixty-eight codemeta V2 terms. The codemeta terms have also been mapped to dialects used by twenty-one software repositories, registries and archives. The average number of terms mapped in these cases is 11.2. The disparity between these numbers reflects the fact that many of the dialects that have been mapped to codemeta are focused on citation or dependency identification and management while ISO and codemeta share additional targets that include access, use, and understanding. Addressing this broader set of use cases requires more metadata elements.
The codemeta project recently proposed a vocabulary for software metadata. ISO Technical Committee 211 has published a set of metadata standards for geographic data and many kinds of related resources, including software. In order for ISO metadata creators and users to take advantage of the codemeta recommendations, a mapping from ISO elements to the codemeta vocabulary must exist. This mapping is complicated by differences in the approaches used by ISO and codemeta, primarily a difference between hard and soft typing of metadata elements. These differences are described in detail and a mapping is proposed that includes sixty-four of the sixty-eight codemeta V2 terms. The codemeta terms have also been mapped to dialects used by twenty-one software repositories, registries and archives. The average number of terms mapped in these cases is 11.2. The disparity between these numbers reflects the fact that many of the dialects that have been mapped to codemeta are focused on citation or dependency identification and management while ISO and codemeta share additional targets that include access, use, and understanding. Addressing this broader set of use cases requires more metadata elements.Towards computational reproducibility: researcher perspectives on the use and sharing of softwarehttps://peerj.com/preprints/267272018-03-192018-03-19Yasmin AlnoamanyJohn A. Borghi
Research software, which includes both the source code and executables used as part of the research process, presents a significant challenge for efforts aimed at ensuring reproducibility. In order to inform such efforts, we conducted a survey to better understand the characteristics of research software as well as how it is created, used, and shared by researchers. Based on the responses of 215 participants, representing a range of research disciplines, we found that researchers create, use, and share software in a wide variety of forms for a wide variety of purposes, including data collection, data analysis, data visualization, data cleaning and organization, and automation. More participants indicated that they use open source software than commercial software. While a relatively small number of programming languages (e.g. Python, R, JavaScript, C++, Matlab) are used by a large number, there is a long tail of languages used by relatively few. Between group comparisons revealed that significantly more participants from computer science write source code and create executables than participants from other disciplines. Group comparisons related to knowledge of best practices related to software creation or sharing were not significant. While many participants indicated that they draw a distinction between the sharing and preservation of software, related practices and perceptions were often not aligned with those of the broader scholarly communications community.
Research software, which includes both the source code and executables used as part of the research process, presents a significant challenge for efforts aimed at ensuring reproducibility. In order to inform such efforts, we conducted a survey to better understand the characteristics of research software as well as how it is created, used, and shared by researchers. Based on the responses of 215 participants, representing a range of research disciplines, we found that researchers create, use, and share software in a wide variety of forms for a wide variety of purposes, including data collection, data analysis, data visualization, data cleaning and organization, and automation. More participants indicated that they use open source software than commercial software. While a relatively small number of programming languages (e.g. Python, R, JavaScript, C++, Matlab) are used by a large number, there is a long tail of languages used by relatively few. Between group comparisons revealed that significantly more participants from computer science write source code and create executables than participants from other disciplines. Group comparisons related to knowledge of best practices related to software creation or sharing were not significant. While many participants indicated that they draw a distinction between the sharing and preservation of software, related practices and perceptions were often not aligned with those of the broader scholarly communications community.Digital scientific notations as a human-computer interface in computer-aided researchhttps://peerj.com/preprints/266332018-03-072018-03-07Konrad Hinsen
Most of today’s scientific research relies on computers and software not only for administrational tasks, but also for processing scientific information. Examples of such computer-aided research are the analysis of experimental data or the simulation of phenomena based on theoretical models. With the rapid increase of computational power, scientific software has integrated more and more complex scientific knowledge in a black-box fashion. As a consequence, its users do not know, and do not even have a chance of finding out, which assumptions and approximations their computations are based on. The black-box nature of scientific software has thereby become a major cause of mistakes. The present work starts with an analysis of this situation from the point of view of human-computer interaction in scientific research. It identifies the key role of digital scientific notations at the human-computer interface, reviews the most popular ones in use today, and describes a proof-of-concept implementation of Leibniz, a language explicitly designed as a digital scientific notation for models formulated as mathematical equations.
Most of today’s scientific research relies on computers and software not only for administrational tasks, but also for processing scientific information. Examples of such computer-aided research are the analysis of experimental data or the simulation of phenomena based on theoretical models. With the rapid increase of computational power, scientific software has integrated more and more complex scientific knowledge in a black-box fashion. As a consequence, its users do not know, and do not even have a chance of finding out, which assumptions and approximations their computations are based on. The black-box nature of scientific software has thereby become a major cause of mistakes. The present work starts with an analysis of this situation from the point of view of human-computer interaction in scientific research. It identifies the key role of digital scientific notations at the human-computer interface, reviews the most popular ones in use today, and describes a proof-of-concept implementation of Leibniz, a language explicitly designed as a digital scientific notation for models formulated as mathematical equations.Code of practice for research data usage metrics release 1https://peerj.com/preprints/265052018-02-112018-02-11Martin FennerDaniella LowenbergMatt JonesPaul NeedhamDave VieglaisStephen AbramsPatricia CruseJohn Chodacki
The Code of Practice for Research Data Usage Metrics standardizes the generation and distribution of usage metrics for research data, enabling for the first time the consistent and credible reporting of research data usage. This is the first release of the Code of Practice and the recommendations are aligned as much as possible with the COUNTER Code of Practice Release 5 that standardizes usage metrics for many scholarly resources, including journals and books. With the Code of Practice for Research Data Usage Metrics data repositories and platform providers can report usage metrics following common best practices and using a standard report format. This is an essential step towards realizing usage metrics as a critical component in our understanding of how publicly available research data are being reused. This complements ongoing work on establishing best practices and services for data citation.
The Code of Practice for Research Data Usage Metrics standardizes the generation and distribution of usage metrics for research data, enabling for the first time the consistent and credible reporting of research data usage. This is the first release of the Code of Practice and the recommendations are aligned as much as possible with the COUNTER Code of Practice Release 5 that standardizes usage metrics for many scholarly resources, including journals and books. With the Code of Practice for Research Data Usage Metrics data repositories and platform providers can report usage metrics following common best practices and using a standard report format. This is an essential step towards realizing usage metrics as a critical component in our understanding of how publicly available research data are being reused. This complements ongoing work on establishing best practices and services for data citation.Assessing value of biomedical digital repositorieshttps://peerj.com/preprints/26882017-11-292017-11-29Chun-Nan HsuAnita BandrowskiJeffrey S. GretheMaryann E. Martone
Digital repositories bring direct impact and influence on the research community and society but measuring their value using formal metrics remains challenging. their value. It is challenging to define a single perfect metric that covers all quality aspects. Here, we distinguish here between impact and influence and discuss measures and mentions as the basis of quality metrics of a digital repository. We argue that these challenges may potentially be overcome through the introduction of standard resource identification and data citation practices. We briefly summarize our research and experience in the Neuroscience Information Framework, the BD2K BioCaddie project on data citation, and the Resource Identification Initiative. Full implementation of these standards will depend on cooperation from all stakeholders --- digital repositories, authors, publishers, and funding agencies, but both resource and data citation have been gaining support with researchers and publishers.
Digital repositories bring direct impact and influence on the research community and society but measuring their value using formal metrics remains challenging. their value. It is challenging to define a single perfect metric that covers all quality aspects. Here, we distinguish here between impact and influence and discuss measures and mentions as the basis of quality metrics of a digital repository. We argue that these challenges may potentially be overcome through the introduction of standard resource identification and data citation practices. We briefly summarize our research and experience in the Neuroscience Information Framework, the BD2K BioCaddie project on data citation, and the Resource Identification Initiative. Full implementation of these standards will depend on cooperation from all stakeholders --- digital repositories, authors, publishers, and funding agencies, but both resource and data citation have been gaining support with researchers and publishers.