All research extends or uses knowledge from other research. With the primary goal of scholarly publishing being the distribution of knowledge, it is important that the publications—and the literature they cite—are distributed in an accessible, identifiable and findable manner (Shotton, 2013). That also allows the analysis and visualisation of how research cites each other (Shotton, 2013; van Eck & Waltman, 2014). While traditionally journals required text-based citations, each formatted in their own specific style, the last few decades the use of Persistent IDentifiers (PIDs) has become commonplace, with Digital Object Identifiers (DOIs) being the most common for scholarly articles, and International Standard Book Numbers (ISBNs) for books. These PIDs are then linked to central stores that provide machine-readable bibliographic information, such as Crossref and DataCite (Lammey, 2015; Brase, 2009; Neumann & Brase, 2014).
This leads to reduced findability between organisations and disciplines (Godby, Young & Childress, 2004; Zinn et al., 2016), and reference managers need to maintain parsers for numerous formats and different types of citable resources (articles, books, data, software (Smith, Katz & Niemeyer, 2016), etc.). The management requires a detailed description of the source being referenced and preferably link to the full-text too (Hull, Pettifer & Kell, 2008). A second requirement is their ability to convert references into citations, according to the norms for formatting citations in writing (Gilmour & Cobus-Kuo, 2011). Reference managers assist in keeping references accessible and machine-readable, ready to be formatted for use in citation (Fenner, Scheliga & Bartling, 2014).
To convert one data format (or scheme) to another, a crosswalk is used. A crosswalk is a set of mappings between equivalent properties and entry types in different formats (Pierre & LaPlant, 1998). First of all, for most properties a simple mapping suffices: title in BibTeX refers to the same concept as it does in BibJSON and CSL-JSON. On top of that, the property coincides with TI in RIS. This mapping could come in the form of a JSON Linked Data (JSON-LD) context, as done by CodeMeta (Jones et al., 2017), as eXtensible Stylesheet Language Transformations (XSLT), as discussed by Godby, Smith & Childress (2003).
Second, some mappings are context-dependent. For example, consider the CSL-JSON properties author and reviewed-author in relation to the RIS properties author (AU) and reviewer (usually C4). Normally, AU maps to author. However, if the entry being converted has the review entry type, AU maps to reviewed-author while author maps to C4.
Third, the data format of the values needs to be converted. While title in BibTeX can have formatting in the form of TeX, title in CSL-JSON uses a subset of HTML for formatting, and TI in RIS does not have formatting at all. Properties can also have different data types. In CSL-JSON, author is a list of objects, while authors in BibTeX, which describes the same concept, is serialized text delimited by ” and ”.
Finally, there might not be a one-to-one mapping between properties. For instance, page in CSL-JSON maps to both start and end in Citation File Format (CFF) (Druskat et al., 2018). Similarly, page, issue, volume and ISSN are all top-level properties in CSL-JSON, while the corresponding properties in Wikidata are proposed to be nested in the journal property.
Since the last two aspects can lead to information loss, crosswalks often need to be one-directional converters between two formats. To not have to create crosswalks between every possible combination of supported formats, one could define a central format, similar to the “interoperable core” in the ”long translation path” proposed by Godby, Smith & Childress (2003). It is however important that the central format can hold as much information as should be represented in any of the output formats, as to prevent information loss when converting between two formats.
Bibutils is very similar to Citation.js in that it also is a set of converters with a central format, there the Metadata Object Description Schema by the Library of Congress (Putnam, 2005). Bibutils is used as a set of CLI programs, and does not directly allow formatting as citations and bibliographies. More recently, astrocite was created, a set of parsers that all output CSL-JSON (Sifford, 2019). Astrocite uses Abstract Syntax Trees (ASTs) and formal grammars, which should make it easier to write, read and maintain parsers. For drawbacks of formal grammars, see the Outlook.
The impact of reference managers should also not be underestimated. However, most reference managers have proprietary backends or require human input to use. Even Zotero, which is open source, is only commonly used via the client or alternatively as a server. While the fact that it has a server already allows for many more possibilities than other managers, it remains difficult to run a standalone program, which Citation.js does allow.
As mentioned, Citation.js converts different bibliographical formats into each other. To achieve interoperability without too much work, a central format is chosen. All input is converted into this central format, and all output is created from it, adopting the approach of the “long translation path” with the “interoperable core” by Godby, Smith & Childress, 2003. See also Fig. 1.
Input is parsed iteratively: for each distinct format along the way from the input to the central format, a separate parser function is defined. This allow progressive enhancement, easily replacing parts of the parsing process without touching the rest, and lets users input intermediate formats without second thought.
Parsing iteratively is relatively simple, because all different kinds of input should get turned into a single type anyway: you do not have to choose what format you need next, you only have to recognize and parse what you have now. However, a similar process for output does not make as much sense. With output formatting, there is only one kind of input, the central format, and several kinds of output. If output formatting should be done iteratively as well, the paths to reach final output formats would have to be defined separately.
The Approach section is organized as follows. First, methods used while developing the software are listed. Then, design choices of the code itself, and how to install and use it is explained. Last, the method to evaluate the results is described.
The software was developed using modern standards: version control with Git, semantic versioning for releases (Preston-Werner, 2013), open source archives on GitHub (https://github.com/larsgw/citation.js; https://github.com/citation-js) and Zenodo (https://doi.org/10.5281/zenodo.1005176), browser bundles with browserify (Halliday et al., 2018), compatible code with Babel (Zhu et al., 2018), integration testing using the Travis-CI service (Travis, 2018), code linting (source code analysis) with ESLint (Zakas et al., 2018) and Standard (Aboukhadijeh et al., 2018), checking RegExp’s for ReDOS vulnerabilities with vuln-regex-detector (Davis et al., 2018), and detailed documentation with JSDoc (Williams et al., 2018).
The development process took place with Node.js and npm. First off, any changes would be linted and tested with the aforementioned tools. Bugs or new features can also warrant the introduction of new test cases. If the changes work properly, they are then committed into the version control. If the changes warrant a new release, or if enough changes have piled up for a new release, the change log is updated. Updating the version in the package metadata automatically triggers the linters and test runners, preventing accidental mistakes. Afterwards, publishing the package to npm automatically triggers the generation of files necessary for the package. The scripts used for this are described in https://github.com/larsgw/citation.js/blob/90cd68c/CONTRIBUTING.md#installing.
Apart from tools used for development, Citation.js also uses a number of runtime libraries. Their function and the reason for using them is explained below.
is a runtime library which fills in support for modern APIs on older platforms. It comes with the use of Babel to transform modern syntax for older platforms (Zhu et al., 2018).
is a utility library, only used for the Command Line Interface (CLI). It parses the command line arguments and generates documentation (Holowaychuk et al., 2018).
is a specific polyfill, a library filling in support, for the Fetch Application Programming Interface (Fetch API), a modern way of requesting web resources. It works in both Node.js and browsers (Andrews et al., 2018).
Citation.js employs a number of ways to achieve a balance between function and ease of use. The program consists of three major parts: the bibliography interface, code handling input parsing, and code handling output formatting. The bibliography interface itself is quite simple; it mainly acts as a wrapper around the parsing and formatting parts. These two parts behave in a modularised way, with a common plugin system.
Input parsing works by registered input formats. These registrations include an optional type recognizer and a synchronous and/or an asynchronous function transforming the input into a format closer to the final format: CSL-JSON. The new input can then be tested again, and will be parsed iteratively until the final format is reached. Plugin authors are encouraged to create input parsers with as small steps as possible, to allow users to input a variety of different formats.
Types can then provide a list of predicates, testing if input belongs to that format. To avoid code repetition and make plugin registration code easier to read, certain common tasks can also be accomplished using shortcuts. These shortcuts include testing text against a RegExp pattern, checking for certain properties and checking for the value of elements in an array. These properties can also eliminate the need for an explicit data type: for example, if a RegExp is provided, input can be expected to be a String.
Output formatting is less complicated. Users and developers only have to provide the identifier of the formatter. Further customization can then be done by providing options, which are automatically forwarded to the formatter. This allows the CSL plugin to take in options specifying the template and locale, for example. All formatting producing bibliographies and citations is done with citeproc-js (Bennett et al., 2018).
For configuring plugins there is also a config option. As an example a labelForm option is added, which could control the way the BibTeX output formatter generates labels. Users of this plugin can then retrieve and modify this configuration. It is also possible to offer internal functions this way, for more fine-grained control.
The methods for parsing input and formatting output are also included in a general class, Cite. Class instances also have access to opt-in version control—changes are tracked if an explicit flag is passed—and sorting. The latter currently does not have effect on CSL bibliographies unless set with the nosort option, as the styles define their own sorting method.
Table 1 shows the formats supported by Citation.js at the moment.
For simple use cases like inserting static bibliographies, a separate tool, citation.js-replacer, was developed. When included on a page, this replaces every HyperText Markup Language (HTML) element matching a certain selector with a bibliography.
Figure 3 shows an example of another use case. For example, the basic use can be extended to add additional information to citations, such as an Altmetric (Adie & Roe, 2013) score icon or Dimensions citation count (Thelwall, 2018). The output is shown in Fig. 4.
Citation.js is published as an npm package on the main npm registry, as citation-js. Use of the package is the same anywhere, apart from platform limitations. For example, synchronous requests for web resources, used to get metadata for DOIs, is limited on Chrome as discussed in Willighagen (2017b). Also, the Node.js platform, not being a browser, doesn’t have access to the Document Object Model (DOM), and so can’t easily use HTML elements as input or output.
Separate components, including formats not included in the standard configuration are available under the @citation-js scope.
Use cases for the npm package include using it when generating content (either at runtime or for static websites) like PubPub (Shihipar & Rich, 2018), and setting up APIs (Willighagen, 2017a). It is also useful for converting metadata when text mining. For example, BibJSON is one of the input formats, and can then be converted to BibTeX or formatted. All references for GitHub projects were created with a simple script running Citation.js.
Simple one-time conversions, with no extensive customization, can also be done with Command Line Interface (CLI). The command can be installed with npm, which may require root privileges depending your setup. Alternatively, any commands can be prefixed with npx instead. The command can get input text from files, command line arguments or via standard in. Output can be configured with a number of options detailed in the man file, also available by running with the -h, -help option. Any output is then written to a file or redirected to standard out.
The Citation.js npm package can also be used as a library to create integrations with, among other things, word processing systems. For example, ReLaXed (Zulko et al., 2018) integrates Citation.js into the Pug templating language to generate citations when creating Portable Document Format (PDF) documents, and the npm package citation-js-showdown was created as a demo on how to introduce syntax for citations in Markdown.
Coverage of types and properties was determined by creating two spreadsheets, one with all of the CSL types and one with all of the CSL variables. Then, columns where created for other supported formats and filled in with the corresponding type or property in that format. The amount of mappings were counted as a percentage of the total possible mappings, i.e., the total amount of types or properties, available in CSL.
This method of counting skews the perspective as not all properties and types can plausibly be mapped, either because no equivalent term exists in the other format, or because the existence is currently unknown to the authors. This is explained in more detail in the results.
To collect dependent projects, the GitHub Dependency Networks was used, which lists other GitHub repositories listing Citation.js as a dependency. To find dependent repositories on different hosting platforms, a search on the respective hosting platforms and Google was carried out. Additionally, projects known to the authors to use the library were listed. Of those lists, a diverse set of projects was extracted by hand, followed by a check to see if and how Citation.js is used.
For download counts, npm-stat.com was chosen because of its ability to collect download statistics over specific, multi-year time frames.
Performance statistics were gathered by recording runtime performance in the Chrome DevTools and the Firefox Developer Tools while importing Citation.js in the browser. The results were obtained with the default settings for each of the platforms. In Chrome, this was achieved by using guest mode, while no custom settings were used for Firefox and Node was used without command-line options affecting performance.
The sizes of the different components was determined with the disc tool (Kennedy et al., 2019).
An important aspect of Citation.js, other than parsing and formatting specific syntaxes, is the mappings between different formats. Since creating mappings between each format is often unnecessarily much work, CSL-JSON was chosen as a central format. To review the existing mappings, the number of mapped properties and entry types relative to CSL-JSON were counted. Table 2 shows those results.
|Properties||78||24 (31%)||46 (59%)||21 (27%)||42 (54%)||20 (27%)|
|Types||35||31 (89%)||31 (89%)||11 (31%)||30 (86%)||11 (31%)|
While the numbers may seem low, note that not every CSL-JSON property can be mapped: the intentions behind at least three properties are contested (Wiernik, 2018), the values of two other properties can usually be derived from other fields, ten properties are specific to references and may not apply to resource-describing schemas, and twenty properties could be reduced to just eight with linked data. Additionally, a number of properties have limited documentation and usage, making it difficult to determine what the exact meaning is.
On top of that, the other formats may not have enough well-defined properties to map either. The BibTeX and BibJSON mappings, the latter being based on BibTeX, are limited by the low number of known properties and types. Without an authoritative list of properties, examples from a range of sources were used to define a mapping, which consequently lacked lesser-used properties. The Wikidata mapping has the most potential for expansion due to the large number of described properties. In fact, a recent Citation.js update nearly doubled the number of mapped properties to 46 (59%). At that point, the CSL specification becomes limiting again.
The number of mapped RIS properties is actually higher than expected; the RIS mapping has a lot of one-to-many and many-to-one mappings, which makes it inconsistent while artificially raising the number of mapped properties. Since there is no authoritative document other than an Internet-Archived spreadsheet linked to on Wikipedia (Reference Manager, 2012), the current mapping is partially based on the Zotero translator for RIS.
Apart from property and type mappings, value conversion affect the accuracy of results as well. For example, since BibTeX does not encode information about how names are built up, such information has to be estimated. While this is seems to be going fine for Western names, other naming systems may not work as well. The same goes for how RIS encodes names: a first name, a last name and an optional suffix. Note, however, that Zotero, which uses CSL, is not particularly geared towards any naming schemes other than the most simple, as noted by D’Arcus (2008). Also, style guides themselves may not have rules for non-Western names either (Qiu, 2008; Puniamoorthy, Jeevanandam & Narayanan Kutty, 2008).
Since Citation.js was published as an npm module, it has been used independently of the authors in a variety of use cases. With the GitHub Dependency graph, Google, and via Twitter, a number of those uses can be identified, listed in Table 3. The download count is also increasing, as can be seen in Fig. 5. Between the first version in October 2016 and 26 April 2019, our package was downloaded a total of 26,718 times.
|wcite (Voß, 2019)||CLI tool for managing a Wikidata bibliography||Wikidata||Yes|
|Reference Extractor (Zelle & Zumstein, 2018)||Browser tool extracting references from Word documents||No||Yes|
|PubPub (Shihipar & Rich, 2018)||Live server-side generation of citations (with REST API)||No||CSL, BibTeX|
|service-based-antipatterns (Boceck, Popp & JREB, 2019)||Live in-browser generation of citations||No||CSL, BibTeX|
|ds.korea.ac.kr, ccv.brown.edu||Static site generation of lists of publications||BibTeX/DOI||CSL|
|dabai.compute.dtu.dk||Live in-browser generation of lists of publications||Wikidata||CSL|
|schol-js, RelaxedJS (Czerwinski, 2018; Zulko et al., 2018)||Document generation||Yes||Yes|
|Ovide, Fonio (de Mourat & Rabot, 2019; de Mourat, Plique & Pichon, 2019)||Experimental publishing platform||BibTeX||CSL|
|PolarisOS (Ribeyre, Louis-Marie & MyScienceWork, 2019)||Library management system and repository||No||CSL|
As shown in Fig. 6 and Table 4, time taken to import the library mainly consists of importing @babel/polyfill. This is because adding the polyfills requires repeated feature detection. After that the actual code is imported in two parts. In the first part, where core functionality like the Cite interface is loaded, the main culprit is addTypeParser, with 0.13 ms per call on average. In the second part, loading output-related code, importing citeproc-js takes the longest with a single call of 2.82 ms. Note that that Firefox uses Just-In-Time (JIT) compilation, compiling pieces of code when they are used a lot.
|Citation.js||Backport||5.9 kB||–||5.9 kB||5.9 ms|
|core||99.7 kB||33.9 kB||133.5 kB||8.3 ms|
|plugin-bibjson||7.6 kB||–||7.6 kB||3.5 ms|
|plugin-bibtex||42.9 kB||–||42.9 kB||2.6 ms|
|plugin-csl||87.0 kB||461.8 kB||548.9 kB||3.2 ms|
|plugin-doi||6.6 kB||–||6.6 kB||0.6 ms|
|plugin-ris||11.1 kB||–||11.1 kB||0.5 ms|
|plugin-wikidata||22.1 kB||40.4 kB||62.4 kB||2.2 ms|
|name||16.6 kB||–||16.6 kB||1.1 ms|
|date||7.3 kB||–||7.3 kB||0.2 ms|
|Additional||@babel/polyfill||–||197.3 kB||197.3 kB||32.0 ms|
|browserify||–||6.8 kB||6.8 kB||0.2 ms|
|Total||306.8 kB||740.1 kB||1046.8 kB||60.3 ms|
While code execution is one part, one should also look into the file size. This is especially important in the browser, which has to fetch the library when loading the page. The biggest part is citeproc-js, accounting for almost half of the file size. Additionally, built-in CSL styles and locales should also be counted. A complete overview can be found in Table 4.
Converting between formats and standardized crosswalks with linked data
Converting input data like parsed BibTeX, BibJSON or Wikidata API results into another format and back can get very repetitive in terms of code. Yet, there are still cases where special handling is needed. Since different formats call for different needs, each plugin has developed its own system to deal with this. Unifying this into a single, performant, reusable and developer-friendly system would be preferable.
Jones et al. (2017) use a JSON-LD context for this. While a JSON-LD context would scale very well without a central format, most cases restrict the usefulness. Consider the page property in CSL-JSON, mapping to the first and last properties in CFF. If the nested values were deserialized, this could be expressed in JSON-LD contexts. However, in the first example it cannot, since JSON-LD cannot distinguish between parts of strings.
Alternatively, a custom system could be developed that defines as much mapping as possible to and from a central format, with special cases for context-dependent mappings and one-to-many mappings. It would be difficult to do this entirely language-agnostic, since serialization and deserialization usually requires some amount of scripting.
CSL-JSON as a central format
As mentioned in the Background, an important feature of the central format is that it can hold any information needed for the output formats—if it cannot, information loss can occur. Citation.js currently uses CSL-JSON as a central format, as it has a (mostly) well-defined list of properties and entry types, an authority to clear up any confusion, and relatively good support for most metadata while still being simple to work with.
However, CSL-JSON is not perfect, and information loss is definitely possible. This is currently the case with the software entry type, which does not exist in CSL 1.0.1, and is represented by the book type by convention. When converting Wikidata input consisting of a computer program to RIS output, the fact that it is in fact a computer program is lost, even though both formats support it as an entry type.
A solution for this would be to extend the format to allow for the missing entry type or property, assuming the specification authors agree. Otherwise, a custom extension could be made. Doing that for every future shortcoming may not be sustainable though, since it effectively creates a new poorly-supported standard to add to the mix. Choosing a different format is among the possibilities, but we have not found a suitable candidate; even Wikidata, which is intended to cover everything, is changing constantly and even if a proper specification is created, the question remains whether the bulk of the publication data is up to date.
If no suitable format is found, one might add additional mappings to common formats, to cover properties missing from the central format. Then, find a path for every property through the crosswalks to get to the end format. That way, one could bypass the central format in specific cases only, keeping an easy mapping for common properties.
Scraping from source versus fetching from central stores
When getting data from, for example, the Wikidata API or scraped from a web page, that data may be incomplete. However, if part of the data you get is the DOI linked to the entity queried, you could amend that data with data fetched from a central store like Crossref or DataCite. Due to difficulties with prioritizing data sources and non-trivial merging conflicts this has not been implemented yet, although linked-data formats such as Resource Description Framework (RDF) and JSON-LD would be possibilities. Additionally, if the user specifically requests data from a specific API, it can be assumed they want that specific data to be used.
Use of formal grammars for parsing
Apart from more common formats like JSON, XML and YAML (YAML Ain’t Markup Language), Citation.js has to parse a number of text formats with syntax specific to that format, like BibTeX and RIS. While one can use standard or even built-in parsers for common formats, that is usually not possible for the latter formats. To solve this, one can employ formal grammars, which can be translated into code parsing and validating input. Examples of libraries working with grammars are PEG.js (Majda et al., 2018) and nearley.js (Kartik et al., 2018). Creating grammars has the benefits of not having to write and maintain code validating and parsing input, and having a readable grammar instead of a complex program file.
However, there are also drawbacks. Generating these grammars requires an extra build step, which in the case of Citation.js cannot be integrated with the preceding step due to a lack of appropriate tooling. On top of that, early tests have shown generated code to have poor performance or large size, and in the case of nearley.js requires a runtime library. Therefore, it may be preferable to write custom parsers in some cases. For example for RIS, which has no balanced brackets or quotes, and does not need much more than simply iterating over the individual lines.
Use of GraphQL for API queries
With REpresentational State Transfer (REST) APIs comes the problem of over-fetching and under-fetching: when fetching a resource it may contain too much unneeded information, require additional calls to the API, or both. This causes unnecessary load on both the client and the server, as both have to process more calls with more network bandwidth. One case where this is especially relevant is Wikidata, which in its linked-data nature does not expose the name of the authors or journals it links to; to retrieve that, additional requests are needed. To overcome this a better way of generating those requests locally could be implemented, or a GraphQL server could be developed by allowing the client to specify exactly what data it needs (Facebook, 2018).
Support for additional formats
Apart from the formats currently supported in Citation.js (see Table 1), there are plans to include more formats such as EndNote import files, MARC XML (Avram, 2003), the Zotero API JSON schema and Office XML. These will be published in thematic plugins. For example, formats used to describe software projects are joined in the plugin @citation-js/plugin-software-formats. These formats will also include linked data scraped from web pages.