PeerJ Computer Science:Digital Libraries

PeerJ Computer Science:Digital Libraries https://peerj.com/articles/index.atom?journal=cs&subject=9800 Digital Libraries articles published in PeerJ Computer Science Adapting multilingual vision language transformers for low-resource Urdu optical character recognition (OCR) https://peerj.com/articles/cs-1964 2024-04-29 2024-04-29 Musa Dildar Ahmed Cheema Mohammad Daniyal Shaiq Farhaan Mirza Ali Kamal M. Asif Naeem

In the realm of digitizing written content, the challenges posed by low-resource languages are noteworthy. These languages, often lacking in comprehensive linguistic resources, require specialized attention to develop robust systems for accurate optical character recognition (OCR). This article addresses the significance of focusing on such languages and introduces ViLanOCR, an innovative bilingual OCR system tailored for Urdu and English. Unlike existing systems, which struggle with the intricacies of low-resource languages, ViLanOCR leverages advanced multilingual transformer-based language models to achieve superior performances. The proposed approach is evaluated using the character error rate (CER) metric and achieves state-of-the-art results on the Urdu UHWR dataset, with a CER of 1.1%. The experimental results demonstrate the effectiveness of the proposed approach, surpassing state of the-art baselines in Urdu handwriting digitization.

Software plays a fundamental role in research as a tool, an output, or even as an object of study. This special issue on software citation, indexing, and discoverability brings together five papers examining different aspects of how the use of software is recorded and made available to others. It describes new work on datasets that enable large-scale analysis of the evolution of software usage and citation, that presents evidence of increased citation rates when software artifacts are released, that provides guidance for registries and repositories to support software citation and findability, and that shows there are still barriers to improving and formalising software citation and publication practice. As the use of software increases further, driven by modern research methods, addressing the barriers to software citation and discoverability will encourage greater sharing and reuse of software, in turn enabling research progress.

Object detection methods based on deep learning have been used in a variety of sectors including banking, healthcare, e-governance, and academia. In recent years, there has been a lot of attention paid to research endeavors made towards text detection and recognition from different scenesor images of unstructured document processing. The article’s novelty lies in the detailed discussion and implementation of the various transfer learning-based different backbone architectures for printed text recognition. In this research article, the authors compared the ResNet50, ResNet50V2, ResNet152V2, Inception, Xception, and VGG19 backbone architectures with preprocessing techniques as data resizing, normalization, and noise removal on a standard OCR Kaggle dataset. Further, the top three backbone architectures selected based on the accuracy achieved and then hyper parameter tunning has been performed to achieve more accurate results. Xception performed well compared with the ResNet, Inception, VGG19, MobileNet architectures by achieving high evaluation scores with accuracy (98.90%) and min loss (0.19). As per existing research in this domain, until now, transfer learning-based backbone architectures that have been used on printed or handwritten data recognition are not well represented in literature. We split the total dataset into 80 percent for training and 20 percent for testing purpose and then into different backbone architecture models with the same number of epochs, and found that the Xception architecture achieved higher accuracy than the others. In addition, the ResNet50V2 model gave us higher accuracy (96.92%) than the ResNet152V2 model (96.34%).

Background Joint local context that is primarily processed by pre-trained models has emerged as a prevailing technique for text classification. Nevertheless, there are relatively few classification applications on small sample of industrial text datasets. Methods In this study, an approach of employing global enhanced context representation of the pre-trained model to classify industrial domain text is proposed. To achieve the application of the proposed technique, we extract primary text representations and local context information as embeddings by leveraging the BERT pre-trained model. Moreover, we create a text information entropy matrix through statistical computation, which fuses features to construct the matrix. Subsequently, we adopt BERT embedding and hyper variational graph to guide the updating of the existing text information entropy matrix. This process is subjected to iteration three times. It produces a hypergraph primary text representation that includes global context information. Additionally, we feed the primary BERT text feature representation into capsule networks for purification and expansion as well. Finally, the above two representations are fused to obtain the final text representation and apply it to text classification through feature fusion module. Results The effectiveness of this method is validated through experiments on multiple datasets. Specifically, on the CHIP-CTC dataset, it achieves an accuracy of 86.82% and an F1 score of 82.87%. On the CLUEEmotion2020 dataset, the proposed model obtains an accuracy of 61.22% and an F1 score of 51.56%. On the N15News dataset, the accuracy and F1 score are 72.21% and 69.06% respectively. Furthermore, when applied to an industrial patent dataset, the model produced promising results with an accuracy of 91.84% and F1 score of 79.71%. All four datasets are significantly improved by using the proposed model compared to the baselines. The evaluation result of the four dataset indicates that our proposed model effectively solves the classification problem.

This research examines the prevalence of research software as independent records of output within UK academic institutional repositories (IRs). There has been a steep decline in numbers of research software submissions to the UK’s Research Excellence Framework from 2008 to 2021, but there has been no investigation into whether and how the official academic IRs have affected the low return rates. In what we believe to be the first such census of its kind, we queried the 182 online repositories of 157 UK universities. Our findings show that the prevalence of software within UK Academic IRs is incredibly low. Fewer than 28% contain software as recognised academic output. Of greater concern, we found that over 63% of repositories do not currently record software as a type of research output and that several Universities appeared to have removed software as a defined type from default settings of their repository. We also explored potential correlations, such as being a member of the Russell group, but found no correlation between these metadata and prevalence of records of software. Finally, we discuss the implications of these findings with regards to the lack of recognition of software as a discrete research output in institutions, despite the opposite being mandated by funders, and we make recommendations for changes in policies and operating procedures.

Digital evolution is a branch of artificial life in which self-replicating computer programs—digital organisms—mutate and evolve within a user-defined computational environment. In spite of its value in biology, we still lack an up-to-date and comprehensive database on digital organisms resulting from evolution experiments. Therefore, we have developed an ontology-based semantic database—avidaDB—and an R package—avidaR—that provides users of the R programming language with an easy-to-use tool for performing complex queries without specific knowledge of SPARQL or RDF. avidaR can be used to do research on robustness, evolvability, complexity, phenotypic plasticity, gene regulatory networks, and genomic architecture by retrieving the genomes, phenotypes, and transcriptomes of more than a million digital organisms available on avidaDB. avidaR is already accepted on CRAN (i.e., a comprehensive collection of R packages contributed by the R community) and will make biologists better equipped to embrace the field of digital evolution.

Citation analysis is used extensively in the bibliometrics literature to assess the impact of individual works, researchers, institutions, and even entire fields of study. In this article, we analyze citations in one large and influential field within computer science, namely computer systems. Using citation data from a cross-sectional sample of 2,088 papers in 50 systems conferences from 2017, we examine four research areas of investigation: overall distribution of systems citations; their evolution over time; the differences between databases (Google Scholar and Scopus), and; the characteristics of self-citations in the field. On citation distribution, we find that overall, systems papers were well cited, with the most cited subfields and conference areas within systems being security, databases, and computer architecture. Only 1.5% of papers remain uncited after five years, while 12.8% accrued at least 100 citations. For the second area, we find that most papers achieved their first citation within a year from publication, and the median citation count continued to grow at an almost linear rate over five years, with only a few papers peaking before that. We also find that early citations could be linked to papers with a freely available preprint, or may be primarily composed of self-citations. For the third area, it appears that the choice of citation database makes little difference in relative citation comparisons, despite marked differences in absolute counts. On the fourth area, we find that the ratio of self-citations to total citations starts relatively high for most papers but appears to stabilize by 12–18 months, at which point highly cited papers revert to predominately external citations. Past self-citation count (taken from each paper’s reference list) appears to bear little if any relationship with the future self-citation count of each paper. The primary practical implication of these results is that the impact of systems papers, as measured in citations, tends to be high relative to comparable studies of other fields and that it takes at least five years to stabilize. A secondary implication is that at least for this field, Google Scholar appears to be a reliable source of citation data for relative comparisons.

With the rapidly increasing amount of scientific literature, it is getting continuously more difficult for researchers in different disciplines to keep up-to-date with the recent findings in their field of study. Processing scientific articles in an automated fashion has been proposed as a solution to this problem, but the accuracy of such processing remains very poor for extraction tasks beyond the most basic ones (like locating and identifying entities and simple classification based on predefined categories). Few approaches have tried to change how we publish scientific results in the first place, such as by making articles machine-interpretable by expressing them with formal semantics from the start. In the work presented here, we propose a first step in this direction by setting out to demonstrate that we can formally publish high-level scientific claims in formal logic, and publish the results in a special issue of an existing journal. We use the concept and technology of nanopublications for this endeavor, and represent not just the submissions and final papers in this RDF-based format, but also the whole process in between, including reviews, responses, and decisions. We do this by performing a field study with what we call formalization papers, which contribute a novel formalization of a previously published claim. We received 15 submissions from 18 authors, who then went through the whole publication process leading to the publication of their contributions in the special issue. Our evaluation shows the technical and practical feasibility of our approach. The participating authors mostly showed high levels of interest and confidence, and mostly experienced the process as not very difficult, despite the technical nature of the current user interfaces. We believe that these results indicate that it is possible to publish scientific results from different fields with machine-interpretable semantics from the start, which in turn opens countless possibilities to radically improve in the future the effectiveness and efficiency of the scientific endeavor as a whole.

FCMpy is an open-source Python module for building and analyzing Fuzzy Cognitive Maps (FCMs). The module provides tools for end-to-end projects involving FCMs. It is able to derive fuzzy causal weights from qualitative data or simulating the system behavior. Additionally, it includes machine learning algorithms (e.g., Nonlinear Hebbian Learning, Active Hebbian Learning, Genetic Algorithms, and Deterministic Learning) to adjust the FCM causal weight matrix and to solve classification problems. Finally, users can easily implement scenario analysis by simulating hypothetical interventions (i.e., analyzing what-if scenarios). FCMpy is the first open-source module that contains all the functionalities necessary for FCM oriented projects. This work aims to enable researchers from different areas, such as psychology, cognitive science, or engineering, to easily and efficiently develop and test their FCM models without the need for extensive programming knowledge.

Background Sharing knowledge such as resources, research results, and scholarly documents, is of key importance to improving collaboration between researchers worldwide. Research results from the field of artificial intelligence (AI) are vital to share because of the extensive applicability of AI to several other fields of research. This has led to a significant increase in the number of AI publications over the past decade. The metadata of AI publications, including bibliometrics and altmetrics indicators, can be accessed by searching familiar bibliographical databases such as Web of Science (WoS), which enables the impact of research to be evaluated and identify rising researchers and trending topics in the field of AI. Problem description In general, bibliographical databases have two limitations in terms of the type and form of metadata we aim to improve. First, most bibliographical databases, such as WoS, are more concerned with bibliometric indicators and do not offer a wide range of altmetric indicators to complement traditional bibliometric indicators. Second, the traditional format in which data is downloaded from bibliographical databases limits users to keyword-based searches without considering the semantics of the data. Proposed solution To overcome these limitations, we developed a repository, named AI-SPedia. The repository contains semantic knowledge of scientific publications concerned with AI and considers both the bibliometric and altmetric indicators. Moreover, it uses semantic web technology to produce and store data to enable semantic-based searches. Furthermore, we devised related competency questions to be answered by posing smart queries against the AI-SPedia datasets. Results The results revealed that AI-SPedia can evaluate the impact of AI research by exploiting knowledge that is not explicitly mentioned but extracted using the power of semantics. Moreover, a simple analysis was performed based on the answered questions to help make research policy decisions in the AI domain. The end product, AI-SPedia, is considered the first attempt to evaluate the impacts of AI scientific publications using both bibliometric and altmetric indicators and the power of semantic web technology.