PeerJ Computer Science:Bioinformatics

PeerJ Computer Science:Bioinformatics https://peerj.com/articles/index.atom?journal=cs&subject=540 Bioinformatics articles published in PeerJ Computer Science DeepCorr: a novel error correction method for 3GS long reads based on deep learning https://peerj.com/articles/cs-2160 2024-07-26 2024-07-26 Rongshu Wang Jianhua Chen

Long reads generated by third-generation sequencing (3GS) technologies are involved in many biological analyses and play a vital role due to their ultra-long read length. However, the high error rate affects the downstream process. DeepCorr, a novel error correction algorithm for data from both PacBio and ONT platforms based on deep learning is proposed. The core algorithm adopts a recurrent neural network to capture the long-term dependencies in the long reads to convert the problem of long-read error correction to a multi-classification task. It first aligns the high-precision short reads to long reads to generate the corresponding feature vectors and labels, then feeds these vectors to the neural network, and finally trains the model for prediction and error correction. DeepCorr produces untrimmed corrected long reads and improves the alignment identity while maintaining the length advantage. It can capture and make full use of the dependencies to polish those bases that are not aligned by any short read. DeepCorr achieves better performance than that of the state-of-the-art error correction methods on real-world PacBio and ONT benchmark data sets and consumes fewer computing resources. It is a comprehensive deep learning-based tool that enables one to correct long reads accurately.

A meniscus injury is a prevalent condition affecting the knee joint. The construction of a subjective prediction model for meniscus injury represents a potentially invaluable diagnostic tool for physicians. Nevertheless, given the variability of pathological manifestations among individual patients, machine learning-based models may produce errors when attempting to predict specific medical records. In order to mitigate this issue, the present study suggests the incorporation of metric learning within the machine learning (ML) modelling process, with the aim of reducing the intra-class spacing of comparable samples and thereby enhancing the classification accuracy of individual medical records. This work has not yet been attempted in the field of knee joint prediction. The findings demonstrate that the adoption of metric learning produces better optimal outcomes. Compared to machine learning baseline models, F1 was increased by 2%.

Piwi-interacting RNA (piRNA) is a type of non-coding small RNA that is highly expressed in mammalian testis. PiRNA has been implicated in various human diseases, but the experimental validation of piRNA-disease associations is costly and time-consuming. In this article, a novel computational method for predicting piRNA-disease associations using a multi-channel graph variational autoencoder (MC-GVAE) is proposed. This method integrates four types of similarity networks for piRNAs and diseases, which are derived from piRNA sequences, disease semantics, piRNA Gaussian Interaction Profile (GIP) kernel, and disease GIP kernel, respectively. These networks are modeled by a graph VAE framework, which can learn low-dimensional and informative feature representations for piRNAs and diseases. Then, a multi-channel method is used to fuse the feature representations from different networks. Finally, a three-layer neural network classifier is applied to predict the potential associations between piRNAs and diseases. The method was evaluated on a benchmark dataset containing 5,002 experimentally validated associations with 4,350 piRNAs and 21 diseases, constructed from the piRDisease v1.0 database. It achieved state-of-the-art performance, with an average AUC value of 0.9310 and an AUPR value of 0.9247 under five-fold cross-validation. This demonstrates the method’s effectiveness and superiority in piRNA-disease association prediction.

Background Cancer remains one of the leading causes of mortality globally, with conventional chemotherapy often resulting in severe side effects and limited effectiveness. Recent advancements in bioinformatics and machine learning, particularly deep learning, offer promising new avenues for cancer treatment through the prediction and identification of anticancer peptides. Objective This study aimed to develop and evaluate a deep learning model utilizing a two-dimensional convolutional neural network (2D CNN) to enhance the prediction accuracy of anticancer peptides, addressing the complexities and limitations of current prediction methods. Methods A diverse dataset of peptide sequences with annotated anticancer activity labels was compiled from various public databases and experimental studies. The sequences were preprocessed and encoded using one-hot encoding and additional physicochemical properties. The 2D CNN model was trained and optimized using this dataset, with performance evaluated through metrics such as accuracy, precision, recall, F1-score, and area under the receiver operating characteristic curve (AUC-ROC). Results The proposed 2D CNN model achieved superior performance compared to existing methods, with an accuracy of 0.87, precision of 0.85, recall of 0.89, F1-score of 0.87, and an AUC-ROC value of 0.91. These results indicate the model’s effectiveness in accurately predicting anticancer peptides and capturing intricate spatial patterns within peptide sequences. Conclusion The findings demonstrate the potential of deep learning, specifically 2D CNNs, in advancing the prediction of anticancer peptides. The proposed model significantly improves prediction accuracy, offering a valuable tool for identifying effective peptide candidates for cancer treatment. Future Work Further research should focus on expanding the dataset, exploring alternative deep learning architectures, and validating the model’s predictions through experimental studies. Efforts should also aim at optimizing computational efficiency and translating these predictions into clinical applications.

Data-driven computational analysis is becoming increasingly important in biomedical research, as the amount of data being generated continues to grow. However, the lack of practices of sharing research outputs, such as data, source code and methods, affects transparency and reproducibility of studies, which are critical to the advancement of science. Many published studies are not reproducible due to insufficient documentation, code, and data being shared. We conducted a comprehensive analysis of 453 manuscripts published between 2016–2021 and found that 50.1% of them fail to share the analytical code. Even among those that did disclose their code, a vast majority failed to offer additional research outputs, such as data. Furthermore, only one in ten articles organized their code in a structured and reproducible manner. We discovered a significant association between the presence of code availability statements and increased code availability. Additionally, a greater proportion of studies conducting secondary analyses were inclined to share their code compared to those conducting primary analyses. In light of our findings, we propose raising awareness of code sharing practices and taking immediate steps to enhance code availability to improve reproducibility in biomedical research. By increasing transparency and reproducibility, we can promote scientific rigor, encourage collaboration, and accelerate scientific discoveries. We must prioritize open science practices, including sharing code, data, and other research products, to ensure that biomedical research can be replicated and built upon by others in the scientific community.

Precise prediction of irrigation volumes is crucial in modern agriculture. This study proposes an optimized long short-term memory (LSTM) model-based irrigation prediction method that combines bidirectional LSTM networks. The model provides farmers with more precise irrigation management decisions, facilitating optimal utilization of water resources and effective crop production management. This proposed model aims to fully exploit spatio-temporal features and sequence dependencies to enhance prediction accuracy and reliability. We aim to fully leverage crop irrigation volumes’ spatio-temporal features and sequence dependencies to improve prediction accuracy and reliability. First, this study adopts a bidirectional LSTM (BiLSTM) model to simulate the temporal features of irrigation volumes and learn the sequential dependencies of crop growth data from historical records. Then, this study passes the irrigation volume data through a convolutional neural network (CNN) model to extract spatial features and capture correlations among various features such as temperature, precipitation, and wind speed. Our prediction performance significantly improved after incorporating an attention mechanism that involves weighting features and enhancing focus on crucial aspects. The proposed BiLSTM-CNN-Attention approach is used to predict irrigation volume for spring corn in significant irrigation areas in Jilin Province, China. The results demonstrate that the proposed method surpasses recurrent neural network (RNN), CNN, LSTM, BiLSTM, and BiLSTM-CNN methods in terms of mean square error (MSE), root mean square error (RMSE), mean absolute error (MAE) (0.000004, 0.005968, 0.004599), and R2 (0.9749), making a superior solution for predicting the volume of crop irrigation.

Increasing research has shown that the abnormal expression of microRNA (miRNA) is associated with many complex diseases. However, biological experiments have many limitations in identifying the potential disease-miRNA associations. Therefore, we developed a computational model of Three-Layer Heterogeneous Network based on the Integration of CircRNA information for MiRNA-Disease Association prediction (TLHNICMDA). In the model, a disease-miRNA-circRNA heterogeneous network is built by known disease-miRNA associations, known miRNA-circRNA interactions, disease similarity, miRNA similarity, and circRNA similarity. Then, the potential disease-miRNA associations are identified by an update algorithm based on the global network. Finally, based on global and local leave-one-out cross validation (LOOCV), the values of AUCs in TLHNICMDA are 0.8795 and 0.7774. Moreover, the mean and standard deviation of AUC in 5-fold cross-validations is 0.8777+/−0.0010. Especially, the two types of case studies illustrated the usefulness of TLHNICMDA in predicting disease-miRNA interactions.

Breast arterial calcifications (BAC) are a type of calcification commonly observed on mammograms and are generally considered benign and not associated with breast cancer. However, there is accumulating observational evidence of an association between BAC and cardiovascular disease, the leading cause of death in women. We present a deep learning method that could assist radiologists in detecting and quantifying BAC in synthesized 2D mammograms. We present a recurrent attention U-Net model consisting of encoder and decoder modules that include multiple blocks that each use a recurrent mechanism, a recurrent mechanism, and an attention module between them. The model also includes a skip connection between the encoder and the decoder, similar to a U-shaped network. The attention module was used to enhance the capture of long-range dependencies and enable the network to effectively classify BAC from the background, whereas the recurrent blocks ensured better feature representation. The model was evaluated using a dataset containing 2,000 synthesized 2D mammogram images. We obtained 99.8861% overall accuracy, 69.6107% sensitivity, 66.5758% F-1 score, and 59.5498% Jaccard coefficient, respectively. The presented model achieved promising performance compared with related models.

Background Cancer is positioned as a major disease, particularly for middle-aged people, which remains a global concern that can develop in the form of abnormal growth of body cells at any place in the human body. Cervical cancer, often known as cervix cancer, is cancer present in the female cervix. In the area where the endocervix (upper two-thirds of the cervix) and ectocervix (lower third of the cervix) meet, the majority of cervical cancers begin. Despite an influx of people entering the healthcare industry, the demand for machine learning (ML) specialists has recently outpaced the supply. To close the gap, user-friendly applications, such as H2O, have made significant progress these days. However, traditional ML techniques handle each stage of the process separately; whereas H2O AutoML can automate a major portion of the ML workflow, such as automatic training and tuning of multiple models within a user-defined timeframe. Methods Thus, novel H2O AutoML with local interpretable model-agnostic explanations (LIME) techniques have been proposed in this research work that enhance the predictability of an ML model in a user-defined timeframe. We herein collected the cervical cancer dataset from the freely available Kaggle repository for our research work. The Stacked Ensembles approach, on the other hand, will automatically train H2O models to create a highly predictive ensemble model that will outperform the AutoML Leaderboard in most instances. The novelty of this research is aimed at training the best model using the AutoML technique that helps in reducing the human effort over traditional ML techniques in less amount of time. Additionally, LIME has been implemented over the H2O AutoML model, to uncover black boxes and to explain every individual prediction in our model. We have evaluated our model performance using the findprediction() function on three different idx values (i.e., 100, 120, and 150) to find the prediction probabilities of two classes for each feature. These experiments have been done in Lenovo core i7 NVidia GeForce 860M GPU laptop in Windows 10 operating system using Python 3.8.3 software on Jupyter 6.4.3 platform. Results The proposed model resulted in the prediction probabilities depending on the features as 87%, 95%, and 87% for class ‘0’ and 13%, 5%, and 13% for class ‘1’ when idx_value=100, 120, and 150 for the first case; 100% for class ‘0’ and 0% for class ‘1’, when idx_value= 10, 12, and 15 respectively. Additionally, a comparative analysis has been drawn where our proposed model outperforms previous results found in cervical cancer research.

The COVID-19 pandemic has far-reaching impacts on the global economy and public health. To prevent the recurrence of pandemic outbreaks, the development of short-term prediction models is of paramount importance. We propose an ARIMA-LSTM (autoregressive integrated moving average and long short-term memory) model for predicting future cases and utilize multi-source data to enhance prediction performance. Firstly, we employ the ARIMA-LSTM model to forecast the developmental trends of multi-source data separately. Subsequently, we introduce a Bayes-Attention mechanism to integrate the prediction outcomes from auxiliary data sources into the case data. Finally, experiments are conducted based on real datasets. The results demonstrate a close correlation between predicted and actual case numbers, with superior prediction performance of this model compared to baseline and other state-of-the-art methods.