PeerJ Computer Science:Data Science

PeerJ Computer Science:Data Science https://peerj.com/articles/index.atom?journal=cs&subject=9600 Data Science articles published in PeerJ Computer Science An Al-BERT-Bi-GRU-LDA algorithm for negative sentiment analysis on Bilibili comments https://peerj.com/articles/cs-2029 2024-05-15 2024-05-15 Ziyu Liang Jun Chen

The number of online self-learning users has been increasing due to the promotion of various lifelong learning programs. Unstructured commentary text related to their real learning experience regarding the learning process is generally published by users to show their opinions and complaints. The article aims to utilize the dataset of real text comments of 10 high school mathematics courses participated by high school students in the Bilibili platform and construct a hybrid algorithm called the Artificial Intelligence-Bidirectional Encoder Representations from Transformers (BERT) + Bidirectional Gated Recurrent Unit (BiGRU) and linear discriminant analysis (LDA) to crunch data and extract their sentiments. A series of tests regarding algorithm comparison were conducted on the educational review datasets. Comparative analysis found that the proposed algorithm achieves higher precision and lower loss rates than other alternative algorithms. Meanwhile, the loss ratio of the proposed algorithm was kept at a low level. At the topic mining level, the topic clustering of negative comments found that the barrage content was very messy and the complexity of the course content was generally reported by the students. Some problems related to videos were also mentioned. The outcomes are promising that the fundamental issues underlined by the students can be effectively resolved to improve curriculum and teaching quality.

Knowledge graph completion aims to predict missing relations between entities in a knowledge graph. One of the effective ways for knowledge graph completion is knowledge graph embedding. However, existing embedding methods usually focus on developing deeper and more complex neural networks, or leveraging additional information, which inevitably increases computational complexity and is unfriendly to real-time applications. In this article, we propose an effective BERT-enhanced shallow neural network model for knowledge graph completion named ShallowBKGC. Specifically, given an entity pair, we first apply the pre-trained language model BERT to extract text features of head and tail entities. At the same time, we use the embedding layer to extract structure features of head and tail entities. Then the text and structure features are integrated into one entity-pair representation via average operation followed by a non-linear transformation. Finally, based on the entity-pair representation, we calculate probability of each relation through multi-label modeling to predict relations for the given entity pair. Experimental results on three benchmark datasets show that our model achieves a superior performance in comparison with baseline methods. The source code of this article can be obtained from https://github.com/Joni-gogogo/ShallowBKGC.

Equipment downtime resulting from maintenance in various sectors around the globe has become a major concern. The effectiveness of conventional reactive maintenance methods in addressing interruptions and enhancing operational efficiency has become inadequate. Therefore, acknowledging the constraints associated with reactive maintenance and the growing need for proactive approaches to proactively detect possible breakdowns is necessary. The need for optimisation of asset management and reduction of costly downtime emerges from the demand for industries. The work highlights the use of Internet of Things (IoT)-enabled Predictive Maintenance (PdM) as a revolutionary strategy across many sectors. This article presents a picture of a future in which the use of IoT technology and sophisticated analytics will enable the prediction and proactive mitigation of probable equipment failures. This literature study has great importance as it thoroughly explores the complex steps and techniques necessary for the development and implementation of efficient PdM solutions. The study offers useful insights into the optimisation of maintenance methods and the enhancement of operational efficiency by analysing current information and approaches. The article outlines essential stages in the application of PdM, encompassing underlying design factors, data preparation, feature selection, and decision modelling. Additionally, the study discusses a range of ML models and methodologies for monitoring conditions. In order to enhance maintenance plans, it is necessary to prioritise ongoing study and improvement in the field of PdM. The potential for boosting PdM skills and guaranteeing the competitiveness of companies in the global economy is significant through the incorporation of IoT, Artificial Intelligence (AI), and advanced analytics.

This article explores the technology of recognizing non-cooperative communication behavior, with a specific emphasis on analyzing communication station signals. Conventional techniques for analyzing signal data frames to determine their identity, while precise, do not have the ability to operate in real-time. In order to tackle this issue, we developed a pragmatic architecture for recognizing communication behavior and a system based on polling. The method utilizes a one-dimensional convolutional neural network (CNN) to segment data, hence improving its ability to recognize various communication activities. The study assesses the reliability of CNN in several real-world scenarios, examining its accuracy in the presence of noise interference, varying lengths of interception signals, interferences at different frequency points, and dynamic changes in outpost locations. The experimental results confirm the efficacy and dependability of the convolutional neural network in recognizing communication behavior in various contexts.

The scarcity of data is likely to have a negative effect on machine learning (ML). Yet, in the health sciences, data is diverse and can be costly to acquire. Therefore, it is critical to develop methods that can reach similar accuracy with minimal clinical features. This study explores a methodology that aims to build a model using minimal clinical parameters to reach comparable performance to a model trained with a more extensive list of parameters. To develop this methodology, a dataset of over 1,000 COVID-19-positive patients was used. A machine learning model was built with over 90% accuracy when combining 24 clinical parameters using Random Forest (RF) and logistic regression. Furthermore, to obtain minimal clinical parameters to predict the mortality of COVID-19 patients, the features were weighted using both Shapley values and RF feature importance to get the most important factors. The six most highly weighted features that could produce the highest performance metrics were combined for the final model. The accuracy of the final model, which used a combination of six features, is 90% with the random forest classifier and 91% with the logistic regression model. This performance is close to that of a model using 24 combined features (92%), suggesting that highly weighted minimal clinical parameters can be used to reach similar performance. The six clinical parameters identified here are acute kidney injury, glucose level, age, troponin, oxygen level, and acute hepatic injury. Among those parameters, acute kidney injury was the highest-weighted feature. Together, a methodology was developed using significantly minimal clinical parameters to reach performance metrics similar to a model trained with a large dataset, highlighting a novel approach to address the problems of clinical data collection for machine learning.

The electric power infrastructure is the cornerstone of contemporary society’s sustenance and advancement. Within the intelligent electric power financial system, substantial inefficiency and waste in information management persist, leading to an escalating depletion of resources. Addressing diverse objectives encompassing economic, environmental, and societal concerns within the power system helps the study to undertake a comprehensive, integrated optimal design and operational scheduling based on a multiobjective optimization algorithm. This article centers on optimizing the power financial system by considering fuel cost, active network loss, and voltage quality as primary objectives. A mathematical model encapsulates these objectives, integrating equations and inequality constraints and subsequently introducing enhancements to the differential evolutionary algorithm. Adaptive variation and dynamic crossover factors within crossover, variation, and selection operations are integrated to optimize algorithm parameters, specifically catering to the multiobjective optimization of the electric power system. An adaptive grid method and cyclic crowding degree ensure population diversity and control the Pareto front distribution. They experimentally validated the approach and the comparisons conducted against AG-MOPSO, INSGA-II, and NSDE algorithms across standard test functions: ZDT1, ZDT2, ZDT3, and DTLZ4. The convergence evaluation indices for this study’s scheme on ZDT1 and ZDT2 are 0.000938 and 0.0034, respectively. Additionally, distribution evaluation indices on ZDT1, ZDT2, ZDT3, and ZDT4 stand at 0.0018, 0.0026, 0.0027, and 0.0009, respectively. These indices indicate a robust convergence and distribution, facilitating the optimization of electric power financial information management and the intelligent handling of the electric power financial system’s information, thereby enhancing the allocation of material and financial resources.

In the contemporary realm of athletic training, integrating technology is a pivotal determinant for augmenting athlete performance and refining training outcomes. The amalgamation of multi-target visual modeling with sensor technology imparts an enriched stratum of sports training data. Subsequently, the sensor scale-space transformation accentuates the comprehensive apprehension of data across diverse scales and angles. Hence, within this manuscript, addressing the multi-target tracking intricacies during sports training and competition, we posit a framework that amalgamates the shortest path elucidated by the K shortest paths (KSP) methodology with the pose information emanating from the Alphapose network. This framework recognizes the athlete’s shortest path through a convolutional neural network and KSP, followed by the amalgamation of these divergent data sources. The fusion unfolds by incorporating the athlete’s pose information grounded in Alphapose, culminating in a comprehensive integration of the two data streams. Consequently, synthesizing alpha-derived athlete information precipitates the ultimate amalgamation of the two information streams. The accomplished fusion, premised on Alphapose, forms the bedrock for multi-target tracking, culminating in a feature-rich synthesis. Empirical results reveal that after integrating these information streams, the Multiple Object Tracking Accuracy (MOTA) index and Global Multiple Object Tracking Accuracy (GMOTA) index surpass those of the solitary information tracking methods, thereby furnishing a technical underpinning and a foundation for information fusion within prospective sports training analysis systems.

Student dropout prediction (SDP) in educational research has gained prominence for its role in analyzing student learning behaviors through time series models. Traditional methods often focus singularly on either prediction accuracy or earliness, leading to sub-optimal interventions for at-risk students. This issue underlines the necessity for methods that effectively manage the trade-off between accuracy and earliness. Recognizing the limitations of existing methods, this study introduces a novel approach leveraging multi-objective reinforcement learning (MORL) to optimize the trade-off between prediction accuracy and earliness in SDP tasks. By framing SDP as a partial sequence classification problem, we model it through a multiple-objective Markov decision process (MOMDP), incorporating a vectorized reward function that maintains the distinctiveness of each objective, thereby preventing information loss and enabling more nuanced optimization strategies. Furthermore, we introduce an advanced envelope Q-learning technique to foster a comprehensive exploration of the solution space, aiming to identify Pareto-optimal strategies that accommodate a broader spectrum of preferences. The efficacy of our model has been rigorously validated through comprehensive evaluations on real-world MOOC datasets. These evaluations have demonstrated our model’s superiority, outperforming existing methods in achieving optimal trade-off between accuracy and earliness, thus marking a significant advancement in the field of SDP.

This article presents an evaluation of BukaGini, a stability-aware Gini index feature selection algorithm designed to enhance model performance in machine learning applications. Specifically, the study focuses on assessing BukaGini’s effectiveness within the domain of intrusion detection systems (IDS). Recognizing the need for improved feature interaction analysis methodologies in IDS, this research aims to investigate the performance of BukaGini in this context. BukaGini’s performance is evaluated across four diverse datasets commonly used in IDS research: NSLKDD (22,544 samples), WUSTL EHMS (16,318 samples), WSN-DS (374,661 samples), and UNSWNB15 (175,341 samples), amounting to a total of 588,864 data samples. The evaluation encompasses key metrics such as stability score, accuracy, F1-score, recall, precision, and ROC AUC. Results indicate significant advancements in IDS performance, with BukaGini achieving remarkable accuracy rates of up to 99% and stability scores consistently surpassing 99% across all datasets. Additionally, BukaGini demonstrates an average reduction in dimensionality of 25%, selecting 10 features for each dataset using the Gini index. Through rigorous comparative analysis with existing methodologies, BukaGini emerges as a promising solution for feature interaction analysis within cybersecurity applications, particularly in the context of IDS. These findings highlight the potential of BukaGini to contribute to robust model performance and propel intrusion detection capabilities to new heights in real-world scenarios.

In the realm of digitizing written content, the challenges posed by low-resource languages are noteworthy. These languages, often lacking in comprehensive linguistic resources, require specialized attention to develop robust systems for accurate optical character recognition (OCR). This article addresses the significance of focusing on such languages and introduces ViLanOCR, an innovative bilingual OCR system tailored for Urdu and English. Unlike existing systems, which struggle with the intricacies of low-resource languages, ViLanOCR leverages advanced multilingual transformer-based language models to achieve superior performances. The proposed approach is evaluated using the character error rate (CER) metric and achieves state-of-the-art results on the Urdu UHWR dataset, with a CER of 1.1%. The experimental results demonstrate the effectiveness of the proposed approach, surpassing state of the-art baselines in Urdu handwriting digitization.