PeerJ Computer Science:Data Sciencehttps://peerj.com/articles/index.atom?journal=cs&subject=9600Data Science articles published in PeerJ Computer ScienceA secure fingerprint hiding technique based on DNA sequence and mathematical functionhttps://peerj.com/articles/cs-18472024-03-192024-03-19Wala’a Essa Al-AhmadiAsia Othman AljahdaliFursan ThabitAsmaa Munshi
DNA steganography is a technique for securely transmitting important data using DNA sequences. It involves encrypting and hiding messages within DNA sequences to prevent unauthorized access and decoding of sensitive information. Biometric systems, such as fingerprinting and iris scanning, are used for individual recognition. Since biometric information cannot be changed if compromised, it is essential to ensure its security. This research aims to develop a secure technique that combines steganography and cryptography to protect fingerprint images during communication while maintaining confidentiality. The technique converts fingerprint images into binary data, encrypts them, and embeds them into the DNA sequence. It utilizes the Feistel network encryption process, along with a mathematical function and an insertion technique for hiding the data. The proposed method offers a low probability of being cracked, a high number of hiding positions, and efficient execution times. Four randomly chosen keys are used for hiding and decoding, providing a large key space and enhanced key sensitivity. The technique undergoes evaluation using the NIST statistical test suite and is compared with other research papers. It demonstrates resilience against various attacks, including known-plaintext and chosen-plaintext attacks. To enhance security, random ambiguous bits are introduced at random locations in the fingerprint image, increasing noise. However, it is important to note that this technique is limited to hiding small images within DNA sequences and cannot handle video, audio, or large images.
DNA steganography is a technique for securely transmitting important data using DNA sequences. It involves encrypting and hiding messages within DNA sequences to prevent unauthorized access and decoding of sensitive information. Biometric systems, such as fingerprinting and iris scanning, are used for individual recognition. Since biometric information cannot be changed if compromised, it is essential to ensure its security. This research aims to develop a secure technique that combines steganography and cryptography to protect fingerprint images during communication while maintaining confidentiality. The technique converts fingerprint images into binary data, encrypts them, and embeds them into the DNA sequence. It utilizes the Feistel network encryption process, along with a mathematical function and an insertion technique for hiding the data. The proposed method offers a low probability of being cracked, a high number of hiding positions, and efficient execution times. Four randomly chosen keys are used for hiding and decoding, providing a large key space and enhanced key sensitivity. The technique undergoes evaluation using the NIST statistical test suite and is compared with other research papers. It demonstrates resilience against various attacks, including known-plaintext and chosen-plaintext attacks. To enhance security, random ambiguous bits are introduced at random locations in the fingerprint image, increasing noise. However, it is important to note that this technique is limited to hiding small images within DNA sequences and cannot handle video, audio, or large images.The reconstruction of equivalent underlying model based on direct causality for multivariate time serieshttps://peerj.com/articles/cs-19222024-03-182024-03-18Liyang XuDezheng Wang
This article presents a novel approach for reconstructing an equivalent underlying model and deriving a precise equivalent expression through the use of direct causality topology. Central to this methodology is the transfer entropy method, which is instrumental in revealing the causality topology. The polynomial fitting method is then applied to determine the coefficients and intrinsic order of the causality structure, leveraging the foundational elements extracted from the direct causality topology. Notably, this approach efficiently discovers the core topology from the data, reducing redundancy without requiring prior domain-specific knowledge. Furthermore, it yields a precise equivalent model expression, offering a robust foundation for further analysis and exploration in various fields. Additionally, the proposed model for reconstructing an equivalent underlying framework demonstrates strong forecasting capabilities in multivariate time series scenarios.
This article presents a novel approach for reconstructing an equivalent underlying model and deriving a precise equivalent expression through the use of direct causality topology. Central to this methodology is the transfer entropy method, which is instrumental in revealing the causality topology. The polynomial fitting method is then applied to determine the coefficients and intrinsic order of the causality structure, leveraging the foundational elements extracted from the direct causality topology. Notably, this approach efficiently discovers the core topology from the data, reducing redundancy without requiring prior domain-specific knowledge. Furthermore, it yields a precise equivalent model expression, offering a robust foundation for further analysis and exploration in various fields. Additionally, the proposed model for reconstructing an equivalent underlying framework demonstrates strong forecasting capabilities in multivariate time series scenarios.AutoSCAN: automatic detection of DBSCAN parameters and efficient clustering of data in overlapping density regionshttps://peerj.com/articles/cs-19212024-03-142024-03-14Adil Abdu BushraDongyeon KimYejin KanGangman Yi
The density-based clustering method is considered a robust approach in unsupervised clustering technique due to its ability to identify outliers, form clusters of irregular shapes and automatically determine the number of clusters. These unique properties helped its pioneering algorithm, the Density-based Spatial Clustering on Applications with Noise (DBSCAN), become applicable in datasets where various number of clusters of different shapes and sizes could be detected without much interference from the user. However, the original algorithm exhibits limitations, especially towards its sensitivity on its user input parameters minPts and ɛ. Additionally, the algorithm assigned inconsistent cluster labels to data objects found in overlapping density regions of separate clusters, hence lowering its accuracy. To alleviate these specific problems and increase the clustering accuracy, we propose two methods that use the statistical data from a given dataset’s k-nearest neighbor density distribution in order to determine the optimal ɛ values. Our approach removes the burden on the users, and automatically detects the clusters of a given dataset. Furthermore, a method to identify the accurate border objects of separate clusters is proposed and implemented to solve the unpredictability of the original algorithm. Finally, in our experiments, we show that our efficient re-implementation of the original algorithm to automatically cluster datasets and improve the clustering quality of adjoining cluster members provides increase in clustering accuracy and faster running times when compared to earlier approaches.
The density-based clustering method is considered a robust approach in unsupervised clustering technique due to its ability to identify outliers, form clusters of irregular shapes and automatically determine the number of clusters. These unique properties helped its pioneering algorithm, the Density-based Spatial Clustering on Applications with Noise (DBSCAN), become applicable in datasets where various number of clusters of different shapes and sizes could be detected without much interference from the user. However, the original algorithm exhibits limitations, especially towards its sensitivity on its user input parameters minPts and ɛ. Additionally, the algorithm assigned inconsistent cluster labels to data objects found in overlapping density regions of separate clusters, hence lowering its accuracy. To alleviate these specific problems and increase the clustering accuracy, we propose two methods that use the statistical data from a given dataset’s k-nearest neighbor density distribution in order to determine the optimal ɛ values. Our approach removes the burden on the users, and automatically detects the clusters of a given dataset. Furthermore, a method to identify the accurate border objects of separate clusters is proposed and implemented to solve the unpredictability of the original algorithm. Finally, in our experiments, we show that our efficient re-implementation of the original algorithm to automatically cluster datasets and improve the clustering quality of adjoining cluster members provides increase in clustering accuracy and faster running times when compared to earlier approaches.Heart failure survival prediction using novel transfer learning based probabilistic featureshttps://peerj.com/articles/cs-18942024-03-122024-03-12Azam Mehmood QadriMuhammad Shadab Alam HashmiAli RazaSyed Ali Jafar ZaidiAtiq ur Rehman
Heart failure is a complex cardiovascular condition characterized by the heart’s inability to pump blood effectively, leading to a cascade of physiological changes. Predicting survival in heart failure patients is crucial for optimizing patient care and resource allocation. This research aims to develop a robust survival prediction model for heart failure patients using advanced machine learning techniques. We analyzed data from 299 hospitalized heart failure patients, addressing the issue of imbalanced data with the Synthetic Minority Oversampling (SMOTE) method. Additionally, we proposed a novel transfer learning-based feature engineering approach that generates a new probabilistic feature set from patient data using ensemble trees. Nine fine-tuned machine learning models are built and compared to evaluate performance in patient survival prediction. Our novel transfer learning mechanism applied to the random forest model outperformed other models and state-of-the-art studies, achieving a remarkable accuracy of 0.975. All models underwent evaluation using 10-fold cross-validation and tuning through hyperparameter optimization. The findings of this study have the potential to advance the field of cardiovascular medicine by providing more accurate and personalized prognostic assessments for individuals with heart failure.
Heart failure is a complex cardiovascular condition characterized by the heart’s inability to pump blood effectively, leading to a cascade of physiological changes. Predicting survival in heart failure patients is crucial for optimizing patient care and resource allocation. This research aims to develop a robust survival prediction model for heart failure patients using advanced machine learning techniques. We analyzed data from 299 hospitalized heart failure patients, addressing the issue of imbalanced data with the Synthetic Minority Oversampling (SMOTE) method. Additionally, we proposed a novel transfer learning-based feature engineering approach that generates a new probabilistic feature set from patient data using ensemble trees. Nine fine-tuned machine learning models are built and compared to evaluate performance in patient survival prediction. Our novel transfer learning mechanism applied to the random forest model outperformed other models and state-of-the-art studies, achieving a remarkable accuracy of 0.975. All models underwent evaluation using 10-fold cross-validation and tuning through hyperparameter optimization. The findings of this study have the potential to advance the field of cardiovascular medicine by providing more accurate and personalized prognostic assessments for individuals with heart failure.Predicting Chinese stock market using XGBoost multi-objective optimization with optimal weightinghttps://peerj.com/articles/cs-19312024-03-082024-03-08Jichen Liu
The application of artificial intelligence (AI) technology in various fields has been a recent research hotspot. As a representative technology of AI, the specific application of machine learning models in the field of economics and finance undoubtedly holds significant research value. This article proposes Extreme Gradient Boosting Multi-Objective Optimization Model with Optimal Weights (OW-XGBoost) to comprehensively balance the returns and risks of investment portfolios. The model utilizes fusing label with optimal weights to achieve multi-objective tasks, effectively controlling the impact of various risk and return indicators on the model, thus improving the interpretability and generalization ability of the model. In the experiments, we tested the model using China A-share data from October 2022 to April 2023 and conducted a series of robustness tests. The results indicate that: (1) The OW-XGBoost outperforms the XGBoost Model with Yield as Label (YL-XGBoost), XGBoost Multi-Label Classification Model (MLC-XGBoost) in controlling risk or achieving returns. (2) OW-XGBoost performs better overall compared to baseline models. (3) The robustness tests demonstrate that the model performs well under different market conditions, stock pools, and training set durations. The model performs best in moderately fluctuating stock markets, stock pools comprising high market value stocks, and training set durations measured in months. The methodology and results of this study provide a new perspective and approach for fundamental quantitative investment and also create new possibilities and avenues for the integration of AI, machine learning, and financial quantitative research.
The application of artificial intelligence (AI) technology in various fields has been a recent research hotspot. As a representative technology of AI, the specific application of machine learning models in the field of economics and finance undoubtedly holds significant research value. This article proposes Extreme Gradient Boosting Multi-Objective Optimization Model with Optimal Weights (OW-XGBoost) to comprehensively balance the returns and risks of investment portfolios. The model utilizes fusing label with optimal weights to achieve multi-objective tasks, effectively controlling the impact of various risk and return indicators on the model, thus improving the interpretability and generalization ability of the model. In the experiments, we tested the model using China A-share data from October 2022 to April 2023 and conducted a series of robustness tests. The results indicate that: (1) The OW-XGBoost outperforms the XGBoost Model with Yield as Label (YL-XGBoost), XGBoost Multi-Label Classification Model (MLC-XGBoost) in controlling risk or achieving returns. (2) OW-XGBoost performs better overall compared to baseline models. (3) The robustness tests demonstrate that the model performs well under different market conditions, stock pools, and training set durations. The model performs best in moderately fluctuating stock markets, stock pools comprising high market value stocks, and training set durations measured in months. The methodology and results of this study provide a new perspective and approach for fundamental quantitative investment and also create new possibilities and avenues for the integration of AI, machine learning, and financial quantitative research.Electroencephalography (EEG) based epilepsy diagnosis via multiple feature space fusion using shared hidden space-driven multi-view learninghttps://peerj.com/articles/cs-18742024-03-072024-03-07Xiujian HuYicheng XieHui ZhaoGuanglei ShengKhin Wee LaiYuanpeng Zhang
Epilepsy is a chronic, non-communicable disease caused by paroxysmal abnormal synchronized electrical activity of brain neurons, and is one of the most common neurological diseases worldwide. Electroencephalography (EEG) is currently a crucial tool for epilepsy diagnosis. With the development of artificial intelligence, multi-view learning-based EEG analysis has become an important method for automatic epilepsy recognition because EEG contains difficult types of features such as time-frequency features, frequency-domain features and time-domain features. However, current multi-view learning still faces some challenges, such as the difference between samples of the same class from different views is greater than the difference between samples of different classes from the same view. In view of this, in this study, we propose a shared hidden space-driven multi-view learning algorithm. The algorithm uses kernel density estimation to construct a shared hidden space and combines the shared hidden space with the original space to obtain an expanded space for multi-view learning. By constructing the expanded space and utilizing the information of both the shared hidden space and the original space for learning, the relevant information of samples within and across views can thereby be fully utilized. Experimental results on a dataset of epilepsy provided by the University of Bonn show that the proposed algorithm has promising performance, with an average classification accuracy value of 0.9787, which achieves at least 4% improvement compared to single-view methods.
Epilepsy is a chronic, non-communicable disease caused by paroxysmal abnormal synchronized electrical activity of brain neurons, and is one of the most common neurological diseases worldwide. Electroencephalography (EEG) is currently a crucial tool for epilepsy diagnosis. With the development of artificial intelligence, multi-view learning-based EEG analysis has become an important method for automatic epilepsy recognition because EEG contains difficult types of features such as time-frequency features, frequency-domain features and time-domain features. However, current multi-view learning still faces some challenges, such as the difference between samples of the same class from different views is greater than the difference between samples of different classes from the same view. In view of this, in this study, we propose a shared hidden space-driven multi-view learning algorithm. The algorithm uses kernel density estimation to construct a shared hidden space and combines the shared hidden space with the original space to obtain an expanded space for multi-view learning. By constructing the expanded space and utilizing the information of both the shared hidden space and the original space for learning, the relevant information of samples within and across views can thereby be fully utilized. Experimental results on a dataset of epilepsy provided by the University of Bonn show that the proposed algorithm has promising performance, with an average classification accuracy value of 0.9787, which achieves at least 4% improvement compared to single-view methods.AMCFCN: attentive multi-view contrastive fusion clustering nethttps://peerj.com/articles/cs-19062024-03-052024-03-05Huarun XiaoZhiyong HongLiping XiongZhiqiang Zeng
Advances in deep learning have propelled the evolution of multi-view clustering techniques, which strive to obtain a view-common representation from multi-view datasets. However, the contemporary multi-view clustering community confronts two prominent challenges. One is that view-specific representations lack guarantees to reduce noise introduction, and another is that the fusion process compromises view-specific representations, resulting in the inability to capture efficient information from multi-view data. This may negatively affect the accuracy of the clustering results. In this article, we introduce a novel technique named the “contrastive attentive strategy” to address the above problems. Our approach effectively extracts robust view-specific representations from multi-view data with reduced noise while preserving view completeness. This results in the extraction of consistent representations from multi-view data while preserving the features of view-specific representations. We integrate view-specific encoders, a hybrid attentive module, a fusion module, and deep clustering into a unified framework called AMCFCN. Experimental results on four multi-view datasets demonstrate that our method, AMCFCN, outperforms seven competitive multi-view clustering methods. Our source code is available at https://github.com/xiaohuarun/AMCFCN.
Advances in deep learning have propelled the evolution of multi-view clustering techniques, which strive to obtain a view-common representation from multi-view datasets. However, the contemporary multi-view clustering community confronts two prominent challenges. One is that view-specific representations lack guarantees to reduce noise introduction, and another is that the fusion process compromises view-specific representations, resulting in the inability to capture efficient information from multi-view data. This may negatively affect the accuracy of the clustering results. In this article, we introduce a novel technique named the “contrastive attentive strategy” to address the above problems. Our approach effectively extracts robust view-specific representations from multi-view data with reduced noise while preserving view completeness. This results in the extraction of consistent representations from multi-view data while preserving the features of view-specific representations. We integrate view-specific encoders, a hybrid attentive module, a fusion module, and deep clustering into a unified framework called AMCFCN. Experimental results on four multi-view datasets demonstrate that our method, AMCFCN, outperforms seven competitive multi-view clustering methods. Our source code is available at https://github.com/xiaohuarun/AMCFCN.Research on the evaluation method of English textbook readability based on the TextCNN model and its application in teaching designhttps://peerj.com/articles/cs-18952024-02-292024-02-29Ying QinAzeem Irshad
English is a world language, and the ability to use English plays an important role in the improvement of college students’ comprehensive quality and career development. However, quite a lot of Chinese college students feel that English learning is difficult; it is difficult to understand the learning materials, and they cannot effectively improve their English ability. This study uses a convolutional neural network to evaluate the readability of English reading materials. It provides students with English reading materials of suitable difficulty based on their English reading ability so as to improve the effect of English learning. Aiming at the high dispersion of students’ English reading level, a text readability evaluation model for English reading textbooks based on deep learning is designed. First, the legibility dataset is constructed based on college English textbooks; second, the TextCNN text legibility evaluation model is constructed; finally, the model training is completed through parameter adjustment and optimization, and the evaluation accuracy rate on the self-built dataset reaches 90%. We use the text readability method based on TextCNN model to conduct experimental teaching, and divided the two groups into comparative experiments. The experimental results showed that the reading level and reading interest of students in the experimental group were significantly improved, which proved that the text readability evaluation method based on deep learning was scientific and effective. In addition, we will further expand the capacity of the English legibility dataset and invite more university classes and students to participate in comparative experiments to improve the generality of the model.
English is a world language, and the ability to use English plays an important role in the improvement of college students’ comprehensive quality and career development. However, quite a lot of Chinese college students feel that English learning is difficult; it is difficult to understand the learning materials, and they cannot effectively improve their English ability. This study uses a convolutional neural network to evaluate the readability of English reading materials. It provides students with English reading materials of suitable difficulty based on their English reading ability so as to improve the effect of English learning. Aiming at the high dispersion of students’ English reading level, a text readability evaluation model for English reading textbooks based on deep learning is designed. First, the legibility dataset is constructed based on college English textbooks; second, the TextCNN text legibility evaluation model is constructed; finally, the model training is completed through parameter adjustment and optimization, and the evaluation accuracy rate on the self-built dataset reaches 90%. We use the text readability method based on TextCNN model to conduct experimental teaching, and divided the two groups into comparative experiments. The experimental results showed that the reading level and reading interest of students in the experimental group were significantly improved, which proved that the text readability evaluation method based on deep learning was scientific and effective. In addition, we will further expand the capacity of the English legibility dataset and invite more university classes and students to participate in comparative experiments to improve the generality of the model.Special issue on analysis and mining of social media datahttps://peerj.com/articles/cs-19092024-02-292024-02-29Arkaitz ZubiagaPaolo Rosso
This Editorial introduces the PeerJ Computer Science Special Issue on Analysis and Mining of Social Media Data. The special issue called for submissions with a primary focus on the use of social media data, for a variety of fields including natural language processing, computational social science, data mining, information retrieval and recommender systems. Of the 48 abstract submissions that were deemed within the scope of the special issue and were invited to submit a full article, 17 were ultimately accepted. These included a diverse set of articles covering, inter alia, sentiment analysis, detection and mitigation of online harms, analytical studies focused on societal issues and analysis of images surrounding news. The articles primarily use Twitter, Facebook and Reddit as data sources; English, Arabic, Italian, Russian, Indonesian and Javanese as languages; and over a third of the articles revolve around COVID-19 as the main topic of study. This article discusses the motivation for launching such a special issue and provides an overview of the articles published in the issue.
This Editorial introduces the PeerJ Computer Science Special Issue on Analysis and Mining of Social Media Data. The special issue called for submissions with a primary focus on the use of social media data, for a variety of fields including natural language processing, computational social science, data mining, information retrieval and recommender systems. Of the 48 abstract submissions that were deemed within the scope of the special issue and were invited to submit a full article, 17 were ultimately accepted. These included a diverse set of articles covering, inter alia, sentiment analysis, detection and mitigation of online harms, analytical studies focused on societal issues and analysis of images surrounding news. The articles primarily use Twitter, Facebook and Reddit as data sources; English, Arabic, Italian, Russian, Indonesian and Javanese as languages; and over a third of the articles revolve around COVID-19 as the main topic of study. This article discusses the motivation for launching such a special issue and provides an overview of the articles published in the issue.Integrating hybrid transfer learning with attention-enhanced deep learning models to improve breast cancer diagnosishttps://peerj.com/articles/cs-18502024-02-282024-02-28Sudha Prathyusha JakkaladikiFilip Maly
Cancer, with its high fatality rate, instills fear in countless individuals worldwide. However, effective diagnosis and treatment can often lead to a successful cure. Computer-assisted diagnostics, especially in the context of deep learning, have become prominent methods for primary screening of various diseases, including cancer. Deep learning, an artificial intelligence technique that enables computers to reason like humans, has recently gained significant attention. This study focuses on training a deep neural network to predict breast cancer. With the advancements in medical imaging technologies such as X-ray, magnetic resonance imaging (MRI), and computed tomography (CT) scans, deep learning has become essential in analyzing and managing extensive image datasets. The objective of this research is to propose a deep-learning model for the identification and categorization of breast tumors. The system’s performance was evaluated using the breast cancer identification (BreakHis) classification datasets from the Kaggle repository and the Wisconsin Breast Cancer Dataset (WBC) from the UCI repository. The study’s findings demonstrated an impressive accuracy rate of 100%, surpassing other state-of-the-art approaches. The suggested model was thoroughly evaluated using F1-score, recall, precision, and accuracy metrics on the WBC dataset. Training, validation, and testing were conducted using pre-processed datasets, leading to remarkable results of 99.8% recall rate, 99.06% F1-score, and 100% accuracy rate on the BreakHis dataset. Similarly, on the WBC dataset, the model achieved a 99% accuracy rate, a 98.7% recall rate, and a 99.03% F1-score. These outcomes highlight the potential of deep learning models in accurately diagnosing breast cancer. Based on our research, it is evident that the proposed system outperforms existing approaches in this field.
Cancer, with its high fatality rate, instills fear in countless individuals worldwide. However, effective diagnosis and treatment can often lead to a successful cure. Computer-assisted diagnostics, especially in the context of deep learning, have become prominent methods for primary screening of various diseases, including cancer. Deep learning, an artificial intelligence technique that enables computers to reason like humans, has recently gained significant attention. This study focuses on training a deep neural network to predict breast cancer. With the advancements in medical imaging technologies such as X-ray, magnetic resonance imaging (MRI), and computed tomography (CT) scans, deep learning has become essential in analyzing and managing extensive image datasets. The objective of this research is to propose a deep-learning model for the identification and categorization of breast tumors. The system’s performance was evaluated using the breast cancer identification (BreakHis) classification datasets from the Kaggle repository and the Wisconsin Breast Cancer Dataset (WBC) from the UCI repository. The study’s findings demonstrated an impressive accuracy rate of 100%, surpassing other state-of-the-art approaches. The suggested model was thoroughly evaluated using F1-score, recall, precision, and accuracy metrics on the WBC dataset. Training, validation, and testing were conducted using pre-processed datasets, leading to remarkable results of 99.8% recall rate, 99.06% F1-score, and 100% accuracy rate on the BreakHis dataset. Similarly, on the WBC dataset, the model achieved a 99% accuracy rate, a 98.7% recall rate, and a 99.03% F1-score. These outcomes highlight the potential of deep learning models in accurately diagnosing breast cancer. Based on our research, it is evident that the proposed system outperforms existing approaches in this field.