PeerJ:Data Mining and Machine Learninghttps://peerj.com/articles/index.atom?journal=peerj&subject=9500Data Mining and Machine Learning articles published in PeerJConstruction of a predictive model for bone metastasis from first primary lung adenocarcinoma within 3 cm based on machine learning algorithm: a retrospective studyhttps://peerj.com/articles/170982024-03-142024-03-14Yu ZhangLixia XiaoLan LYuLiwei Zhang
Background
Adenocarcinoma, the most prevalent histological subtype of non-small cell lung cancer, is associated with a significantly higher likelihood of bone metastasis compared to other subtypes. The presence of bone metastasis has a profound adverse impact on patient prognosis. However, to date, there is a lack of accurate bone metastasis prediction models. As a result, this study aims to employ machine learning algorithms for predicting the risk of bone metastasis in patients.
Method
We collected a dataset comprising 19,454 cases of solitary, primary lung adenocarcinoma with pulmonary nodules measuring less than 3 cm. These cases were diagnosed between 2010 and 2015 and were sourced from the Surveillance, Epidemiology, and End Results (SEER) database. Utilizing clinical feature indicators, we developed predictive models using seven machine learning algorithms, namely extreme gradient boosting (XGBoost), logistic regression (LR), light gradient boosting machine (LightGBM), Adaptive Boosting (AdaBoost), Gaussian Naive Bayes (GNB), multilayer perceptron (MLP) and support vector machine (SVM).
Results
The results demonstrated that XGBoost exhibited superior performance among the four algorithms (training set: AUC: 0.913; test set: AUC: 0.853). Furthermore, for convenient application, we created an online scoring system accessible at the following URL: https://www.xsmartanalysis.com/model/predict/?mid=731symbol=7Fr16wX56AR9Mk233917, which is based on the highest performing model.
Conclusion
XGBoost proves to be an effective algorithm for predicting the occurrence of bone metastasis in patients with solitary, primary lung adenocarcinoma featuring pulmonary nodules below 3 cm in size. Moreover, its robust clinical applicability enhances its potential utility.
Background
Adenocarcinoma, the most prevalent histological subtype of non-small cell lung cancer, is associated with a significantly higher likelihood of bone metastasis compared to other subtypes. The presence of bone metastasis has a profound adverse impact on patient prognosis. However, to date, there is a lack of accurate bone metastasis prediction models. As a result, this study aims to employ machine learning algorithms for predicting the risk of bone metastasis in patients.
Method
We collected a dataset comprising 19,454 cases of solitary, primary lung adenocarcinoma with pulmonary nodules measuring less than 3 cm. These cases were diagnosed between 2010 and 2015 and were sourced from the Surveillance, Epidemiology, and End Results (SEER) database. Utilizing clinical feature indicators, we developed predictive models using seven machine learning algorithms, namely extreme gradient boosting (XGBoost), logistic regression (LR), light gradient boosting machine (LightGBM), Adaptive Boosting (AdaBoost), Gaussian Naive Bayes (GNB), multilayer perceptron (MLP) and support vector machine (SVM).
Results
The results demonstrated that XGBoost exhibited superior performance among the four algorithms (training set: AUC: 0.913; test set: AUC: 0.853). Furthermore, for convenient application, we created an online scoring system accessible at the following URL: https://www.xsmartanalysis.com/model/predict/?mid=731symbol=7Fr16wX56AR9Mk233917, which is based on the highest performing model.
Conclusion
XGBoost proves to be an effective algorithm for predicting the occurrence of bone metastasis in patients with solitary, primary lung adenocarcinoma featuring pulmonary nodules below 3 cm in size. Moreover, its robust clinical applicability enhances its potential utility.Land potential assessment and trend-analysis using 2000–2021 FAPAR monthly time-series at 250 m spatial resolutionhttps://peerj.com/articles/169722024-03-132024-03-13Julia HackländerLeandro ParenteYu-Feng HoTomislav HenglRolf SimoesDavide ConsoliMurat ŞahinXuemeng TianMartin JungMartin HeroldGregory DuveillerMelanie WeynantsIchsani Wheeler
The article presents results of using remote sensing images and machine learning to map and assess land potential based on time-series of potential Fraction of Absorbed Photosynthetically Active Radiation (FAPAR) composites. Land potential here refers to the potential vegetation productivity in the hypothetical absence of short–term anthropogenic influence, such as intensive agriculture and urbanization. Knowledge on this ecological land potential could support the assessment of levels of land degradation as well as restoration potentials. Monthly aggregated FAPAR time-series of three percentiles (0.05, 0.50 and 0.95 probability) at 250 m spatial resolution were derived from the 8-day GLASS FAPAR V6 product for 2000–2021 and used to determine long-term trends in FAPAR, as well as to model potential FAPAR in the absence of human pressure. CCa 3 million training points sampled from 12,500 locations across the globe were overlaid with 68 bio-physical variables representing climate, terrain, landform, and vegetation cover, as well as several variables representing human pressure including: population count, cropland intensity, nightlights and a human footprint index. The training points were used in an ensemble machine learning model that stacks three base learners (extremely randomized trees, gradient descended trees and artificial neural network) using a linear regressor as meta-learner. The potential FAPAR was then projected by removing the impact of urbanization and intensive agriculture in the covariate layers. The results of strict cross-validation show that the global distribution of FAPAR can be explained with an R2 of 0.89, with the most important covariates being growing season length, forest cover indicator and annual precipitation. From this model, a global map of potential monthly FAPAR for the recent year (2021) was produced, and used to predict gaps in actual vs. potential FAPAR. The produced global maps of actual vs. potential FAPAR and long-term trends were each spatially matched with stable and transitional land cover classes. The assessment showed large negative FAPAR gaps (actual lower than potential) for classes: urban, needle-leave deciduous trees, and flooded shrub or herbaceous cover, while strong negative FAPAR trends were found for classes: urban, sparse vegetation and rainfed cropland. On the other hand, classes: irrigated or post-flooded cropland, tree cover mixed leaf type, and broad-leave deciduous showed largely positive trends. The framework allows land managers to assess potential land degradation from two aspects: as an actual declining trend in observed FAPAR and as a difference between actual and potential vegetation FAPAR.
The article presents results of using remote sensing images and machine learning to map and assess land potential based on time-series of potential Fraction of Absorbed Photosynthetically Active Radiation (FAPAR) composites. Land potential here refers to the potential vegetation productivity in the hypothetical absence of short–term anthropogenic influence, such as intensive agriculture and urbanization. Knowledge on this ecological land potential could support the assessment of levels of land degradation as well as restoration potentials. Monthly aggregated FAPAR time-series of three percentiles (0.05, 0.50 and 0.95 probability) at 250 m spatial resolution were derived from the 8-day GLASS FAPAR V6 product for 2000–2021 and used to determine long-term trends in FAPAR, as well as to model potential FAPAR in the absence of human pressure. CCa 3 million training points sampled from 12,500 locations across the globe were overlaid with 68 bio-physical variables representing climate, terrain, landform, and vegetation cover, as well as several variables representing human pressure including: population count, cropland intensity, nightlights and a human footprint index. The training points were used in an ensemble machine learning model that stacks three base learners (extremely randomized trees, gradient descended trees and artificial neural network) using a linear regressor as meta-learner. The potential FAPAR was then projected by removing the impact of urbanization and intensive agriculture in the covariate layers. The results of strict cross-validation show that the global distribution of FAPAR can be explained with an R2 of 0.89, with the most important covariates being growing season length, forest cover indicator and annual precipitation. From this model, a global map of potential monthly FAPAR for the recent year (2021) was produced, and used to predict gaps in actual vs. potential FAPAR. The produced global maps of actual vs. potential FAPAR and long-term trends were each spatially matched with stable and transitional land cover classes. The assessment showed large negative FAPAR gaps (actual lower than potential) for classes: urban, needle-leave deciduous trees, and flooded shrub or herbaceous cover, while strong negative FAPAR trends were found for classes: urban, sparse vegetation and rainfed cropland. On the other hand, classes: irrigated or post-flooded cropland, tree cover mixed leaf type, and broad-leave deciduous showed largely positive trends. The framework allows land managers to assess potential land degradation from two aspects: as an actual declining trend in observed FAPAR and as a difference between actual and potential vegetation FAPAR.EPI-SF: essential protein identification in protein interaction networks using sequence featureshttps://peerj.com/articles/170102024-03-132024-03-13Sovan SahaPiyali ChatterjeeSubhadip BasuMita Nasipuri
Proteins are considered indispensable for facilitating an organism’s viability, reproductive capabilities, and other fundamental physiological functions. Conventional biological assays are characterized by prolonged duration, extensive labor requirements, and financial expenses in order to identify essential proteins. Therefore, it is widely accepted that employing computational methods is the most expeditious and effective approach to successfully discerning essential proteins. Despite being a popular choice in machine learning (ML) applications, the deep learning (DL) method is not suggested for this specific research work based on sequence features due to the restricted availability of high-quality training sets of positive and negative samples. However, some DL works on limited availability of data are also executed at recent times which will be our future scope of work. Conventional ML techniques are thus utilized in this work due to their superior performance compared to DL methodologies. In consideration of the aforementioned, a technique called EPI-SF is proposed here, which employs ML to identify essential proteins within the protein-protein interaction network (PPIN). The protein sequence is the primary determinant of protein structure and function. So, initially, relevant protein sequence features are extracted from the proteins within the PPIN. These features are subsequently utilized as input for various machine learning models, including XGB Boost Classifier, AdaBoost Classifier, logistic regression (LR), support vector classification (SVM), Decision Tree model (DT), Random Forest model (RF), and Naïve Bayes model (NB). The objective is to detect the essential proteins within the PPIN. The primary investigation conducted on yeast examined the performance of various ML models for yeast PPIN. Among these models, the RF model technique had the highest level of effectiveness, as indicated by its precision, recall, F1-score, and AUC values of 0.703, 0.720, 0.711, and 0.745, respectively. It is also found to be better in performance when compared to the other state-of-arts based on traditional centrality like betweenness centrality (BC), closeness centrality (CC), etc. and deep learning methods as well like DeepEP, as emphasized in the result section. As a result of its favorable performance, EPI-SF is later employed for the prediction of novel essential proteins inside the human PPIN. Due to the tendency of viruses to selectively target essential proteins involved in the transmission of diseases within human PPIN, investigations are conducted to assess the probable involvement of these proteins in COVID-19 and other related severe diseases.
Proteins are considered indispensable for facilitating an organism’s viability, reproductive capabilities, and other fundamental physiological functions. Conventional biological assays are characterized by prolonged duration, extensive labor requirements, and financial expenses in order to identify essential proteins. Therefore, it is widely accepted that employing computational methods is the most expeditious and effective approach to successfully discerning essential proteins. Despite being a popular choice in machine learning (ML) applications, the deep learning (DL) method is not suggested for this specific research work based on sequence features due to the restricted availability of high-quality training sets of positive and negative samples. However, some DL works on limited availability of data are also executed at recent times which will be our future scope of work. Conventional ML techniques are thus utilized in this work due to their superior performance compared to DL methodologies. In consideration of the aforementioned, a technique called EPI-SF is proposed here, which employs ML to identify essential proteins within the protein-protein interaction network (PPIN). The protein sequence is the primary determinant of protein structure and function. So, initially, relevant protein sequence features are extracted from the proteins within the PPIN. These features are subsequently utilized as input for various machine learning models, including XGB Boost Classifier, AdaBoost Classifier, logistic regression (LR), support vector classification (SVM), Decision Tree model (DT), Random Forest model (RF), and Naïve Bayes model (NB). The objective is to detect the essential proteins within the PPIN. The primary investigation conducted on yeast examined the performance of various ML models for yeast PPIN. Among these models, the RF model technique had the highest level of effectiveness, as indicated by its precision, recall, F1-score, and AUC values of 0.703, 0.720, 0.711, and 0.745, respectively. It is also found to be better in performance when compared to the other state-of-arts based on traditional centrality like betweenness centrality (BC), closeness centrality (CC), etc. and deep learning methods as well like DeepEP, as emphasized in the result section. As a result of its favorable performance, EPI-SF is later employed for the prediction of novel essential proteins inside the human PPIN. Due to the tendency of viruses to selectively target essential proteins involved in the transmission of diseases within human PPIN, investigations are conducted to assess the probable involvement of these proteins in COVID-19 and other related severe diseases.High-resolution density assessment assisted by deep learning of Dendrophyllia cornigera (Lamarck, 1816) and Phakellia ventilabrum (Linnaeus, 1767) in rocky circalittoral shelf of Bay of Biscayhttps://peerj.com/articles/170802024-03-072024-03-07Alberto Gayá-VilarAdolfo CoboAlberto Abad-UribarrenAugusto RodríguezSergio SierraSabrina ClementeElena Prado
This study presents a novel approach to high-resolution density distribution mapping of two key species of the 1170 “Reefs” habitat, Dendrophyllia cornigera and Phakellia ventilabrum, in the Bay of Biscay using deep learning models. The main objective of this study was to establish a pipeline based on deep learning models to extract species density data from raw images obtained by a remotely operated towed vehicle (ROTV). Different object detection models were evaluated and compared in various shelf zones at the head of submarine canyon systems using metrics such as precision, recall, and F1 score. The best-performing model, YOLOv8, was selected for generating density maps of the two species at a high spatial resolution. The study also generated synthetic images to augment the training data and assess the generalization capacity of the models. The proposed approach provides a cost-effective and non-invasive method for monitoring and assessing the status of these important reef-building species and their habitats. The results have important implications for the management and protection of the 1170 habitat in Spain and other marine ecosystems worldwide. These results highlight the potential of deep learning to improve efficiency and accuracy in monitoring vulnerable marine ecosystems, allowing informed decisions to be made that can have a positive impact on marine conservation.
This study presents a novel approach to high-resolution density distribution mapping of two key species of the 1170 “Reefs” habitat, Dendrophyllia cornigera and Phakellia ventilabrum, in the Bay of Biscay using deep learning models. The main objective of this study was to establish a pipeline based on deep learning models to extract species density data from raw images obtained by a remotely operated towed vehicle (ROTV). Different object detection models were evaluated and compared in various shelf zones at the head of submarine canyon systems using metrics such as precision, recall, and F1 score. The best-performing model, YOLOv8, was selected for generating density maps of the two species at a high spatial resolution. The study also generated synthetic images to augment the training data and assess the generalization capacity of the models. The proposed approach provides a cost-effective and non-invasive method for monitoring and assessing the status of these important reef-building species and their habitats. The results have important implications for the management and protection of the 1170 habitat in Spain and other marine ecosystems worldwide. These results highlight the potential of deep learning to improve efficiency and accuracy in monitoring vulnerable marine ecosystems, allowing informed decisions to be made that can have a positive impact on marine conservation.Unsupervised AI reveals insect species-specific genome signatureshttps://peerj.com/articles/170252024-03-062024-03-06Yui SawadaRyuhei MineiHiromasa TabataToshimichi IkemuraKennosuke WadaYoshiko WadaHiroshi NagataYuki Iwasaki
Insects are a highly diverse phylogeny and possess a wide variety of traits, including the presence or absence of wings and metamorphosis. These diverse traits are of great interest for studying genome evolution, and numerous comparative genomic studies have examined a wide phylogenetic range of insects. Here, we analyzed 22 insects belonging to a wide phylogenetic range (Endopterygota, Paraneoptera, Polyneoptera, Palaeoptera, and other insects) by using a batch-learning self-organizing map (BLSOM) for oligonucleotide compositions in their genomic fragments (100-kb or 1-Mb sequences), which is an unsupervised machine learning algorithm that can extract species-specific characteristics of the oligonucleotide compositions (genome signatures). The genome signature is of particular interest in terms of the mechanisms and biological significance that have caused the species-specific difference, and can be used as a powerful search needle to explore the various roles of genome sequences other than protein coding, and can be used to unveil mysteries hidden in the genome sequence. Since BLSOM is an unsupervised clustering method, the clustering of sequences was performed based on the oligonucleotide composition alone, without providing information about the species from which each fragment sequence was derived. Therefore, not only the interspecies separation, but also the intraspecies separation can be achieved. Here, we have revealed the specific genomic regions with oligonucleotide compositions distinct from the usual sequences of each insect genome, e.g., Mb-level structures found for a grasshopper Schistocerca americana. One aim of this study was to compare the genome characteristics of insects with those of vertebrates, especially humans, which are phylogenetically distant from insects. Recently, humans seem to be the “model organism” for which a large amount of information has been accumulated using a variety of cutting-edge and high-throughput technologies. Therefore, it is reasonable to use the abundant information from humans to study insect lineages. The specific regions of Mb length with distinct oligonucleotide compositions have also been previously observed in the human genome. These regions were enriched by transcription factor binding motifs (TFBSs) and hypothesized to be involved in the three-dimensional arrangement of chromosomal DNA in interphase nuclei. The present study characterized the species-specific oligonucleotide compositions (i.e., genome signatures) in insect genomes and identified specific genomic regions with distinct oligonucleotide compositions.
Insects are a highly diverse phylogeny and possess a wide variety of traits, including the presence or absence of wings and metamorphosis. These diverse traits are of great interest for studying genome evolution, and numerous comparative genomic studies have examined a wide phylogenetic range of insects. Here, we analyzed 22 insects belonging to a wide phylogenetic range (Endopterygota, Paraneoptera, Polyneoptera, Palaeoptera, and other insects) by using a batch-learning self-organizing map (BLSOM) for oligonucleotide compositions in their genomic fragments (100-kb or 1-Mb sequences), which is an unsupervised machine learning algorithm that can extract species-specific characteristics of the oligonucleotide compositions (genome signatures). The genome signature is of particular interest in terms of the mechanisms and biological significance that have caused the species-specific difference, and can be used as a powerful search needle to explore the various roles of genome sequences other than protein coding, and can be used to unveil mysteries hidden in the genome sequence. Since BLSOM is an unsupervised clustering method, the clustering of sequences was performed based on the oligonucleotide composition alone, without providing information about the species from which each fragment sequence was derived. Therefore, not only the interspecies separation, but also the intraspecies separation can be achieved. Here, we have revealed the specific genomic regions with oligonucleotide compositions distinct from the usual sequences of each insect genome, e.g., Mb-level structures found for a grasshopper Schistocerca americana. One aim of this study was to compare the genome characteristics of insects with those of vertebrates, especially humans, which are phylogenetically distant from insects. Recently, humans seem to be the “model organism” for which a large amount of information has been accumulated using a variety of cutting-edge and high-throughput technologies. Therefore, it is reasonable to use the abundant information from humans to study insect lineages. The specific regions of Mb length with distinct oligonucleotide compositions have also been previously observed in the human genome. These regions were enriched by transcription factor binding motifs (TFBSs) and hypothesized to be involved in the three-dimensional arrangement of chromosomal DNA in interphase nuclei. The present study characterized the species-specific oligonucleotide compositions (i.e., genome signatures) in insect genomes and identified specific genomic regions with distinct oligonucleotide compositions.Tracking mosquito-borne diseases via social media: a machine learning approach to topic modelling and sentiment analysishttps://peerj.com/articles/170452024-03-012024-03-01Song-Quan OngHamdan Ahmad
Mosquito-borne diseases (MBDs) are a major threat worldwide, and public consultation on these diseases is critical to disease control decision-making. However, traditional public surveys are time-consuming and labor-intensive and do not allow for timely decision-making. Recent studies have explored text analytic approaches to elicit public comments from social media for public health. Therefore, this study aims to demonstrate a text analytics pipeline to identify the MBD topics that were discussed on Twitter and significantly influenced public opinion. A total of 25,000 tweets were retrieved from Twitter, topics were modelled using LDA and sentiment polarities were calculated using the VADER model. After data cleaning, we obtained a total of 6,243 tweets, which we were able to process with the feature selection algorithms. Boruta was used as a feature selection algorithm to determine the importance of topics to public opinion. The result was validated using multinomial logistic regression (MLR) performance and expert judgement. Important issues such as breeding sites, mosquito control, impact/funding, time of year, other diseases with similar symptoms, mosquito-human interaction and biomarkers for diagnosis were identified by both LDA and experts. The MLR result shows that the topics selected by LASSO perform significantly better than the other algorithms, and the experts further justify the topics in the discussion.
Mosquito-borne diseases (MBDs) are a major threat worldwide, and public consultation on these diseases is critical to disease control decision-making. However, traditional public surveys are time-consuming and labor-intensive and do not allow for timely decision-making. Recent studies have explored text analytic approaches to elicit public comments from social media for public health. Therefore, this study aims to demonstrate a text analytics pipeline to identify the MBD topics that were discussed on Twitter and significantly influenced public opinion. A total of 25,000 tweets were retrieved from Twitter, topics were modelled using LDA and sentiment polarities were calculated using the VADER model. After data cleaning, we obtained a total of 6,243 tweets, which we were able to process with the feature selection algorithms. Boruta was used as a feature selection algorithm to determine the importance of topics to public opinion. The result was validated using multinomial logistic regression (MLR) performance and expert judgement. Important issues such as breeding sites, mosquito control, impact/funding, time of year, other diseases with similar symptoms, mosquito-human interaction and biomarkers for diagnosis were identified by both LDA and experts. The MLR result shows that the topics selected by LASSO perform significantly better than the other algorithms, and the experts further justify the topics in the discussion.Enhancing medical image segmentation with a multi-transformer U-Nethttps://peerj.com/articles/170052024-02-292024-02-29Yongping DanWeishou JinXuebin YueZhida Wang
Various segmentation networks based on Swin Transformer have shown promise in medical segmentation tasks. Nonetheless, challenges such as lower accuracy and slower training convergence have persisted. To tackle these issues, we introduce a novel approach that combines the Swin Transformer and Deformable Transformer to enhance overall model performance. We leverage the Swin Transformer’s window attention mechanism to capture local feature information and employ the Deformable Transformer to adjust sampling positions dynamically, accelerating model convergence and aligning it more closely with object shapes and sizes. By amalgamating both Transformer modules and incorporating additional skip connections to minimize information loss, our proposed model excels at rapidly and accurately segmenting CT or X-ray lung images. Experimental results demonstrate the remarkable, showcasing the significant prowess of our model. It surpasses the performance of the standalone Swin Transformer’s Swin Unet and converges more rapidly under identical conditions, yielding accuracy improvements of 0.7% (resulting in 88.18%) and 2.7% (resulting in 98.01%) on the COVID-19 CT scan lesion segmentation dataset and Chest X-ray Masks and Labels dataset, respectively. This advancement has the potential to aid medical practitioners in early diagnosis and treatment decision-making.
Various segmentation networks based on Swin Transformer have shown promise in medical segmentation tasks. Nonetheless, challenges such as lower accuracy and slower training convergence have persisted. To tackle these issues, we introduce a novel approach that combines the Swin Transformer and Deformable Transformer to enhance overall model performance. We leverage the Swin Transformer’s window attention mechanism to capture local feature information and employ the Deformable Transformer to adjust sampling positions dynamically, accelerating model convergence and aligning it more closely with object shapes and sizes. By amalgamating both Transformer modules and incorporating additional skip connections to minimize information loss, our proposed model excels at rapidly and accurately segmenting CT or X-ray lung images. Experimental results demonstrate the remarkable, showcasing the significant prowess of our model. It surpasses the performance of the standalone Swin Transformer’s Swin Unet and converges more rapidly under identical conditions, yielding accuracy improvements of 0.7% (resulting in 88.18%) and 2.7% (resulting in 98.01%) on the COVID-19 CT scan lesion segmentation dataset and Chest X-ray Masks and Labels dataset, respectively. This advancement has the potential to aid medical practitioners in early diagnosis and treatment decision-making.Co-differential genes between DKD and aging: implications for a diagnostic model of DKDhttps://peerj.com/articles/170462024-02-292024-02-29Hongxuan DuKaiying HeJing ZhaoQicai YouXiaochun ZhouJianqin Wang
Objective
Diabetic kidney disease (DKD) is a serious complication of diabetes mellitus (DM) that is closely related to aging. In this study, we found co-differential genes between DKD and aging and established a diagnostic model of DKD based on these genes.
Methods
Differentially expressed genes (DEGs) in DKD were screened using GEO datasets. The intersection of the DEGs of DKD and aging-related genes revealed DKD and aging co-differential genes. Based on this, a genetic diagnostic model for DKD was constructed using LASSO regression. The characteristics of these genes were investigated using consensus clustering, WGCNA, functional enrichment, and immune cell infiltration. Finally, the expression of diagnostic model genes was analyzed using single-cell RNA sequencing (scRNA-seq) in DKD mice (model constructed by streptozotocin (STZ) injection and confirmed by tissue section staining).
Results
First, there were 159 common differential genes between DKD and aging, 15 of which were significant. These co-differential genes were involved in stress, glucolipid metabolism, and immunological functions. Second, a genetic diagnostic model (including IGF1, CETP, PCK1, FOS, and HSPA1A) was developed based on these genes. Validation of these model genes in scRNA-seq data revealed statistically significant variations in FOS, HSPA1A, and PCK1 gene expression between the early DKD and control groups. Validation of these model genes in the kidneys of DKD mice revealed that Igf1, Fos, Pck1, and Hspa1a had lower expression in DKD mice, with Igf1 expression being statistically significant.
Conclusion
Our findings suggest that DKD and aging co-differential genes are significant in DKD diagnosis, providing a theoretical basis for novel research directions on DKD.
Objective
Diabetic kidney disease (DKD) is a serious complication of diabetes mellitus (DM) that is closely related to aging. In this study, we found co-differential genes between DKD and aging and established a diagnostic model of DKD based on these genes.
Methods
Differentially expressed genes (DEGs) in DKD were screened using GEO datasets. The intersection of the DEGs of DKD and aging-related genes revealed DKD and aging co-differential genes. Based on this, a genetic diagnostic model for DKD was constructed using LASSO regression. The characteristics of these genes were investigated using consensus clustering, WGCNA, functional enrichment, and immune cell infiltration. Finally, the expression of diagnostic model genes was analyzed using single-cell RNA sequencing (scRNA-seq) in DKD mice (model constructed by streptozotocin (STZ) injection and confirmed by tissue section staining).
Results
First, there were 159 common differential genes between DKD and aging, 15 of which were significant. These co-differential genes were involved in stress, glucolipid metabolism, and immunological functions. Second, a genetic diagnostic model (including IGF1, CETP, PCK1, FOS, and HSPA1A) was developed based on these genes. Validation of these model genes in scRNA-seq data revealed statistically significant variations in FOS, HSPA1A, and PCK1 gene expression between the early DKD and control groups. Validation of these model genes in the kidneys of DKD mice revealed that Igf1, Fos, Pck1, and Hspa1a had lower expression in DKD mice, with Igf1 expression being statistically significant.
Conclusion
Our findings suggest that DKD and aging co-differential genes are significant in DKD diagnosis, providing a theoretical basis for novel research directions on DKD.Wood identification based on macroscopic images using deep and transfer learning approacheshttps://peerj.com/articles/170212024-02-282024-02-28Halime Ergun
Identifying forest types is vital for evaluating the ecological, economic, and social benefits provided by forests, and for protecting, managing, and sustaining them. Although traditionally based on expert observation, recent developments have increased the use of technologies such as artificial intelligence (AI). The use of advanced methods such as deep learning will make forest species recognition faster and easier. In this study, the deep network models RestNet18, GoogLeNet, VGG19, Inceptionv3, MobileNetv2, DenseNet201, InceptionResNetv2, EfficientNet and ShuffleNet, which were pre-trained with ImageNet dataset, were adapted to a new dataset. In this adaptation, transfer learning method is used. These models have different architectures that allow a wide range of performance evaluation. The performance of the model was evaluated by accuracy, recall, precision, F1-score, specificity and Matthews correlation coefficient. ShuffleNet was proposed as a lightweight network model that achieves high performance with low computational power and resource requirements. This model was an efficient model with an accuracy close to other models with customisation. This study reveals that deep network models are an effective tool in the field of forest species recognition. This study makes an important contribution to the conservation and management of forests.
Identifying forest types is vital for evaluating the ecological, economic, and social benefits provided by forests, and for protecting, managing, and sustaining them. Although traditionally based on expert observation, recent developments have increased the use of technologies such as artificial intelligence (AI). The use of advanced methods such as deep learning will make forest species recognition faster and easier. In this study, the deep network models RestNet18, GoogLeNet, VGG19, Inceptionv3, MobileNetv2, DenseNet201, InceptionResNetv2, EfficientNet and ShuffleNet, which were pre-trained with ImageNet dataset, were adapted to a new dataset. In this adaptation, transfer learning method is used. These models have different architectures that allow a wide range of performance evaluation. The performance of the model was evaluated by accuracy, recall, precision, F1-score, specificity and Matthews correlation coefficient. ShuffleNet was proposed as a lightweight network model that achieves high performance with low computational power and resource requirements. This model was an efficient model with an accuracy close to other models with customisation. This study reveals that deep network models are an effective tool in the field of forest species recognition. This study makes an important contribution to the conservation and management of forests.moSCminer: a cell subtype classification framework based on the attention neural network integrating the single-cell multi-omics dataset on the cloudhttps://peerj.com/articles/170062024-02-262024-02-26Joung Min ChoiChaelin ParkHeejoon Chae
Single-cell omics sequencing has rapidly advanced, enabling the quantification of diverse omics profiles at a single-cell resolution. To facilitate comprehensive biological insights, such as cellular differentiation trajectories, precise annotation of cell subtypes is essential. Conventional methods involve clustering cells and manually assigning subtypes based on canonical markers, a labor-intensive and expert-dependent process. Hence, an automated computational prediction framework is crucial. While several classification frameworks for predicting cell subtypes from single-cell RNA sequencing datasets exist, these methods solely rely on single-omics data, offering insights at a single molecular level. They often miss inter-omic correlations and a holistic understanding of cellular processes. To address this, the integration of multi-omics datasets from individual cells is essential for accurate subtype annotation. This article introduces moSCminer, a novel framework for classifying cell subtypes that harnesses the power of single-cell multi-omics sequencing datasets through an attention-based neural network operating at the omics level. By integrating three distinct omics datasets—gene expression, DNA methylation, and DNA accessibility—while accounting for their biological relationships, moSCminer excels at learning the relative significance of each omics feature. It then transforms this knowledge into a novel representation for cell subtype classification. Comparative evaluations against standard machine learning-based classifiers demonstrate moSCminer’s superior performance, consistently achieving the highest average performance on real datasets. The efficacy of multi-omics integration is further corroborated through an in-depth analysis of the omics-level attention module, which identifies potential markers for cell subtype annotation. To enhance accessibility and scalability, moSCminer is accessible as a user-friendly web-based platform seamlessly connected to a cloud system, publicly accessible at http://203.252.206.118:5568. Notably, this study marks the pioneering integration of three single-cell multi-omics datasets for cell subtype identification.
Single-cell omics sequencing has rapidly advanced, enabling the quantification of diverse omics profiles at a single-cell resolution. To facilitate comprehensive biological insights, such as cellular differentiation trajectories, precise annotation of cell subtypes is essential. Conventional methods involve clustering cells and manually assigning subtypes based on canonical markers, a labor-intensive and expert-dependent process. Hence, an automated computational prediction framework is crucial. While several classification frameworks for predicting cell subtypes from single-cell RNA sequencing datasets exist, these methods solely rely on single-omics data, offering insights at a single molecular level. They often miss inter-omic correlations and a holistic understanding of cellular processes. To address this, the integration of multi-omics datasets from individual cells is essential for accurate subtype annotation. This article introduces moSCminer, a novel framework for classifying cell subtypes that harnesses the power of single-cell multi-omics sequencing datasets through an attention-based neural network operating at the omics level. By integrating three distinct omics datasets—gene expression, DNA methylation, and DNA accessibility—while accounting for their biological relationships, moSCminer excels at learning the relative significance of each omics feature. It then transforms this knowledge into a novel representation for cell subtype classification. Comparative evaluations against standard machine learning-based classifiers demonstrate moSCminer’s superior performance, consistently achieving the highest average performance on real datasets. The efficacy of multi-omics integration is further corroborated through an in-depth analysis of the omics-level attention module, which identifies potential markers for cell subtype annotation. To enhance accessibility and scalability, moSCminer is accessible as a user-friendly web-based platform seamlessly connected to a cloud system, publicly accessible at http://203.252.206.118:5568. Notably, this study marks the pioneering integration of three single-cell multi-omics datasets for cell subtype identification.