PeerJ Preprints: Computational Sciencehttps://peerj.com/preprints/index.atom?journal=peerj&subject=3670Computational Science articles published in PeerJ PreprintsData-driven classification of the certainty of scholarly assertionshttps://peerj.com/preprints/278292019-12-202019-12-20Mario PrietoHelena DeusAnita De WaardErik SchultesBeatriz García-JiménezMark D Wilkinson
The grammatical structures scholars use to express their assertions are intended to convey various degrees of certainty or speculation. Prior studies have suggested a variety of categorization systems for scholarly certainty; however, these have not been objectively tested for their validity, particularly with respect to representing the interpretation by the reader, rather than the intention of the author. In this study, we use a series of questionnaires to determine how researchers classify various scholarly assertions, using three distinct certainty classification systems. We find that there are three distinct categories of certainty along a spectrum from high to low. We show that these categories can be detected in an automated manner, using a machine learning model, with a cross-validation accuracy of 89.2% relative to an author-annotated corpus, and 82.2% accuracy against a publicly-annotated corpus. This finding provides an opportunity for contextual metadata related to certainty to be captured as a part of text-mining pipelines, which currently miss these subtle linguistic cues. We provide an exemplar machine-accessible representation - a Nanopublication - where certainty category is embedded as metadata in a formal, ontology-based manner within text-mined scholarly assertions.
The grammatical structures scholars use to express their assertions are intended to convey various degrees of certainty or speculation. Prior studies have suggested a variety of categorization systems for scholarly certainty; however, these have not been objectively tested for their validity, particularly with respect to representing the interpretation by the reader, rather than the intention of the author. In this study, we use a series of questionnaires to determine how researchers classify various scholarly assertions, using three distinct certainty classification systems. We find that there are three distinct categories of certainty along a spectrum from high to low. We show that these categories can be detected in an automated manner, using a machine learning model, with a cross-validation accuracy of 89.2% relative to an author-annotated corpus, and 82.2% accuracy against a publicly-annotated corpus. This finding provides an opportunity for contextual metadata related to certainty to be captured as a part of text-mining pipelines, which currently miss these subtle linguistic cues. We provide an exemplar machine-accessible representation - a Nanopublication - where certainty category is embedded as metadata in a formal, ontology-based manner within text-mined scholarly assertions.Digestiflow: from BCL to FASTQ with easehttps://peerj.com/preprints/277172019-11-112019-11-11Manuel HoltgreweMikko NieminenClemens MesserschmidtDieter Beule
Management raw sequencing data and its preprocessing (conversion into sequences and demultiplexing) remains a challenging topic for groups running sequencing devices. They face many challenges in such efforts and solutions ranging from manual management of spreadsheets to very complex and customized LIMS systems handling much more than just sequencing raw data. In this manuscript, we describe the software package DigestiFlow that focuses on the management of Illumina flow cell sample sheets and raw data. It allows for automated extraction of information from flow cell data and management of sample sheets. Furthermore, it allows for the automated and reproducible conversion of Illumina base calls to sequences and the demultiplexing thereof using bcl2fastq and Picard Tools, followed by quality control report generation.
Management raw sequencing data and its preprocessing (conversion into sequences and demultiplexing) remains a challenging topic for groups running sequencing devices. They face many challenges in such efforts and solutions ranging from manual management of spreadsheets to very complex and customized LIMS systems handling much more than just sequencing raw data. In this manuscript, we describe the software package DigestiFlow that focuses on the management of Illumina flow cell sample sheets and raw data. It allows for automated extraction of information from flow cell data and management of sample sheets. Furthermore, it allows for the automated and reproducible conversion of Illumina base calls to sequences and the demultiplexing thereof using bcl2fastq and Picard Tools, followed by quality control report generation.Wave propagation in the biosonar organ of sperm whales using a finite difference time domain methodhttps://peerj.com/preprints/279952019-09-302019-09-30Maxence FerrariRicard MarxerMark AschHervé Glotin
The bio-sonar of sperm whales presents many specific characteristics, such as its size, its loudness or its vocalization abilities. Furthermore it fulfills several roles in their foraging and social behaviour. However our knowledge about its operation remains limited to the main acoustic path that the emitted pulse may take. We still ignore the precise mechanisms that shape the wave and on which parts the sperm whale is able to act. In this paper, we describe a technique to simulate sperm whale click generation from a physical perspective. Such an approach aims at unveiling the processes involved in their vocal production, as a stepping stone towards a better understanding of their interaction with peers and the environment.
The bio-sonar of sperm whales presents many specific characteristics, such as its size, its loudness or its vocalization abilities. Furthermore it fulfills several roles in their foraging and social behaviour. However our knowledge about its operation remains limited to the main acoustic path that the emitted pulse may take. We still ignore the precise mechanisms that shape the wave and on which parts the sperm whale is able to act. In this paper, we describe a technique to simulate sperm whale click generation from a physical perspective. Such an approach aims at unveiling the processes involved in their vocal production, as a stepping stone towards a better understanding of their interaction with peers and the environment.Ten simple rules for a successful remote postdochttps://peerj.com/preprints/279072019-08-182019-08-18Kevin R BurgioCaitlin McDonough MacKenzieStephanie B BorrelleS. K. Morgan ErnestJacquelyn L GillKurt E IngemanAmy K TefferEthan P White
Postdoctoral positions are temporary full-time positions typically taken between completion of a PhD and the start of a permanent position. Postdocs are expected to move for short-term positions which can often be problematic for early-career researchers, especially those from under-represented groups in STEM. However, the proliferation of computational research has changed how scientists can conduct science, opening the door to postdoctoral work being conducted remotely. Research activities primarily involving quantitative analysis, modeling, writing, and data collection can take place anywhere and therefore can all be conducted on a remote or semi-remote basis. We offer 10 simple rules for overcoming challenges and leveraging the unique opportunities presented by remote postdoc positions, derived from our experiences as either remote postdocs or the PIs who have mentored them. We believe that not only will these suggestions increase the desirability of remote postdoc positions whenever they are feasible, but that they also contain good practices for facilitating better communication both within labs more generally and in other long-distance collaborations.
Postdoctoral positions are temporary full-time positions typically taken between completion of a PhD and the start of a permanent position. Postdocs are expected to move for short-term positions which can often be problematic for early-career researchers, especially those from under-represented groups in STEM. However, the proliferation of computational research has changed how scientists can conduct science, opening the door to postdoctoral work being conducted remotely. Research activities primarily involving quantitative analysis, modeling, writing, and data collection can take place anywhere and therefore can all be conducted on a remote or semi-remote basis. We offer 10 simple rules for overcoming challenges and leveraging the unique opportunities presented by remote postdoc positions, derived from our experiences as either remote postdocs or the PIs who have mentored them. We believe that not only will these suggestions increase the desirability of remote postdoc positions whenever they are feasible, but that they also contain good practices for facilitating better communication both within labs more generally and in other long-distance collaborations.The geometric formulas of the Lewis’s law and Aboav-Weaire’s law in two dimensions based on ellipse packinghttps://peerj.com/preprints/277972019-06-182019-06-18Kai Xu
The two-dimensional (2D) Lewis’s law and Aboav-Weaire’s law are two simple formulas derived from empirical observations. Numerous attempts have been made to improve the empirical formulas. In this study, we simulated a series of Voronoi diagrams by randomly disordered the seed locations of a regular hexagonal 2D Voronoi diagram, and analyzed the cell topology based on ellipse packing. Then, we derived and verified the improved formulas for Lewis’s law and Aboav-Weaire’s law. Specifically, we found that the upper limit of the second moment of edge number is 3. In addition, we derived the geometric formula of the von Neumann-Mullins’s law based on the new formula of the Aboav-Weaire’s law. Our results suggested that the cell area, local neighbor relationship, and cell growth rate are closely linked to each other, and mainly shaped by the effect of deformation from circle to ellipse and less influenced by the global edge distribution.
The two-dimensional (2D) Lewis’s law and Aboav-Weaire’s law are two simple formulas derived from empirical observations. Numerous attempts have been made to improve the empirical formulas. In this study, we simulated a series of Voronoi diagrams by randomly disordered the seed locations of a regular hexagonal 2D Voronoi diagram, and analyzed the cell topology based on ellipse packing.Then, we derived and verified the improved formulas for Lewis’s law and Aboav-Weaire’s law. Specifically, we found that the upper limit of the second moment of edge number is 3. In addition, we derived the geometric formula of the von Neumann-Mullins’s law based on the new formula of the Aboav-Weaire’s law. Our results suggested that the cell area, local neighbor relationship, and cell growth rate are closely linked to each other, and mainly shaped by the effect of deformation from circle to ellipse and less influenced by the global edge distribution.Joint predictive modeling for geospatial data at various locationshttps://peerj.com/preprints/277952019-06-112019-06-11Xi ChengHarry Xie
Predictive modeling uses statistics to predict unknown outcomes. In general, there are two categories of predictive modeling, parametric and non-parametric. There are many applications of predictive modeling, for example, it can be used to predict the risk score of a credit card transaction, it can also be used in health care to identify the probability of having certain disease. When it comes to geospatial data, there are some unique characteristics of the problem. Predictive modeling of geospatial data naturally involves multiple response variables at various locations. The response variables are not independent with each other and thus building separate models for each individual response variable is not appropriate. In addition, many geospatial data has strong spatial auto-correlation such that data from nearby locations are more similar with each other. A joint modeling takes into account of both the correlation among response variables and relationship among different locations, and can make predictions for locations with no training data. In this paper, we review works on joint predictive modeling for multiple response variables at various locations.
Predictive modeling uses statistics to predict unknown outcomes. In general, there are two categories of predictive modeling, parametric and non-parametric. There are many applications of predictive modeling, for example, it can be used to predict the risk score of a credit card transaction, it can also be used in health care to identify the probability of having certain disease. When it comes to geospatial data, there are some unique characteristics of the problem. Predictive modeling of geospatial data naturally involves multiple response variables at various locations. The response variables are not independent with each other and thus building separate models for each individual response variable is not appropriate. In addition, many geospatial data has strong spatial auto-correlation such that data from nearby locations are more similar with each other. A joint modeling takes into account of both the correlation among response variables and relationship among different locations, and can make predictions for locations with no training data. In this paper, we review works on joint predictive modeling for multiple response variables at various locations.Linking TPP2 to the protein interaction and signalling networkshttps://peerj.com/preprints/277892019-06-072019-06-07Jarmila Nahálková
The present manuscript explores signalling and metabolic pathways which mediate functions of Tripeptidyl-peptidase 2 (TPP2) using the analysis of its protein-protein interaction network. The protein interaction data were retrieved partially from our previous experimental published work and the public databases. The interacting proteins were collected from BioGRID 3.5.169; STRING and Agilent Literature Search 3.1.1 databases based on the increased threshold criteria. Totally 13 interacting proteins were obtained from the public sources, and they were combined with TPP2 interacting proteins included in the interaction network PIN7, which involves seven interacting proteins with the pleiotropic functions involved in tumorigenesis, neurodegeneration and ageing. The interaction network of TPP2 was analysed by the pathway enrichment, the protein function prediction and the protein node prioritisation analysis by GeneMania and Cytohubba applications run under Cytoscape 3.7.0 environment. The most enriched signalling pathways were functionally linked to the regulation of the adaptive and innate immunity (ID, Kit Receptor, BCR, IL2, the regulation of NFκB), the aerobic glycolysis (ID and IL-2), tumorigenesis (TGFβ, p53, the high priority nodes MAPKs, and the control of mTOR), diabetes (Kit receptor, the top priority node GSK3β) and neurodegeneration (the regulation of mTOR and Aβ peptide degradation). The BioGRID database mining also showed the interaction with two lung cancer suppressors (DOK3, DENND2D), a protein involved in the increased risk of the lung cancer in smokers (CYP1A1) and a protein involved in asthmatic reactions (CHIA). The potentially unexplored functions of TPP2 in the lung pathologies are also discussed with regards to these interactions. Additional interesting functions might suggest the interaction with methyltransferase CARNMT1, which modifying di- and tripeptides and the xenobiotic processing enzyme CYP1A1.
The present manuscript explores signalling and metabolic pathways which mediate functions of Tripeptidyl-peptidase 2 (TPP2) using the analysis of its protein-protein interaction network. The protein interaction data were retrieved partially from our previous experimental published work and the public databases. The interacting proteins were collected from BioGRID 3.5.169; STRING and Agilent Literature Search 3.1.1 databases based on the increased threshold criteria. Totally 13 interacting proteins were obtained from the public sources, and they were combined with TPP2 interacting proteins included in the interaction network PIN7, which involves seven interacting proteins with the pleiotropic functions involved in tumorigenesis, neurodegeneration and ageing. The interaction network of TPP2 was analysed by the pathway enrichment, the protein function prediction and the protein node prioritisation analysis by GeneMania and Cytohubba applications run under Cytoscape 3.7.0 environment. The most enriched signalling pathways were functionally linked to the regulation of the adaptive and innate immunity (ID, Kit Receptor, BCR, IL2, the regulation of NFκB), the aerobic glycolysis (ID and IL-2), tumorigenesis (TGFβ, p53,the high priority nodes MAPKs, and the control of mTOR), diabetes (Kit receptor, the top priority node GSK3β) and neurodegeneration (the regulation of mTOR and Aβ peptide degradation). The BioGRID database mining also showed the interaction with two lung cancer suppressors (DOK3, DENND2D), a protein involved in the increased risk of the lung cancer in smokers (CYP1A1) and a protein involved in asthmatic reactions (CHIA). The potentially unexplored functions of TPP2 in the lung pathologies are also discussed with regards to these interactions. Additional interesting functions might suggest the interaction with methyltransferase CARNMT1, which modifying di- and tripeptides and the xenobiotic processing enzyme CYP1A1.Interpreting and integrating big data in the life scienceshttps://peerj.com/preprints/276032019-06-072019-06-07Serghei Mangul
Recent advances in omics technologies have led to the broad applicability of computational techniques across various domains of life science and medical research. These technologies provide an unprecedented opportunity to collect omics data from hundreds of thousands of individuals and to study gene-disease association without the aid of prior assumptions about the trait biology. Despite the many advantages of modern omics technologies, interpretations of big data produced by such technologies require advanced computational algorithms. Below I outline key challenges that biomedical researches are facing when interpreting and integrating big omics data. I discuss the reproducibility aspect of big data analysis in the life sciences and review current practices in reproducible research. Finally, I explain the skills which biomedical researchers need to acquire in order to independently analyze big omics data.
Recent advances in omics technologies have led to the broad applicability of computational techniques across various domains of life science and medical research. These technologies provide an unprecedented opportunity to collect omics data from hundreds of thousands of individuals and to study gene-disease association without the aid of prior assumptions about the trait biology. Despite the many advantages of modern omics technologies, interpretations of big data produced by such technologies require advanced computational algorithms. Below I outline key challenges that biomedical researches are facing when interpreting and integrating big omics data. I discuss the reproducibility aspect of big data analysis in the life sciences and review current practices in reproducible research. Finally, I explain the skills which biomedical researchers need to acquire in order to independently analyze big omics data.OpenMS for open source analysis of mass spectrometric datahttps://peerj.com/preprints/277662019-05-292019-05-29Oliver AlkaTimo SachsenbergLeon BichmannJulianus PfeufferHendrik WeisserSamuel WeinEugen NetzMarc RurikOliver KohlbacherHannes RostComplexity of human walking: the attractor complexity index is sensitive to gait synchronization with visual and auditory cueshttps://peerj.com/preprints/277112019-05-072019-05-07Philippe Terrier
Background. During steady walking, gait parameters fluctuate from one stride to another with complex fractal patterns and long-range statistical persistence. When a metronome is used to pace the gait (sensorimotor synchronization), long-range persistence is replaced by stochastic oscillations (anti-persistence). Fractal patterns present in gait fluctuations are most often analyzed using detrended fluctuation analysis (DFA). This method requires the use of a discrete times series, such as intervals between consecutive heel strikes, as an input. Recently, a new nonlinear method, the attractor complexity index (ACI), has been shown to respond to complexity changes like DFA. But in contrast to DFA, ACI can be applied to continuous signals, such as body accelerations. The aim of this study was to further compare DFA and ACI in a treadmill experiment that induced complexity changes through sensorimotor synchronization. Methods. Thirty-six healthy adults walked 30 minutes on an instrumented treadmill under three conditions: no cueing, auditory cueing (metronome walking), and visual cueing (stepping stones). The center-of-pressure trajectory was discretized into time series of gait parameters, after which a complexity index (scaling exponent alpha) was computed via DFA. Continuous pressure position signals were used to compute the ACI. Correlations between ACI and DFA were then analyzed. The predictive ability of DFA and ACI to differentiate between cueing and no-cueing conditions was assessed using regularized logistic regressions and areas under the receiver operating characteristic curves (AUROC). Results. DFA and ACI were both significantly different among the cueing conditions. DFA and ACI were correlated (Pearson’s r = 0.78). Logistic regressions showed that DFA and ACI could differentiate between cueing/no cueing conditions with a high degree of confidence (AUROC = 1.0 and 0.96, respectively). Conclusion. Both DFA and ACI responded similarly to changes in cueing conditions and had comparable predictive power. This support the assumption that ACI could be used instead of DFA to assess the long-range complexity of continuous gait signals.
Background. During steady walking, gait parameters fluctuate from one stride to another with complex fractal patterns and long-range statistical persistence. When a metronome is used to pace the gait (sensorimotor synchronization), long-range persistence is replaced by stochastic oscillations (anti-persistence). Fractal patterns present in gait fluctuations are most often analyzed using detrended fluctuation analysis (DFA). This method requires the use of a discrete times series, such as intervals between consecutive heel strikes, as an input. Recently, a new nonlinear method, the attractor complexity index (ACI), has been shown to respond to complexity changes like DFA. But in contrast to DFA, ACI can be applied to continuous signals, such as body accelerations. The aim of this study was to further compare DFA and ACI in a treadmill experiment that induced complexity changes through sensorimotor synchronization. Methods. Thirty-six healthy adults walked 30 minutes on an instrumented treadmill under three conditions: no cueing, auditory cueing (metronome walking), and visual cueing (stepping stones). The center-of-pressure trajectory was discretized into time series of gait parameters, after which a complexity index (scaling exponent alpha) was computed via DFA. Continuous pressure position signals were used to compute the ACI. Correlations between ACI and DFA were then analyzed. The predictive ability of DFA and ACI to differentiate between cueing and no-cueing conditions was assessed using regularized logistic regressions and areas under the receiver operating characteristic curves (AUROC). Results. DFA and ACI were both significantly different among the cueing conditions. DFA and ACI were correlated (Pearson’s r = 0.78). Logistic regressions showed that DFA and ACI could differentiate between cueing/no cueing conditions with a high degree of confidence (AUROC = 1.0 and 0.96, respectively). Conclusion. Both DFA and ACI responded similarly to changes in cueing conditions and had comparable predictive power. This support the assumption that ACI could be used instead of DFA to assess the long-range complexity of continuous gait signals.