PeerJ Computer Science Preprints: Scientific Computing and Simulationhttps://peerj.com/preprints/index.atom?journal=cs&subject=11100Scientific Computing and Simulation articles published in PeerJ Computer Science PreprintsA Dynamic Bayesian Network model for simulation of disease progression in Amyotrophic Lateral Sclerosis patientshttps://peerj.com/preprints/32622017-09-182017-09-18Alessandro ZandonàMatilde FrancesconMaya BronfeldAndrea CalvoAdriano ChiòBarbara Di Camillo
Background. Amyotrophic lateral sclerosis (ALS) is a progressive neurodegenerative disease primarily affecting upper and lower motor neurons in the brain and spinal cord. The heterogeneity in the course of ALS clinical progression and ultimately survival, coupled with the rarity of this disease, make predicting disease outcome at the level of the individual patient very challenging. Besides, stratification of ALS patients has been known for years as a question of great importance to clinical practice, research and drug development.
Methods. In this work, we present a Dynamic Bayesian Network (DBN) model of ALS progression to detect probabilistic relationships among variables included in the Pooled Resource Open-Access ALS Clinical Trials Database (PRO-ACT), which provides records of over 10,700 patients from different clinical trials, and with over 2,869,973 longitudinally collected data measurements.
Results. Our model unravels new dependencies among clinical variables in relation to ALS progression, such as the influence of basophil count and creatine kinase on patients’ clinical status and the respiratory functional state, respectively. Furthermore, it provided an indication of ALS temporal evolution, in terms of the most probable disease trajectories across time at the level of both patient population and individual patient.
Conclusions. The risk factors identified by out DBN model could allow patients' stratification based on velocity of disease progression and a sensitivity analysis on this latter in response to changes in input variables, i.e. variables measured at diagnosis.
Background. Amyotrophic lateral sclerosis (ALS) is a progressive neurodegenerative disease primarily affecting upper and lower motor neurons in the brain and spinal cord. The heterogeneity in the course of ALS clinical progression and ultimately survival, coupled with the rarity of this disease, make predicting disease outcome at the level of the individual patient very challenging. Besides, stratification of ALS patients has been known for years as a question of great importance to clinical practice, research and drug development.Methods. In this work, we present a Dynamic Bayesian Network (DBN) model of ALS progression to detect probabilistic relationships among variables included in the Pooled Resource Open-Access ALS Clinical Trials Database (PRO-ACT), which provides records of over 10,700 patients from different clinical trials, and with over 2,869,973 longitudinally collected data measurements.Results. Our model unravels new dependencies among clinical variables in relation to ALS progression, such as the influence of basophil count and creatine kinase on patients’ clinical status and the respiratory functional state, respectively. Furthermore, it provided an indication of ALS temporal evolution, in terms of the most probable disease trajectories across time at the level of both patient population and individual patient.Conclusions. The risk factors identified by out DBN model could allow patients' stratification based on velocity of disease progression and a sensitivity analysis on this latter in response to changes in input variables, i.e. variables measured at diagnosis.Multiplication and Division over Extended Galois Field GF(p^q): A new Approach to find Monic Irreducible Polynomials over any Galois Field GF(p^q).https://peerj.com/preprints/32592017-09-172017-09-17Sankhanil DeyRanjan Ghosh
Irreducible Polynomials (IPs) have been of utmost importance in generation of substitution boxes in modern cryptographic ciphers. In this paper an algorithm entitled Composite Algorithm using both multiplication and division over Galois fields have been demonstrated to generate all monic IPs over extended Galois Field GF(p^q) for large value of both p and q. A little more efficient Algorithm entitled Multiplication Algorithm and more too Division Algorithm have been illustrated in this Paper with Algorithms to find all Monic IPs over extended Galois Field GF(p^q) for large value of both p and q. Time Complexity Analysis of three algorithms with comparison to Rabin’s Algorithms has also been exonerated in this Research Article.
Irreducible Polynomials (IPs) have been of utmost importance in generation of substitution boxes in modern cryptographic ciphers. In this paper an algorithm entitled Composite Algorithm using both multiplication and division over Galois fields have been demonstrated to generate all monic IPs over extended Galois Field GF(p^q) for large value of both p and q. A little more efficient Algorithm entitled Multiplication Algorithm and more too Division Algorithm have been illustrated in this Paper with Algorithms to find all Monic IPs over extended Galois Field GF(p^q) for large value of both p and q. Time Complexity Analysis of three algorithms with comparison to Rabin’s Algorithms has also been exonerated in this Research Article.Assessment of spectral properties of Apollo 12 landing sitehttps://peerj.com/preprints/21242017-09-052017-09-05Yann H CheminIan A CrawfordPeter GrindrodLouise Alexander
The geology and mineralogy of the Apollo 12 landing site has been the subject of recent studies that this research attempts to complement from a remote sensing point of view using the Moon Mineralogy Mapper (M3) sensor data, onboard the Chandrayaan-1 lunar orbiter. It is a higher spatial-spectral resolution sensor than the Clementine UVVis sensor and gives the opportunity to study the lunar surface with a comparatively more detailed spectral resolution.
The M3 signatures are showing a monotonic featureless increment, with very low reflectance, suggesting a mature regolith. The regolith maturity is splitting the landing site in a younger Northwest and older Southeast. The mineral identification using the lunar sample spectra from within the Relab database found some similarity to a basaltic rock/glass mix. The spectrum features of clinopyroxene have been found in the Copernican rays and at the landing site. Lateral mixing increases FeO content away from the central part of the ray. The presence of clinopyroxene in the pigeonite basalt in the stratigraphy of the landing site brings forth some complexity in differentiating the Copernican ray’s clinopyroxene from the local source, as the spectra are twins but for their vertical shift in reflectance, reducing away from the central part of the ray.
Spatial variations in mineralogy were not found mostly because of the pixel size compared to the landing site area. The contribution to stratigraphy is limited to the topmost layer which is a clinopyroxene-dominated basalt belonging to the most remote tip of a Copernican ray and its resulting local regolith mix.
The geology and mineralogy of the Apollo 12 landing site has been the subject of recent studies that this research attempts to complement from a remote sensing point of view using the Moon Mineralogy Mapper (M3) sensor data, onboard the Chandrayaan-1 lunar orbiter. It is a higher spatial-spectral resolution sensor than the Clementine UVVis sensor and gives the opportunity to study the lunar surface with a comparatively more detailed spectral resolution.The M3 signatures are showing a monotonic featureless increment, with very low reflectance, suggesting a mature regolith. The regolith maturity is splitting the landing site in a younger Northwest and older Southeast. The mineral identification using the lunar sample spectra from within the Relab database found some similarity to a basaltic rock/glass mix. The spectrum features of clinopyroxene have been found in the Copernican rays and at the landing site. Lateral mixing increases FeO content away from the central part of the ray. The presence of clinopyroxene in the pigeonite basalt in the stratigraphy of the landing site brings forth some complexity in differentiating the Copernican ray’s clinopyroxene from the local source, as the spectra are twins but for their vertical shift in reflectance, reducing away from the central part of the ray.Spatial variations in mineralogy were not found mostly because of the pixel size compared to the landing site area. The contribution to stratigraphy is limited to the topmost layer which is a clinopyroxene-dominated basalt belonging to the most remote tip of a Copernican ray and its resulting local regolith mix.Wrangling categorical data in Rhttps://peerj.com/preprints/31632017-08-302017-08-30Amelia McNamaraNicholas J Horton
Data wrangling is a critical foundation of data science, and wrangling of categorical data is an important component of this process. However, categorical data can introduce unique issues in data wrangling, particularly in real-world settings with collaborators and periodically-updated dynamic data. This paper discusses common problems arising from categorical variable transformations in R, demonstrates the use of factors, and suggests approaches to address data wrangling challenges. For each problem, we present at least two strategies for management, one in base R and the other from the ‘tidyverse.’ We consider several motivating examples, suggest defensive coding strategies, and outline principles for data wrangling to help ensure data quality and sound analysis.
Data wrangling is a critical foundation of data science, and wrangling of categorical data is an important component of this process. However, categorical data can introduce unique issues in data wrangling, particularly in real-world settings with collaborators and periodically-updated dynamic data. This paper discusses common problems arising from categorical variable transformations in R, demonstrates the use of factors, and suggests approaches to address data wrangling challenges. For each problem, we present at least two strategies for management, one in base R and the other from the ‘tidyverse.’ We consider several motivating examples, suggest defensive coding strategies, and outline principles for data wrangling to help ensure data quality and sound analysis.Lessons from between the white lines for isolated data scientistshttps://peerj.com/preprints/31602017-08-302017-08-30Benjamin S Baumer
Many current and future data scientists will be "isolated"---working alone or in small teams within a larger organization. This isolation brings certain challenges as well as freedoms. Drawing on my considerable experience both working in the professional sports industry and teaching in academia, I discuss troubled waters likely to be encountered by newly-minted data scientists, and offer advice about how to navigate them. Neither the issues raised nor the advice given are particular to sports, and should be applicable to a wide range of knowledge domains.
Many current and future data scientists will be "isolated"---working alone or in small teams within a larger organization. This isolation brings certain challenges as well as freedoms. Drawing on my considerable experience both working in the professional sports industry and teaching in academia, I discuss troubled waters likely to be encountered by newly-minted data scientists, and offer advice about how to navigate them. Neither the issues raised nor the advice given are particular to sports, and should be applicable to a wide range of knowledge domains.Teaching stats for data sciencehttps://peerj.com/preprints/32052017-08-292017-08-29Daniel T Kaplan
The familiar mathematical topics of introductory statistics --- means, proportions, t-tests, normal and t distributions, chi-squared, etc. --- are a product of the first half of the 20th century. Naturally, they reflect the statistical conditions of that era: scarce, e.g. n < 10, data originating in benchtop or agricultural experiments; algorithms communicated via algebraic formulas. Today, applied statistics relates to a different environment: software is the means of algorithmic communication, observational and "unplanned" data are interpreted for causal relationships, and data are large both in n and the number of variables. This change in situation calls for a thorough rethinking of the topics in and approach to statistics education. This paper presents a set of ten organizing blocks for intro stats that are better suited to today's environment.
The familiar mathematical topics of introductory statistics --- means, proportions, t-tests, normal and t distributions, chi-squared, etc. --- are a product of the first half of the 20th century. Naturally, they reflect the statistical conditions of that era: scarce, e.g. n < 10, data originating in benchtop or agricultural experiments; algorithms communicated via algebraic formulas. Today, applied statistics relates to a different environment: software is the means of algorithmic communication, observational and "unplanned" data are interpreted for causal relationships, and data are large both in n and the number of variables. This change in situation calls for a thorough rethinking of the topics in and approach to statistics education. This paper presents a set of ten organizing blocks for intro stats that are better suited to today's environment.Excuse me, do you have a moment to talk about version control?https://peerj.com/preprints/31592017-08-282017-08-28Jennifer Bryan
Data analysis, statistical research, and teaching statistics have at least one thing in common: these activities all produce many files! There are data files, source code, figures, tables, prepared reports, and much more. Most of these files evolve over the course of a project and often need to be shared with others, for reading or edits, as a project unfolds. Without explicit and structured management, project organization can easily descend into chaos, taking time away from the primary work and reducing the quality of the final product. This unhappy result can be avoided by repurposing tools and workflows from the software development world, namely, distributed version control. This article describes the use of the version control system Git and and the hosting site GitHub for statistical and data scientific workflows. Special attention is given to projects that use the statistical language R and, optionally, R Markdown documents. Supplementary materials include an annotated set of links to step-by-step tutorials, real world examples, and other useful learning resources.
Data analysis, statistical research, and teaching statistics have at least one thing in common: these activities all produce many files! There are data files, source code, figures, tables, prepared reports, and much more. Most of these files evolve over the course of a project and often need to be shared with others, for reading or edits, as a project unfolds. Without explicit and structured management, project organization can easily descend into chaos, taking time away from the primary work and reducing the quality of the final product. This unhappy result can be avoided by repurposing tools and workflows from the software development world, namely, distributed version control. This article describes the use of the version control system Git and and the hosting site GitHub for statistical and data scientific workflows. Special attention is given to projects that use the statistical language R and, optionally, R Markdown documents. Supplementary materials include an annotated set of links to step-by-step tutorials, real world examples, and other useful learning resources.Human-Borg dynamics during a cybernetic alien invasionhttps://peerj.com/preprints/31982017-08-282017-08-28Jaderick P Pabico
We propose a series of ODE systems to model the various dynamics of human-Borg interaction during a Borg invasion. These models, progressively developed one after the other, could provide humanity and other peace-loving intergalactic species the necessary mathematical tools to develop survival strategies in the event of future alien invasion. The Borg is a race of technologically-advanced cybernetic aliens and acts as a very powerful antagonist against the peace-loving human species in various Star Trek sci-fi story lines. Cybernetics, also called cyborgs, are organic individuals implanted with intelligent electromechanical devices for the purpose of increasing the individuals' efficiency by several degrees (e.g., strength, speed, and intelligence) but in the expense of procreation. Thus, the "parasitic" Borg needs to assimilate other species in an epidemiological manner for the survival of their own race.
In these models, humans can be transformed into one of six types depending on their reaction on or resistance to Borg assimilation. These are Susceptible (\(S\)), Captured (\(C\)), Assimilated (\(A\)), Rescued (\(R\)), Educated (or Rehabilitated, \(E\)), and Defiant (\(D\)). {\em Susceptible} humans can be captured and then assimilated into being a Borg drone. The remaining humans can rescue those who were captured or assimilated. Once rescued, they will undergo rehabilitation after which they either end up (again) susceptible to or strongly defiant from being captured and assimilated.
We start by describing the SCA model which has the same (analytical and/or numerical) solution to the Susceptible-Exposed-Infected model in epidemiology. Then we move on to the SCAR model which incorporates the tendency of humans to fight back by rescuing the captured or assimilated. SCARE further models the propensity of humans to educate (or rehabilitate) those whom they have rescued. Finally, we present the SCARED model which describes the natural inclinations of humans to either "relapse" to being susceptible to or grow being defiant against assimilation after undergoing rehabilitation. The numerical solutions to all these models will be presented using a popular yet simple computer software.
The SCA, SCAR, and SCARE models are reduction from the SCARED model when all the respective coefficients of the quantities not present in the reduced model are zero. In other words, SCA reduces from SCAR; SCA and SCAR both reduce from SCARE; and SCA, SCAR, and SCARE all reduce from SCARED. The bottomline is SCA \(\subset\) SCAR \(\subset\) SCARE \(\subset\) SCARED. The dynamics in the general SCARED model is governed by the system of ODEs shown (as a teaser) in the attached addendum.
We propose a series of ODE systems to model the various dynamics of human-Borg interaction during a Borg invasion. These models, progressively developed one after the other, could provide humanity and other peace-loving intergalactic species the necessary mathematical tools to develop survival strategies in the event of future alien invasion. The Borg is a race of technologically-advanced cybernetic aliens and acts as a very powerful antagonist against the peace-loving human species in various Star Trek sci-fi story lines. Cybernetics, also called cyborgs, are organic individuals implanted with intelligent electromechanical devices for the purpose of increasing the individuals' efficiency by several degrees (e.g., strength, speed, and intelligence) but in the expense of procreation. Thus, the "parasitic" Borg needs to assimilate other species in an epidemiological manner for the survival of their own race.In these models, humans can be transformed into one of six types depending on their reaction on or resistance to Borg assimilation. These are Susceptible (\(S\)), Captured (\(C\)), Assimilated (\(A\)), Rescued (\(R\)), Educated (or Rehabilitated, \(E\)), and Defiant (\(D\)). {\em Susceptible} humans can be captured and then assimilated into being a Borg drone. The remaining humans can rescue those who were captured or assimilated. Once rescued, they will undergo rehabilitation after which they either end up (again) susceptible to or strongly defiant from being captured and assimilated.We start by describing the SCA model which has the same (analytical and/or numerical) solution to the Susceptible-Exposed-Infected model in epidemiology. Then we move on to the SCAR model which incorporates the tendency of humans to fight back by rescuing the captured or assimilated. SCARE further models the propensity of humans to educate (or rehabilitate) those whom they have rescued. Finally, we present the SCARED model which describes the natural inclinations of humans to either "relapse" to being susceptible to or grow being defiant against assimilation after undergoing rehabilitation. The numerical solutions to all these models will be presented using a popular yet simple computer software.The SCA, SCAR, and SCARE models are reduction from the SCARED model when all the respective coefficients of the quantities not present in the reduced model are zero. In other words, SCA reduces from SCAR; SCA and SCAR both reduce from SCARE; and SCA, SCAR, and SCARE all reduce from SCARED. The bottomline is SCA \(\subset\) SCAR \(\subset\) SCARE \(\subset\) SCARED. The dynamics in the general SCARED model is governed by the system of ODEs shown (as a teaser) in the attached addendum.Infrastructure and tools for teaching computing throughout the statistical curriculumhttps://peerj.com/preprints/31812017-08-242017-08-24Mine Cetinkaya-RundelColin W Rundel
Modern statistics is fundamentally a computational discipline, but too often this fact is not reflected in our statistics curricula. With the rise of big data and data science it has become increasingly clear that students both want, expect, and need explicit training in this area of the discipline. Additionally, recent curricular guidelines clearly state that working with data requires extensive computing skills and that statistics students should be fluent in accessing, manipulating, analyzing, and modeling with professional statistical analysis software. Much has been written in the statistics education literature about pedagogical tools and approaches to provide a practical computational foundation for students. This article discusses the computational infrastructure and toolkit choices to allow for these pedagogical innovations while minimizing frustration and improving adoption for both our students and instructors.
Modern statistics is fundamentally a computational discipline, but too often this fact is not reflected in our statistics curricula. With the rise of big data and data science it has become increasingly clear that students both want, expect, and need explicit training in this area of the discipline. Additionally, recent curricular guidelines clearly state that working with data requires extensive computing skills and that statistics students should be fluent in accessing, manipulating, analyzing, and modeling with professional statistical analysis software. Much has been written in the statistics education literature about pedagogical tools and approaches to provide a practical computational foundation for students. This article discusses the computational infrastructure and toolkit choices to allow for these pedagogical innovations while minimizing frustration and improving adoption for both our students and instructors.An algorithm for calculating top-dimensional bounding chainshttps://peerj.com/preprints/31512017-08-142017-08-14J. Frederico CarvalhoMikael Vejdemo-JohanssonDanica KragicFlorian T. Pokorny
We describe the \textsc{Coefficient-Flow} algorithm for calculating the bounding chain of an $(n-1)$--boundary on an $n$--manifold-like simplicial complex $S$. We prove its correctness and show that it has a computational time complexity of $O(|S^{(n-1)}|)$ (where $S^{(n-1)}$ is the set of $(n-1)$--faces of $S$). We estimate the big-$O$ coefficient which depends on the dimension of $S$ and the implementation. We present an implementation, experimentally evaluate the complexity of our algorithm, and compare its performance with that of solving the underlying linear system.
We describe the \textsc{Coefficient-Flow} algorithm for calculating the bounding chain of an $(n-1)$--boundary on an $n$--manifold-like simplicial complex $S$. We prove its correctness and show that it has a computational time complexity of $O(|S^{(n-1)}|)$ (where $S^{(n-1)}$ is the set of $(n-1)$--faces of $S$). We estimate the big-$O$ coefficient which depends on the dimension of $S$ and the implementation. We present an implementation, experimentally evaluate the complexity of our algorithm, and compare its performance with that of solving the underlying linear system.