PeerJ: Statisticshttps://peerj.com/articles/index.atom?journal=peerj&subject=7900Statistics articles published in PeerJMaking inference from wildlife collision data: inferring predator absence from prey strikeshttps://peerj.com/articles/30142017-02-222017-02-22Peter CaleyGeoffrey R. HosackSimon C. Barry
Wildlife collision data are ubiquitous, though challenging for making ecological inference due to typically irreducible uncertainty relating to the sampling process. We illustrate a new approach that is useful for generating inference from predator data arising from wildlife collisions. By simply conditioning on a second prey species sampled via the same collision process, and by using a biologically realistic numerical response functions, we can produce a coherent numerical response relationship between predator and prey. This relationship can then be used to make inference on the population size of the predator species, including the probability of extinction. The statistical conditioning enables us to account for unmeasured variation in factors influencing the runway strike incidence for individual airports and to enable valid comparisons. A practical application of the approach for testing hypotheses about the distribution and abundance of a predator species is illustrated using the hypothesized red fox incursion into Tasmania, Australia. We estimate that conditional on the numerical response between fox and lagomorph runway strikes on mainland Australia, the predictive probability of observing no runway strikes of foxes in Tasmania after observing 15 lagomorph strikes is 0.001. We conclude there is enough evidence to safely reject the null hypothesis that there is a widespread red fox population in Tasmania at a population density consistent with prey availability. The method is novel and has potential wider application.
Wildlife collision data are ubiquitous, though challenging for making ecological inference due to typically irreducible uncertainty relating to the sampling process. We illustrate a new approach that is useful for generating inference from predator data arising from wildlife collisions. By simply conditioning on a second prey species sampled via the same collision process, and by using a biologically realistic numerical response functions, we can produce a coherent numerical response relationship between predator and prey. This relationship can then be used to make inference on the population size of the predator species, including the probability of extinction. The statistical conditioning enables us to account for unmeasured variation in factors influencing the runway strike incidence for individual airports and to enable valid comparisons. A practical application of the approach for testing hypotheses about the distribution and abundance of a predator species is illustrated using the hypothesized red fox incursion into Tasmania, Australia. We estimate that conditional on the numerical response between fox and lagomorph runway strikes on mainland Australia, the predictive probability of observing no runway strikes of foxes in Tasmania after observing 15 lagomorph strikes is 0.001. We conclude there is enough evidence to safely reject the null hypothesis that there is a widespread red fox population in Tasmania at a population density consistent with prey availability. The method is novel and has potential wider application.Modeling using clinical examination indicators predicts interstitial lung disease among patients with rheumatoid arthritishttps://peerj.com/articles/30212017-02-212017-02-21Yao WangWuqi SongJing WuZhangming LiFengyun MuYang LiHe HuangWenliang ZhuFengmin Zhang
Interstitial lung disease (ILD) is a severe extra-articular manifestation of rheumatoid arthritis (RA) that is well-defined as a chronic systemic autoimmune disease. A proportion of patients with RA-associated ILD (RA-ILD) develop pulmonary fibrosis (PF), resulting in poor prognosis and increased lifetime risk. We investigated whether routine clinical examination indicators (CEIs) could be used to identify RA patients with high PF risk. A total of 533 patients with established RA were recruited in this study for model building and 32 CEIs were measured for each of them. To identify PF risk, a new artificial neural network (ANN) was built, in which inputs were generated by calculating Euclidean distance of CEIs between patients. Receiver operating characteristic curve analysis indicated that the ANN performed well in predicting the PF risk (Youden index = 0.436) by only incorporating four CEIs including age, eosinophil count, platelet count, and white blood cell count. A set of 218 RA patients with healthy lungs or suffering from ILD and a set of 87 RA patients suffering from PF were used for independent validation. Results showed that the model successfully identified ILD and PF with a true positive rate of 84.9% and 82.8%, respectively. The present study suggests that model integration of multiple routine CEIs contributes to identification of potential PF risk among patients with RA.
Interstitial lung disease (ILD) is a severe extra-articular manifestation of rheumatoid arthritis (RA) that is well-defined as a chronic systemic autoimmune disease. A proportion of patients with RA-associated ILD (RA-ILD) develop pulmonary fibrosis (PF), resulting in poor prognosis and increased lifetime risk. We investigated whether routine clinical examination indicators (CEIs) could be used to identify RA patients with high PF risk. A total of 533 patients with established RA were recruited in this study for model building and 32 CEIs were measured for each of them. To identify PF risk, a new artificial neural network (ANN) was built, in which inputs were generated by calculating Euclidean distance of CEIs between patients. Receiver operating characteristic curve analysis indicated that the ANN performed well in predicting the PF risk (Youden index = 0.436) by only incorporating four CEIs including age, eosinophil count, platelet count, and white blood cell count. A set of 218 RA patients with healthy lungs or suffering from ILD and a set of 87 RA patients suffering from PF were used for independent validation. Results showed that the model successfully identified ILD and PF with a true positive rate of 84.9% and 82.8%, respectively. The present study suggests that model integration of multiple routine CEIs contributes to identification of potential PF risk among patients with RA.Risk analysis of colorectal cancer incidence by gene expression analysishttps://peerj.com/articles/30032017-02-152017-02-15Wei-Chuan ShangkuanHung-Che LinYu-Tien ChangChen-En JianHueng-Chuen FanKang-Hua ChenYa-Fang LiuHuan-Ming HsuHsiu-Ling ChouChung-Tay YaoChi-Ming ChuSui-Lung SuChi-Wen Chang
Background
Colorectal cancer (CRC) is one of the leading cancers worldwide. Several studies have performed microarray data analyses for cancer classification and prognostic analyses. Microarray assays also enable the identification of gene signatures for molecular characterization and treatment prediction.
Objective
Microarray gene expression data from the online Gene Expression Omnibus (GEO) database were used to to distinguish colorectal cancer from normal colon tissue samples.
Methods
We collected microarray data from the GEO database to establish colorectal cancer microarray gene expression datasets for a combined analysis. Using the Prediction Analysis for Microarrays (PAM) method and the GSEA MSigDB resource, we analyzed the 14,698 genes that were identified through an examination of their expression values between normal and tumor tissues.
Results
Ten genes (ABCG2, AQP8, SPIB, CA7, CLDN8, SCNN1B, SLC30A10, CD177, PADI2, and TGFBI) were found to be good indicators of the candidate genes that correlate with CRC. From these selected genes, an average of six significant genes were obtained using the PAM method, with an accuracy rate of 95%. The results demonstrate the potential of utilizing a model with the PAM method for data mining. After a detailed review of the published reports, the results confirmed that the screened candidate genes are good indicators for cancer risk analysis using the PAM method.
Conclusions
Six genes were selected with 95% accuracy to effectively classify normal and colorectal cancer tissues. We hope that these results will provide the basis for new research projects in clinical practice that aim to rapidly assess colorectal cancer risk using microarray gene expression analysis.
Background
Colorectal cancer (CRC) is one of the leading cancers worldwide. Several studies have performed microarray data analyses for cancer classification and prognostic analyses. Microarray assays also enable the identification of gene signatures for molecular characterization and treatment prediction.
Objective
Microarray gene expression data from the online Gene Expression Omnibus (GEO) database were used to to distinguish colorectal cancer from normal colon tissue samples.
Methods
We collected microarray data from the GEO database to establish colorectal cancer microarray gene expression datasets for a combined analysis. Using the Prediction Analysis for Microarrays (PAM) method and the GSEA MSigDB resource, we analyzed the 14,698 genes that were identified through an examination of their expression values between normal and tumor tissues.
Results
Ten genes (ABCG2, AQP8, SPIB, CA7, CLDN8, SCNN1B, SLC30A10, CD177, PADI2, and TGFBI) were found to be good indicators of the candidate genes that correlate with CRC. From these selected genes, an average of six significant genes were obtained using the PAM method, with an accuracy rate of 95%. The results demonstrate the potential of utilizing a model with the PAM method for data mining. After a detailed review of the published reports, the results confirmed that the screened candidate genes are good indicators for cancer risk analysis using the PAM method.
Conclusions
Six genes were selected with 95% accuracy to effectively classify normal and colorectal cancer tissues. We hope that these results will provide the basis for new research projects in clinical practice that aim to rapidly assess colorectal cancer risk using microarray gene expression analysis.Pattern analysis of total item score and item response of the Kessler Screening Scale for Psychological Distress (K6) in a nationally representative sample of US adultshttps://peerj.com/articles/29872017-02-092017-02-09Shinichiro TomitakaYohei KawasakiKazuki IdeMaiko AkutagawaHiroshi YamadaOno YutakaToshiaki A. Furukawa
Background
Several recent studies have shown that total scores on depressive symptom measures in a general population approximate an exponential pattern except for the lower end of the distribution. Furthermore, we confirmed that the exponential pattern is present for the individual item responses on the Center for Epidemiologic Studies Depression Scale (CES-D). To confirm the reproducibility of such findings, we investigated the total score distribution and item responses of the Kessler Screening Scale for Psychological Distress (K6) in a nationally representative study.
Methods
Data were drawn from the National Survey of Midlife Development in the United States (MIDUS), which comprises four subsamples: (1) a national random digit dialing (RDD) sample, (2) oversamples from five metropolitan areas, (3) siblings of individuals from the RDD sample, and (4) a national RDD sample of twin pairs. K6 items are scored using a 5-point scale: “none of the time,” “a little of the time,” “some of the time,” “most of the time,” and “all of the time.” The pattern of total score distribution and item responses were analyzed using graphical analysis and exponential regression model.
Results
The total score distributions of the four subsamples exhibited an exponential pattern with similar rate parameters. The item responses of the K6 approximated a linear pattern from “a little of the time” to “all of the time” on log-normal scales, while “none of the time” response was not related to this exponential pattern.
Discussion
The total score distribution and item responses of the K6 showed exponential patterns, consistent with other depressive symptom scales.
Background
Several recent studies have shown that total scores on depressive symptom measures in a general population approximate an exponential pattern except for the lower end of the distribution. Furthermore, we confirmed that the exponential pattern is present for the individual item responses on the Center for Epidemiologic Studies Depression Scale (CES-D). To confirm the reproducibility of such findings, we investigated the total score distribution and item responses of the Kessler Screening Scale for Psychological Distress (K6) in a nationally representative study.
Methods
Data were drawn from the National Survey of Midlife Development in the United States (MIDUS), which comprises four subsamples: (1) a national random digit dialing (RDD) sample, (2) oversamples from five metropolitan areas, (3) siblings of individuals from the RDD sample, and (4) a national RDD sample of twin pairs. K6 items are scored using a 5-point scale: “none of the time,” “a little of the time,” “some of the time,” “most of the time,” and “all of the time.” The pattern of total score distribution and item responses were analyzed using graphical analysis and exponential regression model.
Results
The total score distributions of the four subsamples exhibited an exponential pattern with similar rate parameters. The item responses of the K6 approximated a linear pattern from “a little of the time” to “all of the time” on log-normal scales, while “none of the time” response was not related to this exponential pattern.
Discussion
The total score distribution and item responses of the K6 showed exponential patterns, consistent with other depressive symptom scales.Phylogenetic factorization of compositional data yields lineage-level associations in microbiome datasetshttps://peerj.com/articles/29692017-02-092017-02-09Alex D. WashburneJustin D. SilvermanJonathan W. LeffDominic J. BennettJohn L. DarcySayan MukherjeeNoah FiererLawrence A. David
Marker gene sequencing of microbial communities has generated big datasets of microbial relative abundances varying across environmental conditions, sample sites and treatments. These data often come with putative phylogenies, providing unique opportunities to investigate how shared evolutionary history affects microbial abundance patterns. Here, we present a method to identify the phylogenetic factors driving patterns in microbial community composition. We use the method, “phylofactorization,” to re-analyze datasets from the human body and soil microbial communities, demonstrating how phylofactorization is a dimensionality-reducing tool, an ordination-visualization tool, and an inferential tool for identifying edges in the phylogeny along which putative functional ecological traits may have arisen.
Marker gene sequencing of microbial communities has generated big datasets of microbial relative abundances varying across environmental conditions, sample sites and treatments. These data often come with putative phylogenies, providing unique opportunities to investigate how shared evolutionary history affects microbial abundance patterns. Here, we present a method to identify the phylogenetic factors driving patterns in microbial community composition. We use the method, “phylofactorization,” to re-analyze datasets from the human body and soil microbial communities, demonstrating how phylofactorization is a dimensionality-reducing tool, an ordination-visualization tool, and an inferential tool for identifying edges in the phylogeny along which putative functional ecological traits may have arisen.The use of gene interaction networks to improve the identification of cancer driver geneshttps://peerj.com/articles/25682017-01-262017-01-26Emilie RamsahaiKheston WalkinsVrijesh TripathiMelford John
Bioinformaticians have implemented different strategies to distinguish cancer driver genes from passenger genes. One of the more recent advances uses a pathway-oriented approach. Methods that employ this strategy are highly dependent on the quality and size of the pathway interaction network employed, and require a powerful statistical environment for analyses. A number of genomic libraries are available in R. DriverNet and DawnRank employ pathway-based methods that use gene interaction graphs in matrix form. We investigated the benefit of combining data from 3 different sources on the prediction outcome of cancer driver genes by DriverNet and DawnRank. An enriched dataset was derived comprising 13,862 genes with 372,250 interactions, which increased its accuracy by 17% and 28%, respectively, compared to their original networks. The study identified 33 new candidate driver genes. Our study highlights the potential of combining networks and weighting edges to provide greater accuracy in the identification of cancer driver genes.
Bioinformaticians have implemented different strategies to distinguish cancer driver genes from passenger genes. One of the more recent advances uses a pathway-oriented approach. Methods that employ this strategy are highly dependent on the quality and size of the pathway interaction network employed, and require a powerful statistical environment for analyses. A number of genomic libraries are available in R. DriverNet and DawnRank employ pathway-based methods that use gene interaction graphs in matrix form. We investigated the benefit of combining data from 3 different sources on the prediction outcome of cancer driver genes by DriverNet and DawnRank. An enriched dataset was derived comprising 13,862 genes with 372,250 interactions, which increased its accuracy by 17% and 28%, respectively, compared to their original networks. The study identified 33 new candidate driver genes. Our study highlights the potential of combining networks and weighting edges to provide greater accuracy in the identification of cancer driver genes.A critical issue in model-based inference for studying trait-based community assembly and a solutionhttps://peerj.com/articles/28852017-01-122017-01-12Cajo J.F. ter BraakPedro Peres-NetoStéphane Dray
Statistical testing of trait-environment association from data is a challenge as there is no common unit of observation: the trait is observed on species, the environment on sites and the mediating abundance on species-site combinations. A number of correlation-based methods, such as the community weighted trait means method (CWM), the fourth-corner correlation method and the multivariate method RLQ, have been proposed to estimate such trait-environment associations. In these methods, valid statistical testing proceeds by performing two separate resampling tests, one site-based and the other species-based and by assessing significance by the largest of the two p-values (the pmax test). Recently, regression-based methods using generalized linear models (GLM) have been proposed as a promising alternative with statistical inference via site-based resampling. We investigated the performance of this new approach along with approaches that mimicked the pmax test using GLM instead of fourth-corner. By simulation using models with additional random variation in the species response to the environment, the site-based resampling tests using GLM are shown to have severely inflated type I error, of up to 90%, when the nominal level is set as 5%. In addition, predictive modelling of such data using site-based cross-validation very often identified trait-environment interactions that had no predictive value. The problem that we identify is not an “omitted variable bias” problem as it occurs even when the additional random variation is independent of the observed trait and environment data. Instead, it is a problem of ignoring a random effect. In the same simulations, the GLM-based pmax test controlled the type I error in all models proposed so far in this context, but still gave slightly inflated error in more complex models that included both missing (but important) traits and missing (but important) environmental variables. For screening the importance of single trait-environment combinations, the fourth-corner test is shown to give almost the same results as the GLM-based tests in far less computing time.
Statistical testing of trait-environment association from data is a challenge as there is no common unit of observation: the trait is observed on species, the environment on sites and the mediating abundance on species-site combinations. A number of correlation-based methods, such as the community weighted trait means method (CWM), the fourth-corner correlation method and the multivariate method RLQ, have been proposed to estimate such trait-environment associations. In these methods, valid statistical testing proceeds by performing two separate resampling tests, one site-based and the other species-based and by assessing significance by the largest of the two p-values (the pmax test). Recently, regression-based methods using generalized linear models (GLM) have been proposed as a promising alternative with statistical inference via site-based resampling. We investigated the performance of this new approach along with approaches that mimicked the pmax test using GLM instead of fourth-corner. By simulation using models with additional random variation in the species response to the environment, the site-based resampling tests using GLM are shown to have severely inflated type I error, of up to 90%, when the nominal level is set as 5%. In addition, predictive modelling of such data using site-based cross-validation very often identified trait-environment interactions that had no predictive value. The problem that we identify is not an “omitted variable bias” problem as it occurs even when the additional random variation is independent of the observed trait and environment data. Instead, it is a problem of ignoring a random effect. In the same simulations, the GLM-based pmax test controlled the type I error in all models proposed so far in this context, but still gave slightly inflated error in more complex models that included both missing (but important) traits and missing (but important) environmental variables. For screening the importance of single trait-environment combinations, the fourth-corner test is shown to give almost the same results as the GLM-based tests in far less computing time.The fishing and natural mortality of large, piscivorous Bull Trout and Rainbow Trout in Kootenay Lake, British Columbia (2008–2013)https://peerj.com/articles/28742017-01-102017-01-10Joseph L. ThorleyGreg F. Andrusak
Background
Estimates of fishing and natural mortality are important for understanding, and ultimately managing, commercial and recreational fisheries. High reward tags with fixed station acoustic telemetry provides a promising approach to monitoring mortality rates in large lake recreational fisheries. Kootenay Lake is a large lake which supports an important recreational fishery for large Bull Trout and Rainbow Trout.
Methods
Between 2008 and 2013, 88 large (≥500 mm) Bull Trout and 149 large (≥500 mm) Rainbow Trout were marked with an acoustic transmitter and/or high reward ($100) anchor tags in Kootenay Lake. The subsequent detections and angler recaptures were analysed using a Bayesian individual state-space Cormack–Jolly–Seber (CJS) survival model with indicator variable selection.
Results
The final CJS survival model estimated that the annual interval probability of being recaptured by an angler was 0.17 (95% CRI [0.11–0.23]) for Bull Trout and 0.14 (95% CRI [0.09–0.19]) for Rainbow Trout. The annual interval survival probability for Bull Trout was estimated to have declined from 0.91 (95% CRI [0.76–0.97]) in 2009 to just 0.46 (95% CRI [0.24–0.76]) in 2013. Rainbow Trout survival was most strongly affected by spawning. The annual interval survival probability was 0.77 (95% CRI [0.68–0.85]) for a non-spawning Rainbow Trout compared to 0.41 (95% CRI [0.30–0.53]) for a spawner. The probability of spawning increased with the fork length for both species and decreased over the course of the study for Rainbow Trout.
Discussion
Fishing mortality was relatively low and constant while natural mortality was relatively high and variable. The results indicate that angler effort is not the primary driver of short-term population fluctations in the Rainbow Trout abundance. Variation in the probability of Rainbow Trout spawning suggests that the spring escapement at the outflow of Trout Lake may be a less reliable index of abundance than previously assumed. Multi-species stock assessment models need to account for the fact that large Bull Trout are more abundant than large Rainbow Trout in Kootenay Lake.
Background
Estimates of fishing and natural mortality are important for understanding, and ultimately managing, commercial and recreational fisheries. High reward tags with fixed station acoustic telemetry provides a promising approach to monitoring mortality rates in large lake recreational fisheries. Kootenay Lake is a large lake which supports an important recreational fishery for large Bull Trout and Rainbow Trout.
Methods
Between 2008 and 2013, 88 large (≥500 mm) Bull Trout and 149 large (≥500 mm) Rainbow Trout were marked with an acoustic transmitter and/or high reward ($100) anchor tags in Kootenay Lake. The subsequent detections and angler recaptures were analysed using a Bayesian individual state-space Cormack–Jolly–Seber (CJS) survival model with indicator variable selection.
Results
The final CJS survival model estimated that the annual interval probability of being recaptured by an angler was 0.17 (95% CRI [0.11–0.23]) for Bull Trout and 0.14 (95% CRI [0.09–0.19]) for Rainbow Trout. The annual interval survival probability for Bull Trout was estimated to have declined from 0.91 (95% CRI [0.76–0.97]) in 2009 to just 0.46 (95% CRI [0.24–0.76]) in 2013. Rainbow Trout survival was most strongly affected by spawning. The annual interval survival probability was 0.77 (95% CRI [0.68–0.85]) for a non-spawning Rainbow Trout compared to 0.41 (95% CRI [0.30–0.53]) for a spawner. The probability of spawning increased with the fork length for both species and decreased over the course of the study for Rainbow Trout.
Discussion
Fishing mortality was relatively low and constant while natural mortality was relatively high and variable. The results indicate that angler effort is not the primary driver of short-term population fluctations in the Rainbow Trout abundance. Variation in the probability of Rainbow Trout spawning suggests that the spring escapement at the outflow of Trout Lake may be a less reliable index of abundance than previously assumed. Multi-species stock assessment models need to account for the fact that large Bull Trout are more abundant than large Rainbow Trout in Kootenay Lake.An extensive comparison of species-abundance distribution modelshttps://peerj.com/articles/28232016-12-222016-12-22Elita BaldridgeDavid J. HarrisXiao XiaoEthan P. White
A number of different models have been proposed as descriptions of the species-abundance distribution (SAD). Most evaluations of these models use only one or two models, focus on only a single ecosystem or taxonomic group, or fail to use appropriate statistical methods. We use likelihood and AIC to compare the fit of four of the most widely used models to data on over 16,000 communities from a diverse array of taxonomic groups and ecosystems. Across all datasets combined the log-series, Poisson lognormal, and negative binomial all yield similar overall fits to the data. Therefore, when correcting for differences in the number of parameters the log-series generally provides the best fit to data. Within individual datasets some other distributions performed nearly as well as the log-series even after correcting for the number of parameters. The Zipf distribution is generally a poor characterization of the SAD.
A number of different models have been proposed as descriptions of the species-abundance distribution (SAD). Most evaluations of these models use only one or two models, focus on only a single ecosystem or taxonomic group, or fail to use appropriate statistical methods. We use likelihood and AIC to compare the fit of four of the most widely used models to data on over 16,000 communities from a diverse array of taxonomic groups and ecosystems. Across all datasets combined the log-series, Poisson lognormal, and negative binomial all yield similar overall fits to the data. Therefore, when correcting for differences in the number of parameters the log-series generally provides the best fit to data. Within individual datasets some other distributions performed nearly as well as the log-series even after correcting for the number of parameters. The Zipf distribution is generally a poor characterization of the SAD.Simplified large African carnivore density estimators from track indiceshttps://peerj.com/articles/26622016-12-222016-12-22Christiaan W. WinterbachSam M. FerreiraPaul J. FunstonMichael J. Somers
Background
The range, population size and trend of large carnivores are important parameters to assess their status globally and to plan conservation strategies. One can use linear models to assess population size and trends of large carnivores from track-based surveys on suitable substrates. The conventional approach of a linear model with intercept may not intercept at zero, but may fit the data better than linear model through the origin. We assess whether a linear regression through the origin is more appropriate than a linear regression with intercept to model large African carnivore densities and track indices.
Methods
We did simple linear regression with intercept analysis and simple linear regression through the origin and used the confidence interval for ß in the linear model y = αx + ß, Standard Error of Estimate, Mean Squares Residual and Akaike Information Criteria to evaluate the models.
Results
The Lion on Clay and Low Density on Sand models with intercept were not significant (P > 0.05). The other four models with intercept and the six models thorough origin were all significant (P < 0.05). The models using linear regression with intercept all included zero in the confidence interval for ß and the null hypothesis that ß = 0 could not be rejected. All models showed that the linear model through the origin provided a better fit than the linear model with intercept, as indicated by the Standard Error of Estimate and Mean Square Residuals. Akaike Information Criteria showed that linear models through the origin were better and that none of the linear models with intercept had substantial support.
Discussion
Our results showed that linear regression through the origin is justified over the more typical linear regression with intercept for all models we tested. A general model can be used to estimate large carnivore densities from track densities across species and study areas. The formula observed track density = 3.26 × carnivore density can be used to estimate densities of large African carnivores using track counts on sandy substrates in areas where carnivore densities are 0.27 carnivores/100 km2 or higher. To improve the current models, we need independent data to validate the models and data to test for non-linear relationship between track indices and true density at low densities.
Background
The range, population size and trend of large carnivores are important parameters to assess their status globally and to plan conservation strategies. One can use linear models to assess population size and trends of large carnivores from track-based surveys on suitable substrates. The conventional approach of a linear model with intercept may not intercept at zero, but may fit the data better than linear model through the origin. We assess whether a linear regression through the origin is more appropriate than a linear regression with intercept to model large African carnivore densities and track indices.
Methods
We did simple linear regression with intercept analysis and simple linear regression through the origin and used the confidence interval for ß in the linear model y = αx + ß, Standard Error of Estimate, Mean Squares Residual and Akaike Information Criteria to evaluate the models.
Results
The Lion on Clay and Low Density on Sand models with intercept were not significant (P > 0.05). The other four models with intercept and the six models thorough origin were all significant (P < 0.05). The models using linear regression with intercept all included zero in the confidence interval for ß and the null hypothesis that ß = 0 could not be rejected. All models showed that the linear model through the origin provided a better fit than the linear model with intercept, as indicated by the Standard Error of Estimate and Mean Square Residuals. Akaike Information Criteria showed that linear models through the origin were better and that none of the linear models with intercept had substantial support.
Discussion
Our results showed that linear regression through the origin is justified over the more typical linear regression with intercept for all models we tested. A general model can be used to estimate large carnivore densities from track densities across species and study areas. The formula observed track density = 3.26 × carnivore density can be used to estimate densities of large African carnivores using track counts on sandy substrates in areas where carnivore densities are 0.27 carnivores/100 km2 or higher. To improve the current models, we need independent data to validate the models and data to test for non-linear relationship between track indices and true density at low densities.