Collinearity does not affect Procrustes analysis outputs: directions for plant and soil ecologists

Statistics Division (ESS), Food and Agricultural Organization of the United Nations, Rome, Italy
Ecological Sciences, James Hutton Institute, Aberdeen, Scotland - UK
Biomathematics & Statistics Scotland, Aberdeen, Scotland - UK
Soil Science, Universidade Federal Rural do Rio de Janeiro, Seropedica, Brazil
DOI
10.7287/peerj.preprints.3272v1
Subject Areas
Biodiversity, Ecology, Plant Science, Soil Science, Statistics
Keywords
Procrustes association metric, , multivariate data, correlation, ANOVA, ecology
Copyright
© 2017 Lisboa et al.
Licence
This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, reproduction and adaptation in any medium and for any purpose provided that it is properly attributed. For attribution, the original author(s), title, publication source (PeerJ Preprints) and either DOI or URL of the article must be cited.
Cite this article
Lisboa F, Mitchell R, Chapman SJ, Potts J, Berbara R. 2017. Collinearity does not affect Procrustes analysis outputs: directions for plant and soil ecologists. PeerJ Preprints 5:e3272v1

Abstract

Background. The Procrustean residual vector (or PAM, an acronym for the alternative equivalent term Procrustean association metric) derived from Procrustes analysis can be seen as the univariate form of relationship between two or more data tables, which provides an interesting way for ecologists to place multivariate relationships as the central object of investigation in more familiar statistical approaches such as ANOVA and post hoc tests. However, many aspects need to be elucidated to make ecologists more confident in using Procrustes in their studies going beyond the simple comparisons. We attempted to address two questions: 1) How does the increasing number of correlated columns within an entire data table affect the Procrustes results? 2) Can the PAM be used for detecting how the correlation is partitioned across treatment levels within the original data table? Methods. Question 1) two data tables, X and Y, from a previous research were used to conduct the study. Four levels of correlation between variables (0.9, 0.7, 0.5, and 0.2) within the X data table were imposed to an increasing number of variables (6, 9, 12, and 15) to assess their effects on Procrustes relationship and its significance. Question 2) two simulated data tables covering four hypothetical categorical predictors (A, B, C, D) were created varying the relationship between them regarding the treatment A (0.2, 0.5, 0.7, 0.9) in order to assess the association between Procrustes and multiple mean comparisons method. Results. for the first question, we found that increasing the number of correlated variables across different imposed correlation levels (0.9, 0.7, 0.5, and 0.2) in the data table not subject to Procrustean linear transformation (translation and rotation), i.e. the X data table, had no effects either on the classical Procrustes outcomes related to the fit between data tables (R statistic and its P value), or on the significance of the ANOVA using the Procrustes association metric (PAM), which summarizes the multivariate correlation between two data tables, as the response variable. For the second question, increasing the between correlation levels between X and Y data tables for a specific set of rows in these tables corresponding to a hypothetical treatment A resulted in PAMs that, when used in mean multiple comparisons, did show this treatment A as different from all others treatments B, C, and D from which X and Y were not related above (0.1). Discussion. Our results support that the Procrustes fit is only dependent on the information between data tables instead of within a data table. Finally, we showed that PAM, in fact, reflects the differences in multivariate correlation across data tables which can be useful for ecological questions addressing the partitioning of the multivariate correlation among different categorical levels (e.g. plots, time, land use type, etc.).

Author Comment

This is a submission to PeerJ for review.

Supplemental Information

Code Question 1

Incorporating correlation structure into dataset.

DOI: 10.7287/peerj.preprints.3272v1/supp-1

code Question 2

Simulating data tables ranging treatments (A, B, C, D) varying in terms of relationship between them with respect the treatment A.

DOI: 10.7287/peerj.preprints.3272v1/supp-2

Suplemental Figures

Supplemental Figures.

DOI: 10.7287/peerj.preprints.3272v1/supp-3

Raw Dataset

Raw dataset from

http://www.sciencedirect.com/science/article/pii/S0038071714002570 used to address the first question of the paper:

" does the increasing number of correlated columns within an entire data table affect the Procrustes results?"

X raw dataset is the soil fertility, Y raw dataset represents soil microbial community (PLFA analysis)

DOI: 10.7287/peerj.preprints.3272v1/supp-4