Creating functional groups of marine fish from categorical traits

Monique A Ladds; Nokuthaba Sibanda; Richard Arnold; Matthew R Dunn

doi:10.7287/peerj.preprints.27148v1

Creating functional groups of marine fish from categorical traits

Monique A Ladds ¹, Nokuthaba Sibanda¹, Richard Arnold¹, Matthew R Dunn²

1 School of Mathematics and Statistics, Victoria University of Wellington, Kelburn, Wellington, New Zealand

2 Population Modelling Group, National Institute of Water and Atmospheric Research, Wellington, New Zealand

DOI: 10.7287/peerj.preprints.27148v1

Published: 2018-08-28
Accepted: 2018-08-28

Subject Areas: Ecology, Marine Biology, Statistics, Data Mining and Machine Learning, Data Science
Keywords: clustering, traits, fish, missing data, stability, compactness, separation, connectedness, teleost, morphology

Copyright: © 2018 Ladds et al.
Licence: This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, reproduction and adaptation in any medium and for any purpose provided that it is properly attributed. For attribution, the original author(s), title, publication source (PeerJ Preprints) and either DOI or URL of the article must be cited.

Cite this article: Ladds MA, Sibanda N, Arnold R, Dunn MR. 2018. Creating functional groups of marine fish from categorical traits. PeerJ Preprints 6:e27148v1 https://doi.org/10.7287/peerj.preprints.27148v1

Abstract

Background. Functional groups serve two important functions in ecology, they allow for simplification of ecosystem models and can aid in understanding diversity. Despite their important applications, there has not been a universally accepted method of how to define them. A common approach is to cluster species on a set of traits, validated through visual confirmation of resulting groups based primarily on expert opinion. The goal of this research is to determine a suitable procedure for creating and evaluating functional groups that arise from clustering nominal traits.

Methods. To do so we produced a species by trait matrix of 22 traits from 116 fish species from Tasman Bay and Golden Bay, New Zealand. Data collected from photographs and published literature were predominantly nominal, and a small number of continuous traits were discretized. Some data were missing, so the benefit of imputing data was assessed using four approaches on data with known missing values. Hierarchical clustering is utilised to search for underlying data structure in the data that may represent functional groups. Within this clustering paradigm there are a number of distance matrices and linkage methods available, several combinations of which we test. The resulting clusters are evaluated using internal metrics developed specifically for nominal clustering. This revealed the choice of number of clusters, distance matrix and linkage method greatly affected the overall within- and between- cluster variability. We visualise the clustering in two dimensions and the stability of clusters is assessed through bootstrapping.

Results. Missing data imputation showed up to 90% accuracy using polytomous imputation, so was used to impute the real missing data. A division of the species information into three functional groups was the most separated, compact and stable result. Increasing the number of clusters increased the inconsistency of group membership, and selection of the appropriate distance matrix and linkage method improved the fit.

Discussion. We show that the commonly used methodologies used for the creation of functional groups are fraught with subjectivity, ultimately causing significant variation in the composition of resulting groups. Depending on the research goal dictates the appropriate strategy for selecting number of groups, distance matrix and clustering algorithm combination.

Author Comment

This is a submission to PeerJ for review.

Supplemental Information

Histograms of the raw data with the corresponding bar plot of the discretized variable for trophic level (a,b), maximum depth (m) (c,d), common maximum depth (m) (e,f) and length (cm) (g,h)

Supplemental Information

List of diet and life history traits to include in clustering

Missing data imputation methods

Distance matrix calculations

Histograms of the raw data with the corresponding bar plot of the discretized variable for trophic level (a,b), maximum depth (m) (c,d), common maximum depth (m) (e,f) and length (cm) (g,h)

Alternative solution for evaluating reliability using the Gini coefficient