Creating functional groups of marine fish from categorical traits

School of Mathematics and Statistics, Victoria University of Wellington, Kelburn, Wellington, New Zealand
Population Modelling Group, National Institute of Water and Atmospheric Research, Wellington, New Zealand
DOI
10.7287/peerj.preprints.27148v1
Subject Areas
Ecology, Marine Biology, Statistics, Data Mining and Machine Learning, Data Science
Keywords
clustering, traits, fish, missing data, stability, compactness, separation, connectedness, teleost, morphology
Copyright
© 2018 Ladds et al.
Licence
This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, reproduction and adaptation in any medium and for any purpose provided that it is properly attributed. For attribution, the original author(s), title, publication source (PeerJ Preprints) and either DOI or URL of the article must be cited.
Cite this article
Ladds MA, Sibanda N, Arnold R, Dunn MR. 2018. Creating functional groups of marine fish from categorical traits. PeerJ Preprints 6:e27148v1

Abstract

Background. Functional groups serve two important functions in ecology, they allow for simplification of ecosystem models and can aid in understanding diversity. Despite their important applications, there has not been a universally accepted method of how to define them. A common approach is to cluster species on a set of traits, validated through visual confirmation of resulting groups based primarily on expert opinion. The goal of this research is to determine a suitable procedure for creating and evaluating functional groups that arise from clustering nominal traits.

Methods. To do so we produced a species by trait matrix of 22 traits from 116 fish species from Tasman Bay and Golden Bay, New Zealand. Data collected from photographs and published literature were predominantly nominal, and a small number of continuous traits were discretized. Some data were missing, so the benefit of imputing data was assessed using four approaches on data with known missing values. Hierarchical clustering is utilised to search for underlying data structure in the data that may represent functional groups. Within this clustering paradigm there are a number of distance matrices and linkage methods available, several combinations of which we test. The resulting clusters are evaluated using internal metrics developed specifically for nominal clustering. This revealed the choice of number of clusters, distance matrix and linkage method greatly affected the overall within- and between- cluster variability. We visualise the clustering in two dimensions and the stability of clusters is assessed through bootstrapping.

Results. Missing data imputation showed up to 90% accuracy using polytomous imputation, so was used to impute the real missing data. A division of the species information into three functional groups was the most separated, compact and stable result. Increasing the number of clusters increased the inconsistency of group membership, and selection of the appropriate distance matrix and linkage method improved the fit.

Discussion. We show that the commonly used methodologies used for the creation of functional groups are fraught with subjectivity, ultimately causing significant variation in the composition of resulting groups. Depending on the research goal dictates the appropriate strategy for selecting number of groups, distance matrix and clustering algorithm combination.

Author Comment

This is a submission to PeerJ for review.

Supplemental Information

List of diet and life history traits to include in clustering

DOI: 10.7287/peerj.preprints.27148v1/supp-1

Missing data imputation methods

DOI: 10.7287/peerj.preprints.27148v1/supp-2

Distance matrix calculations

DOI: 10.7287/peerj.preprints.27148v1/supp-3

Histograms of the raw data with the corresponding bar plot of the discretized variable for trophic level (a,b), maximum depth (m) (c,d), common maximum depth (m) (e,f) and length (cm) (g,h)

DOI: 10.7287/peerj.preprints.27148v1/supp-4

Alternative solution for evaluating reliability using the Gini coefficient

DOI: 10.7287/peerj.preprints.27148v1/supp-5