WhoseEgg: classification software for invasive carp eggs

Katherine Goode; Michael J. Weber; Philip M. Dixon

doi:10.7717/peerj.14787

WhoseEgg: classification software for invasive carp eggs

Katherine Goode ¹, Michael J. Weber², Philip M. Dixon¹

1Department of Statistics, Iowa State University, Ames, Iowa, United States

2Natural Resource Ecology and Management, Iowa State University, Ames, Iowa, United States

DOI: 10.7717/peerj.14787

Published: 2023-02-27
Accepted: 2023-01-03
Received: 2022-10-20

Academic Editor: Eric Ward

Subject Areas: Aquaculture, Fisheries and Fish Science, Bioinformatics, Zoology, Freshwater Biology, Natural Resource Management
Keywords: Bigheaded carp, Invasive species, Machine learning, Morphometrics, R Shiny, Random forests, Reproduction

Copyright: © 2023 Goode et al.
Licence: This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, reproduction and adaptation in any medium and for any purpose provided that it is properly attributed. For attribution, the original author(s), title, publication source (PeerJ) and either DOI or URL of the article must be cited.

Cite this article: Goode K, Weber MJ, Dixon PM. 2023. WhoseEgg: classification software for invasive carp eggs. PeerJ 11:e14787 https://doi.org/10.7717/peerj.14787

The authors have chosen to make the review history of this article public.

Abstract

The collection of fish eggs is a commonly used technique for monitoring invasive carp. Genetic identification is the most trusted method for identifying fish eggs but is expensive and slow. Recent work suggests random forest models could provide an inexpensive method for identifying invasive carp eggs based on morphometric egg characteristics. While random forests provide accurate predictions, they do not produce a simple formula for obtaining new predictions. Instead, individuals must have knowledge of the R coding language, limiting the individuals who can use the random forests for resource management. We present WhoseEgg: a web-based point-and-click application that allows non-R users to access random forests via a point and click interface to rapidly identify fish eggs with an objective of detecting invasive carp (Bighead, Grass, and Silver Carp) in the Upper Mississippi River basin. This article provides an overview of WhoseEgg, an example application, and future research directions.

Introduction

The collection and identification of fish eggs is a common practice for monitoring invasive aquatic species. By collecting eggs, it is possible to relate environmental conditions to the timing of reproduction, estimate spawning locations, identify potential recruitment bottlenecks between early life stages, and estimate adult spawning biomass (Leggett & Deblois, 1994; Takasuka, Yoneda & Oozeki, 2019; Camacho et al., In press). As a result, egg identification is a useful tool for understanding the reproductive processes of invasive species, which helps to monitor the range of the species and inform management decisions (MICRA, 2017).

Egg collection is one method used to monitor the spread of invasive Grass Carp (Ctenopharyngodon idella), Silver Carp (Hypophthalmichthys molitrix), and Bighead Carp (H. nobilis; Deters, Chapman & McElroy, 2013; Coulter et al., 2016; Embke et al., 2016). Hereafter, Grass Carp, Silver Carp, and Bighead Carp are collectively referred to as “invasive carp”. Invasive carp were introduced to the United States in the 1960s (Freeze & Henderson, 1982; Wittmann et al., 2014) and have spread throughout the Mississippi River basin via natural and anthropogenic means (Chick & Pegg, 2001; Hinterthuer, 2012). Invasive carp alter food webs and fish communities through alterations in nutrient cycling (Collins & Wahl, 2017), reductions in plankton resources, competition with native planktivorous fishes, and reductions in native fish recruitment (Irons et al., 2007; Chick et al., 2020; Tillotson, Weber & Pierce, 2022). Identifying when and where invasive carp are reproducing could inform management efforts seeking to limit their spread into new habitats (e.g., installation of deterrents).

Taxonomic keys are available to identify fish larvae (e.g., Auer, 1982). Fish eggs can also have distinguishing morphological features but are much more difficult to visually identify (Kelso, Kaller & Rutherford, 2012), making the visual identification of invasive carp eggs unreliable (USGS, 2014; Larson et al., 2016). The visual identification of invasive carp eggs is challenging for two reasons. First, morphological features of invasive carp eggs overlap with native species (Chapman, 2006; George & Chapman, 2015; Camacho et al., 2019). Second, invasive carp egg characteristics are not only plastic between their native and invaded regions (Mack et al., 2000; Peterson & Vieglais, 2001; Lenaerts et al., 2015) but can also vary within invaded areas (Lenaerts et al., In press). Currently, genetic identification is the most accurate method for identifying fish eggs (Becker et al., 2015; Coulter et al., 2016; Embke et al., 2016) but it is time intensive and expensive, limiting the number of eggs that can be processed. Consequently, eDNA techniques are being developed to determine if target species are present within a sample (e.g., Fritts et al., 2018), but additional work is needed to identify large numbers of individual fish eggs more easily, quickly, and inexpensively.

Camacho et al. (2019) provides one solution by using random forest machine learning models (Breiman, 2001) to predict the family, genus, and species levels of fish eggs collected in pools 17–20 of the Upper Mississippi River basin during 2014 and 2015. Genetic identifications of eggs were used as response variables in the models (one model for each taxon) with invasive carp treated as one prediction class within each taxonomic level. The predictor variables were 17 egg and environmental characteristics associated with egg collection. Random forests predicted at least 97% of invasive carp eggs correctly at all three taxonomic levels from out-of-bag samples. Goode et al. (In press) validated the models from Camacho et al. (2019) using a set of eggs collected from a third year (2016) across a larger geographic area (pools 14–20). The validation revealed models from Camacho et al. (2019) predicted at least 89% of invasive carp eggs correctly. Additionally, Goode et al. (In press) trained new random forests using the same structure as the models in Camacho et al. (2019) but with all 3 years of egg data. These models returned predictive accuracies for invasive carp between 96% and 98% on the out-of-bag samples. Performance of the models on the validation data suggests random forests can be a useful tool for identifying invasive carp eggs.

Random forests are a relatively new model (Breiman, 2001) that have been applied to a wide range of ecological questions (e.g., Cutler et al., 2007; Evans & Cushman, 2009; Darling et al., 2012). Random forests are desirable because they often produce more accurate model predications compared to more traditional statistical approaches (e.g., logistic regression; Cutler et al., 2007). Yet, a downside of random forests is that the algorithm is too complicated to be written as a predictive equation in a practical form to make predictions for new observations. Instead, a saved version of the model must be accessed directly to obtain predictions. Camacho et al. (2019) and Goode et al. (In press) trained their random forests using the R statistical coding language (R Core Team, 2021). As a result, if an individual is interested in using the random forests from either Camacho et al. (2019) or Goode et al. (In press), it is necessary for the individual to be familiar with the R programming language, limiting the individuals that can access and apply the models for the identification of invasive carp eggs. While random forests could be used as an inexpensive tool to classify invasive carp eggs, there is a need to make the models more broadly available.

We developed the online application of WhoseEgg to allow users unfamiliar with the R programming language to use random forest models to classify invasive carp eggs in the Upper Mississippi River basin. Users can upload their own fish egg characteristics and compute family, genus, and species taxonomic level predictions using random forests based on those from Camacho et al. (2019) and Goode et al. (In press). This article introduces and provides an overview of the capabilities of WhoseEgg. In particular, the article (1) describes how WhoseEgg is accessed and structured, (2) provides details about the training data and random forests used by WhoseEgg, (3) describes the processes for measuring egg characteristic, (4) includes an example demonstrating WhoseEgg, (5) discusses the limitations of WhoseEgg, and (6) suggests directions for future work.

App access and architecture

WhoseEgg is free and available online at https://whoseegg.stat.iastate.edu/. The app is accessible from any device with a web browser but was developed to perform best when used on a laptop or desktop computer. The app was built using R code (R Core Team, 2021) and the R package Shiny (Chang et al., 2021). WhoseEgg is hosted on an R server that allows the app to connect to R to perform the necessary computations. Data uploaded to WhoseEgg will not be saved or redistributed in any manner to protect the privacy of users’ data. The code, random forests, and training data associated with WhoseEgg are available on GitHub (https://github.com/goodekat/WhoseEgg) and in the Iowa State University digital repository (https://doi.org/10.25380/iastate.15046578; version 1.0.0).

WhoseEgg is divided into six pages listed in the top panel of its ‘Home’ page (Fig. 1). The pages are organized so that users begin at the ‘Home’ page and progress left to right through the other pages. The ‘Home’ page contains information to familiarize users with WhoseEgg, including its purpose and instructions. The ‘Home’ page also includes details about the collection locations and species in the training data to help users determine whether the models in WhoseEgg are appropriate for their data.

Figure 1: Homepage of WhoseEgg.
The homepage contains a description of the app and instructions on how to use the app to obtain fish egg taxonomic predictions.

Download full-size image

DOI: 10.7717/peerj.14787/fig-1

The ‘Data Input’, ‘Predictions’, and ‘Downloads’ pages contain interactive tools that allow users to provide their own data and acquire predictions. Each of these pages is divided into two panels: an instruction panel and a main panel (e.g., Fig. 2). The main panels contain additional information and interactive features to assist users. The flowchart included on the WhoseEgg ‘Home’ page describes the steps to obtain predictions (Fig. 1).

Figure 2: ‘Data Input’ page.
The content in WhoseEgg after the example spreadsheet of fish egg characteristics were uploaded is depicted.

Download full-size image

DOI: 10.7717/peerj.14787/fig-2

Data Input: The user uploads a spreadsheet with the necessary egg characteristics (Table 1) via the ‘Data Input’ page. The spreadsheet must be an Excel or csv file and formatted appropriately. The ‘Data Input’ page describes the necessary spreadsheet format and provides a downloadable Excel template (included in the Supplemental Material). The template has data validation helpers to further assist users with formatting (Figs. 3A and 3B). Additionally, informative errors and warnings appear in WhoseEgg if the uploaded data are not formatted correctly.
Predictions: The user obtains predictions from the WhoseEgg random forests for the uploaded egg data on the ‘Predictions’ page.
Downloads: The user downloads a spreadsheet from the ‘Downloads’ page containing the uploaded data, additional egg characteristics computed by WhoseEgg, and the random forest predictions.

Table 1:

WhoseEgg random forest predictor variables.

The table contains the WhoseEgg random forest predictor variables with definitions and training data means (and standard deviations) or levels (and proportion of eggs per level).

Variable	Definition	Mean (standard deviation) or levels (proportion)
Compact or diffuse	Whether the embryo is compact or diffuse	Compact (0.87); Diffuse (0.13)
Conductivity (µ/cm)	Conductivity of the water at the time of collection	462.21 (103.03)
Deflated membrane	Whether the membrane is deflated or not	Yes (0.59); No (0.41)
Egg stage	Stage of the egg when collected (based on Kelso & Rutherford (1996))	1 (0.15); 2 (0.01); 3 (0.09); 4 (0.22); 5 (0.10); 6 (0.12); 7 (0.10); 8 (0.08); Broken (<0.01); Diffuse (0.13)
Embryo diameter average (mm)	Average of four measurements of the embryo diameter	1.36 (0.42)
Embryo diameter coefficient of variation	Coefficient of variation of four measurements of the embryo diameter	0.1 (0.08)
Embryo diameter standard deviation (mm)	Standard deviation of four measurements of the embryo diameter	0.14 (0.14)
Embryo to membrane ratio	Ratio of the embryo diameter average to the membrane diameter average	0.67 (0.2)
Julian day	Julian day when the egg was collected	167.99 (27.21)
Larval length (mm)	Length along the midline for all eggs in stages 6–8 (otherwise set to 0)	0.49 (1.12)
Membrane diameter average (mm)	Average of four measurements of the membrane diameter	2.27 (1.04)
Membrane diameter coefficient of variation	Coefficient of variation of four measurements of the membrane diameter	0.07 (0.07)
Membrane diameter standard deviation (mm)	Standard deviation of four measurements of the membrane diameter	0.17 (0.17)
Month	Month when the egg was collected	5.85 (0.98)
Pigment presence	Whether there is pigment present on the egg	Yes (0.29); No (0.71)
Sticky debris	Whether there is debris on the egg	Yes (0.23); No (0.77)
Temperature (°C)	Temperature of the water when the egg was collected	23.39 (2.95)

DOI: 10.7717/peerj.14787/table-1

Figure 3: Examples of the validation helpers in the spreadsheet template to assist users correctly format the egg characteristic data.
(A) When a column is selected, a description of the variable and necessary format appears. (B) If an observation is entered incorrectly or falls outside of the range of the WhoseEgg training data, an error/warning appears.

Download full-size image

DOI: 10.7717/peerj.14787/fig-3

The ‘Help’ and ‘References’ pages are designed to be accessed at any time. The ‘Help’ page contains the details of how to measure the egg characteristics, an overview of random forests, and answers to frequently asked questions. The ‘References’ page lists citations mentioned throughout the app.

Training data and random forests

WhoseEgg uses three random forests to separately predict the family, genus, and species of a fish egg based on egg characteristics. We trained the random forests using a compilation of the training data from Camacho et al. (2019; 734 and 541 fish eggs from 2014 and 2015, respectively) and the validation data from Goode et al. (In press; 703 fish eggs from 2016). The eggs in both studies were sampled from locations in the Upper Mississippi River basin (Fig. 4). The data collection was approved by the Iowa State University Institutional Animal Care and Use Committee Protocol (7-13-7599-I), and the Iowa DNR gave permission for field sampling (SC1037). The data sets contain genetic identifications and egg characteristics. See Camacho et al. (2019) and Goode et al. (In press) for additional details about the egg sampling, subsampling, genetic identification, and egg characteristic measurement procedures. We identified 29 eggs from 2016 with incorrect data entries. We were able to correct 23 of the observations and removed six of the eggs. Thus, the WhoseEgg random forests were trained on a total of 1,972 eggs but provide comparable estimates to the Goode et al. (In press) combined validation models (analysis and results included in the Supplemental Material).

Figure 4: Upper Mississippi River and tributary rivers in Iowa and Illinois, USA where eggs were collected.
The symbols indicate the year(s) of collection: 2014–2015 (plus), 2016 (star), or 2014–2016 (diamond). Map of sampling locations acquired from Goode et al. (In press).

Download full-size image

DOI: 10.7717/peerj.14787/fig-4

The predictor variables in the WhoseEgg random forests were the same 17 egg and environmental characteristics (Table 1) used in the random forests from Camacho et al. (2019) and Goode et al. (In press). The response variables were the genetically identified family, genus, and species levels of the eggs. For all three taxonomic levels, Grass Carp, Silver Carp, and Bighead Carp were grouped into the category of “invasive carp” due to similar egg characteristics among these species. The training data contained other species in the same family as invasive carp (Cyprinidae), so the family of Cyprinidae excluding invasive carp was treated as a separate category in the family level response variable. The distribution of eggs per species was imbalanced with invasive carp, Freshwater Drum (Aplodinotus grunniens), and Emerald Shiner (Notropis atherinoides) comprising most of the eggs in the training data (Table 2).

Table 2:

Training data taxonomic levels.

The table includes the taxonomic levels and number of eggs per species in the training data collected from pools 14–20 of the Upper Mississippi River during 2014–2016. The eggs with a label of “species unidentified” were eggs where the genetic analysis was able to identify a genus but not a species.

Family	Genus	Common name (species)	Number of eggs in training data
Catostomidae	Carpiodes	Carpsuckers species unidentified	1
		Quillback (cyprinus)	1
		River Carpsucker (carpio)	8
	Ictiobus	Bigmouth Buffalo (cyprinellus)	7
		Black Buffalo (niger)	1
		Buffalo species unidentified	10
		Smallmouth Buffalo (bubalus)	2
Clupeidae	Alosa	Skipjack Shad (chrysochloris)	1
Clupeidae	Dorosoma	Gizzard Shad (cepedianum)	2
Cyprinidae	Cyprinella	Spotfin Shiner (spiloptera)	6
	Luxilus	Common Shiner (cornutus)	1
	Macrhybopsis	Silver Chub (storeriana)	36
	Macrhybopsis	Speckled Chub (aestivalis)	28
	Notropis	Channel Shiner (wickliffi)	32
		Emerald Shiner (atherinoides)	201
		River Shiner (blennius)	16
		Sand Shiner (stramineus)	1
		Shiner species unidentified	69
	Pimephales	Fathead Minnow (promelas)	5
Hiodontidae	Hiodon	Goldeye (alosoides)	7
Invasive Carp	Invasive Carp	Invasive Carp	782
Moronidae	Morone	Striped Bass (saxatilis)	17
Moronidae	Morone	White Bass (chrysops)	1
Percidae	Etheostoma	Banded Darter (zonale)	1
	Percina	Common Logperch (caprodes)	1
	Sander	Walleye (vitreus)	2
Sciaenidae	Aplodinotus	Freshwater Drum (grunniens)	733

DOI: 10.7717/peerj.14787/table-2

We trained the WhoseEgg random forests using the randomForest R package (Liaw & Wiener, 2002). Each model was trained with 1,000 trees. The other tuning parameters were set to the default values in randomForest. Parameters were specified to be consistent with Camacho et al. (2019) and Goode et al. (In press). WhoseEgg returns several values from the random forests for each egg observation in the uploaded data:

Random forest probabilities: Random forests are ensembles of many trees (1,000 in the case of WhoseEgg), and each tree returns a prediction. The proportion of trees that return a prediction of a particular level (on out-of-bag observations) is interpreted as the probability that a randomly generated tree (under the conditions used by a random forest) will predict an observation to be in a specific level (Cutler et al., 2007). WhoseEgg returns the random forest probability for each level within the family, genus, and species levels contained in the training data.
Random forest prediction: The response variable level with the highest random forest probability is considered the random forest prediction. WhoseEgg returns the random forest prediction for the family, genus, and species levels.

Egg characteristic collection

WhoseEgg requires users to collect and provide 15 variables associated with an egg: the year and day of egg collection and all variables listed in Table 1 except for Julian day, embryo diameter coefficient of variation, membrane diameter coefficient of variation, and the embryo to membrane ratio. WhoseEgg internally computes the four excluded variables from the provided variables.

The WhoseEgg ‘Help’ page contains detailed descriptions of these variables to assist the user in the data collection process. For each egg characteristic, the page provides a definition of the variable including units, the name of the variable and required format for the WhoseEgg spreadsheet, and whether the variable is required for upload or computed after upload (Fig. 5A). Additional information is provided with some of the variables. For example, the range of the variable in the training data is provided for continuous variables to allow users to determine if their observations fall within the training data range, and example photos are provided to help with measurements that require additional measurement details (Fig. 5B).

The environmental variables of temperature and conductivity should be collected at the time of egg collection. The egg morphological variables are best measured in a laboratory. The morphological variables in the WhoseEgg training data were collected by first placing an egg in a petri dish with just enough ethanol to cover the egg and help hold it stationary. Then a photograph was taken of the egg using an Olympus SXZ7 microscope (Image Pro 7.0 software; Media Cybernetics, Bethesda, MD, USA) at two times magnification. Camacho et al. (2019) states that, “For eggs with an embryo, the pictures were taken in the dorsal, ventral, and lateral positions in relation to the embryo. If an embryo was not identifiable, a picture was taken after a quarter rotation of the egg on its y-axis, x-axis, and again on its y-axis.” The quantitative measurements (e.g., larval length and membrane diameter average) were obtained using Image Pro software. The qualitative measurements (e.g., pigment presence and egg stage) were determined visually based on the criteria described on the WhoseEgg ‘Help’ page. Users with questions about variable collection are encouraged to reach out to the app developers for clarifications.

Example

Here, we present an example using WhoseEgg to obtain predictions on a set of fish eggs from the WhoseEgg training data collected in 2016 at the mouth of the Iowa River in Pool 18 of the Upper Mississippi River. This location was selected since it is an area that has been actively monitored to observe invasive carp reproduction along the invasion front (Camacho et al., In press). The data set contains the egg characteristics measured on 215 fish eggs.

Data Input Page. We first upload the spreadsheet ‘example-data.csv’ (available in the Supplemental Material) containing the egg characteristics (gold short-dashed box in Fig. 2). The spreadsheet has the same format as the downloadable template (red solid box in Fig. 2) except that we added variables for site and river of collection. If the uploaded spreadsheet has no formatting errors, WhoseEgg prints two interactive tables under ‘Egg Characteristics’ (blue long-dashed box in Fig. 2). The table in the ‘Input Data’ tab contains the uploaded data (Fig. 2) and the table in the ‘Processed Data’ tab contains a dataset created by WhoseEgg with the predictor variables. Note that the variables of Year and Day are included in the input data table but will not be in the processed data table since they are not used as predictor variables by the models. The variable of Julian Day, however, will be computed based on these variables and added to the processed data. The user can filter and sort the data in the tables as desired.

Predictions Page. After uploading the data, we navigate to the ‘Predictions’ page. We click on the ‘Get Predictions’ button (solid red box in Fig. 6) and a ‘Table of Predictions’ (gold short-dashed box in Fig. 6) and ‘Visualizations of Predictions’ (blue long-dashed box in Fig. 6) appear. The table contains seven variables. The first variable is the Egg ID provided in the uploaded spreadsheet. The remainder of the variables are the random forest predictions (variables ending with ‘Pred’) and corresponding random forest probabilities (variables ending with ‘Prob’) for the family, genus, and species.

The plots on the ‘Summary of Predictions’ tab summarize the predictions of all observations in the uploaded data. A bar chart is created for each taxonomic level showing the levels included in the random forest predictions and the number of predictions per level (blue long-dashed box in Fig. 6). In our example data, most predictions fall in the family of Sciaenidae, the genus of Aplodinotus, and the species of Freshwater Drum. The second most frequent category in each taxonomic level is invasive carp. Note the number of invasive carp predictions varies from 55 at both the family and genus levels to 57 at the species level. Additional work outside of WhoseEgg could be done after the predictions are downloaded to investigate which observations are predicted differently across the taxonomies by the random forests. This could provide insight into why the random forests made different predictions for these observations.

The ‘Individual Egg Predictions’ tab allows users to select an egg of interest by clicking on a row in the ‘Table of Predictions’ (Fig. 7A). Then bar charts are created showing the random forest probabilities corresponding to the selected egg for all categories within family, genus, and species (Fig. 7B). In our example, 56 eggs are predicted to have a species of invasive carp. After the predictions were downloaded, it was found that 47 (84%) of the eggs have a random forest probability greater than 80% for invasive carp. While the model mostly returns invasive carp predictions with high random forest probabilities, it may be of interest to further explore the eggs with lower random forest probabilities. Here, we consider the egg with the lowest random forest probability for invasive carp out of the eggs with a random forest species prediction of invasive carp (Fig. 7A; egg 77). Only 37% of the trees voted for egg 77 to be an invasive carp. Since this is a low percentage, we are interested in knowing what other species received votes from the random forest trees. The bar chart of random forest species probabilities for egg 77 shows that approximately 27% of the trees in the species random forest voted for Speckled Chub (Macrhybopsis aestivalis) and approximately 17% of trees in the species random forest voted for Freshwater Drum.

Figure 7: ‘Individual Egg Predictions’ visualizations.
(A) The ‘Table of Predictions’ from the example data was filtered to only include rows with at least one prediction of invasive carp and sorted from lowest to highest species probability. The egg with the lowest random forest probability of invasive carp (the first row) was selected. (B) The ‘Individual Egg Predictions’ tab then generates bar charts of the random forest probabilities for all categories within a taxonomic level corresponding to the egg selected.

Download full-size image

DOI: 10.7717/peerj.14787/fig-7

Downloads Page. With the predictions obtained, we move to the ‘Downloads’ page. We first click on the ‘Preview Data’ button (solid red box in Fig. 8) to preview the spreadsheet available for download. The table is too long to show all the columns at once, but a horizontal scrolling option allows the user to see all columns. The table includes all initial variables uploaded to WhoseEgg, additional predictor variables computed by WhoseEgg, and random forest predictions and probabilities for all categories within the three taxonomic levels. We then select a file type (xlxs, xls, or csv) for download (gold short-dashed box in Fig. 9) and click the ‘Download Predictions’ button (blue long-dashed box in Fig. 9).

Figure 9: Violin plots of species random forest probabilities for invasive carp.
The plots are separated by the species random forest predictions.

Download full-size image

DOI: 10.7717/peerj.14787/fig-9

After Download. The downloaded results may be used for further investigation. For example, we explore the relationship between the invasive carp random forest probabilities and predictions for the species model. We create a violin plot of the species predictions vs. the random forest probabilities that an egg is an invasive carp (Fig. 9). Most of the eggs predicted to be invasive carp have high probabilities (above 0.8) of being invasive carp and most of the eggs predicted to be a species other than invasive carp have low invasive carp random forest probabilities (below 0.25). We could elect to genetically identify a handful of eggs predicted to be invasive carp with low probabilities. This would enable us to gain confidence in those eggs. The results from WhoseEgg suggest there are invasive carp eggs present at the mouth of the Iowa River and help us to identify a possible subset of eggs for genetic identification if deemed necessary.

Limitations and user responsibility

WhoseEgg is a powerful tool for classifying fish eggs without the need to obtain costly genetic identifications, but as with all models, there are assumptions and limitations users must be aware of. First, the random forests used by WhoseEgg at the time of writing this article were trained and validated on data from the Upper Mississippi River basin. Due to possible variation in fish egg characteristics across regions, additional validation is required to know how the models will perform in other regions. One option for users in other regions is to perform their own validation similar to the one in Goode et al. (In press) by first applying WhoseEgg to genetically identified eggs from the region of interest. If the models perform well on invasive carp, it provides evidence that WhoseEgg will return trustworthy predictions on future eggs from that region. If the models do not perform well, new random forest models could be developed using similar approaches as Camacho et al. (2019) and Goode et al. (In press) to identify fish eggs in different regions or across multiple regions.

A second limitation of WhoseEgg is that models are only able to return predictions that are in the training data taxonomic levels (Table 1). If a new collection of eggs contains a different family, genus, or species, WhoseEgg will not be able to correctly predict the egg. If other species are likely be present in egg collections, WhoseEgg should be used with caution. If the other species have characteristics that vary from invasive carp, WhoseEgg could still be a useful tool for identifying invasive carp. However, if the other species have similar characteristics to invasive carp, WhoseEgg may incorrectly predict these eggs as invasive carp.

A third limitation is that the validation of the random forest in WhoseEgg was focused on the classification of invasive carp and not other species present in the training data. Random forests generally were successful at predicting the identity of other fish eggs, but because the success of identifying other species was not specifically assessed, we urge users to be cautious if there is an interest in focusing on the identification of different species. As with the regional limitation, users who are interested in applying WhoseEgg to identify other species could perform a validation focusing on the other species of interest. If the WhoseEgg training data contain a large amount of the species of interest, the validation could be performed on the WhoseEgg training data. Otherwise, a new dataset with more observations from the species of interest should be used.

The three limitations discussed indicate that a user of WhoseEgg has the responsibility to acknowledge if their data are not appropriate to use with the WhoseEgg models. In addition to considering the location of data collection and possible fish species present in the data, users should also consider whether the egg characteristics in their data fall in range of the training data egg characteristics (see the app ‘Help’ page). If egg characteristics fall outside of the training data ranges, the random forests will be forced to extrapolate, which could lead to untrustworthy predictions. WhoseEgg alerts users if the uploaded egg characteristics fall outside of the training data values, but the final check of data correctness is the responsibility of the user.

Future work

There are many possibilities for updates to WhoseEgg. As additional eggs are collected, the random forests within WhoseEgg can be updated. This could include adding eggs from other species (e.g., Black Carp; Mylopharyngodon piceus) and regions (e.g., Ohio River Valley) not currently included in the training data. Models for predicting species other than invasive carp (e.g., Walleye; Sander vitreus) could also be trained. In regards to model extrapolation outside of the training data values, WhoseEgg already returns warning messages if an observation falls outside of the range of one variable. However, the procedure is done independently for each variable, which ignores correlation between variables. A procedure could be developed and implemented to determine if an observation falls within the joint range of multiple variables. Another possibility to improve the random forest ability to predict well on unseen data is feature selection. Camacho et al. (2019) explored models with a reduced set of input variables, which performed well, but these models were not implemented in WhoseEgg. Furthermore, additional resources for app usability could be developed (e.g., video demonstrations of variable measurements and app use).

WhoseEgg also provides inspiration for the development of other tools for egg classification. For example, future work could explore the use of convolutional neural networks to classify egg species given an image of an egg. This would provide a more streamlined approach for classification that removes the process of taking manual measurements of morphological variables. A convolutional neural network could be incorporated in a phone app that allows users to take a picture of a fish egg with a phone camera and return a prediction. Such a tool could possibly be used in the field for immediate data-based evidence of the classification of the egg.

Beyond the identification of fish eggs, this web-based application demonstrates the potential for fisheries scientists to make their work more accessible to other professions in our field. We encourage others to develop similar web-based tools for other complicated models that will make them more accessible and help to facilitate their use and application.

Supplemental Information

Template provided by WhoseEgg for inputting egg data.

The spreadsheet contains “helpers” to assist users correctly format the data.

DOI: 10.7717/peerj.14787/supp-1

Download

Code use to prepare data for WhoseEgg (R markdown version).

DOI: 10.7717/peerj.14787/supp-2

Download

Code use to prepare data for WhoseEgg (PDF version).

A PDF generated from the ‘preparing-data-for-app.Rmd’ file with the code used prepare the egg data for the WhoseEgg app along with descriptions and figures.

DOI: 10.7717/peerj.14787/supp-3

Download

Example egg data used in the article.

This example data contains eggs characteristics on a set of 215 fish eggs collected in 2016 at the mouth of the Iowa River in Pool 18 of the Upper Mississippi River.

DOI: 10.7717/peerj.14787/supp-4

Download

[1] Auer NA. 1982. Identification of larval fishes of the Great Lakes basin with emphasis on the Lake Michigan drainage. Ann Arbor: Great Lakes Fishery Commission, Ann Arbor.

[2] Becker RA, Sales NG, Santos GM, Santos GB, Carvalho DC. 2015. DNA barcoding and morphological identification of neotropical ichthyoplankton from the upper Paraná and São Francisco. Journal of Fish Biology 87(1):159-168

[3] Breiman L. 2001. Random forests. Machine Learning 45(1):5-32

[4] Camacho CA, Sullivan CJ, Weber MJ, Pierce CL. 2019. Morphological identification of Bighead Carp, Silver Carp, and Grass Carp eggs using random forests machine learning classification. North American Journal of Fisheries Management 39(6):1373-1384

[5] Camacho CA, Sullivan CJ, Weber MJ, Pierce CL. In press. Invasive Carp reproduction phenology in tributaries of the Upper Mississippi River. North American Journal of Fisheries Management 26(8):41

[6] Chang W, Cheng J, Allaire JJ, Sievert C, Schloerke B, Xie Y, Allen J, McPherson J, Dipert A, Borges B. 2021. shiny: web application framework for R. R package version 1.6.0.

[7] Chapman DC. 2006. Early development of four cyprinids native to the Yangtze River, China. Data Series 239. Reston, Virginia: U.S. Geological Survey

[8] Chick JH, Gibson-Reinemer DK, Soeken-Gittinger L, Casper AF. 2020. Invasive Silver Carp is empirically linked to declines of native sport fish in the Upper Mississippi River System. Biological Invasions 22(2):723-734

[9] Chick JH, Pegg MA. 2001. Invasive Carp in the Mississippi River Basin. Science 292(5525):2250-2251

[10] Collins SF, Wahl DH. 2017. Invasive planktivores as mediators of organic matter exchanges within and across ecosystems. Oecologia 184(2):521-530

[11] Coulter AA, Keller D, Bailey EJ, Goforth R. 2016. Predictors of Bigheaded Carp drifting egg density and spawning activity in an invaded, free-flowing river. Journal of Great Lakes Research 42(1):83-89

[12] Cutler DR, Edwards TC, Beard KH, Cutler A, Hess KT, Gibson J, Lawler JJ. 2007. Random forests for classification in ecology. Ecology 88(11):2783-2792

[13] Darling ES, Alvarez-Filip L, Oliver TA, McClanahan TR, Côté IM. 2012. Evaluating life-history strategies of reef corals from species traits. Ecology Letters 15:1378-1386

[14] Deters JE, Chapman DC, McElroy B. 2013. Location and timing of invasive carp spawning in the Lower Missouri River. Environmental Biology of Fishes 96:617-629

[15] Embke HS, Kocovsky PM, Richter CA, Pritt JJ, Mayer CM, Qian SS. 2016. First direct confirmation of Grass Carp spawning in a Great Lakes tributary. Journal of Great Lakes Research 42:899-903

[16] Evans JS, Cushman SA. 2009. Gradient modeling of conifer species using random forests. Landscape Ecology 24:673-683

[17] Freeze M, Henderson S. 1982. Distribution and status of the Bighead Carp and Silver Carp in Arkansas. North American Journal of Fisheries Management 2:197-200

[18] Fritts AK, Knights BC, Amberg J, Larson JH, Amberg JJ, Merkes C, Tajioui T, Butler SE, Diana MJ, Wahl DH, Weber MJ, Waters JD. 2018. Development of a quantitative PCR method for screening ichthyoplankton samples for bigheaded carps. Biological Invasions 21:1143-1153

[19] George AE, Chapman DC. 2015. Embryonic and larval development and early behavior in Grass Carp, Ctenopharyngodon idella: implications for recruitment in rivers. PLOS ONE 10(3):e0119023

[20] Goode K, Weber MJ, Matthews A, Pierce CL. In press. Evaluation of a random forest model to identify invasive carp eggs based on morphometric features. North American Journal of Fisheries Management

[21] Hinterthuer A. 2012. The explosive spread of Asian Carp: can the Great Lakes be protected? Does it matter? BioScience 62:220-224

[22] Irons KS, Sass G, McClelland M, Stafford J. 2007. Reduced condition factor of two native fish species coincident with invasion of non-native Asian carps in the Illinois River, USA. Is this evidence for competition and reduced fitness? Journal of Fish Biology 71:258-273

[23] Kelso WE, Kaller MD, Rutherford DA. 2012. Collecting, processing, and identification of fish eggs and larvae and zooplankton. In: Zale AV, Parrish DL, Sutton TM, eds. Fisheries Techniques (3rd Edition). Bethesda, Maryland: American Fisheries Society. 363-451

[24] Kelso WE, Rutherford DA. 1996. Collection, preservation, and identification of fish eggs and larvae. In: Murphy BR, Willis DW, eds. Fisheries Techniques (2nd Edition). Bethesda, Maryland: American Fisheries Society. 250-302

[25] Larson JH, McCalla SG, Chapman DC, Rees C, Knights BC, Vallazza JM, George AE, Richardson WB, Amberg J. 2016. Genetic analysis shows that morphology alone cannot distinguish Asian carp eggs from those of other cyprinid species. North American Journal of Fisheries Management 36:1053-1058

[26] Leggett WC, Deblois E. 1994. Recruitment in marine fishes: is it regulated by starvation and predation in the egg and larval stages? Netherlands Journal of Sea Research 32:119-134

[27] Lenaerts A, Coulter A, Feiner Z, Goforth R. 2015. Egg size variability in an establishing population of invasive Silver Carp Hypophthalmichthys molitrix. Aquatic Invasions 10:449-461

[28] Lenaerts AW, Coulter AA, Irons KS, Lamer JT. In press. Plasticity in reproductive potential of Bigheaded Carp along an invasion front. North American Journal of Fisheries Management

[29] Liaw A, Wiener M. 2002. Classification and regression by randomForest. R News 2(3):18-22

[30] Mack RN, Simberloff D, Lonsdale WM, Evans H, Clout M, Bazzaz FA. 2000. Biotic invasions: causes, epidemiology, global consequences, and control. Ecological Applications 10:689-710

[31] MICRA. 2017. Monitoring and response plan for Asian carp in the Mississippi River Basin.

[32] Peterson AT, Vieglais DA. 2001. Predicting species invasions using ecological niche modeling: new approaches from bioinformatics attack a pressing problem: a new approach to ecological niche modeling based on new tools drawn from biodiversity informatics is applied to the challenge of predicting potential species’ invasions. BioScience 51:363-371