Species-specific audio detection : A 1 comparison of three template-based 2 classification algorithms using random 3 forests

9 We developed a web-based cloud-hosted system that allow users to archive, listen, visualize, and annotate recordings. The system also provides tools to convert these annotations into datasets that can be used to train a computer to detect the presence or absence of a species. The algorithm used by the system was selected after comparing the accuracy and efficiency of three variants of a template-based classification. The algorithm computes a similarity vector by comparing a template of a species call with time increments across the spectrogram. Statistical features are extracted from this vector and used as input for a Random Forest classifier that predicts presence or absence of the species in the recording. The fastest algorithm variant had the highest average accuracy and specificity; therefore, it was implemented in the ARBIMON web-based system. 10 11 12 13 14 15 16 17 18


INTRODUCTION
Monitoring fauna is an important task for ecologists, natural resource managers, and conservationists.
Historically, most data were collected manually by scientists that went to the field and annotated their observations (Terborgh et al., 1990).This generally limited the spatial and temporal extend of the data.Furthermore, given that the data were based on an individual's observations, the information is difficult to verify, reducing its utility for understanding long-term ecological processes (Acevedo and Villanueva-Rivera, 2006).
To understand the impacts of climate change and deforestation on the fauna, the scientific community needs long-term, wide-spread and frequent data (Walther et al., 2002).Passive acoustic monitoring (PAM) can contribute to this need because it facilitates the collection of large amounts of data from many sites simultaneously, and with virtually no impact to the fauna and environment (Brandes, 2008;Lammers et al., 2008;Tricas and Boyle, 2009;Celis-Murillo et al., 2012).In general, PAM systems include a microphone or a hydrophone connected to a self powered system and enough memory to store various weeks or months of recordings, but there are also permanent systems that use solar panels and an Internet connection to upload recordings in real time to a cloud based analytical platform (Aide et al., 2013).
Passive recorders can easly create a very large data set (e.g.100,000s of recordings) that is overwhelming to manage and analyze.Although reserachers often collect recordings twenty-four hours a day for weeks or months (Acevedo and Villanueva-Rivera, 2006;Brandes, 2008;Lammers et al., 2008;Sueur et al., 2008;Marques et al., 2013;Blumstein et al., 2011), in practice, most studies have only analyzed a small percentage of the total number of recordings.
Web-based applications have been developed to facilitate data management of these increasingly large datasets (Aide et al., 2013;Villanueva-Rivera and Pijanowski, 2012), but the biggest challenge is to develop efficient and accurate algorithms for detecting the presence or absence of a species in many recordings.Algorithms for species identification have been developed using spectrogram matched filtering (Clark et al., 1987;Chabot, 1988), statistical feature extraction (Taylor, 1995;Grigg et al., 1996), k-Nearest neighbor algorithm (Hana et al., 2011;Gunasekaran and Revathy, 2010), Support Vector PeerJ Preprints | https://doi.org/10.7287/peerj.preprints.2713v1| CC BY 4.0 Open Access | rec: 11 Jan 2017, publ: 11 Jan 2017 Machine (Fagerlund, 2007;Acevedo et al., 2009), tree-based classifiers (Adams et al., 2010;Henderson and Hildebrand, 2011) and Template based classification (Anderson et al., 1996;Mellinger and Clark, 2000), but most of these algorithms are built for a specific species and there was no infrastructure provided for the user to create models for other species.
The main objective of this study was to compare the performance (e.g.efficiency and accuracy) of three variants of a template-based classification algorithm and incorporate the best into the ARBIMON II bioacoustic platform.

Passive acoustic data acquisition
We gathered recordings from five locations, four in Puerto Rico and one in Peru.Some of the recordings were acquired using the Automated Remote Biodiversity Monitoring Network (ARBIMON) data acquisition system described in Aide et al. (2013) while others were acquired using the newest version of ARBIMON permanent recording station, which uses an Android cell phone and transmits the recorded data through a cellular network.All recordings have a sampling rate of 44.1kHz, a sampling depth of 16-bit and an approximate duration of 60 seconds (±.5s) The locations in Puerto Rico were the Sabana Seca permanent station in Toa Baja, the Casa la Selva station in Carite Mountains (Patillas), El Yunque National Forest in Rio Grande and Mona Island (see Figure 1).The location in Peru was the Amarakaeri Communal Reserve in the Madre de Dios Region (see Figure 2).In all the locations, the recorders were programmed to record 1 minute of audio every 10 minutes.The complete dataset has more than 100,000 1-minute recordings.We randomly chose 362 recordings from Puerto Rico and 547 recordings from Peru for comparing the three algorithm variants.We used the ARBIMON II web application interface to annotate the presence or absence of 21 species in all the recordings.Regions in the recording where a species emits a sound were also marked using the web interface.Each region of interest (ROI) is a rectangle delimited by starting time, ending time, lowest frequency and highest frequency along with a species id and sound type.The species included in the analysis are listed in Table 1, along with the number of total recordings and the number of recordings where the species is present or absent.

Algorithm
The algorithm recognition process is divided into three phases: 1) Template Computation, 2) Model Training and 3) Classification (see Figure 3).In Template computation all ROIs submitted by the user in the training set are aggregated into a template.In Model Training the template is used to compute recognition functions from validated audio recordings and features from the resulting vector V are computed.These features are used to train a random forest model.In the Classification phase the template is used to compute the features, but this time the features are fed to the trained random forest model to compute a prediction of presence or absence.In the following sections the Template Computation process will be explained, then the process of using the Template to extract features from a recording is presented and finally, the procedures to use the features to train the model and to classify recordings are discussed.

Template Computation
The template refers to the combination of all ROIs in the training data.To create a template we first start with the examples of the specific call of interest (i.e.ROIs) that were annotated from a set of recordings for a given species and a specific call type (e.g.common, alarm).Each ROI encompasses an example Specifically, for each recording that has an annotated ROI, a spectrogram matrix (SM) is computed using the Short Time Fourier Transform with a frame size of 1024 samples, 512 samples of overlap and a Hanning analysis window, thus the matrices have 512 rows.For a recording with a sampling rate of 44,100 Hz, the matrix bin bandwidth is approximately 43.06 Hz.The SM is arranged so that the row of index 0 represents the lowest frequency and the row with index 511 represents the highest frequency of the spectrum.Properly stated the columns c 1 to c 2 and the rows from r 1 to r 2 of SM were extracted, where: The rows and columns that represent the ROI in the recording (between frequencies f 1 and f 2 and between times t 1 and t 2 ) are extracted.The submatrix of SM that contains only the area bounded by the ROI is define as SM ROI and refer in the manuscript as the ROI matrix.
Since the ROI matrices can vary in size, to compute the aggregation from the ROI matrices we have to take into account the difference in the number of rows and columns of the matrices.All recordings have the same sampling rate, 44100Hz.Thus the rows from different SMs, computed with the same parameters, will represent the same frequencies, i.e. rows with same indexes represent the same frequency.
After the ROI matrix, SM ROI , has been extracted from SM, the rows of SM ROI will also represent specific frequencies.Thus, if we were to perform an element-wise matrix sum between two ROI matrices with potentially different number of rows, we should only sum rows that represent the same frequency.
To take into account the difference in the number of columns of the ROI matrices, we use the Frobenius norm to optimized the aligment of the smaller ROI matrices and perform element-wise sums between rows that represent the same frequency.We present that algorithm in the following section and a flow chart of the process in Figure 4.

Template Computation Algorithm:
1. Generate the set of SM ROI matrices by computing the short time Fourier Transform of all the user generated ROIs.

Set c max as the number of columns in SM max
4. Create matrix T temp , with the same dimensions as SM max and all entries equal to 0. This matrix will contain the element-wise addition of all the extracted SM ROI matrices.
5. Create matrix W with the same dimensions of SM max and all entries equal to 0. This matrix will hold the count on the number of SM ROI matrices that participate in the calculation of each element of T temp .
6.For each one of the SM i ROI matrices in SM ROI : (a) If SM i has the same number of columns as T temp : i. Align the rows of SM i and T temp so they represent equivalent frequencies and perform an element-wise addition of the matrices and put the result in T temp .
ii. Add one to all the elements of the W matrix where the previous addition participated.
(b) If the number of columns differs between SM i and T temp , then find the optimal alignment with SM max as follows: i. Set c i as the number of columns in SM i .
ii. Define (SM max ) I as the set of all submatrices of SM max with the same dimensions as where NORM is the Frobenius norm defined as: where a i, j are the elements of matrix A.
iv. Define Sub min{d k } as the Sub k matrix with the minimum d k .This is the optimal alignment of SM i with SM max .
v. Align the rows of Sub min{d k } and T temp so they represent equivalent frequencies, perform an element-wise addition of the matrices and put the result in T temp .
vi. Add one to all the elements of the W matrix where the previous addition participated.
7. Define the matrix T template as the element-wise division between the T temp matrix and the W matrix.
The resulting T template matrix summarizes the information available in the ROI matrices submitted by the user and it will be used to extract information from the audio recordings that are to be analyzed.In this article each species T template was created using five ROIs.
In Figure 5a a training set for the Eleutherodactylus coqui is presented and in Figure 5b the resulting template can be seen.This tool is very useful because the user can see immediately the effect of adding or subtracting a specific sample to the training set.

Model Training
The goal of this phase is to train a random forest model.The input to train the random forest are a series of statistical features extracted from vectors V i that are created by computing a recognition function (similarity measure) between the computed T template and submatrices of the spectrogram matrices of a series of recordings.
In the following section we present the details of the algorithm that processes a recording to create the recognition function vector and in Figure 6, we present a flowchart of the process.

Recognition Function
We used three variations of a pattern match procedure to define the similarity measure vector V .First, the Structural Similarity Index described in Wang et al. (2004) and implemented in van der Walt et al. ( 2014) as compare_ssim with the default window size of seven unless the generated pattern is smaller.It will be referred in the rest of the manuscript as the SSIM variant.For the SSIM variant we define meas i as: where SPEC i is the submatrix of SPEC that spans the columns from i × step to i × step + c template and the same number of rows as Second, the dynamic thresholding method (threshold_adaptive) described in Wang et al. (2004) with a block size of 127 and an arithmetic mean filter is used over both T template and SPEC i before multiplying them and applying the Frobenius norm and normalized by the norm of a matrix with same dimensions as T template and all elements equal to one.Therefore, meas i for the NORM variant is defined as: where again SPEC i is the submatrix of SPEC that spans the columns from i × step to i × step + c template , FN is the Frobenius norm, DTM is the dynamic thresholding method, U is a matrix with same dimensions as T template with all elements equal to one and .* performs an element-wise multiplication of the matrices.
Finally, for the CORR variation we first apply the OpenCV's matchTemplate procedure (Bradski, 2000) with the Normalized Correlation Coefficient option to SPEC i , the submatrix of SPEC that spans the columns from i × step to i × step + c template .However, for this variant, SPEC i includes two additions rows above and below, thus it is slightly larger than the T template .With these we can define: where SPEC j,i is the submatrix of SPEC i that starts at row j (note that there are 5 such SPEC j,i matrices).Now, we select 5 points at random from all the points above the 98.5 percentile of meas j,i and apply the Structural Similarity Index to the neighborhoods of the 5 selected points.Each neighborhood is 266% of the length of T template , 133% before and 133% after.Then, lets define FilterSPEC as the matrix that contains these 5 neighborhoods and FilterSPEC i as the submatrix of FilterSPEC that spans the columns from i to i + c template then, the similarity measure for this variant is define as: and the resulting vector V = [meas 1 , meas 2 , meas 3 , • • • , meas n ] but this time with It is important to note that no matter which variant is used to calculate the similarity measures, the result will always be a vector of measurements V .The idea is that the statistical properties of these computed recognition functions have enough information to distinguish between a recording that has the target species present and a recording that does not have the target species present.However, notice that since c SPEC , the length of SPEC, is much larger that c template the length of the vector V for the CORR variant is much smaller than the other two.After calculating V for many recording we can train a random forest model.First, we need a set of validated recordings with the specific species vocalization present in some recordings and absent in others.
Then for each recording we compute a vector V i as described in the previous section and extract the statistical features presented in Table 2.These statistical features represent the dataset used to train the random forest model, which will be used to classify recordings for presence or absence of a species call event.These 12 features along with the species presence information are used as input to a random forest classifier with a 1000 trees.

Recording Classification
Now that we have a trained model to classify a recording, we have to compute the statistical features from the similarity vector V of the selected recording.This is performed in the same way as it was described in the previous section.These features are then used as the input dataset to the previously trained random forest classifier and a label indicating presence or absence of the species in the recording is given as output.

The Experiment
To decide which of the three variants was to be selected, we performed the algorithm explained in the previous section with each of the similarity measures.We computed 10-fold validations on each of the variants to obtained measurements of the performance of the algorithm.In each validation 90% of the data is used as training and 10% of the data is used as validation data.Each algorithm variant used the same 10-fold validation partition for each species.The measures calculated were accuracy or correct classification rate (Ac), negative predictive value (N pv), precision or positive predictive value (Pr), sensitivity, recall or true positive rate (Se) and specificity or true negative rate (Sp) where they are defined as follows: and Sp = t n t n + f p with t p the number of true positives (number of times both the expert and the algorithm agree that the species is present), t n the number of true negatives (number of times both the expert and the algorithm agree that the species is not present), f p the number of false positives (number of times the algorithm states that the species is present while the expert states is absent) and f n the number of false negatives (number of times the algorithm states that the species is not present while the expert states it is present).
Although we present and discuss all measures, we gave accuracy more importance since it is fundamentally a weighted average between sensitivity and specificity and therefore contain the information of true positive rate as well as true negative rate.Also notice, that when the number of positive cases is equal to the number of negative cases the accuracy measure becomes the area below the line formed between the three points (0, 0), (1-mean(Sp), mean(Se)), (1, 1) in a receiver operating characteristic (ROC) graph and therefore is proportional to the area under the ROC curve but much simpler to calculate.

9/19
PeerJ Preprints | https://doi.org/10.7287/peerj.preprints.2713v1| CC BY 4.0 Open Access | rec: 11 Jan 2017, publ: 11 Jan 2017 The experiment was performed in an Intel i7 4790K 4 cores computer with 32GB of RAM and running Ubuntu Linux.The execution time needed to classify each recording was registered and the mean and standard deviation of the execution times were calculated for each variant of the algorithm.We also computed the quantity of pixels on all the T template matrices and correlated with the execution time of each of the variants.
A global one-way analysis of variance (ANOVA) was performed on the five calculated measures across all of the 10-fold validations to identify if there was a significant difference between the variants of the algorithm.Then a post-hoc Tukey HSD comparison test was performed to identify which one of the variants was significantly different at the 95% confidence level.Additionally, an ANOVA was performed locally between the 10-fold validation of each species and on the mean execution time for each species across the algorithm variants to identify if there was any significant execution time difference at the 95% confidence level.Similarly, a post-hoc Tukey HSD comparison test was performed on the execution times.

RESULTS
The five measurements (accuracy, negative predictive value, precision, sensitivity and specificity) computed to compared the model across the three variants varied greatly among the 21 species.The lowest scores were among bird species while most of the highest scores came from amphibian species.Table 3 presents a summary of the results of the measurements comparing the three variants of the algorithm (for a detail presentation see Table 6 in Appendix 1).The NORM variant did not achieve a best value in any of the measures summarized in Table 3 while the CORR variant had a greater number of species with 80% or greater for all the measures and an overall median accuracy of 81%.We considered these two facts fundamental for a generic non-species specific system.
The local species ANOVA suggested that there are significant accuracy differences at the 95% significance level for 6 of the 21 species studied as well as 4 in terms of precision and 3 in terms of specificity (see supplemental materials).Algorithm variants SSIM and CORR have higher mean accuracy than the NORM variant.Nevertheless, variant CORR has the highest median accuracy of 81%, which is slightly higher that the median accuracy of the SSIM variant at 76%.In addition, variant CORR had more species with an accuracy of 80% or greater.
In terms of median precision, the three variants had similar values, although in terms of mean precision variants SSIM and CORR have greater values than the NORM variant.Moreover, the median and mean precision of the SSIM variant were only 1% higher than the median and mean precision of the CORR variant.In terms of sensitivity, variants SSIM and CORR have greater values than the NORM variant.It is only in terms of specificity that the CORR variant has greater values than all other variants.Figure 8 presents a summary of these results with whisker graphs.
In terms of execution times, an ANOVA analysis on the mean execution times suggests a difference between the variants (F = 9.9341e + 30, d f = 3, p < 2.2e − 16).The CORR variant has the lowest mean execution time at 0.255s followed very closely by the NORM variant with 0.271s while the SSIM variant was noticeably worst with a mean execution time of 2.269s (Figure 9).The Tukey HSD test suggests that there was no statistical significant difference between the mean execution times of the NORM and CORR variants (p = 0.999).However, there was a statistical significant difference at the 95% confidence level between the mean execution times of all other pairs of variants, specifically variants SSIM and CORR (p < 2.2e − 16).Moreover, the mean execution time of the SSIM variant increased as the number of pixels in the T template matrix increases (Figure 9b).There was no statistically significant relationship between the T template pixel size and the execution time for the other two variants (Table 4).
In summary, variants SSIM and CORR outperform the NORM variant in most of the statistical measures computed having statistically significant high accuracy for three species each.However, the CORR variant has much lower mean execution times than the SSIM variant (Table 3).Furthermore, the mean execution time of CORR variant did not increase with increasing size of the T template (Table 4).

DISCUSSION
The algorithm used by the ARBIMON system was selected by comparing three variants of a templatebased method for the detection of presence or absence of a species vocalization in recordings.The most important features for the selection of this method is that the algorithm have to provide a generic The difference in execution time between the SSIM variant and the other two was due to a memory management issue in the SSIM algorithm.An analysis reveals that all the algorithms have order of where c SPEC and c template are the number of columns in SPEC and T template respectively and r template is the number of rows in T template .The only explanation we can give is that the SSIM function uses an uniformly distributed filter (uniform_filter) that has a limit on the size of the memory buffer that handles (4000 64-bit doubles divided by the number of elements in the dimension been process).Therefore, as the size of T template increase the number of calls to allocate the buffer, free and allocate again can become a burden since it has a smaller locality of reference even when the machine has enough memory and cache to handle the process.Further investigation is required to confirm this.
Another interesting result is that the SSIM variant provide more stable results.The boxes for the SSIM variant in all the whisker boxes in Figures 7 and 8   for all the cases.However, although this variant appears to perform better in terms of false positives, in terms of false negatives performs worst than the CORR variant.This is interesting because the CORR variant is a "lite version" of the SSIM variant.We started looking to achieve comparable performances in terms of detection effectiveness with a much better performance in terms of execution time.The idea was to run the SSIM function over a selected number of elements to maintain reasonable execution times.
This is what we achieve with the pattern matching phase, a function that by itself did not provided good results but one that as a ranker provided enough information for the SSIM to decide on the presence or absence of a species.It will seem that for some species this filtering also helps in obtaining less false negatives than in the SSIM variant.

CONCLUSIONS
Now that passive autonomous acoustic recorders are readily available the amount of data is growing exponentially.For example, one permanent station recording 1 minute of every 10 minutes every day of the year generates 52,560 one minute recordings.Multiply that by the need to monitor thousands of locations across the planet and one can understand the magnitude of the task in hand.
We have shown how the algorithm used in the ARBIMON II web-based cloud-hosted system was selected.The ease of managing of this system as well as the options to create playlists based on many different parameters including user-created tags, allow users to analize large quantities of recordings (see Table 5).Therefore, a generic non-species specific system for detecting presence or absence of a species in recordings is fundamental.For example, the system currently counts with 1,749,551 recordings uploaded by 453 users and 659 species specific models have being created and run over 3,780,552 minutes   As a society, it is fundamental that we study the effects of climate change and deforestation on the Table 5. Summary of the usage of the ARBIMON2 system and its model creation feature.

Figure 1 .
Figure 1.Recording locations in Puerto Rico.

Figure 3 .
Figure 3.The three phases of the algorithm to create the species-specific models.In the Model Training phase Rec i is a recording, V i is the vector generated by the recognition function on Rec i and in the Classification phase V is the the vector generated by the recognition function on the incoming recording.

Figure 4 .Figure 5 .
Figure 4. Flowchart of the algorithm to generate the template of each species.

Figure 6 .
Figure 6.Flowchart of the algorithm to generate the similarity vector of each recording.

Figure 7 .
Figure 7. Whisker boxes of the 10-fold validations for the three variants of the presented algorithm for the accuracy measure.

Figure 8 .
Figure 8. Whisker boxes of the 10-fold validations for the three variants of the presented algorithm. 287

288Figure 9 .
Figure 9. (a) Whisker boxes of the execution times of the three algorithms.(b) Execution times as a function of the size of the template in number of pixels.

Table 1 .
Species, class, location and count of recordings with validated data. of the call, and is an instance of time between time t 1 and time t 2 of a given recording and low and high boundary frequencies of f 1 and f 2 , where t 1 < t 2 and f 1 < f 2. In a general sense, we combine these examples to produce a template of a specific song type of a single species.

Table 3 .
PeerJ Preprints | https://doi.org/10.7287/peerj.preprints.2713v1|CC BY 4.0 Open Access | rec: 11 Jan 2017, publ: 11 Jan 2017 Summary of the measures of the three variants of the algorithm.Best values are in bold.non-speciesspecific system that can detect species and given that it will have to process hundred of thousand recordings, that can do so in a reasonable amount of time.The CORR algorithm was selected because of it's speed and it's comparable performance in terms of detection efficiency with the SSIM variant.It achieved accuracy of 0.80 or better in 12 of the 21 species and sensitivity of 0.80 or more in 11 of the 21 species and the average execution time of 0.26s per minute per recording means that it can process around 14,000 minutes of recordings per hour.

Table 4 .
are smaller and the standard deviation is also smaller PeerJ Preprints | https://doi.org/10.7287/peerj.preprints.2713v1|CC BY 4.0 Open Access | rec: 11 Jan 2017, publ: 11 Jan 2017 Summary of the execution times of the three variants of the algorithm.Best values are in bold.PPMCC is the Pearson product-moment correlation coefficient.