Biodiversity assessment is a central and urgent task, not only for research in the biological sciences, but also in applied conservation biology, including major multi-lateral initiatives for promoting and protecting biodiversity. At the governmental level biodiversity needs to be incorporated into national accounting by 2020 (Aichi Biodiversity targets A2) (http://www.cbd.int/sp/targets/) and cost effective tools necessary to achieve this remain elusive. Operating within the conceptual and methodological framework of the burgeoning field of ecoacoustics, (Sueur & Farina, 2015) we are interested in the potential for investigating the acoustic environment–or soundscape–as a resource from which to infer ecological information. The main contribution of this paper is to highlight a disjunct between a founding premise of ecoacoustics (that the acoustic environment is structured through spectro-temporal partitioning) and the fact that community level indices to date are derived from representations of the acoustic signal in the time or frequency domain and therefore limited in accessing and evaluating structures across spectro-temporal dimensions. We consider approaches to decomposition which preserve time-frequency structure and propose sparse-coding as a possible solution. Ecoacoustic applications are illustrated with example analyses from a recent acoustic survey. Results are illustrative rather than conclusive but point to possibilities for analyses of community level acoustic structures which are impervious to investigation with current tools.
Ecoacoustic approaches to biodiversity assessment
In ongoing work, we are exploring cost-effective solutions, including remote sensing (camera traps and aerial photography of canopy) and identification of ‘ecological-disturbance indicator species’ (Caro, 2010). Remote sensors are an attractive choice for data collection in that they are noninvasive, scalable in both space and time and remove the bias and cost associated with programs which require either experts (All taxa biodiversity inventory, Gewin, 2002) or even non-specialists (Rapid Biodiversity Assessment, Oliver & Beattie, 1993), in situ.
Various forms of remote visual sensing technologies have been explored. Global satellite imaging has been investigated to monitor biophysical characteristics of the earth’s surface by assessing species ranges and richness patterns indirectly (e.g., Wang et al., 2010). These methods are attractive, but rely on expensive equipment, are difficult to adapt to small spatial scales and require a time-consuming validation step. It is possible, for example, to infer valid species-level identification of canopy trees from high-resolution aerial imagery, providing a means of remote sensing to assess forest status (Peck et al., 2012). However, the principal weakness of this and other existing visual remote sensing methods is that they cannot provide direct information on the status of taxa other than plants: they cannot detect ‘silent forests.’ The need for innovative remote sensing methods to monitor the status of wildlife remains and acoustic, rather than visual, sensors have many attractive characteristics.
Acoustic surveys have most obvious relevance for the identification of vocal animals. In terrestrial habitats, bird species in particular are of interest as their importance as indicator species of environmental health has been demonstrated in temperate (Gregory & Strien, 2010) and tropical (Peck et al., 2015) climates. One approach is to focus on automatic species call identification, but current methods are far from reliable (e.g., Skowronski & Harris, 2006, for bats), increasingly difficult in complex environments such as tropical forest soundscapes, where tens of signals mix and many species still remain unknown (Riede, 1993) and notoriously difficult to generalize across locations due to natural geographic variation in species’ calls (Towsey et al., 2013).
Rather than focusing on individual species, there is a growing interest in monitoring high-level structure within the emerging field of Soundscape Ecology (Pijanowski et al., 2011) in which systematic interactions between animals, humans and their environment are studied at the landscape level. From this emerging perspective, the landscape’s acoustic signature–the soundscape–is seen as a unique component in the evaluation of its function, and therefore potential indicator of its status (Krause, 1987; Schafer, 1977). We can consider similar processes occurring at the community level: vocalising species establish an acoustic community when they sing at the same time at a particular place. The potential for estimation of acoustic community dynamics as key to understanding what drives change in community composition and species abundance is being recognised (Lellouch et al., 2014). The nascent discipline of ecoacoustics unites theoretical and practical research which aims to infer ecological information from the acoustic environment across levels (Sueur & Farina, 2015) and habitats. In this paper we focus on terrestrial habitats, although the discussion is equally applicable to aquatic environments.
The motivations of the ecoacoustic approach can be understood in evolutionary terms: the same competitive forces which drive organisms to partition and therefore structure dimensions of their shared biophysical environment (food supply, nesting locations etc.) apply in the shared sonic environment; the soundscape is seen as a finite resource in which organisms (including humans) compete for spectro-temporal space. These ideas were first explicitly captured in Krause’s Acoustic Niche Hypotheses (ANH) (Krause, 1987). Referring directly to Hutchinson’s original niche concept (Hutchinson, 1957) the ANH suggests that vocalising organisms have evolved to occupy unique spectro-temporal ‘niches,’ minimising competition and optimising intraspecific communication mechanisms. Formulated following countless hours recording in pristine habitats, Krause goes so far as to posit that this spectro-temporal partitioning structures the global soundscape, such that the global compositional structure is indicative of the ‘health’ of a habitat. Crudely put, in ancient, stable ecosystems, the soundscape is expected to comprise a complex of non-overlapping signals well dispersed across spectro-temporal niches; a newly devastated area might be characterised by gaps in the spectro-temporal structure; and an area of regrowth may comprise competing, overlapping signals due to invasive species.
Krause’s ANH can be understood in terms of several theories of the evolution of bird species, which are supported by field studies. Avian mating signals are thought to diverge via several processes: (1) as a by-product of morphological adaptation, the Morphological Adaptation Hypothesis; (2) through direct adaptation to physical features of the signalling environment, the Acoustic Adaptation Hypothesis; and (3) to facilitate species recognition, the Species Recognition Hypothesis. Field studies of the Neotropical suboscine antbird (Thamnophilidae) provide direct evidence that species recognition and ecological adaptation operate in tandem, and that the interplay between these factors drives the evolution of mating signals in suboscine birds (Seddon, 2005). Although the ANH is challenged by various studies in terrestrial and marine environments (Amézquita et al., 2011; Chek, Bogart & Lougheed, 2003; Tobias et al., 2014), it is tenable in evolutionary terms and acoustic partitioning in both time and frequency domains have been observed (Sinsch et al., 2012; Schmidt & Balakrishnan, 2015; Ruppé et al., 2015).
The constraints of existing acoustic indices for automated ecoacoustics
This emerging framework, coupled with the technical feasibility of remote acoustic sensing and pressure to meet strategic biodiversity targets, fuels a growing research interest in ecological applications of acoustic indices; several dozen have been proposed over the last 6 years (see Sueur et al., 2014; Towsey et al., 2013; Lellouch et al., 2014, for good overviews in terrestrial habitats and Parks, Miksis-Olds & Denes, 2014, for marine acoustic diversity). These are motivated by different approaches to measuring the variations in acoustic activity and predominantly derived from statistical summaries of amplitude variation in time domain or magnitude differences between frequency bands of a spectrogram.
The simplest indices provide summaries of the Sound Pressure Level (e.g., peaks, or specific times of day). In Rodriguez et al. (2013), for example, root mean square values of raw signals from a network of recorders are used to create maps of amplitude variation to reveal spatiotemporal dynamics in a neotropical forest.
Under the assumption that anthropogenic noise contribution is band-limited to a frequency range (anthropophony: 0.2–2 kHz) below that of the rest of the biological world (biophony: 2–8 kHz), the Normalized Difference Soundscape Index (NDSI) (Kasten et al., 2012) seeks to describe the ‘health’ of the habitat in terms of the level of anthropogenic disturbance by calculating the ratio (biophony − anthropophony) / (biophony + anthropophony). In long term studies, the NDSI has been shown to reflect assumed seasonal and diurnal variation in a landscape and may prove useful for observing high level, long term interactions between animals and human populations (Kasten et al., 2012). However, it does not give an estimation of local diversity within the range of biophony, or provide a means to investigate short term interactions in detail. Further, assumptions about frequency ranges may not generalize. For example in non-industrialized tropical climes (arguably the most precious in ecological terms) animals vocalize outside the 2–8 kHz range (Sueur et al., 2014), and industrial anthropophony is minimal. Similarly, in marine habitats these frequency ranges may not be relevant.
A range of entropy indices are based on the assumption that the acoustic output of a community will increase in complexity with the number of vocalising individuals and species. A summary of the complexity of the sound is assumed to give a proxy of animal acoustic activity. Complexity here is used as a synonym of heterogeneity and many indices derive from classical ecological biodiversity indices. Shannon Entropy (Shannon & Weaver, 1949) Eq. 1 is favoured by ecologists as a measure of species diversity, where pi is the proportion of individuals belonging to the ith species in the data set of interest; it quantifies the uncertainty in predicting the species identity of an individual that is taken at random from the dataset. (1)
The Acoustic Entropy Index, H (Sueur et al., 2008b) is described as the product of spectral (sh) and temporal (th) entropies which are calculated on the mean spectrum and Hilbert amplitude envelope of a time wave respectively. H ranges from 0 for pure tones to 1 for high-energy, evenly distributed sound. The index was first tested against simulated choruses, generated by mixing together samples of avian vocalisations and systematically varying the number of species in each track. H values increased with species richness S following a logarithmic model. Field trials were carried out in pristine and degraded African coastal forests and H was shown to reflect assumed variation in species richness (Sueur et al., 2008b). The study was in an area where animal acoustic activity was high and background noise low. When background noise (such as traffic) or broadband signals (such as rain, cicada or tropical cricket choruses) are higher, spectral entropy measures may give counter-intuitive results–times of low acoustic activity with relatively loud background noise for example would approach 1. This is an real issue in passive acoustic monitoring in both terrestrial and marine environments as the low sensitivity of rugged outdoor microphones tend to create a high background noise level.
The Acoustic Diversity Index (Villanueva-Rivera et al., 2011) (ADI) is a spectral entropy measure which summarises the distribution of the proportion of signals across the spectrum. The FFT spectrogram is divided into a number of bins (default 10) the proportion of the signals in each bin above a threshold (default = 50 dBFS) is calculated. The Shannon Index Eq. 1 is then applied, where pi is the fraction of sound in each ith of R frequency bands. An evenness metric, the Acoustic Evenness Index (AEI) is similarly derived by calculating the Gini index (Gini, 1912) (commonly used by ecologists to estimate species evenness) on the spectrum. These relatively simple indices are shown to effectively reflect observed distinctions in gross acoustic activity, for example between dawn choruses and night activity, or between diverse habitats (mature oak forest, secondary forest, wetland and agricultural land).
The spectral indices provide a statistical summary of the distribution of energy across the sample, typically 1–10 mins are analysed at a time. These prove useful in long term studies or for observing gross changes in time or space. Seeking to capture subtler changes in behaviour and composition of vocalising communities, and to counter the noise-sensitivity of the entropy indices, the Acoustic Complexity Index (ACI) was developed specifically to capture the dynamic changes in the soundscape: “many biotic sounds, such as bird songs, are characterised by an intrinsic variability of intensities, while some types of human generated noise (such as car passing or airplane transit) present very constant intensity values” (Pieretti, Farina & Morri, 2011). The ACI is derived from measures of absolute difference in adjacent bins in a spectrogram and was shown to correlate with the number of bird vocalisations in a small scale spatial study in an Apennine National Park, Italy (Pieretti, Farina & Morri, 2011).
The Bioacoustic Index (Boelman et al., 2007) is presented as a measure of avian abundance and is calculated simply as the area under the mean frequency spectrum (minus the value of the lowest bin), providing a measure of both the sound level and the number of frequency bands used by the avifauna. It was used to investigate differences between exotic and native species in Hawaii and shown to be strongly correlated with counts from direct ornithological survey when calculated for single samples taken across a six week period.
Initial studies are encouraging: indices have been shown to correlate with aurally identified changes in bird species richness (Depraetere et al., 2012) and reveal dynamic variation across landscape, however there are many open questions both methodologically and theoretically. Existing indices are inherently likely to be affected by several factors including transitory or permanent background noise, variation in distance of the animal to the microphone and relative intensity of particular species call patterns. Theoretically, we are still far from understanding exactly what aspects of biodiversity these indices might represent (Pijanowski et al., 2011; Sueur et al., 2008b; Servick, 2014). This is highlighted in a recent temporal study of dissimilarity indices (Lellouch et al., 2014) in which indices were shown to correlate well with diversity of simulated communities, but did not track community composition changes in the wild, raising the question of what, if any, aspect of compositional diversity such indices represent?
By virtue of being based on either time-averaged spectrograms or amplitude changes in the time domain, we argue here that such indices are fundamentally limited in their ability to detect patterns across the spectro-temporal domain, which may be key to evaluating the acoustic dynamics of specific communities. Frequency-based indices can pick up on crude differences in gross frequency range; time domain indices can pick up changes in amplitude; both are inherently constrained in their ability to detect global spectro-temporal patterns created by cohabiting species interacting in an acoustic community. As the motivational premise of the community level approach assumes that spectro-temporal partitioning is responsible for structuring the soundscape (in both marine and terrestrial habitats), this is a significant constraint. Although existing indices are very cheap in computational terms, the fear is that what is gained in computational efficiency is lost of ecological efficacy. Rather than collapsing the signal into one or other domain, we propose that the theoretical and practical strands of ecoacoustic research could be advanced by developing tools in which time-frequency structure is preserved.
Sparse coding and latent component analysis
Time-frequency tradeoffs are an important issue in all signal processing tasks. Sparse coding is gaining popularity in brain imaging, image analysis and audio classification tasks as an alternative to vector-based feature representations in part because it is seen to have more time-frequency flexibility than standard Fourier transform representations. Sparse coding aims to construction efficient representations of data as a combination of a few typical patterns (atoms) learned from the data itself. For a given set of input signals, a number of atoms are sought, such that each input signal can be approximated sparsely by a linear combination of a relatively small number of this set of atoms (the dictionary). The dictionary size is higher than the dimensionality of the signal such that a subset of atoms can span the whole signal space—an overcomplete dictionary (Scholler & Purwins, 2011). Sparse approximations of the signal are then constructed by finding the “best matching” projections of multidimensional data onto an over-complete dictionary, Matching Pursuit (Mallat & Zhang, 1993) (MP) being a popular choice.
Sparse decomposition using dictionaries of atoms based on biologically informed time-frequency atoms such as Gabor and Gammatone functions–which are seen to resemble characteristics of cochlea filters–are intuitively attractive as they can provide a feature set which is oriented in a two dimensional time-frequency space with which to approximate the original signal. This has been shown to be more efficient than Fourier or wavelet representations (Smith & Lewicki, 2005) and to provide effective and efficient input features in a range of audio discrimination tasks in everyday sounds (Adiloglu et al., 2012), drum samples (Scholler & Purwins, 2011) and similarity matching of bioacoustic data (Glotin et al., 2013).
Probabilistic Latent Component Analysis (PLCA) is one of a family of techniques used for source separation, which similarly provides a tool for extracting components according to common frequency-amplitude statistics. PLCA is a probabilistic variant of non-negative matrix factorization (NMF) (Lee & Seung, 2001). It decomposes a non-negative matrix V into the product of two multinomial probability distributions, W and H, and a mixing weight, Z. In the auditory domain, V would be a matrix representing the time-frequency content of an audio signal: (2) where each column of W can be thought of as a recurrent frequency template and each row of H as the excitations in time of the corresponding basis. is a diagonal matrix of mixing weights z and K is the number of bases in W (Weiss & Bello, 2010). Each of V, wk, zk, and hk correspond to probability distributions and are normalized to sum to 1.
PLCA can be compared to more familiar component analysis tools such as Principle or Independent Component Analyses (PCA, ICA) and can be used to perform dimensionality reduction, feature extraction or to explore structure in a data set. The non-negativity constraint is a valuable property for audio and image decompositions, where non-negative representations are prevalent, as the non-negative elements are often perceptually meaningful decompositions which can be easily interpreted. By comparison, methods using non-negativity are bound to return bases that contain negative elements and then employ cross-cancellation between them in order to approximate output (Smaragdis, Raj & Shashanka, 2008). Such components are harder to interpret in a positive only setting and although useful for their statistical properties provide little insight.
Sparse and shift-invariant PLCA (SI-PLCA) extends PLCA to enable the extraction of multiple shift-invariant features from analysis of non-negative data of arbitrary dimensionality and was first demonstrated as an effective unsupervised tool for extracting shift-invariant features in images, audio and video (Smaragdis, Raj & Shashanka, 2008). The algorithm provides a very precise and perceptually meaningful description of content. A series of piano notes, for example, is automatically decomposed into a kernel distribution representing the harmonic series common to all notes, the peaks of the impulse distribution representing the fundamental frequency of each note and its location in time (Smaragdis, Raj & Shashanka, 2008). Weiss & Bello (2010) demonstrated application in segmentation task, showing SI-PLCA to be competitive with Hidden Markov Models and self-similarity matrices. More recently, Sarroff & Casey (2013) developed a shift and time-scale invariant PLCA which performed well against subjective human ratings of musical ‘groove’—a multi-dimensional rhythmic musical feature correlated with the induction of bodily movement.
A common strategy used throughout the NMF literature is to favour sparse settings in order to learn parsimonious, parts-based decompositions of the data. Sparse solutions can be encouraged when estimating the parameters of the convolution matrix by imposing constraints using an appropriate prior distribution (Smaragdis, Raj & Shashanka, 2008). Under the Dirichlet distribution for example, hyper-parameters can be set to favour a sparse distribution. In these cases, the algorithm will attempt to use as few bases as possible, providing an ‘automatic relevance determination strategy’ (Weiss & Bello, 2010): the algorithm can be initialised to use many bases; the sparse prior then prunes out those that do not contribute significantly to the reconstruction of the original signal. In the context of pop song segmentation, this enables the algorithm to automatically learn the number and length of repeated patterns in a song. In soundscape analysis, might this provide an ecologically-relevant indicator of the compositional complexity of an acoustic community?
In Music Information Retrieval (MIR) and composition tasks, SI-PLCA provides a tool for accessing perceptually meaningful decompositions—time-frequency shifted patterns in a dynamic signal. From the perspective of community level ecoacoustics we are not necessarily concerned with the identification of specific species, so much as achieving a numerical description of the qualitative patterns of interaction between them. By way of musical analogy, we don’t care what the specific instruments of the orchestra are, rather we wish to assess characteristics of the arrangement and how the voices interact as an ensemble toward a coherent global composition through time, timbre and frequency space. Frequency-based indices may succeed in tracking species richness in simulated communities by measuring gross changes in frequency band occupancy. Perhaps their failure to track variation in species richness in the wild is because the defining feature of acoustic communities are global patterns of interaction across a more complex spectro-temporal domain, rather than frequency band occupancy or amplitude variation alone. As outlined above, current indices based on frequency or amplitude statistics inherently throw away information crucial to the analysis of spectro-temporal patterns: SI-PLCA provides a tool for extracting dynamic sound components grouped by common frequency-amplitude statistics, even when pitch or time shifted. That it has been demonstrated to be effective in extracting the perceptually-meaningful but nebulous concept of ‘groove’ (Sarroff & Casey, 2013) suggests potential as a tool for beginning to interrogate the multi-dimensional dynamic complexities of acoustic communities.
In this paper we take a first look at how these methods might provide a complementary approach to current acoustic indices for investigation of soundscape dynamics and ultimately for biodiversity assessment. Taking a small sample of field recordings across different terrestrial habitats in an Ecuadorian cloud forest reserve we compare existing spectral and temporal indices with sample analyses of a number of approaches to sparse approximation, including dictionaries built using mini-batch gradient descent, Gabor functions and SI-PLCA2D. The potential value of this approach is illustrated with example reconstructions from a new variant of SI-PLCA using dual dictionaries.
Methods and Materials
Study area and acoustic survey methods
The data reported here is a subset of that collected during an 8 week field survey (June–August 2014) in the Ecuadorian Andean cloud forest at the Santa Lucia Cloud Forest Reserve (SLR)1. The SLR (0°07′30″N, 78°40′3″W) is situated on the western (Pacific) slopes of the Andes in northwestern Ecuador and spans an elevational range of 1,400–2,560 m. The forest is lower montane rain forest (cloud forest). The area has a humid subtropical climate and is composed of fragmented forest reserves surrounded by a matrix of cultivation and pasture lands. It lies within the Tropical Andes biodiversity hotspot and exhibits high plant species endemism and diversity. Topography is defined by steep-sloping valley systems of varying aspect.
The SLR was awarded reserve status 20 years ago, prior to which areas of Primary Forest had cleared for fruit farming. The SLR therefore consists of a complex mosaic of habitat types: Ancient Primary Forest (FP) punctuated by small areas of secondary regrowth of around 20 years (FS) and silvopasture (S), typically elephant grass pastures used as grazing paddocks for the mules, which provide local transport. These areas are less than 5 ha. In contrast to other studies where dramatically different sites have been used to validate indices, this complex patchy habitat provides subtle habitat gradients.
Acoustic data was collected using nine digital audio field recorders Song Meter SM2+‘ (Wildlife Acoustics), giving three replicates of each of the three habitat types. Minimum distance between recorders was 300 m to avoid pseudo sampling (the sound of most species being attenuated over this distance in cloud forest conditions). Altitudinal range was minimised. Recording schedules captured the full dawn (150 min), dusk choruses (90 min) plus midday (60 min) activity; throughout the rest of the period 3 min recordings were taken every 15 min and ran for a minimum of 14 days at each study site.
The SM2+ is a schedulable, off-line, weatherproof recorder, with two channels of omni-directional microphone (flat frequency response between 20 Hz and 20 kHz). Gains were set experimentally at 36 dB and recordings made at 16 bit with a sampling rate of 44.1 kHz. All recordings were pre-processed with a high pass filter at 500 Hz (12 dB) to attenuate the impact of the occasional aircraft and local generator noise.
A local expert ornithologist carried out point count surveys (Ralph, Sauer & Droege, 1995), noting all birds seen and heard at each survey point for 10 min periods during the dawn chorus. A record was made for each individual, rather than individual vocalisations, and distance estimates given, providing species presence–absence and abundance measures.
For the purposes of this illustrative exercise, analyses were carried out on dawn chorus recordings from just one day at one recording station for each of the three habitat types sampled. A range of indices frequency and time domain indices were calculated: NDSI, H, ADI, AEI, ACI and BI. Indices were calculated for the same 10 min periods during which point counts were made at each site. Calculations were made using the seewave (Sueur, Aubin & Simonis, 2008a; Available at: http://cran.r-project.org/web/packages/seewave/index.html) and soundecology (Available at: http://cran.r-project.org/web/packages/soundecology/index.html) packages in R.
Audio spectrum approximation methods
Three approaches to audio decomposition are illustrated using the Bregman Media Labs Audio Patch Approximation Python package (https://github.com/bregmanstudio/audiospectrumpatchapproximation): dictionary learning using mini-batch gradient learning, a Gabor field dictionary and, shift invariant 2D Principle Latent Component Analysis (SI-PLCA2D). Each uses Orthogonal Matching Pursuit (OMP) to build the component reconstructions. Samples were extracted from analyses of 1 min sections of the field recordings. These examples are aimed at illustrating the potential of a two-dimensional (2D) atomic rather than 1D vector approach in general, rather than experimental validation of any particular algorithm. Default parameters were used in all cases.
A potential future direction is illustrated using a SI-PLCA variant (SI-PLCA2) using 2D dual dictionaries (Smaragdis & Raj, 2007; Weiss & Bello, 2010; Sarroff & Casey, 2013) based on frequency * local time functions and frequency-shift * global time-activations. The expectation-maximisation (EM) algorithm (Smaragdis & Raj, 2007) is used to build component reconstructions.
As described in ‘Sparse Coding and Latent Component Analysis’, the algorithm returns a set of k from a K maximum components (): independent component reconstructions, time-frequency kernels and shift-time activation functions. Entropies of each are also returned. These example analyses are used to illustrate SI-PLCA as a potentially rich tool for future research in investigating the complex quasi-periodic signals of wild soundscapes.
Results and Discussion
Species composition of acoustic communities
The species observations for each site, shown in Table 1, reveal little variation in overall abundance or species number between the sites when seen and heard records are considered together. Several species are observed in all sites; others are observed only in one habitat type. Discounting the seen-only counts, the highest number of species, and individuals, was recorded at S, with least heard at FP. The spectrograms and mean spectrum profiles (Fig. 1) for these recordings suggest that this information is present in the soundscape. The number of shared species between sites results in acoustic communities with an overall similar overlap, differentiated by calls of ‘keynote’ species. Each acoustic community occupies a broadly similar frequency range, with variation in the peaks of spectral profiles according to the prevalence of calls of habitat-specific species. FP appears to have lowest over-all activity, in line with the relatively fewer number of species observed.
Despite occupying an overall similar frequency range and not differing dramatically in abundance, each site is distinctly characterised by differing quasi-periodic patterns of calls. The same patterns observed at the 1 min shown continued for the full 10 min sample2. The soundscape is structured, not just by repetitions of specific species calls, but by turn taking, i.e., interactions between species. This is most evident in listening, and can be observed visually as an interplay of periodic gestures in the spectrogram. It is precisely this complex of interacting periodic structures which we wish to evaluate under the soundscape approach, but which are impervious to analyses by current indices.
Values for each of the acoustic indices calculated for the three habitats are given in Table 2 and shown as bar plots in Fig. 2. As we might expect given the minimal anthropogenic noise and broadly similar spectral profile, the NDSI reports near maximum values for each site. The global complexity of each scene is high; it is no surprise then that entropy indices approach 1 and differences between sites are minimal. The ADI reports a small variation, following the rank-order pattern of species heard at each site. Differences between Sueur’s spectral, temporal and therefore overall, H entropy are minimal. ACI similarly shows small variation between sites. This index in particular is very sensitive to the size of the analysis window and requires further exploration to establish which aspects of community composition may be being assessed. BI values report the differences in overall acoustic energy, observable in mean spectrum plot (Fig. 1D), with the highest value at FS, FP being slightly higher than site S. These basic features of the acoustic recordings are at odds with the field observations of abundance and species numbers. An increase in overall energy could be due to certain individuals having intrinsically louder calls, calling more frequently, or simply being closer to the microphone. In validation studies the latter could be countered by factoring in field-based point count distance measures (recorded, but not included here) and call frequencies, as well as tallies of individual vocalisations, the latter being expedited by the use of automatic segmentation software (as in Pieretti, Farina & Morri, 2011).
The key issue raised here, however, is that in providing summaries of frequency or temporal amplitude profile and magnitude differences, these current indices are not only sensitive to these largely irrelevant variations in overall amplitude changes, but are all insensitive to the periodic structures which uniquely characterise the three soundscapes.
Dictionaries and sparse-approximations for FP using mini-batch gradient descent, Gabor atoms and SI-PLCA2D are shown in Fig. 3. The input for each is a log-frequency spectrogram (constant-Q transform) of samples from the field recordings, as shown in Fig. 1. Example dictionaries (Figs. 3A, 3C, 3E) and sparse approximations of the input spectrogram (Figs. 3B, 3D, 3F) for site FP are shown for each method (component reconstructions not shown). Comparing the sparse-approximation of the original spectrogram for FP (see Fig. 1A), the superior performance of SI-PLCA2D over the other two methods is evident.
The Gabor field dictionary has an intuitive advantage over vector descriptors in representing features oriented in time-frequency space. The dictionary learned under mini-batch gradient descent similarly exhibits time-frequency atoms differing subtly in orientation. The SI-PLCA2D dictionary however, comprises a collection of spectrum patches with a variety of micro-structures across a range of orientation and spread. In terms of the filter model which motivates the use of Gabor atoms, the Gabor and mini-batch dictionaries could be described as having relatively homogenous widths across the dictionary; the SI-PLCA2D dictionary by contrast contains points not only differing in time-frequency orientation, but in spectral width, atoms 0, 2, 3 and 4 being considerably more focused than 1 and 5 (Fig. 3E). This is an appealing property for the analysis of broad-spectrum versus pitched soundscape elements.
Full outputs for all three sites using the SI-PLCA2 algorithm with dual 2D dictionaries are shown in Figs. 4, 5 and 6. Each 10 min site recording is sampled, taking 16 time windows from across the file of around 4 s each, arranged in order. The input is the log-frequency spectrogram of these samples, as before. Extensive analysis of larger data sets across more diverse soundscapes is needed before we can begin to evaluate the ecological significance or application of this approach, but a number of promising observations can be made. As can be seen in Figs. 4A, 5A, 6A, the component reconstructions appear faithful to the original spectrogram. The individual component reconstructions (Figs. 4C, 5C, 6C) pull out clearly distinct components. This is clearest in S3 (Fig. 6C) where the first component is broadband ambient noise, and each of components 1–5 appear as distinct ‘voices’ grouped according to both spectral range and spectro-temporal periodic gesture.
The time-frequency kernels provide a lower dimension representation of components with apparently similar characteristics: compare each component in Figs. 4C and 4D, for example. The automatic relevance determination feature deserves further investigation as a quick and dirty proxy for community composition assessment. In this example in FP k = 4, FS k = 7 and S k = 6. Does K increase with the number of vocalising species? Might it reflect the complexity or ‘decomposability’ of a scene in some way?
The entropies of each distribution are given in the subfigure captions of Figs. 4, 5 and 6. Whether these can provide useful information as a difference measure either between components within a particular reconstruction, or between kernels extracted from different soundscapes deserves further investigation. No conclusions can be drawn from this illustrative analysis, but it raises a number of questions for future research: (1) Are the component reconstructions meaningful soundscape objects in ecological terms? Are vocalising species separated in any meaningful ecological way either by soundscape component (geophony, biophony, anthrophony) or acoustic community (species, functionality etc.)? (2) Might the statistics generated be meaningful? Does the number of components (k) returned reflect ‘complexity’ or ‘decomposability’ in a way which may reflect the status of the acoustic community? Could the entropy summaries of each component be used as a measure of diversity within or between communities?
The ability of PLCA to separate streams of distinct sonic components is well recognised. Within the conceptual framework of ecoacoustics, such techniques provide a means to investigate the composition of the acoustic community as a whole in terms of dynamic interactions between spectro-temporal patterns of vocalising component species, providing a new tool to begin to experimentally interrogate the concept of acoustic niche in order to develop the understanding necessary to create more ecologically meaningful monitoring tools.
Summary and Future Work
Monitoring subtle changes in complex ecosystems is crucial for ecological research and conservation but far from straight forward. Acoustic indices hold promise as a rapid assessment tool, but are subject to the same trade-offs as traditional ecological research of quality versus quantity: any metric necessarily throws away some information. In this paper we have provided an overview of the motivational premises of community-level ecoacoustics, including the concept that acoustic communities may be structured according to competition across acoustic niches through spectro-temporal partitioning. We suggest that existing indices operating in time or frequency domain may be insensitive to the dynamic patterns of interaction in the soundscape which characterise specific acoustic communities and propose SI-PLCA2D as a promising new tool for research. This was illustrated with example analyses of tropical dawn chorus recordings along a gradient of habitat degradation. It seems likely that if acoustic niches exist that they do not lie neatly along 1D vectors in the frequency or time domain but dance dynamically across pitch-timbre-time space. SI-PLCA2D and related sparse-coding methods are computationally expensive and do not offer an instant ready-to-use proxy for biodiversity monitoring. What they do provide is a tool for extracting shift-invariant spectro-temporal patterns in a dynamic soundscape, structures which are impervious to analysis with current tools. In future work we are testing the approach on more extensive data sets to establish the ecological meaning of the extracted components.