Review History


All reviews of published articles are made public. This includes manuscript files, peer review comments, author rebuttals and revised materials. Note: This was optional for articles submitted before 13 February 2023.

Peer reviewers are encouraged (but not required) to provide their names to the authors when submitting their peer review. If they agree to provide their name, then their personal profile page will reflect a public acknowledgment that they performed a review (even if the article is rejected). If the article is accepted, then reviewers who provided their name will be associated with the article itself.

View examples of open peer review.

Summary

  • The initial submission of this article was received on January 28th, 2016 and was peer-reviewed by 2 reviewers and the Academic Editor.
  • The Academic Editor made their initial decision on March 1st, 2016.
  • The first revision was submitted on May 9th, 2016 and was reviewed by the Academic Editor.
  • The article was Accepted by the Academic Editor on May 23rd, 2016.

Version 0.2 (accepted)

· May 23, 2016 · Academic Editor

Accept

Thank you for your re submission. I have examined your manuscript and rebuttal and believe that the manuscript is now suitable for publication.

Version 0.1 (original submission)

· Mar 1, 2016 · Academic Editor

Major Revisions

On the basis of comments of two reviewers, I recommend careful changes in this manuscript. I look forward to your revision.

Reviewer 1 ·

Basic reporting

My only "Basic Reporting" criticism given the guidelines is that the data has not been made clearly available to readers, which seems to go against Peer J guidelines.The subsection "Data" in "DATA AND METHODOLOGY" presents no indication of where others can find the data sets used.

Experimental design

No comments

Validity of the findings

No comments

Additional comments

I found this to be a well written and interesting paper where the authors propose a new way to screen for generic "audio events" in acoustic recordings. Sometimes I felt that words were somewhat arbitrarily used, and provide further comments on this below.
While screening “recordings” for “audio events” is an interesting question, and one likely to become more and more important as recordings become more and more common, it is at odds with the perhaps more usual task found in the literature (that I am most familiar with), where one is trying to identify specific signals (e.g. a coqui sound, a bird song, a whale sound) in recordings. Therefore, I believe that the authors must bring forward that distinction and in particular must find a precise way to define what is an “acoustic event”. If that definition is unclear, how can we judge a given algorithm is useful or efficient? One needs to unambiguously define what are the “audio events” (i.e. the signal) one needs to detect in order to quantify true positives and false positives, etc. This must be done from the start, around line 43 or before when the term is first used.
When one reads the abstract sentence “Our goal is to develop an algorithm that is not sensitive to noise, does not need any prior training data and works with any type of audio event.” or the introduction sentence, even more optimistic, “What is needed is an algorithm that works for any recording, is not targeted to a specific type of audio event, does not need any prior training data, is not sensitive to noise, is fast and requires as little user intervention
as possible., one has to wonder, what is the catch? One usually says there are no free lunches in statistics. So what is the price to pay? And the price you pay is specificity. When you choose to detect “audio events” without clearly defining them a priori, then you actually lose the ability of doing many of the analysis that you might do when you target a specific signal. This is fine if is intended, but that must be explicitly stated so, and eventually discuss clearly the notion that this is a precursor step in other more elaborate and fine-tuned procedures.
This is to me an important point: I thought that all of these discussions regarding the distribution of variables when events where detected is not really possible unless we know the distribution of the variables available for the entire duration of the recordings was available. It would mean very different things to have many events at 18:00 hours if most of the recordings were at 18:00 hours versus if few recordings were at 18:00s! (see e.g. discussions around line 292-294).
More detailed/specific comments that might be used to improve the paper follow below.
Line 10 – 11 – I would say that “thousands of recordings” is wording to be avoided. Recording is a meaningless unit. It should be easy to replace by something that is unambiguous to readers. Same on line 48, “recordings” again used as a unit. This is not a sensible unit. Please check the entire text for the use of the word, as sometimes it makes sense, other not. In particular when you describe the data, again “recordings” is not a useful description. I know you are thinking about Arbimon, but this must be general. How long is a recording say?
Line 46-47 – the “easily draw a boundary around any audio event” is strictly not true, especially at low signal to noise ratios, or when a given frequency band is saturated and so many overlapping events occur continuously for a long time period. Please reword.
While I understand the purpose of the last sentence in the introduction, I believe it includes several bits of information that should belong in the methods. It makes sense to end the discussion with laying the paper that lies ahead, but one should avoid technical details like “2051 manually labelled audio events” or “20 recordings”. One gets confused also because in the text you say this is the workflow of the article, the figure states it’s the workflow of the AED methodology. These can be naturally closely related, but are not one and the same.
Figure 2 legend – to me Yen threshold means nothing, and legends should be self-explanatory. I suggest removing as it’s mostly diversionary here, this is in the text after line 153.
Line 89 – here you refer fig 2C, but fig 2B was not mentioned yet. It would flow bether if the description matched the figure order.
Property A1. – what is tau? And eta? I mean, I know, but rigorously it would be useful to define these.
Line 118 – I believe the notation needs tweaking, as S_db(t,f) has no I which is what it needs to be summed over?
Line 103-109 – I think you cannot extend this to infinity. Would the algorithm not break apart if an entire recording had a given frequency band saturated? Discuss, or change wording, please.
Line 124 – “valuation”?
Line 132 – Surely this is over a given time frame in practice? What is the time frame considered? Will you discuss the sensitivity to different time frames considered?
Line 133 There’s something missing as like an “at least” before “1-\rho(f) proportion”, right?
Lines 141-142, I find this confusing… would it not be clearer if you delete just ”by estimating it as” and replace by “:”
Just before line 146 – he “r>0” is this in time, frequency, both?
Line 146 – Equations should read just as text. Therefore, the “Where” must be “where” and not indented, and there’s no dot after the previous equation. Same in line 201. Check remaining instances.
Line 148 – reword the arbitrary “the estimator should have a small response”. What are you referring to? Whan is an estimator response?
Line 155 – I understand what you mean, but clarify “image values”, since there’s strictly no images here.
Line 156 – “Th” should be “T”?
Line 157 – the contiguous here refers to both time and frequency?
Line 168 – not sure why you need the descriptor “the sites dataset”? The first data set was also collected on sites…
Line 175 – as I said, here’s a good example. You set to find “events” here. What is an event? Is this a circular definition to some extent, since events are sounds you are able to classify as events??? In particular, how these relate e.g. to the 3 types of sound you menion in the first paragraph of the introduction?

Line 179 – 21 by 21 and alpha=5. These are fundamental details. Are these values optimal in any way? Why? What is the sensitivity of the method to changing them? What are recommendations for users?
Line 188 and 191 (2 times) – I think the word “automatically” needs to be added for clarity, e.g. “were detected over the total count” is “were automatically detected over the total count”
Line 187-192 – I’d introduce the wording true and false positives here, and note explicitly this is a typical confusion 2 by 2 matrix but one of the cells is absent (i.e. there are no true negatives).
Line after 200 – you need a more rigorous wording. “To measure the degree of separation of each variable on the audio event density”… there is no such characteristic as “variable separation”… what do you mean exactly?
Line 201 i-th should be i^{th} (superscript)
Line 203 – “The 2-variable marginal distributions” reword to bivariate?
Line 205 – why this one chosen? Just as an example? If so say so.
Line 208 – words missing or plurar vs singular mistake
Table 1 legend – last sentence – you say “arbitrary number of false negative examples”. Don’t you mean “true negatives”?
Figure 3 legend – need to label the columns, since each column is a different type of image, right?
Figure 4,5,6 – you need at the very least to state explicitly that dark is less and white is more, but a scale would be helpful!
Figure 6 – plot y_max by tod. What is the reason for the weird vertical bands? It’s hard for me to imagine what would create these tod “discontinuities”?
Line 228 – so does “cov”, in fact with a larger H value than _max. Why do you not mention it
Line 236-237 – explanation? Is this a feature or an artefact? If you raised the question you can’t leave it unanswered.
Line 243 – seems inconsistent to me to do this “separation” (and I do not like the word as I said above) only visually in 2d but with an H statistic in 1d, why this choice?
Line 247 and figure 7 – 6? Which? I can’t see this. Please mark them all in figure 7.
Figure 7 legend – “Close-up on the first area”… why the first.
Line 251 and several others – all latin names must be italicized, including in references.
Line 262-264 – Discuss problems with intense chorus on a given noise band.
Line 281 –use of word recordings here is inconsistent (cf. with say line 168-171), you say 6, these are presumably what you referred as dataset?
Line 282 – remove the two instances of the word “any”, not useful.. but this needs added clarification, not sure what is meant here?
Lines 300-303 – I’d like to see comments regarding whether these have also been missed sometimes or not?
Line 334 – “University Press”
Lin e 344- incomplete ref?
Line 347 – “Conference on…”- incomplete
Line 355 – “pages”???
Line 366 – species name needs italics

·

Basic reporting

The English language is good.
The mathematical description has not helped the presentation. It took time to understand and was not adding much more than could have been said quickly in words. In particular, symbols are used which are not defined - e.g. eta, n, tau. The numeral '1' is used for the indicator function but it is not stated as an indicator function and is doubly confusing because bold or blackboard-bold font is not used.

T in the equation just above line 154 is not defined. I had to go on-line to see what it was. It is the threshold which is later referred using the letter 'Th". All this suggests that not a lot of care was put into proof reading.

The images are just of sufficient resolution to support the text.

Figures 4, 5 and 6 are puzzling in that the images at top and bottom of the left column have different time scales. Also it appears from the top lect image that more events happen at night thanin the day. The reader needs a lot more help to understand these figures.

Experimental design

The method for subtracting base-line value is valid although it is based on an important assumption. The signal model is not that different from an additive noise model. The authors state that they do not make an assumption about the distribution of the noise, symbol epsilon in text, and this is a nice feature. (Although later the use of 5 percentile tails to establish cutoffs implies that something like gaussian noise is assumed).

The nicest feature of the method is that used to calculate the Range estimator and the use of entropic correlation. I am not aware of this being done elsewhere and is the main interesting result of the paper.

The authors report their accuracy based on overlap of observed rectangle with predicted rectangle. This criterion is far too liberal because even a slight overlap can lead to a correct prediction for the wrong reasons. I would suggest that at least a 50% overlap is required which is indeed the case in some of their images.

Validity of the findings

The authors imply that there are few thresholds or critical parameters that must be tuned for their system. However the use of the 21x21 window size is surely important and must have been determined by trial and error. If an event was too large in area, this window would leave "holes" in the event due to the way they calculate the range estimator. In general I felt the authors were too uncritical of their method. The important assumption of
rho(f) < 0.5 is hidden in extensive and not very helpful mathematics. It is a reasonable assumption - other methods have to make similar assumptions but the authors claim some superiority for their method.

Additional comments

This could be a nice paper. The method of using the Range estimator is very nice and something I am sure others will emulate when the paper is published. However the paper is inadequate in three respects:
1) The authors have been too uncritical in promoting the advanages of their method.
2) The estimates of accuracy are based on a very easy success criterion.
3) There is no comparison with another method. I totally agree that a fixed threshold technique is not useful but there are better techniques to compare their method with.

All text and materials provided via this peer-review history page are made available under a Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.