To increase transparency, PeerJ operates a system of 'optional signed reviews and history'. This takes two forms: (1) peer reviewers are encouraged, but not required, to provide their names (if they do so, then their profile page records the articles they have reviewed), and (2) authors are given the option of reproducing their entire peer review history alongside their published article (in which case the complete peer review process is provided, including revisions, rebuttal letters and editor decision letters).
We thank you for another thorough revision of the manuscript, which now appears to be in an acceptable form.
The revised version of the paper is a significant improvement compared to the original submission. Yet, there are still some comments and suggestions by one of the reviewers, which should be taken into account prior to acceptance.
Basic reporting looks great. I'd suggest the authors consider renaming the section "Transcription Evaluation" to "Experimental Method", "Experimentation", or something to that effect. Parameter sweeps and tuning are still part of finding the solution that's being evaluated, which roll up to experimentation.
Overall, the experimental design is reasonably solid. I do, however, have two smaller comments.
First, regarding L396 on page 12: The text states that "recordings" are partitioned into two sets; is this split conditional on the source songs? For example, imagine two GuitarPro files A and B. A is passed through three guitar models producing A1, A2, and A3, and B is rendered to B1, B2, and B3. Can A and B both occur in the test set? In the spirit of true scientific rigor, audio rendered from the same symbolic source really shouldn't fall across partitions, and I'm curious whether or not this is considered here.
Second, regarding L527-528 on page 16: The authors intend to measure the effect of model depth, i.e. the number of layers, independently, but keep the width, i.e. number of nodes, of each layer constant. This isn't truly an independent assessment, because the number of parameters (and thus model complexity) is certainly increasing. To truly measure the effect of layers, the number of *parameters* should be held constant, as this would offer insight into what may be gained / lost with depth. Otherwise, one would expect that over-fitting will almost certainly happen as the number of parameters increase, especially here given the small size of the dataset and the minimal variance of the sound fonts considered.
That said, I don't think this is an irrelevant experiment, but it's not testing the hypothesis set out by the authors. Rather, I'd be willing to wager that performance on the training and test sets are going in opposite directions here. Reporting performance on the training set (not given) would provide some insight into what's happening here.
The majority of the findings reported by the authors are substantiated and insightful. There are a few, though, that I would suggest the authors revisit.
L502-503: The authors offer that "steel samples are generally louder than the electric or nylon acoustic samples", and perhaps that's to account for a difference in performance between the reference system (Zhou) and the one proposed here. I'm curious what the rationale would be in that one system would be more or less affected by the gain of a signal? Or how could this hypothesis be tested? After all, symbolic MIDI / GuitarPro files still encode a range of velocity values, no? I'd suspect it has more to do with the timbre of the sounds than the loudness, in that the nylon and electric guitar are "closer" than the steel sound fount (damped overtone series?).
L522-525: In motivating the exploration of multiple layers in the network (2-4), a parallel is drawn between deep networks and neurobiology. I would strongly advise against trying to make this link. Not only is it debatable, it's an unnecessary distraction that undermines the good work around it. It is sufficient to say "deeper models afford greater representational power and can better model complex acoustic signals" without bringing brains into the mix, and no one can take issue with the claim. Similar comments hold for lines L536-538
L548-553: As a conclusion to the same section named above, the authors offer three explanations for an increase in depth leading to decreases in performance: "First, increasing the complexity of the model could have resulted in overfitting the network to the training data. Second, the issue of “vanishing gradients” Bengio et al. (1994) could be occurring in the network fine-tuning training procedure, whereby the training signal passed to lower layers gets lost in the depth of the network. Yet another potential cause of this result is that the pretraining procedure may have found insufficient initial edge weights for networks with increasing numbers of hidden layers."
Based on past experience, I'd bet it is exclusively due to the first reason named. All speculation would be easily resolved by including performance numbers over the training set, as well as the test set. If training accuracy increases with model complexity, then we have our answer. In a similar manner, I am suspicious that the vanishing gradient problem is to blame, or that pre-training yielded poor parameters. Again, performance on the training set would shed some light on this. Also, I'd offer that some of the narrative be adjusted to reflect that model complexity, not just depth, is being varied here.
Overall the article is in good shape, and I commend the authors for their diligence in continuing to improve the work. My most important feedback (regarding science and whatnot) is named above, but I've a number of much smaller notes to share. Do with them as you will.
L34: It would be more slightly more convincing to find a more modern reference than (Klapuri, 2004) when referring to the state of the art being so far behind human experts, since it's almost 10 years older than the (Benetos, 2012) citation, which is used as evidence that a monophonic transcription is solved.
L81: It's a minor misrepresentation that Humphrey et al, 2012 & 2013 advocate the use of "deep belief networks", which are a specific kind of neural network (RBM pre-training followed by supervised fine tuning). Rather, the articles argue for feature learning and deep architectures generically, e.g. CNNs, DBNs, LSTMs, autoencoders, etc.
L119-127: It's somewhat unclear from this passage that the two systems presented are doing different things on different data, i.e. extensive guitar fingerings from solo guitar recordings versus guitar chord shapes over polyphonic pop/rock music.
L380-381: Why would MFCCs be a good feature for polyphonic pitch tracking?
L485-489: Both seem like reasonable kinds of errors. Do you have any insight to the frequency or prevalence of one over the other? Personally I'd expect the duration merging kind to be more common than the thresholding issue, but that's just a hunch.
L473: Perhaps consider using the Constant-Q transform in future work, which provides a more reasonable trade-off between frequency resolution and time resolution than the DFT.
L560: It's more accurate to say "faster than real-time", right? It's my understanding that the HMM decoding is non-causal, which means the full signal must be processed before an output can be given at t=0. This would be different from an "on-line" system, as in "as one plays music."
L565: Similar to the previous comment, it's a bit of a stretch to claim that the algorithm could be achieved with a microcontroller. It's doubtful that the processing speed seen on a personal computer (with an Intel or AMD processor in the GHz) will translate well to smaller processors with less / slower RAM.
L568: Also in the realm of tempering optimistic claims, the sentence "All that is required is a set of audio files" is a tad ironic, given how difficult it sounds to obtain data in L355-363 and L590-609. Perhaps something along the lines of "When it's possible to find, curate, or synthesize data, this approach is great."
L587: The approach used in this paper is not early stopping, but a fixed number of iterations. Early stopping requires that some measure (typically over a validation set) is computed as a function of iteration / parameters [https://en.wikipedia.org/wiki/Early_stopping].
L590: What about sample-based synthesizers? These are at least real sound recordings, rather than algorithmically defined synthesis equations.
Dear Authors, the reviewers express an overall positive opinion about your paper and confirm that it holds promise. Nevertheless, they also raise a number of concerns and suggest several points that should be addressed in a thorough revision of the paper. Therefore, we kindly ask you to prepare such a revision, complying with the reviewers' comments as much as possible.
By and large, this article checks all the main boxes of a well-prepared submission: the writing is thoughtful and clear; the literature review is reasonably thorough (a couple suggestions below); the work presented asks the primary scientific questions, and makes a solid effort to address them comprehensively; experimental data is made freely available online for subsequent inquiry; and, overall, I quite enjoyed the read.
I've also two pieces of relevant work the authors may consider including (I'd leave it to their discretion though):
A. Barbancho, A. Klapuri, L. J. Tardon, and I. Barbancho, “Automatic transcription of guitar chords and fingering from audio,” IEEE Transactions on Audio, Speech & Language Processing, vol. 20, no. 3, pp. 915–921, 2012
E. J. Humphrey and J. P. Bello. "From music audio to chord tablature: Teaching deep convolutional networks to play guitar." IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2014.
The experimental design is generally sound. I've migrated my primary concern (re: data) to the following experimental validity bucket, though it arguably fits into either. All other facets are well motivated and / or standard in this course of work.
There are a handful of tiny nits I've noted in the annotated manuscript (attached), but a major concern I'd like to express is with regard to the limited data variance used for experimentation. Specifically, the authors generate a collection of audio files for training and evaluation, but use one guitar sound font for all data. As a result, the train / test distinction of the different folds is rather undermined, and the reader has minimal real insight into how this method may or may not generalize in realistic scenarios, or how much data is required to achieve these results. Realistically, train / test splits should at least occur at the level of the sound font.
This issue is further exacerbated as the algorithm with which the authors compare performance leverages no machine learning (as far as I can tell by reviewing the previous article). Whereas the baseline algorithm is rather generic, the deep learning model has been exposed and optimized to this one sound font. The only fair comparison might be with respect to run-time; however, processing time is only loosely correlated with complexity, e.g. big O, and perhaps the previous algorithm could be optimized, also via GPUs, such that it exhibits comparable computational performance.
Realistically, the choice of remedies I see consist of the following:
- One or more additional guitar sound fonts are obtained, the same GuitarPro files are re-rendered for the test set, and used to recompute Tables 1 & 2. Though probably time consuming, this should (hopefully?) not be especially difficult, and would greatly improve experimental validity.
- Alternatively, the authors should thoroughly acknowledge this limitation in the article as a means of better contextualizing results, and identify more extensive evaluation as a key future work item.
As a final comment, I both understand and empathize with the authors' need to leverage synthesized data, having worked with GuitarPro data a great deal in the past, but it is my feeling that, while correctable, this is the only substantive deficiency of this submission.
Complementing the notes I've scribbled in the annotated version, there are two points I wanted to revisit for emphasis:
- I really like the idea of optimizing the cross entropy loss jointly over the polyphony and pitch estimators. I find it a bit odd though that each logistic is given equal weight, regardless of what it represents, and wonder if this might be biasing the estimators toward better pitch estimates than polyphony (higher contribution).
- There are a variety of modern deep learning "tricks" you may find useful, especially rectified linear units (for hidden layers), dropout, and data augmentation.
The paper is well written and structured. It was a pleasure to read. The authors seem to be on top of the relevant literature and review it adequately, and Introduction is well presented.
The problem is well defined and technical descriptions are clear and unambiguous almost all the time. See my "General comments for the author" for a few places where I think clarification is needed.
What I was missing in this paper was a comparison of different DNN configurations. They authors present an entire system composed of many parts, but the main original contribution regards the DNN (deep neural network) used. The decisions made regarding the DNN structure are just given, without telling how they ended up using those. More specifically, I would have been interested to see a comparison between a) sigmoid and rectified linear unit non-linearity, b) deep NN versus a shallow one, c) using unsupervised pre-training versus simply applying backpropagation on from scratch on the entire network (with sufficient amount of training data). Also a so-called dropout regularization technique is not used here although it has become very popular in DNNs. Is there a reason why it was not used?
Comparing at least some of the above-mentioned configurations would greatly increase the value of the paper.
The main shortcoming of the paper regards the evaluation, more specifically:
1) The developed methods are evaluated using only synthesized recordings. I cannot accept the given reasons for that. It is totally reasonable to hire a session musician to perform at least a few songs (from a given musical notation). Perfect temporal alignment with the reference is not necessary in order to use that for evaluation (recall/precision etc.) Optionally, one can also use a six-channel microphone that records each string individually, which makes temporal alignment with reference possible too.
2) Only one guitar mark and model was used in the synthesis, causing the neural network to learn the sound of that particular guitar. Even worse, the same guitar model was used to generate both train and test data. The results do not generalize.
3) Because of the point 2 above, the main conclusion and the claim made in the paper is not valid ("When applied to the problem of polyphonic guitar transcription, deep belief networks outperform stat-of-the-art"). The reference method was generic, not even intended for guitar specifically. That makes comparison with it unfair. The method presented here is overfitted to the particular guitar used.
When comparing different configurations of their DNN, even synthetic recordings could be used, provided several different guitar models would be used for the synthesis (different for training and testing). However there should be at least a small set of actual recordings from a human performer to validate the final performance of the proposed methods.
Detailed minor comments:
* smallers/highest diameter string -> thinnest / thickest string
* at the bottom of page 6 (In paragraph starting "For training and testing the algorithm...": How did you define the offset of a note? Guitar sounds are exponentially-decaying, therefore they become inaudible even without explicit offset. Also there should be some non-pitched transition between notes/chords in continuous playing. How were those dealt with?
* page 10: please define "one error", "hamming loss" and "polyphony recall" exactly.
All text and materials provided via this peer-review history page are made available under a Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.