Review History


All reviews of published articles are made public. This includes manuscript files, peer review comments, author rebuttals and revised materials. Note: This was optional for articles submitted before 13 February 2023.

Peer reviewers are encouraged (but not required) to provide their names to the authors when submitting their peer review. If they agree to provide their name, then their personal profile page will reflect a public acknowledgment that they performed a review (even if the article is rejected). If the article is accepted, then reviewers who provided their name will be associated with the article itself.

View examples of open peer review.

Summary

  • The initial submission of this article was received on March 26th, 2018 and was peer-reviewed by 4 reviewers and the Academic Editor.
  • The Academic Editor made their initial decision on May 2nd, 2018.
  • The first revision was submitted on July 10th, 2018 and was reviewed by 1 reviewer and the Academic Editor.
  • A further revision was submitted on August 16th, 2018 and was reviewed by the Academic Editor.
  • The article was Accepted by the Academic Editor on September 5th, 2018.

Version 0.3 (accepted)

· Sep 5, 2018 · Academic Editor

Accept

Thanks for addressing most of the previously raised issues.

Version 0.2

· Aug 6, 2018 · Academic Editor

Minor Revisions

Thank you for addressing many of the issues raised by the reviewers. However, a few minor issues remain as detailed.

Reviewer 4 ·

Basic reporting

The response to my question "What are the complications associated with MH, if left undiagnosed?" is missing the point somewhat. It might well be obvious in the field of ophthalmology... but this is not an ophthalmology journal. Non-experts should come away knowing *why* this work is important. Not a massive issue, but a tad dismissive in my opinion.

I also still have issue with the "Statistical analysis" section. This sentence makes no sense to me: "The model was fitted to only 90% of the test data. We created 100 ROC curves by making 100 patterns, and 10% were thinned out." You should not have fitted *anything* to the test data, please correct this typo or clarify what you mean.

"Correct answer rate" is still used in several places - to me this should be "accuracy".

Fully connection layers --> Fully connected layers

The remaining comments have been addressed satisfactorily.

Experimental design

The comments have been addressed satisfactorily.

Validity of the findings

The comments have been addressed satisfactorily.

Additional comments

Thank you for addressing the comments from the original review. With the exception of a few lingering issues from the original (and a few typos here and there) I think that it merits publication.

Version 0.1 (original submission)

· May 2, 2018 · Academic Editor

Major Revisions

Thanks you for submitting your article to PeerJ. It is an interesting article. However, a number of important items were raised during by the reviewers, most importantly by Reviewer 4. Please address these.

Reviewer 1 ·

Basic reporting

1. line 61-
Please correct Optus to Optos and rewrite the sentence as “The study dataset included 910 Optos color images obtained at Tsukazaki Hospital (Himeji, Japan) and Tokushima University Hospital (715 normal images and 195 MH images).” because the original one was difficult to read.

2. line 111, ROV should be ROC.

3. line 125-
I don’t know what this sentence meant.

4. line 203
The authors wrote that "If surgical treatment is performed at an appropriate time in MH patients, a good prognosis can be obtained".
How the Optos-based telemedicine system is used for the determination of appropriate timing?

Experimental design

1. As the authors commented, the limitation of this study was the inclusion of only normal and MH eyes.

2. line 81-,
When did authors obtain the informed consent from each subject? Were all images used in this study collected for the purpose of this study after obtaining the informed consent from each subject? According to the clinical research ethical guidelines, the researchers can include the existent data after they disclose the research information.

Validity of the findings

Although I admit the accuracy of the AI, the scores of ophthalmologists for the diagnosis of MH were low, especially the sensitivity. Were those ophthalmologists instructed 1:1 ratio of the normal and the MH image in the data set?

Additional comments

1. There are eyes having “pseudo” MH. Please discuss whether the AI can differentiate true and pseudo MH.

2. The results discourage the "real" ophthalmologists. In addition to the speed, the accuracy of the diagnosis was superior in the AI than in the ophthalmologists. Please discuss the role of the ophthalmologists in the future.

Reviewer 2 ·

Basic reporting

Good enough.

Experimental design

Concise and good.

Validity of the findings

Data is robust, statistically sound.

Additional comments

Nagasawa et al. reported the possibility of deep learning for the detection of idiopathic macular holes in ultra-wide-field fundus images. This study is unique and new. I think this paper has a sufficient priority to be published in PeerJ. However, I also have several minor concerns in this manuscript.

1. line 39; the authors emphasize that Optos dose not need mydriasis. In the current study, it is not clear all the Optos images were taken under the condition of non-mydriasis.
2. line 75; Images from patients, complications, such as vitreous hemorrhage, asteroid hyalosis, 76 intense cataract, and retinal photocoagulation scars, and other conditions, such as fundus diseases, were excluded. Additionally, images with poor clarity were excluded. Moreover, images from patients with stage 1 MHs and those with retinal detachment were excluded. The authors need to describe how many Optus images were excluded from all images.
3. Table 2; it is unclear what 32:80±7:36 and 13:58:00±3:19:16 actually mean.
4. I am not sure why the authors use Optos to detect MH. OCT should be more accurate, easy, and more common.

Reviewer 3 ·

Basic reporting

In the table, it is not clear what format and units the time is reported in.

The figure legends should allow the figure to be read without referring to the original article – they may need to be made slightly more descriptive.

I’m not sure ROC curve is necessary or helpful when the AUC is essentially 1.

The image preprocessing is not well described. The images appear to have a circular crop applied to the original image – this should be described.

Was any effort made to center the images, align the disc, or flip left/right eyes to make the images appear similar to the CNN?

Experimental design

no comment

Validity of the findings

no comment

Additional comments

From an image processing perspective, assuming a good quality fundus image, the detection of a macular hole (a small dark circle in a larger fairly homogenous image) is not that complicated, and thus it is not surprising the CNN works as well as it does, but the results are nonetheless impressive.

From a clinical utility perspective, it is not clear that this is a solution to an existing clinical problem since macular holes always cause visual loss in the stages included in this study, the rationale for creating a screening program to detect them is less compelling. While perhaps not necessary for publication here, it may strengthen the paper to add some discussion as to how such a program might be used in the real world.

Reviewer 4 ·

Basic reporting

Introduction
For me, the intro is far too short and doesn’t really describe the problem in enough detail. There is almost no clinical background, and the discussion around deep learning is too brief. I would suggest expanding the Introduction to cover the following topics:

- What is a macular hole? How does it appear in a fundus photo vs. OCT?
- What is the prevalence of macular holes? Some statistics might be helpful
- What are the complications associated with MH, if left undiagnosed?
- Deep learning is not a machine learning algorithm; it’s a sub-field of research within ML
- You state that DL is good generally, but you should give details of why DL is a good approach specifically to your problem. Have other methods been tried previously for MH? Are they inadequate?
- Please cite some other recent DL papers in the context of ophthalmology, especially this one: https://www.nature.com/articles/s41551-018-0195-0

Methods

- It probably makes more sense to describe the FC dropout layer in the section “Deep learning model”, rather than the “Training…” section.
- Lines 132-135 do not make sense - please revise these sentences to be more clear
- A citation is needed for the Grad-CAM method on line 141
- What cost function did you use? Cross-entropy, or something else?

Results

I am quite confused about the methods on page 9. Specifically, how (or why) is “deep-learning response time” calculated by the ophthalmologists (line 125)? The description on line 132-135 about data entry is also unclear, particularly the sentence: “In deep learning, a series of tasks was performed for all presented numbers as follows…”. My best guess is that the authors are trying to fairly compare the DL computation time with the ophthalmologists’ time taken to record the same information. Please revise this section to be more clear.

Regarding the figures, I think there a few things that can be improved:

- In my opinion, the legends are too short. I personally try to provide enough information in the figures so that a reader could get the gist of the whole paper by reading the legends alone.
- For Figure 2, I’d suggest zooming in on the ROC curve figure, perhaps with the x and y axes at 0.5 or something. You really can’t make anything out otherwise. I’d suggest also including curves from several runs - perhaps the best, worst and average? It’ll give readers a better sense of the variability.
- Figure 3 showing the heat map is not all that informative without a colorbar. It also might be useful to include a few examples rather than just one.
- Table 2: “Accuracy” is a better term for “correct answer rate”. Also please state the unit of measurement time

Other points:

- What is the “first” curve? The first experiment you ran? Why not the best curve?
- On line 162, is this 13 minutes per image?

Conclusions

- There’s no need to repeat that deep learning is an ML technology.
- What are you going to do next?

Grammar, spelling and formatting
Overall the language is very good, though there are a few spelling/grammatical errors:
- Missing space after “macular holes” (line 19)
- Optus → Optos (line 61)
- Lots of unnecessary hyphens in the terms deep-learning and machine-learning (line 71, 88 and various other places)
- “...using a CNN” (line 88)
- “The rectified linear unit (ReLU) activation function…” (line 89)
- What is meant by a ‘tie layer’? Not sure what this means (line 92, 100)
- “The network weights were optimized using stochastic gradient descent (SGD) with momentum…“ (line 101-102)
- ROV → ROC, and various grammatical errors afterward (line 111 onwards)
- Background data (line 147)
- Probably better to describe the eye in terms of left/right or OD/OS (line 148)
- Resions → regions (line 201)

Experimental design

The research question is not all that well defined in the introduction. Ultimately, the goal was to evaluate the performance of DL algorithm for detecting MH. However, the authors also do a good job of comparing the algorithm to multiple experts; something many papers do not do. I would therefore suggest adding a couple of sentences at the end of the introduction to state that this was also part of the study.

My main issue with the overall experimental design relates to how the final model for evaluation was selected. You shouldn’t use test accuracy as the basis, but instead use a validation set. Later, in the “Statistical analysis” section, I don’t really understand the authors’ description of the ROC analysis. I get that there should be one curve per model (100 overall) but I do not understand what is meant by: “We created 100 ROC curves by making 100 patterns, and 10% were thinned out”. Some clarification is needed. The authors also state the model was fitted to only 90% of the test data. Presumably this is an error, and the authors mean training data. This would suggest that the authors did indeed use a 10% validation set, but this is unclear. Please revise this section to better describe how the model was tested and evaluated.

Some other points:

- It’s very common to pre-train networks on bigger datasets to boost performance and reduce the time needed to train. Did you try this out?
- Images from patients with various complications were excluded. How many? Why?
- What was the criteria for “clarity”?
- What is “stage 1” MH and why are they not included?
- You state that 100 models were trained - what was the variability over the 100 runs?

Validity of the findings

As described above, I am concerned about how the final model performance was evaluated. Running the algorithm 100 times and picking the best one based on testing accuracy is cheating a little bit, although it’s not clear from the methods whether this is how it was actually done. Furthermore, given the size of the dataset and near-perfect performance on the test set, I think that cross-validation is necessary to get true sense of algorithm generalizability. I think that given the authors can afford to do 100 repeats on the same data, performing K-fold cross validation should be feasible and would strengthen the reporting.

Given that the authors went to the effort of making an application to capture six experts’ gradings, it would be good if the authors could report metrics of interrater agreement (e.g. kappa coefficients). Furthermore, it’s important that the authors discuss the limitations of the reference standard used, especially if it was based on the diagnosis of a single grader. This paper gives some good insight into why this is important: https://research.google.com/pubs/pub46802.html

Additional comments

Overall, the paper is well written and demonstrates that DL is a powerful approach for classifying MH in wide-angle retinal photographs. The choice of metric is appropriate and the comparison with multiple raters is great to see. Some more background information about MH would be useful for readers outside of the discipline, and would help to contextualize why this work is important.

Ultimately, I am concerned that the results do not reflect the realistic real-world performance of the proposed method. With further clarification of how the model was tested and evaluated, I think that most of concerns will be addressed. However, I would maintain that cross-validation would be a more appropriate way of assessing the generalizabiltiy of the CNN.

All text and materials provided via this peer-review history page are made available under a Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.