All reviews of published articles are made public. This includes manuscript files, peer review comments, author rebuttals and revised materials. Note: This was optional for articles submitted before 13 February 2023.
Peer reviewers are encouraged (but not required) to provide their names to the authors when submitting their peer review. If they agree to provide their name, then their personal profile page will reflect a public acknowledgment that they performed a review (even if the article is rejected). If the article is accepted, then reviewers who provided their name will be associated with the article itself.
The reviewers are satisfied with the recent changes, and therefore I can recommend this article for acceptance.
[# PeerJ Staff Note - this decision was reviewed and approved by Mehmet Cunkas, a PeerJ Section Editor covering this Section #]
The revised manuscript significantly improves upon the original version. The introduction now includes up-to-date references, covering recent YOLO variants (YOLOv9–YOLOv12), enhancing the discussion of related work.
Dataset annotation details are explicitly described in Section 2.2.4, including annotation tools (LabelImg), labeling protocols, and inter-annotator agreement. This enhances the transparency and reproducibility of the study.
The data split methodology is now clearly explained in Section 2.1 and Table 1, indicating a standard 80-10-10 split across over 20,000 images.
The authors provide a comprehensive justification for using Canny and Sobel filters in preprocessing (Section 2.2.1), emphasizing their complementary nature in enhancing edge features for YOLOv8 input.
A comparative analysis has been added, covering YOLOv5, SSD, and RetinaNet, which strengthens the rationale for choosing YOLOv8.
The paper now clearly differentiates the roles of YOLOv8 (used for object detection) and other models like VGG16, ResNet, and MobileNet (used for classification tasks based on extracted features) as explained in Section 3.4.
Training methodology has been fully detailed in Section 3.2, including hyperparameters such as learning rate, batch size, and optimizer settings (see Table 2).
Evaluation metrics (mAP, precision, and recall) are justified in Section 3.1, with each metric’s role contextualized for object detection evaluation.
Statistical significance testing has been added (Table 5), including confidence intervals and p-values, which reinforce the robustness of YOLOv8’s superior performance over baseline models.
Section 3.4 also includes computational analysis, detailing inference speeds and memory requirements across models to highlight real-time applicability.
The authors have strengthened the manuscript substantially. Improvements include:
• Detailed preprocessing and annotation procedures.
• Clear articulation of training methodology and model design.
• Inclusion of visual and quantitative results.
• Explanation of hardware requirements and inference performance.
These changes enhance both the technical depth and the practical relevance of the study.
**PeerJ Staff Note:** Please ensure that all review and editorial comments are addressed in a response letter and that any edits or clarifications mentioned in the letter are also inserted into the revised manuscript where appropriate.
**Language Note:** The review process has identified that the English language must be improved. PeerJ can provide language editing services - please contact us at [email protected] for pricing (be sure to provide your manuscript number and title). Alternatively, you should make your own arrangements to improve the language quality and provide details in your response letter. – PeerJ Staff
The paper introduces a sophisticated approach to apple recognition using YOLO v8, combined with various image preprocessing techniques. The integration of HSV color space transformation to assess apple ripeness and the use of 3D modeling for weight estimation are particularly commendable. However, there are a couple of significant areas where the discussion could be expanded to fully contextualize the study within the current and future landscape of agricultural automation technologies.
Future Exploration with Advanced YOLO Models: The paper provides a thorough analysis using the YOLO v8 model, demonstrating impressive precision and efficiency. However, the discussion would benefit significantly from an exploration of more recent developments in the YOLO series, such as YOLO v9, v10, YOLO11, and YOLOv12. Recent papers utilizing these models have shown varied improvements in speed and accuracy, which are crucial for real-time applications like apple-picking robots. Incorporating insights from these papers could provide a forward-looking perspective and suggest how the integration of newer YOLO models might further enhance the performance metrics of apple recognition systems. Discussing these models would not only update the paper's relevance but also expand its applicability and potential for future experimental setups.
Incorporation of Multimodality and Synthetic Image Generation: While the paper effectively uses traditional image processing techniques, it overlooks the burgeoning field of synthetic image generation and multimodality in training deep learning models. Recent advancements in large language models (LLMs) that generate synthetic imagery could significantly augment the dataset, especially in scenarios lacking diverse training samples. Papers leveraging LLM-generated images for YOLO-based object detection have demonstrated how these techniques can mitigate common issues like overfitting to a limited dataset and improve the model's robustness in diverse operational environments. Discussing the potential of incorporating LLM-generated synthetic images and multimodal data inputs could provide a pathway to substantially boosting the accuracy and generalizability of apple detection algorithms.
Can only discuss the potential.
Its presented good
Please revise
1) Exploration of Newer YOLO Models: The paper should discuss more recent YOLO models like YOLO v9 to YOLOv12, as these could offer improvements in speed and precision for apple recognition, providing a broader perspective on future enhancements in agricultural robotics.
2) Use of Multimodality and Synthetic Images: Incorporating discussions on synthetic image generation and multimodality using large language models could improve dataset diversity and robustness, enhancing the model's performance in complex agricultural environments.
These revisions in literature review and discussion of future will enhance the paper's strength
• The introduction provides sufficient background, but adding more recent references could enhance the discussion of related work.
• The dataset annotations are not explicitly detailed. It is unclear if manual annotation was performed or if pre-annotated datasets were used. Providing this information would enhance reproducibility.
• The data split methodology is not specified. It should be clarified whether a standard split (e.g., 70%-20%-10% for training, validation, and testing) was used.
• The reason for using Canny and Sobel edge detection algorithms is not well justified. The paper should clearly explain why these methods were chosen over other preprocessing techniques and how they contribute to object detection performance.
• The study lacks a comparison between YOLOv8 and other one-stage object detection models such as YOLOv5, SSD, or RetinaNet. Including such a comparison would strengthen the argument for using YOLOv8.
• The paper claims that VGG16, ResNet, and YOLOv8 produced better results, but it is unclear whether VGG16 and ResNet were used for object detection or just feature extraction. This needs clarification.
• The experimental methodology does not specify how different models were trained and tested, nor does it mention whether transfer learning was used.
• The Hyperparameter Tuning Table appears to be missing from the document. You may need to include details about the specific hyperparameters tuned for your model (e.g., learning rate, batch size, optimizer settings) to enhance the experimental clarity.
• The evaluation metrics (e.g., mAP, precision, recall) should be explicitly justified in the context of the specific object detection task.
• If YOLOv8 performed better, the paper should include statistical significance testing (e.g., confidence intervals, p-values) to validate the findings.
• The computational efficiency and inference speed of YOLOv8 should be compared against other models to assess real-world applicability.
• The impact of using Canny and Sobel filtering on detection accuracy should be analyzed. It is unclear whether these methods improve performance or are necessary for preprocessing.
• The study would benefit from providing more details on dataset preprocessing, augmentation techniques, and model training procedures.
• Including qualitative results (visual comparisons of detections) along with quantitative results would improve clarity.
• If real-world applications are a focus, deployment feasibility and hardware requirements should be discussed.
The manuscript presents interesting work, but there are several areas that need improvement to meet the journal’s standards.
The English language should be revised for clarity and accuracy. Some terms are used incorrectly or imprecisely (e.g., “workstations” instead of “workers,” or “recognition accuracy” instead of “mAP”), and a few sentences are hard to follow.
The introduction must clearly distinguish between different tasks such as object detection, classification, and recognition.
The description of Average Precision is not accurate, and the method used to estimate apple volume from 2D images lacks depth data, making the results questionable.
Several preprocessing steps (like binarization and edge detection) are said to improve performance, but no experimental evidence is provided to support this.
While the general pipeline (YOLOv8 training and evaluation) is described, important details are missing or vague. For example, it is unclear how the dataset is split into training, validation, and test sets, or whether the evaluation is performed on unseen data. Claims regarding the benefit of preprocessing steps (like Sobel, Canny, and morphological operations) are not supported by ablation studies or comparative experiments.
The explanation of the number of "80 categories" appears to be a leftover from a COCO-pretrained configuration and does not reflect the actual classes in the dataset.
Comparisons with other models like VGG16, ResNet50, and MobileNetV2 are not clearly described and may not be valid unless those models were adapted for detection tasks in the same way as YOLOv8.
The paper makes strong claims about improvements in performance and generalizability, but these are not always supported by sufficient evidence. There are no ablation studies to show the impact of preprocessing steps, and comparisons to other models are not adequately justified or explained. In particular, the use of 2D bounding box dimensions to estimate 3D volume without any depth information undermines the reliability of the apple volume and ripeness estimations.
All text and materials provided via this peer-review history page are made available under a Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.