Review History


All reviews of published articles are made public. This includes manuscript files, peer review comments, author rebuttals and revised materials. Note: This was optional for articles submitted before 13 February 2023.

Peer reviewers are encouraged (but not required) to provide their names to the authors when submitting their peer review. If they agree to provide their name, then their personal profile page will reflect a public acknowledgment that they performed a review (even if the article is rejected). If the article is accepted, then reviewers who provided their name will be associated with the article itself.

View examples of open peer review.

Summary

  • The initial submission of this article was received on November 5th, 2024 and was peer-reviewed by 2 reviewers and the Academic Editor.
  • The Academic Editor made their initial decision on January 8th, 2025.
  • The first revision was submitted on March 1st, 2025 and was reviewed by 2 reviewers and the Academic Editor.
  • A further revision was submitted on June 30th, 2025 and was reviewed by the Academic Editor.
  • A further revision was submitted on August 7th, 2025 and was reviewed by 1 reviewer and the Academic Editor.
  • The article was Accepted by the Academic Editor on September 4th, 2025.

Version 0.4 (accepted)

· Sep 4, 2025 · Academic Editor

Accept

All the comments have been addressed.

The paper is ready for publication.

[# PeerJ Staff Note - this decision was reviewed and approved by Mehmet Cunkas, a PeerJ Section Editor covering this Section #]

**PeerJ Staff Note:** Although the Academic and Section Editors are happy to accept your article as being scientifically sound, a final check of the manuscript shows that it would benefit from further editing. Therefore, please identify necessary edits and address these while in proof stage.

Reviewer 1 ·

Basic reporting

Basic reporting
The article is executed at a high professional level. The author's language is clear, concise, and without ambiguities, which makes the work easily understandable. The introduction contains a justification of the relevance of the problem and outlines the main objectives of the research, such as the complexity of hand gesture recognition in complex backgrounds and the need to create high-precision methods for such tasks. The literature review includes references to relevant studies, emphasizing the relationship of the presented work with existing approaches. The structure of the paper fully meets the requirements of scientific publications: there is a clear division into introduction, methodology, results, and discussion. The illustrations and tables accompanying the text visualize the key data well and contribute to a better perception of the results. The authors emphasize the use of state-of-the-art YOLOv8 models (YOLOv8n and YOLOv8s versions) and analyze their performance when trained on different datasets - Oxford Hand and EgoHand. The introduction concludes with a clear statement of the research goal and a list of the main innovations, such as the optimization of the YOLO architecture and the introduction of new approaches for model training.
Responses: Thanks to the reviewer for the comment.

New comment: +

Experimental design

The study was performed within current scientific standards. The authors provided a clear description of the experiments, including:
- The hardware used (Intel Core i9, NVIDIA RTX 3060 and Google Colab Pro+).
- Two large and diverse datasets (Oxford Hand Dataset and EgoHand Dataset), including images of hand gestures in complex backgrounds.
- Details of training YOLOv8n and YOLOv8s models with different settings for the number of epochs (50 and 100).
- Metrics are used to evaluate the performance of the models (mAP, precision, recall, and GFLOPS).
The methodology is described with a high level of detail, allowing the experiments to be fully reproduced. However, to enhance the practical value of the work, it would have been worthwhile to disclose in more detail which data preprocessing techniques were used to improve recognition performance on complex backgrounds such as blurred images or scenes with multiple objects. In addition, the authors could have supplemented the description with the choice of hyperparameters, for example, explaining why a particular learning rate value or network structure was chosen.
Responses:
Thanks to the reviewer for the comment. We do not implement data processing other than that provided by the basic YOLOv8 model. We explained in line 237-253.
In this study, the hand detection procedure was employed, utilizing photos sourced from the Oxford hand dataset (Mittal et al., 2011) as the input data. Subsequently, we are engaging in the process of training our dataset utilizing the advanced YOLOv8n and YOLOv8s architectures, employing 50 and 100 epochs for each respective model. Subsequently, we shall engage in a comprehensive examination and discourse about the outcomes derived from the training and testing stages of YOLOv8. This intricate process encompasses the meticulous computation of the bounding box through the utilization of Non-Maximum Suppression (NMS). We do not implement data processing other than that provided by the basic YOLOv8 model.

The Yolo labeling format is commonly used by annotation programs to generate output. This format organizes annotations for each image into a single text file. The annotation for each graphical element in an image is represented by a bounding box, also referred to as a "BBox" abbreviation, in the corresponding text file. The scale of the annotations has been modified to maintain proportionality with the image. The values of the annotations range from 0 to 1 (Long et al., 2021). In the computation that will be done using the Yolo Format, Equations (1) through (6) will serve as the basis for the adjustment technique.

New comments: The paper does not consider or describe such important hyperparameters as batch size and learning rate or other optimisation strategies.
The study lacks an analysis of the causes of model errors (e.g., a detailed analysis of false-positive and false-negative cases). This would have allowed for a more precise identification of weaknesses and potential areas for improvement.

Validity of the findings

The experimental results are convincing and valid. The authors demonstrated that the YOLOv8n model trained on 100 epochs significantly outperforms previous methods:
- For the Oxford Hand Dataset, a mAP of 86.7% was achieved, outperforming previous work (e.g., YOLOv7x with a mAP of 86.3%).
- For the EgoHand Dataset, a mAP of 98.9% was achieved, which is also significantly higher than Faster R-CNN (mAP 96%).
Particular attention is paid to the visualization of the results through Precision-Recall curves and training efficiency plots, which confirm the model's stability at different training stages. A detailed analysis of the computational performance (GFLOPS) is also performed, emphasizing the proposed architecture's effectiveness. The conclusions are data-driven and logically summarised. The authors pointed out that increasing the number of epochs improves the quality of training but is accompanied by an increase in processing time. This is an essential practical observation that can be useful for researchers and engineers when choosing the compromise between accuracy and speed.
Responses:
Thanks to the reviewer for the comment.

New comment: +

Additional comments

The phrase “Integrate the model with software APIs (Python, TensorFlow, OpenCV)” sounds incorrect because Python is a programming language, not an API, and TensorFlow and OpenCV are usually referred to as frameworks or libraries, not just APIs.

It would be more correct to write:
Integrate the model with Python-based frameworks and libraries (TensorFlow, OpenCV) via their APIs.

It is necessary to add the latest publications from the last 2 years, such as:

1) Feature-Fused Deep Learning Approach for Hand Gesture Recognition in Intelligent Myoelectric Hand (DOI 10.1109/JSEN.2025.3570236)
2) Enhancing dynamic hand gesture recognition through the HMM-RB/LR algorithm (DOI: 10.1007/s11042-025-20972-2)
3) Battle royale sparse bayesian extreme learning machine for electromyographic signals based hand gesture recognition (DOI: 10.1007/s11042-025-21071-y)
4) Gesture Transformer: A Hybrid CNN-Transformer Model for Hand Gesture Recognition in Smart Educational Environments (DOI: 10.53964/mit.2025001)

etc.

**PeerJ Staff Note:** It is PeerJ policy that additional references suggested during the peer-review process should only be included if the authors agree that they are relevant and useful.

Cite this review as

Version 0.3

· Jul 25, 2025 · Academic Editor

Minor Revisions

Dear authors,

Thank you for your revised submission and your detailed responses to the reviewers' comments. The manuscript has been significantly improved and the experimental framework is clearly explained. However, a few minor revisions are still required before final acceptance.

Please revise lines 529–549, which currently contain over-general explanations likely generated by AI. These should be rewritten or removed.

The abstract should be adjusted to reflect the actual scope and results of the study.

We recommend relocating your discussion on model generalizability (currently in the "Author Contributions" section) to the Discussion or Conclusion section.
Please improve the clarity and rationale in the "Research Gap and Work Intention" section and strengthen the related work comparison.

Consider including a brief discussion or table highlighting false positive and false negative examples, or at least limitations in model accuracy related to such cases.

Version 0.2

· May 2, 2025 · Academic Editor

Major Revisions

The manuscript addresses a timely and potentially valuable topic, but it requires substantial revision to reach the scientific maturity expected for publication. The research objectives are unclear and oscillate between hand detection and gesture recognition, compromising the coherence of the work. The claimed contributions are either weak or improperly stated—for example, the use of YOLOv8, which in itself does not constitute an innovation. Several sections—such as the abstract, related work, and conclusion—need to be rewritten to improve clarity, logical flow, and scientific rigor. The structure of the manuscript is inconsistent, with misplaced content and potentially unnecessary equations. Finally, the text suffers from numerous linguistic issues and includes passages that appear AI-generated, which undermines the overall quality of the presentation. For these reasons, a major revision is considered necessary.

**PeerJ Staff Note:** Your submission appears to have been at least partially authored or edited by a generative AI/large language model. When you submit your revision, please detail whether (and if so, how) AI was used in the construction of your manuscript in your response letter, AND mention it in the Acknowledgements section of your manuscript.

Reviewer 1 ·

Basic reporting

Basic reporting
The article is executed at a high professional level. The author's language is clear, concise, and without ambiguities, which makes the work easily understandable. The introduction contains a justification of the relevance of the problem and outlines the main objectives of the research, such as the complexity of hand gesture recognition in complex backgrounds and the need to create high-precision methods for such tasks. The literature review includes references to relevant studies, emphasizing the relationship of the presented work with existing approaches. The structure of the paper fully meets the requirements of scientific publications: there is a clear division into introduction, methodology, results, and discussion. The illustrations and tables accompanying the text visualize the key data well and contribute to a better perception of the results. The authors emphasize the use of state-of-the-art YOLOv8 models (YOLOv8n and YOLOv8s versions) and analyze their performance when trained on different datasets - Oxford Hand and EgoHand. The introduction concludes with a clear statement of the research goal and a list of the main innovations, such as the optimization of the YOLO architecture and the introduction of new approaches for model training.
Responses: Thanks to the reviewer for the comment.

New comment: +

Experimental design

The study was performed within current scientific standards. The authors provided a clear description of the experiments, including:
- The hardware used (Intel Core i9, NVIDIA RTX 3060 and Google Colab Pro+).
- Two large and diverse datasets (Oxford Hand Dataset and EgoHand Dataset), including images of hand gestures in complex backgrounds.
- Details of training YOLOv8n and YOLOv8s models with different settings for the number of epochs (50 and 100).
- Metrics are used to evaluate the performance of the models (mAP, precision, recall, and GFLOPS).
The methodology is described with a high level of detail, allowing the experiments to be fully reproduced. However, to enhance the practical value of the work, it would have been worthwhile to disclose in more detail which data preprocessing techniques were used to improve recognition performance on complex backgrounds such as blurred images or scenes with multiple objects. In addition, the authors could have supplemented the description with the choice of hyperparameters, for example, explaining why a particular learning rate value or network structure was chosen.
Responses:
Thanks to the reviewer for the comment. We do not implement data processing other than that provided by the basic YOLOv8 model. We explained in line 237-253.
In this study, the hand detection procedure was employed, utilizing photos sourced from the Oxford hand dataset (Mittal et al., 2011) as the input data. Subsequently, we are engaging in the process of training our dataset utilizing the advanced YOLOv8n and YOLOv8s architectures, employing 50 and 100 epochs for each respective model. Subsequently, we shall engage in a comprehensive examination and discourse about the outcomes derived from the training and testing stages of YOLOv8. This intricate process encompasses the meticulous computation of the bounding box through the utilization of Non-Maximum Suppression (NMS). We do not implement data processing other than that provided by the basic YOLOv8 model.

The Yolo labeling format is commonly used by annotation programs to generate output. This format organizes annotations for each image into a single text file. The annotation for each graphical element in an image is represented by a bounding box, also referred to as a "BBox" abbreviation, in the corresponding text file. The scale of the annotations has been modified to maintain proportionality with the image. The values of the annotations range from 0 to 1 (Long et al., 2021). In the computation that will be done using the Yolo Format, Equations (1) through (6) will serve as the basis for the adjustment technique.

New comments: The paper does not consider or describe such important hyperparameters as batch size and learning rate or other optimisation strategies.
The study lacks an analysis of the causes of model errors (e.g., a detailed analysis of false-positive and false-negative cases). This would have allowed for a more precise identification of weaknesses and potential areas for improvement.

Validity of the findings

The experimental results are convincing and valid. The authors demonstrated that the YOLOv8n model trained on 100 epochs significantly outperforms previous methods:
- For the Oxford Hand Dataset, a mAP of 86.7% was achieved, outperforming previous work (e.g., YOLOv7x with a mAP of 86.3%).
- For the EgoHand Dataset, a mAP of 98.9% was achieved, which is also significantly higher than Faster R-CNN (mAP 96%).
Particular attention is paid to the visualization of the results through Precision-Recall curves and training efficiency plots, which confirm the model's stability at different training stages. A detailed analysis of the computational performance (GFLOPS) is also performed, emphasizing the proposed architecture's effectiveness. The conclusions are data-driven and logically summarised. The authors pointed out that increasing the number of epochs improves the quality of training but is accompanied by an increase in processing time. This is an essential practical observation that can be useful for researchers and engineers when choosing the compromise between accuracy and speed.
Responses:
Thanks to the reviewer for the comment.

New comment: +

Additional comments

The work gives the impression of a significant contribution to computer vision. The use of recent advances in neural networks (YOLOv8) to solve the problem of hand gesture recognition in complex backgrounds demonstrates the perspective of the approach. Significantly, the paper discusses the application of YOLOv8 to two specific datasets and provides a comparative analysis with other methods, such as YOLOv7x and Faster R-CNN. This highlights the novelty and competitiveness of the proposed approach. However, there are several additional aspects that the authors could have considered:
- The universality of the model. How will the model perform on data differing from Oxford Hand and EgoHand? For example, when processing real-time or highly distorted video?
Responses:
Thanks to the reviewer for the comment. The universality of a YOLOv8-based hand recognition model depends on its generalization ability across different datasets and real-world conditions. While trained on Oxford Hand and EgoHand, the model may face challenges when processing real-time video or highly distorted inputs due to variations in lighting, hand appearance, occlusions, and motion blur. We will focus on the real-time video dataset in our future works. We explain in line 615-618.

NEW comments: Lines 615-618 contains Author Contribution but not explanation

In our upcoming research, we are going to investigate the possibility of combining hand detection with real-time video dataset and explainable artificial intelligence (XAI). Further, our plan to couple hand detection in a logic-based framework to make automatic inferences based on the detected scene (Calimeri et al., 2019).
- Potential applications. The article mentions applications such as human-computer interaction or rehabilitation. It would be helpful to discuss the steps needed to integrate the proposed model into real systems.
Responses:
Thanks to the reviewer for the comment. We discussed the steps needed to integrate the proposed model into real systems on line 578-585.
Steps to Integrate YOLOv8 hand recognition into real systems as follows: (1) Data Collection & Preprocessing: Gather diverse hand gesture datasets in real-world conditions. Apply data augmentation (lighting changes, occlusion simulation, motion blur). (2) Model Training & Optimization: Fine-tune YOLOv8 on domain-specific gestures (rehabilitation, gaming, security). Use quantization & pruning for real-time performance on edge devices. (3) System Deployment & Real-World Testing: Integrate the model with software APIs (Python, TensorFlow, OpenCV). Deploy on hardware platforms (PC, mobile, Jetson Nano). Conduct real-world trials in the intended application environment.

NEW comments: Lines 529-549 need to be redone as they are AI generated. Also, please clarify what you mean by API PYTHON.

Annotated reviews are not available for download in order to protect the identity of reviewers who chose to remain anonymous.
Cite this review as

Reviewer 2 ·

Basic reporting

1. Abstract – Method Subsection
The authors mention: “This research investigates a wide range of object detection algorithms, including those specifically employed for hand identification purposes.” However, the study only compares these algorithms with YOLO models rather than investigating them. The term “investigates” implies a deeper analysis, which is not reflected in the content.

2. Abstract – Results Subsection
The results subsection in the abstract needs improvement to clearly reflect key findings and the significance of the work.

3. Page 6, Lines 58–66
The authors describe challenges in gesture recognition using deep learning in the presence of complex backgrounds, such as variations in skin tone. While this is a well-written explanation of potential difficulties, the experimental section does not demonstrate how these challenges were addressed or how they impacted the performance of the proposed YOLO models. It would strengthen the paper if such analysis were included.

4. Page 6, Lines 66–68
The authors state: “To improve results on the hand identification task, we can use the common information supplied in the training signal for the hand appearance reconstruction task as an inductive bias (Alam et al., 2022).” This sentence is unclear. The authors should clarify what “hand appearance reconstruction task” entails and how it contributes to the main objective.

5. Page 6, Line 69
The sentence “The YOLOv8 model is named after the maxim ‘You Only Look Once’” could be improved by removing “v8” in this context, since the sentence refers to YOLO as a general concept rather than a specific version.

6. Page 7, Line 90
The heading “Research Gap and Work Intention” suggests that the section should outline the research gap and how this paper addresses it. However, the authors primarily describe their contributions. This section should be revised to better position the work within the existing research landscape.

7. Page 7, Lines 91–101
While the contributions are listed, citing the use of YOLOv8 is not, in itself, a novel contribution, as the model is publicly available and widely documented. Additionally, comparing existing models is more of an evaluation than a contribution. The authors should emphasize how they tackled the previously mentioned challenges in hand detection and highlight any unique methods or innovations introduced.
The following sentence is also problematic:
“The motivation of this study is to investigate a better approach to recognizing human hand gestures, and the motivation for using the proposed method is that we need to continue our conducted studies further.”
It is unclear whether the focus is on hand detection or gesture recognition. Also, the justification “we need to continue our conducted studies further” does not provide a strong motivation for the research. This should be rewritten for clarity and purpose.

8. Related Work Section
This section lacks specificity. The referenced studies are discussed in general terms without describing the models used or their outcomes. A proper review should include details of previous work and how it differs from or relates to the current study.

9. Page 11, Line 275
The authors state: “We execute the necessary data preparation on this dataset before exporting it in Yolo format.” It would be helpful to describe what kind of data preparation was performed.

10. Model Evaluation Subsection
This subsection should appear earlier in the Experiments section, preferably before it is referenced. Reordering would improve the logical flow of the paper.

11. Equations (1–6 and 13–14)
The inclusion of these YOLO loss function equations may not be necessary. Their relevance should be clearly justified, or they could be omitted to maintain clarity and conciseness.

12. Conclusion Section
The sentence “The purpose of this research publication is to offer a comprehensive review of the CNN-based object recognition algorithms that are currently available” is misleading, as the paper is not structured as a review article. The conclusion section should be rewritten to summarize the findings, contributions, and potential future work more effectively.

13. General Remarks
Numerous typos and grammatical errors were found throughout the manuscript. A thorough proofreading is recommended to enhance readability and professionalism.

Experimental design

.

Validity of the findings

.

Additional comments

.

Cite this review as

Version 0.1 (original submission)

· Jan 8, 2025 · Academic Editor

Major Revisions

Although the work is recognized for its scientific value, clarity of language and sound methodological approach, several issues requiring attention have been identified:

Structure and flow of ideas: The organization of the manuscript needs to be improved. It is recommended to add a section dedicated to related works to better position the contribution with respect to the state of the art and clarify the research gap addressed. This includes a more detailed analysis of the specific problem that the work intends to solve.

Consistency between title and content: The title of the manuscript refers to gesture recognition, but the content focuses mainly on hand identification. Authors should clarify the focus of the research or adjust the title to better reflect the objectives of the study.

Description of techniques and methodological choices: It is important to provide more details on data preprocessing techniques, especially to handle complex backgrounds and scenes with skin-like features. Furthermore, authors should better justify the choices of learning parameters and explain their impact on the model performance.

Experimental analysis and metrics: The experimental section requires further elaboration. Specific analyses should be included on the difficulties related to the nature of the datasets used (e.g., the problem of skin and complex backgrounds). In addition, metrics such as Accuracy and F1, mentioned in the text, should be integrated or, alternatively, removed if not relevant.

Applications and generalizability: It is recommended to discuss the generalizability of the model, considering different datasets and practical applications, as well as the steps needed to integrate the system in real-world contexts such as human-computer interaction or rehabilitation.

These improvements can significantly strengthen the manuscript, increasing both the clarity and the scientific impact of the work.

Reviewer 1 ·

Basic reporting

The article is executed at a high professional level. The author's language is clear, concise, and without ambiguities, which makes the work easily understandable. The introduction contains a justification of the relevance of the problem and outlines the main objectives of the research, such as the complexity of hand gesture recognition in complex backgrounds and the need to create high-precision methods for such tasks. The literature review includes references to relevant studies, emphasizing the relationship of the presented work with existing approaches. The structure of the paper fully meets the requirements of scientific publications: there is a clear division into introduction, methodology, results and discussion. The illustrations and tables accompanying the text visualize the key data well and contribute to a better perception of the results. The authors emphasize the use of state-of-the-art YOLOv8 models (YOLOv8n and YOLOv8s versions) and analyze their performance when trained on different datasets - Oxford Hand and EgoHand. The introduction concludes with a clear statement of the research goal and a list of the main innovations, such as the optimization of the YOLO architecture and the introduction of new approaches for model training.

Experimental design

The study was performed within current scientific standards. The authors provided a clear description of the experiments, including:
- The hardware used (Intel Core i9, NVIDIA RTX 3060 and Google Colab Pro+).
- Two large and diverse datasets (Oxford Hand Dataset and EgoHand Dataset), including images of hand gestures in complex backgrounds.
- Details of training YOLOv8n and YOLOv8s models with different settings for the number of epochs (50 and 100).
- Metrics are used to evaluate the performance of the models (mAP, precision, recall, and GFLOPS).
The methodology is described with a high level of detail, allowing the experiments to be fully reproduced. However, to enhance the practical value of the work, it would have been worthwhile to disclose in more detail which data preprocessing techniques were used to improve recognition performance on complex backgrounds such as blurred images or scenes with multiple objects. In addition, the authors could have supplemented the description with the choice of hyperparameters, for example, explaining why a particular learning rate value or network structure was chosen.

Validity of the findings

The experimental results are convincing and valid. The authors demonstrated that the YOLOv8n model trained on 100 epochs significantly outperforms previous methods:
- For the Oxford Hand Dataset, a mAP of 86.7% was achieved, outperforming previous work (e.g., YOLOv7x with a mAP of 86.3%).
- For the EgoHand Dataset, a mAP of 98.9% was achieved, which is also significantly higher than Faster R-CNN (mAP 96%).
Particular attention is paid to the visualization of the results through Precision-Recall curves and training efficiency plots, which confirm the model's stability at different training stages. A detailed analysis of the computational performance (GFLOPS) is also performed, emphasizing the proposed architecture's effectiveness. The conclusions are data-driven and logically summarised. The authors pointed out that increasing the number of epochs improves the quality of training but is accompanied by an increase in processing time. This is an essential practical observation that can be useful for researchers and engineers when choosing the compromise between accuracy and speed.

Additional comments

The work gives the impression of a significant contribution to computer vision. The use of recent advances in neural networks (YOLOv8) to solve the problem of hand gesture recognition in complex backgrounds demonstrates the perspective of the approach. Significantly, the paper discusses the application of YOLOv8 to two specific datasets and provides a comparative analysis with other methods, such as YOLOv7x and Faster R-CNN. This highlights the novelty and competitiveness of the proposed approach. However, there are several additional aspects that the authors could have considered:
- The universality of the model. How will the model perform on data differing from Oxford Hand and EgoHand? For example, when processing real-time or highly distorted video?
- Potential applications. The article mentions applications such as human-computer interaction or rehabilitation. It would be helpful to discuss the steps needed to integrate the proposed model into real systems.
Recommendations for improvement
1. Enhance the description of data processing. Specify the particular techniques (e.g., augmentation, filtering) and their impact on the results.
2. Describe the functioning of the model on complex backgrounds. What limitations are seen when working with real video or dynamic scenes?
3. Consider other datasets. Use additional open datasets to increase the generalisability of the study.

Annotated reviews are not available for download in order to protect the identity of reviewers who chose to remain anonymous.
Cite this review as

Reviewer 2 ·

Basic reporting

This paper proposed to use YOLO v8 in hand gesture recognition using Oxford Hand
Dataset and EgoHand Dataset.
What I found and would like to notice the authors about is that it needs to be reconstructed again since there is no clear flow of the ideas mentioned. For example:
*In the introduction section, you mentioned the problem with hand identification, especially in cases where the hand holds skin similar to the background. Still, you did not mention the gap in the research that others could not fill, and you did it in your paper.
*The literature mentioned is not clear and is scattered throughout the whole paper, you miss to put it in a separate section as a related works section.
* In the title of the paper you mentioned that you are going to utilize the YOLO v8 for hand gesture recognition, but in the paper, only the hand identification is mentioned, and nothing about the type of gestures you focus on.
*In the abstract section, you mentioned that your research 'investigates a wide range of object detection algorithms, including those specifically employed for hand identification', but in the experimental section, you just mentioned YOLO v8 n and s, and no other competitors were mentioned.

Experimental design

Further to what I mentioned earlier, the experimental section does not present any analysis of the skin problem in both used datasets, and no mention of the level of difficulties appears in these images regarding this problem. You only mentioned the following for EgoHand dataset:
"This dataset consists of complex backgrounds including : (1) Dark background and incomplete gesture, (2) Skin-like background and incomplete gesture, (3) Motion blur, and (4) Extremely small gesture. Figure 4 exhibits the example of EgoHand Dataset. When dealing with a dataset that consists of complex backgrounds for hand gesture recognition, it’s important to preprocess the data effectively to improve model performance.", in other words not many details about this problem especially when you focus on it in this paper.
You mentioned Accuracy and F1 metrics but they did not used? Also IoU.

Also, the Computing Infrastructure mentioned in lines 118 - 122 normally appears in the experimental Section.

Validity of the findings

We could not judge the novelty of the utilized work since there is no clear related works section describing where this work stands.

Additional comments

No comment

Cite this review as

All text and materials provided via this peer-review history page are made available under a Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.