All reviews of published articles are made public. This includes manuscript files, peer review comments, author rebuttals and revised materials. Note: This was optional for articles submitted before 13 February 2023.
Peer reviewers are encouraged (but not required) to provide their names to the authors when submitting their peer review. If they agree to provide their name, then their personal profile page will reflect a public acknowledgment that they performed a review (even if the article is rejected). If the article is accepted, then reviewers who provided their name will be associated with the article itself.
I hope this message finds you well. After carefully reviewing the revisions you have made in response to the reviewers' comments, I am pleased to inform you that your manuscript has been accepted for publication in PeerJ Computer Science.
Your efforts to address the reviewers’ suggestions have significantly improved the quality and clarity of the manuscript. The changes you implemented have successfully resolved the concerns raised, and the content now meets the high standards of the journal.
Thank you for your commitment to enhancing the paper. I look forward to seeing the final published version.
[# PeerJ Staff Note - this decision was reviewed and approved by Massimiliano Fasi, a PeerJ Section Editor covering this Section #]
The authors have already revised the paper according to the reviewer comments.
The authors have already revised the paper according to the reviewer comments.
The authors have already revised the paper according to the reviewer comments.
No comments.
Thank you for submitting your manuscript to PeerJ Computer Science. After careful review, the reviewers have raised some concerns regarding the methodology and experimentation that need to be addressed before we can proceed with the publication.
We kindly request that you revise your manuscript in light of the reviewers' comments and make the necessary adjustments. Please also provide a detailed response letter addressing each of the reviewers' suggestions and observations.
We are confident that, with these revisions, your manuscript will be considered for publication.
Thank you again for your contribution, and we look forward to receiving your revised submission.
The authors have satisfactorily responded to all my previous comments. I do not have any more concerns.
The authors have satisfactorily responded to all my previous comments. I do not have any more concerns.
The authors have satisfactorily responded to all my previous comments. I do not have any more concerns.
The authors have satisfactorily responded to all my previous comments. I do not have any more concerns.
• The authors have already revised the paper according to the reviewer comments.
• The authors do not address the following comment:
o Authors report their results on self-collected dataset only. They should evaluate the proposed methods on public datasets to ensure the generalizability of their approach.
• Authors should keep consistency between their tables of results. Use the same metrics as possible. Arrange the metrics in the same order.
• There is still a conflict in the results tables. An example is presented in the PDF.
• The authors do not address the following comment:
o In the conclusion, authors emphasize that the model should be “real-time crash prediction model”. However, there is no study for the time of the proposed model, neither training time nor inference time.
I hope this email finds you well. After a thorough review of your manuscript by the assigned reviewers, I would like to inform you that, while there is potential in your work, several significant concerns have been raised regarding the experimentation and methodology.
The reviewers have pointed out that certain aspects of the experimental setup lack sufficient clarity and justification. In particular, they believe that more detailed explanations and stronger validations are necessary to support your findings. Additionally, methodological improvements have been recommended to ensure the robustness and reliability of the results.
In light of these concerns, we are requesting major revisions to the manuscript. We kindly ask that you carefully address each of the reviewers' comments in your revised submission, providing additional detail and supporting evidence where necessary.
Use the full form of abbreviations in the abstract.
Add keywords.
Lines 96 to 126 could be better placed before the authors represent previous findings about driver actions.
Lines 228-231: The numbers can be put into a table for better representation.
There should be some description about XGBoost Random Forest algorithms.
Instead of saying "Balanced Approach" the authors should use the term "Balancing Approach".
The "Discussion" section does not contain any real discussion. It is just pointing out the main findings.
Table 7 is missing.
Did all the drivers went through the same conditions, driving time, scenarios, etc?
How did the simulator take into account the tire conditions?
How would the authors justify having more male drivers in the participants than female?
So the simulation is done to observe conflicts, and not crashes? The model development will also be for the same purpose.
Figure 5. All features have the same importance? How is this possible? Furthermore there is no discussion about variable impact on the output which is a very important element of the prediction model. If the results are as per shown in figure, then it should be written as a limitation of the proposed model.
What are non-accidents?
Overall, the paper proposes an intriguing hybrid resampling technique and a GAANN-based framework to tackle imbalanced crash data and identify critical driver actions. Strengthening real-world validation, detailing feature importance, and clarifying model constraints would substantially enhance the paper’s contribution to the field of traffic safety and machine learning
The study relies on a simulator-based dataset, which, while helpful for controlled experimentation, may not fully capture the complexity of real-world driving behaviors and crash scenarios. Validation with actual on-road data or naturalistic driving studies would strengthen the claims and improve generalizability.
While the paper highlights driver actions, vehicle telemetry, tire conditions, and weather, it would be useful to offer a clearer explanation of why these features are selected and how they interrelate. Deeper insight into each feature’s direct impact on crash likelihood would enhance interpretability.
The newly proposed DBSTLink approach for handling imbalance (DBSCAN + SMOTE-TL) is interesting, but the paper would benefit from a more thorough justification of hyperparameters and a clearer comparison to other state-of-the-art methods. Demonstrating how parameter choices influence performance would solidify confidence in the technique.
The article presents a significant contribution to the field by proposing a hybrid GAANN machine learning model with a novel DBSTLink data balancing approach for driving action detection and road crash prediction. Yet, several critical aspects require attention to enhance the scientific rigor and clarity of the manuscript.
The article is well-written, yet there are instances where the technical language is imprecise or ambiguous, which can hinder the comprehension of an international audience. The frequent use of non-standard abbreviations and technical jargon without immediate definitions detracts from readability. For example, the acronym DBSTLink is introduced early in the abstract without an adequate explanation of how it combines DBSCAN and SMOTETL. Such omissions could confuse readers unfamiliar with these methods.
The introduction provides a solid overview of the research problem and its relevance in the context of traffic safety and machine learning applications. Nevertheless, the background section lacks depth in explaining the broader context of previous work. Although various machine learning models and balancing techniques are mentioned, the manuscript fails to critically assess how the proposed DBSTLink method improves upon existing methods beyond reporting higher metrics. For example, it is stated that DBSTLink outperforms other models with an F1 score of 99%, but there is no in-depth explanation of why this result is statistically significant or how it addresses gaps in the literature.
The structure of the article generally follows a standard scientific format, but there are some issues that could be improved for better clarity and coherence. The transition between sections is occasionally abrupt, and some details that should be in the methods section appear in the introduction and vice versa. For example, the description of the driving simulation setup, which should be a part of the experimental design, is partially included in the introduction, causing unnecessary repetition.
The submission is self-contained to an extent, yet some results and discussions remain underdeveloped. The experimental results are comprehensive, but the manuscript does not always connect them back to the original research question. For instance, the performance of the hybrid GAANN classifier with different resampling techniques is reported extensively, but there is limited discussion on the practical implications of these findings for real-world applications. This omission reduces the broader impact and utility of the research.
Formal results should be presented with more precision and detailed definitions of terms and algorithms. While some terms such as SMOTE, DBSCAN, and Tomek Link are mentioned, their operational details are insufficiently defined. Moreover, there is a lack of detailed proof or mathematical justification for the superiority of DBSTLink compared to other techniques. For instance, the description of how synthetic samples are generated through SMOTE and refined with Tomek Link lacks a rigorous mathematical explanation, which would be expected in a high-quality scientific manuscript.
The article does well in using figures and tables to present data, but there are instances where figures are not adequately described or discussed in the main text. For example, one of the Figure depicts the DBSTLink algorithm’s flowchart, but the text does not fully explain the relevance of each step in relation to the overall objective of the study. Such omissions leave the reader without a complete understanding of how the algorithm contributes to the improved performance metrics reported later.
In terms of citations, the article references relevant literature, but there are instances where citations are sparse or missing for key claims. For example, when discussing the accuracy and recall values of various machine learning models, no citation is provided to validate these benchmarks. Besides, the claim that the hybrid GAANN classifier significantly outperforms other classifiers should be supported by a more extensive comparison with state-of-the-art techniques from recent literature.
The manuscript would benefit greatly from a more robust conclusion that synthesizes the findings and emphasizes their implications for future research. While the conclusion section briefly mentions future directions, it lacks a clear articulation of how this research could influence developments in the field. Given the importance of this work in predicting road crashes, a well-rounded conclusion would help position it within the broader academic and practical context.
The authors propose a combination of DBSCAN and SMOTE Tomek Link as a novel balancing strategy to mitigate data imbalance issues, which is an important and meaningful contribution to the field. Unfortunately, the manuscript does not fully articulate how this specific combination offers a significant improvement over similar methods described in prior studies. Explicit comparative discussions are minimal, and the extent to which this work advances existing knowledge beyond incremental improvement is not always clear.
The investigation demonstrated a reasonable effort to maintain a high technical standard to incorporate a driving simulator setup and data collection under varying weather conditions to simulate real-world scenarios. The description of participant demographics and driving experience supports the robustness of the experimental setup. Yet, the absence of detailed statistical analysis of the participants' background raises concerns about potential biases that might affect the generalizability of the findings. The study’s reliance on simulation-based data, while understandable for safety reasons, limits the extent to which results can be directly applied to real-world driving environments. The authors acknowledge this limitation but fail to provide sufficient discussion on how this could influence the model's predictive accuracy in real-world contexts.
The methods section provides a detailed account of the machine learning workflow, covering data preprocessing, the DBSTLink balancing approach, and the GAANN classifier. The proposed algorithm for the DBSTLink strategy is outlined, but some critical elements are missing that affect reproducibility. The manuscript does not specify key hyperparameters for the genetic algorithm and artificial neural network, such as mutation rate, crossover probability, number of hidden layers, and activation functions. These omissions hinder full replication of the study.
While the flowcharts and descriptions of the DBSTLink algorithm provide a high-level overview, they lack the granular detail necessary for another researcher to implement the process without significant additional effort. The authors describe the evaluation metrics, such as accuracy, F1-score, MCC, and G-mean, which are appropriate for assessing performance on imbalanced datasets. Though, there is no clear explanation of the rationale behind the selection of specific threshold values used for evaluation. This could lead to inconsistent results when attempting to replicate the study under similar conditions.
The manuscript claims that the proposed hybrid GAANN classifier outperforms other machine learning models across various performance metrics. Nevertheless, the results section does not always present comprehensive evidence to support these claims. While the tables compare the proposed method with other techniques, the explanations are sometimes insufficient. For instance, the text does not detail how the statistical significance of differences between models was assessed. The lack of confidence intervals or significance tests makes it difficult to judge whether the reported improvements are meaningful or due to random variation.
The manuscript addresses a timely and significant topic, yet there is no explicit assessment of the broader impact and novelty of this contribution within the existing body of literature. While the authors provide comparisons with related techniques like SMOTE, SMOTE-TL, and DBSM, the rationale for why the proposed combination of GAANN and DBSTLink represents a groundbreaking advancement is not well-articulated. The manuscript would benefit from a more detailed discussion on how this approach moves beyond existing methods and its potential implications for future research or real-world applications.
Encouraging meaningful replication is important, yet the current presentation does not provide a strong rationale for how replicating this method could broadly benefit the field.
The data on which the conclusions are based appear to be comprehensive in terms of the driving simulator setup and the collection of multiple variables, including driver inputs, vehicle kinematics, and weather conditions. Nevertheless, it is unclear whether all underlying data have been made publicly available or deposited in a suitable repository, as required for robust scientific validation. The authors mention that 75900 samples were recorded during simulations with a highly imbalanced distribution of crash events. While this large dataset is commendable, the description of how the data were cleaned and controlled for potential sources of bias is not provided. This raises questions about the robustness and statistical soundness of the data. The reported accuracy and F1 scores suggest high model performance, but without access to the raw data, it is difficult for readers to verify these claims independently.
The conclusions presented in the manuscript are generally consistent with the stated objectives and are linked to the original research question. The authors claim that the proposed DBSTLink method outperforms other balancing techniques and that the hybrid GAANN classifier delivers superior results. Nonetheless, some conclusions appear overstated given the evidence provided. For instance, the assertion that DBSTLink is the most robust balancing approach across all models lacks the necessary statistical validation to substantiate such a broad claim. The absence of confidence intervals and statistical significance tests makes it challenging to determine whether the observed improvements are genuinely meaningful or simply due to random variation. The manuscript would benefit from a more cautious interpretation of the results by focusing on what the data genuinely support without implying causation where only correlation has been established.
The study also highlights that driving actions, vehicular telemetry, tire conditions, and weather factors are critical contributors to crash prediction. Yet, no experimental interventions were conducted to establish a direct causal relationship between these factors and the likelihood of crashes. While correlational analyses are a valid starting point, any claim of causality requires a more controlled experimental approach. This distinction is essential to ensure that the conclusions remain within the boundaries of what the data can support. A more detailed explanation of potential confounding factors and their impact on the results would also enhance the strength of the conclusions.
Following factors can be considered for the improvement of study:
1. Introduce a more comprehensive analysis of the novelty and broader impact of the proposed GAANN-DBSTLink approach within the context of existing literature
2. Provide a detailed explanation of the data cleaning and control procedures to enhance the statistical soundness and robustness of the dataset
3. Ensure the raw data are deposited in a suitable public repository to facilitate independent validation and reproducibility
4. Include the exact hyperparameter configurations for the genetic algorithm and artificial neural network to enable full replication of the study
5. Add confidence intervals and statistical significance tests to validate performance claims and avoid potential overstatements of the results
6. Clarify the rationale behind the selection of evaluation thresholds to support consistency and reliability in future replications
7. Discuss the limitations of the simulation-based data and its implications for real-world applicability in greater detail
8. Distinguish correlation from causation in the conclusions to maintain scientific accuracy and prevent overgeneralization of findings
9. Incorporate a more cautious interpretation of the results by focusing only on what the presented data directly support
10. Expand the discussion on the potential real-world applications and future directions to give context to the practical significance of the findings.
The primary goal of the manuscript is to enhance road safety by developing a hybrid machine learning model GAANN that can predict crash occurrences by taking into account the most important features of driving actions. In addition, the study introduces a novel data balancing method, DBSTLink, which combines SMOTE Tomek Link and DBSCAN techniques, significantly improving the accuracy and efficacy of crash prediction models. We think this idea is interesting. However, we have the following notes:
• The manuscript needs proofreading for some grammatical errors. There are some typing errors like : duplication of word “Road” in line 40, “is” in line 415, and word “result” in line 472. Please check the total manuscript.
Some sentences are overly complex and lengthy, making it difficult for readers to follow the argument. There are instances of repetitive information like “the aim of this work …” “The goal of this study …”.
The terms "driving action" and "driver behavior" are used interchangeably where I think “driver behavior” is a subset of “driving action”.
• The introduction section is very long. It contains an excessive amount of statistical data. You can choose the most important one. Consider condensing the information extracted from the related work articles.
• The manuscript structure meets the PeerJ standards.
• Most of the figures are relevant, good quality, well labelled & described. However, there are some notes:
• The caption of figure 3 should re-written to reflect all the algorithms included.
• Figure 4 : the arrows should re-drawn.
• Figure 5 is not referenced in the text.
• Table 6 and Table 7 captions are “Table 5” and “Table 6”. Correct them.
• Tables 4, 5, 6, 7 make the best result bold to be observed.
• Raw data is supplied.
• The research is a purely computer science point, so it is within the Scope of the journal.
• Research contributions are well defined.
• Most of the methods are described with sufficient details like DBSCAN, SMOTE, Tomik Link, and GA with its hyperparameters. However, authors should explain the architecture of used ANN and its hyperparameters.
• Authors report their results on self-collected dataset only. They should evaluate the proposed methods on public datasets to ensures the generalizability of their approach.
• The authors do not include in their results the performance of the tested models on the raw data, then on the data after feature selection and before balancing.
• Comparing Table 6 and Table 4, there is a great conflict. The results in Table 6 are not matched with the results in table 4 using the DBSTLink balancing technique. If these results are on the raw data, where is the benefit of DBSTLink method?
• In the conclusion, authors emphasize that the model should be “real-time crash prediction model”. However, there is no study for the time of the proposed model, neither training time nor inferencing time.
no comment
All text and materials provided via this peer-review history page are made available under a Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.