Javascript is disabled in your browser. Please enable Javascript to view PeerJ.

Review History
Sequence-based undersampling: an algorithm for managing imbalanced datasets

All reviews of published articles are made public. This includes manuscript files, peer review comments, author rebuttals and revised materials. Note: This was optional for articles submitted before 13 February 2023.

Peer reviewers are encouraged (but not required) to provide their names to the authors when submitting their peer review. If they agree to provide their name, then their personal profile page will reflect a public acknowledgment that they performed a review (even if the article is rejected). If the article is accepted, then reviewers who provided their name will be associated with the article itself.

View examples of open peer review.

Summary

The initial submission of this article was received on December 11th, 2024 and was peer-reviewed by 2 reviewers and the Academic Editor.
The Academic Editor made their initial decision on February 7th, 2025.
The first revision was submitted on March 10th, 2025 and was reviewed by 2 reviewers and the Academic Editor.
A further revision was submitted on June 26th, 2025 and was reviewed by 1 reviewer and the Academic Editor.
The article was Accepted by the Academic Editor on July 4th, 2025.

Version 0.3 (accepted)

Martina Iammarino · Jul 4, 2025 · Academic Editor

Accept

The authors have thoroughly addressed the comments and questions raised during the review process. Their responses demonstrate a clear effort to improve the manuscript, and an acceptance is therefore recommended. Nonetheless, the discussion of the precedent could be strengthened by adopting a more analytical perspective on the individual components, moving beyond descriptive summaries of existing methods.

[# PeerJ Staff Note - this decision was reviewed and approved by Claudio A. Ardagna, a PeerJ Section Editor covering this Section #]

Reviewer 1 · Jul 2, 2025

Basic reporting

The authors have detailedly answered my questions, thus an acceptance would be preferred. However, it is suggested that the discussion of the precedent would benefit from a more analytical examination of each component, rather than focusing primarily on descriptive summaries of existing methods.

Experimental design

N/A

Validity of the findings

N/A

Additional comments

N/A

Cite this review as

Anonymous Reviewer (2025) Peer Review #1 of "Sequence-based undersampling: an algorithm for managing imbalanced datasets (v0.3)". PeerJ Computer Science

Download Version 0.3 (PDF) Download author's response letter (v0.3) - submitted Jun 26, 2025

Version 0.2

Martina Iammarino · Mar 30, 2025 · Academic Editor

Major Revisions

After carefully reviewing the comments provided by both reviewers, I believe that the manuscript shows potential for publication, provided that substantial revisions are made to address the concerns raised.

Both reviewers identified critical weaknesses in the literature review and experimental justification. Specifically, the PRECEDENTS section lacks depth in the analysis of state-of-the-art methods and omits more recent and relevant approaches to imbalanced data, especially in sequential contexts. This affects the clarity of the motivation for the proposed method and its positioning in relation to existing techniques. Moreover, a more appropriate comparison with other data-level and sequence-sensitive undersampling techniques is necessary to validate the originality and impact of the proposed approach.

Further concerns were raised regarding the presentation and clarity of results. These include the explanation of negative R² values, unclear parameter selection in the SBU framework, the lack of sensitivity analysis, and missing or incorrect dataset references. Additionally, the rationale for comparing the proposed undersampling method primarily with oversampling techniques is currently weak, and more recent undersampling methods should be included to provide a balanced evaluation.

Although one reviewer suggested rejection, I believe that the issues raised—while significant—can be addressed through a comprehensive revision. Therefore, I am recommending a major revision to give the authors the opportunity to improve the manuscript in line with the reviewers’ feedback.

**PeerJ Staff Note:** Please ensure that all review, editorial, and staff comments are addressed in a response letter and that any edits or clarifications mentioned in the letter are also inserted into the revised manuscript where appropriate.

Reviewer 1 · Mar 26, 2025

Basic reporting

The authors provided a brief response to the reviewers' comments, but it still failed to meet the standards of the journal. Below are my detailed comments and suggestions for improvement：
1.The PRECEDENTS section should provide a more thorough analysis of both classical and recent state-of-the-art methods, highlighting their respective strengths and limitations. However, the current discussion lacks depth in analyzing the cited methods. Additionally, many references are outdated. Apart from the classical SMOTE approach, only one oversampling method from 2025 is included, while the undersampling methods cited are also based on relatively old literature.
2.Table 1 should be more concise, emphasizing key takeaways rather than providing lengthy descriptions of algorithms. A more structured and focused presentation would improve readability and clarity.
3.Given that the proposed approach operates at the data level, it is unclear why the PRECEDENTS section dedicates a significant portion to algorithm-level methods. A more relevant comparison to other data-level approaches would strengthen the motivation for the proposed method.
4.In the last row of Figure 2, the reported R² values appear to be negative. Since R² typically ranges from 0 to 1 (mentioned on page 6, line 184), could the authors clarify why negative values are observed? This discrepancy requires further explanation, as it may indicate potential issues with the evaluation process.
5.Regarding Sequence-Based Undersampling (SBU), how does the proposed method fundamentally differ from traditional undersampling techniques such as Tomek Links and Edited Nearest Neighbor? What are its key innovations? Additionally, are there any other sequence-sensitive undersampling techniques that could serve as a baseline for comparison?
6.In the SBU framework, how are the “average sequence length” and the proportion of instances to be removed determined? Have the authors conducted a sensitivity analysis to assess the impact of different parameter choices on model performance? Would different configurations lead to significant variations in the results? A discussion on this aspect would strengthen the validity of the approach.

Experimental design

N/A

Validity of the findings

N/A

Additional comments

N/A

Cite this review as

Anonymous Reviewer (2025) Peer Review #1 of "Sequence-based undersampling: an algorithm for managing imbalanced datasets (v0.2)". PeerJ Computer Science

Reviewer 2 · Mar 29, 2025

Basic reporting

The authors have not fully addressed my previous comments as follows:
[R2.2]: The additional works included in the literature review lack detailed discussion, and some fail to provide any useful details. Moreover, the literature review does not include any references that address the issue of imbalanced data in sequential datasets and existing solutions to it.
[R2.4]: The manuscript does not currently explain the ranges for R² and PR of the RUS method, nor does it clarify why the obtained values are presented as ranges as shown in Table 6.
[R2.5]: The link to the dataset is missing, and the corresponding reference item that was previously available is also absent.
[R2.10]: To avoid any confusion among readers, the authors should provide a clear explanation of this issue directly in the manuscript.

Experimental design

[R2.11]: The justification for comparing the proposed undersampling method primarily with oversampling techniques is insufficient. For a fair comparison, please consider adding additional undersampling methods—preferably more recent ones—to the evaluation.

Validity of the findings

[R2.14]: The third paragraph of the conclusion is not reasonable for two main reasons:
(1)The undersampling methods mentioned, namely ENN, TomekLinks, and CNN, are "not advanced techniques.

(2) There is no proof that demonstrates the performance shortcomings of these techniques with the dataset used in this paper. A more robust, evidence-based discussion is needed to support such a claim.

Additional comments

The header of the first column in Table 6 is incorrect.

Cite this review as

Anonymous Reviewer (2025) Peer Review #2 of "Sequence-based undersampling: an algorithm for managing imbalanced datasets (v0.2)". PeerJ Computer Science

Download Version 0.2 (PDF) Download author's response letter (v0.2) - submitted Mar 10, 2025

Version 0.1 (original submission)

Martina Iammarino · Feb 7, 2025 · Academic Editor

Major Revisions

The manuscript requires major revisions before further consideration for publication. While the proposed sequential-based undersampling (SBU) method presents potential, several aspects need improvement.

The introduction should better emphasize the challenges of imbalanced data in regression tasks and remove redundant content to enhance clarity. A more detailed discussion of recent advancements in imbalanced regression techniques, particularly in sequential data, should be incorporated, with updated references from the past five years.

The literature review needs expansion to include recent works on imbalanced data classification. Suggested references may be considered if deemed relevant.

Figures and tables require refinement. Figure 2 should highlight key points more effectively, and figures should appear in the correct order. Tables should present results more clearly, with highlighted key findings and a thorough discussion of all metrics. Specific inconsistencies in Table 4 regarding R-squared values should be resolved.

Several methodological clarifications are necessary. The dataset description should include explanations of each feature, and the dataset link must be corrected. Selection criteria for data subsets should be justified, and Algorithm 1 should be checked for potential errors. The choice of existing techniques such as RUS, ROS, SMOTE, and Gaussian Noise Oversampling should be justified, or alternative modern approaches should be explored. Additionally, the handling of identical timestamps in sequential data should be explicitly described.

The results and discussion require further refinement. Improvements should be described consistently, avoiding contradictory statements. The relationship between the data and the results in Figures 5–7 should be better explained, and discussions on methodology and results should be made more precise.

The manuscript contains some grammatical errors and typos that need correction. Citation formats should be checked for consistency, and redundant phrases should be removed to enhance readability.

Given these necessary revisions, the authors should carefully address each point and submit a revised manuscript along with a detailed response document outlining the changes made.

[# PeerJ Staff Note: It is PeerJ policy that additional references suggested during the peer-review process should *only* be included if the authors are in agreement that they are relevant and useful #]

[# PeerJ Staff Note: The review process has identified that the English language must be improved. PeerJ can provide language editing services if you wish - please contact us at [email protected] for pricing (be sure to provide your manuscript number and title). Your revision deadline is always extended while you undergo language editing. #]

Reviewer 1 · Jan 27, 2025

Basic reporting

This paper proposes a sequential-based undersampling method to deal with the imbalance problem in sequential datasets. It utilizes the natural order to select and remove overrepresented samples. The following comments are some suggestions:

1. The authors address the imbalance issue in regression tasks, but the abstract does not highlight the main challenges imbalanced data poses in regression scenarios.

2. The introduction is somewhat redundant. It is recommended that the authors focus more on the imbalance issue and streamline the content.

3. It is suggested that the authors conduct a comprehensive review of recent literature on imbalanced data classification, particularly focusing on papers published in the past two years.
For example:
- Gao, Jintong, et al. "Enhancing minority classes by mixing: an adaptive optimal transport approach for long-tailed classification." Advances in Neural Information Processing Systems 36 (2024).
- Niu J, Zhang Z, Liu Z. "Class Activation Maps-based Feature Augmentation for long-tailed classification." Expert Systems with Applications, 2024, 249: 123588.

**PeerJ Staff Note:** It is PeerJ policy that additional references suggested during the peer-review process should only be included if the authors are in agreement that they are relevant and useful.

4. Figure 2 is unclear. It is suggested that the authors highlight the key points more effectively.

5. There are some typo mistakes. For example, Line 325 should be "Sequential-based undersampling". Please check the manuscript carefully.

6. The citation format of the references needs to be consistent. It is recommended that the authors conduct a thorough review.

Experimental design

N/A

Validity of the findings

N/A

Additional comments

N/A

Cite this review as

Anonymous Reviewer (2025) Peer Review #1 of "Sequence-based undersampling: an algorithm for managing imbalanced datasets (v0.1)". PeerJ Computer Science

Reviewer 2 · Jan 31, 2025

Basic reporting

- The manuscript is generally well-written in clear, professional English, and the structure adheres to the standards of a scientific paper, with a logical flow from introduction to methods, results, and conclusions. However, there are occasional grammatical issues, misspellings (e.g., “secuential” in lines 325–326), and redundant phrases (e.g., lines 202–208) that should be revised to enhance readability and precision.
- While the introduction provides a solid foundation for the study, it could be strengthened by incorporating a more detailed discussion of recent advancements in imbalanced regression techniques, particularly those relevant to sequential data. Many of the references and methods cited are outdated, and the manuscript would benefit from including more recent work (e.g., from the last 5 years) to better contextualize the novelty and relevance of the proposed Sequential-Based Undersampling (SBU) algorithm. This would also help highlight how the study advances the current state of the art.
- Some figures could be improved to better convey key information. For instance:
Figure 1: The legend should be described in the text to ensure readers understand the significance of the input variables and their distributions.
Figure 2: The current presentation of the target variable’s distribution makes it difficult to discern details for lower-frequency values. The authors should consider rescaling the y-axis or inserting a zoomed-in window to better visualize these less frequent but critical data points.
Figure 4: The legends and annotations could be made clearer to help readers interpret the density distributions after applying different oversampling techniques. Adding more descriptive labels or captions would enhance the figure’s utility.
- The presentation of results in tables could be improved to make it easier for readers to identify key findings. For example, highlighting the best-performing results in tables (e.g., Table 2 and Table 6). All metrics presented in tables should be thoroughly discussed in the text. For instance, in Table 2, the RMSE and Pearson correlation (PR) values are not mentioned at all. Similarly, in Table 6, the range of R² and Pearson correlation values for random undersampling is not explained.
- Figures should appear in the order they are mentioned in the text (Line 376).

Experimental design

- A description of the dataset, including a brief explanation of each feature, should be included in both the dataset link and the manuscript. Additionally, the hyperlink to the dataset appears to be incorrect and should be updated.
- Since only a subset of the data was used, a clear description of the subsets (including how many instances and imbalance ratio) included in the experiments should be provided. Additionally, the selection of these specific portions should be justified.
- The methods are described with sufficient detail to allow replication. However, there may be an issue in Line 26 of Algorithm 1, where the multiplier for each sequence should not be “p” if the total percentage to remove is “p”. Providing the code alongside the manuscript would greatly enhance reproducibility and allow for independent verification of the method.
- The discussion of results in Lines 214–218 appears inconsistent. Both methods show roughly the same level of improvement, yet the authors describe one as a “slight enhancement” and the other as a “significant improvement.”
- Revise the phrase in Line 245, as scaling does not address the issue of imbalance.
- Specify which dataset was used to generate the results in Table 4. Additionally, the R-squared value for "No preprocessing" in Table 4 is inconsistent with the results presented in Table 2.
- The selection of RUS, ROS, SMOTE, and Gaussian Noise Oversampling is not justified, especially since these methods are outdated and have been surpassed by more recent techniques. Please provide a rationale for choosing these methods or consider including newer approaches.

Validity of the findings

- There are pairs of data with identical timestamps. How do the authors handle the sequencing of these data points?
- The sources of the results in Figures 5–7 are not mentioned. The relationship between the data and the results should be discussed to provide clearer insight.
- The discussion in Lines 427-428 does not appear to be reasonable. Please clarify or revise.

Additional comments

- The manuscript presents a novel and potentially impactful algorithm (SBU) for handling imbalanced datasets in regression tasks, particularly for sequential data, which could contribute well to the field.
- Expand the discussion on the limitations of the SBU algorithm and its potential applicability to other types of datasets.

Cite this review as

Anonymous Reviewer (2025) Peer Review #2 of "Sequence-based undersampling: an algorithm for managing imbalanced datasets (v0.1)". PeerJ Computer Science

Download Original Submission (PDF) - submitted Dec 11, 2024

All text and materials provided via this peer-review history page are made available under a Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.

Review History Sequence-based undersampling: an algorithm for managing imbalanced datasets

Summary

Version 0.3 (accepted)

Martina Iammarino · Jul 4, 2025 · Academic Editor

Reviewer 1 · Jul 2, 2025

Basic reporting

Experimental design

Validity of the findings

Additional comments

Version 0.2

Martina Iammarino · Mar 30, 2025 · Academic Editor

Reviewer 1 · Mar 26, 2025

Basic reporting

Experimental design

Validity of the findings

Additional comments

Reviewer 2 · Mar 29, 2025

Basic reporting

Experimental design

Validity of the findings

Additional comments

Version 0.1 (original submission)

Martina Iammarino · Feb 7, 2025 · Academic Editor

Reviewer 1 · Jan 27, 2025

Basic reporting

Experimental design

Validity of the findings

Additional comments

Reviewer 2 · Jan 31, 2025

Basic reporting

Experimental design

Validity of the findings

Additional comments

Review History
Sequence-based undersampling: an algorithm for managing imbalanced datasets