Javascript is disabled in your browser. Please enable Javascript to view PeerJ.

Review History
Empirical copula-based data augmentation for mixed-type datasets: a robust approach for synthetic data generation

All reviews of published articles are made public. This includes manuscript files, peer review comments, author rebuttals and revised materials. Note: This was optional for articles submitted before 13 February 2023.

Peer reviewers are encouraged (but not required) to provide their names to the authors when submitting their peer review. If they agree to provide their name, then their personal profile page will reflect a public acknowledgment that they performed a review (even if the article is rejected). If the article is accepted, then reviewers who provided their name will be associated with the article itself.

View examples of open peer review.

Summary

The initial submission of this article was received on April 23rd, 2025 and was peer-reviewed by 2 reviewers and the Academic Editor.
The Academic Editor made their initial decision on July 7th, 2025.
The first revision was submitted on August 7th, 2025 and was reviewed by 2 reviewers and the Academic Editor.
The article was Accepted by the Academic Editor on August 29th, 2025.

Version 0.2 (accepted)

Shibiao Wan · Aug 29, 2025 · Academic Editor

Accept

Reviewers are satisfied with the revisions, and I concur to recommend accepting this manuscript.

**PeerJ Staff Note:** The Academic Editor has requested that all code associated with this submission be shared on GitHub.

[# PeerJ Staff Note - this decision was reviewed and approved by Jyotismita Chaki, a PeerJ Section Editor covering this Section #]

Karim El Moutaouakil · Aug 8, 2025

Basic reporting

The manuscript is written in clear, unambiguous, and professional English throughout.

Sufficient background and context are provided, supported by appropriate literature references.

The article follows a professional structure, with well-organized figures and tables; raw data is also shared.

The study is self-contained, and the results are relevant and directly aligned with the stated hypotheses.

Experimental design

The study was conducted with rigorous investigation, adhering to high technical and ethical standards.

The methods are described in sufficient detail to allow for reproducibility.

Validity of the findings

All underlying data have been provided and are robust, statistically sound, and appropriately controlled.

The conclusions are clearly articulated, directly address the original research question, and are appropriately limited to the supporting results.

Cite this review as

El Moutaouakil K (2025) Peer Review #1 of "Empirical copula-based data augmentation for mixed-type datasets: a robust approach for synthetic data generation (v0.2)". PeerJ Computer Science

Reviewer 2 · Aug 11, 2025

Basic reporting

Most of my concerns have been addressed.

Experimental design

-

Validity of the findings

-

Cite this review as

Anonymous Reviewer (2025) Peer Review #2 of "Empirical copula-based data augmentation for mixed-type datasets: a robust approach for synthetic data generation (v0.2)". PeerJ Computer Science

Download Version 0.2 (PDF) Download author's response letter (v0.2) - submitted Aug 7, 2025

Version 0.1 (original submission)

PeerJ Staff · Jul 7, 2025 · Academic Editor

Major Revisions

**PeerJ Staff Note:** Please ensure that all review, editorial, and staff comments are addressed in a response letter and that any edits or clarifications mentioned in the letter are also inserted into the revised manuscript where appropriate.

**PeerJ Staff Note:** It is PeerJ policy that additional references suggested during the peer-review process should only be included if the authors agree that they are relevant and useful.

Karim El Moutaouakil · May 9, 2025

Basic reporting

-English: clear.

-Literature references: not sufficient enough.

-Article structure, figures, and tables: need additional improvements.

-Include a graphical abstract that clearly outlines the key steps of the proposed solution.

-Conduct a more in-depth complexity analysis of the proposed method, presenting curves that illustrate the variation in complexity with respect to dataset size.

-Additionally, the authors should discuss the algorithms used in greater detail.

Experimental design

Research question: well-defined and relevant.
Method description: not sufficient.

Since the datasets used are well-known, dedicating separate subsections to each is unnecessary.

Validity of the findings

Numerous methods are available in the literature to handle missing values. Why was the mean method chosen exclusively? It would be more appropriate to compare its effectiveness against other well-known techniques in the context of the proposed method.

Similarly, the authors should provide justifications for each methodological choice at every step.

In the conclusion, clearly present the performance improvements achieved by the proposed method, expressed in percentage terms.

Cite this review as

El Moutaouakil K (2025) Peer Review #1 of "Empirical copula-based data augmentation for mixed-type datasets: a robust approach for synthetic data generation (v0.1)". PeerJ Computer Science

Reviewer 2 · Jun 24, 2025

Basic reporting

This paper introduces a novel empirical copula-based framework for generating synthetic data that preserves both marginal and joint probability distributions and dependencies of 10 mixed-type features, addressing the issue of missing values and heterogeneous data. Specifically, the paper proposes a data generation approach that transforms each feature of the original dataset into a uniform distribution over [0,1]^d, enabling sampling and noise injection in a mathematically tractable space. While the method is mathematically elegant and avoids model training, I have several concerns:

1. In high-dimensional or low-sample scenarios, the transformed "uniform dataset" may be extremely sparse in the [0,1]^d space. Simply adding small noise to such sparse samples is unlikely to ensure adequate coverage and may result in synthetic samples clustering in a few regions. Has authors considered this issue? It may be beneficial to explore more advanced strategies, such as using Vine Copulas to model dependencies between features or applying block-wise augmentation in lower-dimensional subspaces.

2. The manuscript (lines 35–45) claims substantial advantages over other approaches, particularly SDV-G. However, no theoretical justification or empirical evidence is provided to support this statement. It may strengthen the paper to include a comparative experiment or a theoretical analysis that clearly demonstrates the domains or conditions under which your method outperforms SDV-G.

3. While the introduction emphasizes that data augmentation is critical for improving model performance in low-resource or imbalanced settings, the experimental evaluation focuses solely on distributional similarity between real and generated data. To convincingly validate the utility of the proposed method, I encourage the authors to include experiments where the generated data is used to train models.

4. In line 175, the condition if F[:,j] == F[:,j] appears to be tautological and thus may be functionally meaningless. Should this be if I[:,j] == F[:,j] instead? Please verify and revise the pseudocode for clarity and correctness.

5. Insufficient discussion of related work. The paper currently lacks a sufficient discussion of related literature on data augmentation. Most of the references are from before 2023. For instance, [a] also estimates the distributional similarity between real and generated data. [c-e] are newly proposed data augmentation methods. The authors are strongly advised to read the related literature:
[a] "Investigating the effectiveness of data augmentation from similarity and diversity: An empirical study." Pattern Recognition 148 (2024): 110204.
[b] "A survey of synthetic data augmentation methods in machine vision." Machine Intelligence Research 21.5 (2024): 831-869.
[c] PostAugment
[d] EntAugment
[e] FreeAugment

**PeerJ Staff Note:** The PeerJ's policy is that any additional references suggested during peer review should only be included if the authors find them relevant and useful.

Experimental design

Why are all figures and tables placed at the end of the manuscript rather than embedded near their corresponding places in the text?

Validity of the findings

-

Cite this review as

Anonymous Reviewer (2025) Peer Review #2 of "Empirical copula-based data augmentation for mixed-type datasets: a robust approach for synthetic data generation (v0.1)". PeerJ Computer Science

Download Original Submission (PDF) - submitted Apr 23, 2025

All text and materials provided via this peer-review history page are made available under a Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.

Review History Empirical copula-based data augmentation for mixed-type datasets: a robust approach for synthetic data generation

Summary

Version 0.2 (accepted)

Shibiao Wan · Aug 29, 2025 · Academic Editor

Karim El Moutaouakil · Aug 8, 2025

Basic reporting

Experimental design

Validity of the findings

Reviewer 2 · Aug 11, 2025

Basic reporting

Experimental design

Validity of the findings

Version 0.1 (original submission)

PeerJ Staff · Jul 7, 2025 · Academic Editor

Karim El Moutaouakil · May 9, 2025

Basic reporting

Experimental design

Validity of the findings

Reviewer 2 · Jun 24, 2025

Basic reporting

Experimental design

Validity of the findings

Review History
Empirical copula-based data augmentation for mixed-type datasets: a robust approach for synthetic data generation