All reviews of published articles are made public. This includes manuscript files, peer review comments, author rebuttals and revised materials. Note: This was optional for articles submitted before 13 February 2023.
Peer reviewers are encouraged (but not required) to provide their names to the authors when submitting their peer review. If they agree to provide their name, then their personal profile page will reflect a public acknowledgment that they performed a review (even if the article is rejected). If the article is accepted, then reviewers who provided their name will be associated with the article itself.
Reviewers are satisfied with the revisions, and I concur to recommend accepting this manuscript.
**PeerJ Staff Note:** The Academic Editor has requested that all code associated with this submission be shared on GitHub.
[# PeerJ Staff Note - this decision was reviewed and approved by Jyotismita Chaki, a PeerJ Section Editor covering this Section #]
The manuscript is written in clear, unambiguous, and professional English throughout.
Sufficient background and context are provided, supported by appropriate literature references.
The article follows a professional structure, with well-organized figures and tables; raw data is also shared.
The study is self-contained, and the results are relevant and directly aligned with the stated hypotheses.
The study was conducted with rigorous investigation, adhering to high technical and ethical standards.
The methods are described in sufficient detail to allow for reproducibility.
All underlying data have been provided and are robust, statistically sound, and appropriately controlled.
The conclusions are clearly articulated, directly address the original research question, and are appropriately limited to the supporting results.
Most of my concerns have been addressed.
-
-
**PeerJ Staff Note:** Please ensure that all review, editorial, and staff comments are addressed in a response letter and that any edits or clarifications mentioned in the letter are also inserted into the revised manuscript where appropriate.
**PeerJ Staff Note:** It is PeerJ policy that additional references suggested during the peer-review process should only be included if the authors agree that they are relevant and useful.
-English: clear.
-Literature references: not sufficient enough.
-Article structure, figures, and tables: need additional improvements.
-Include a graphical abstract that clearly outlines the key steps of the proposed solution.
-Conduct a more in-depth complexity analysis of the proposed method, presenting curves that illustrate the variation in complexity with respect to dataset size.
-Additionally, the authors should discuss the algorithms used in greater detail.
Research question: well-defined and relevant.
Method description: not sufficient.
Since the datasets used are well-known, dedicating separate subsections to each is unnecessary.
Numerous methods are available in the literature to handle missing values. Why was the mean method chosen exclusively? It would be more appropriate to compare its effectiveness against other well-known techniques in the context of the proposed method.
Similarly, the authors should provide justifications for each methodological choice at every step.
In the conclusion, clearly present the performance improvements achieved by the proposed method, expressed in percentage terms.
This paper introduces a novel empirical copula-based framework for generating synthetic data that preserves both marginal and joint probability distributions and dependencies of 10 mixed-type features, addressing the issue of missing values and heterogeneous data. Specifically, the paper proposes a data generation approach that transforms each feature of the original dataset into a uniform distribution over [0,1]^d, enabling sampling and noise injection in a mathematically tractable space. While the method is mathematically elegant and avoids model training, I have several concerns:
1. In high-dimensional or low-sample scenarios, the transformed "uniform dataset" may be extremely sparse in the [0,1]^d space. Simply adding small noise to such sparse samples is unlikely to ensure adequate coverage and may result in synthetic samples clustering in a few regions. Has authors considered this issue? It may be beneficial to explore more advanced strategies, such as using Vine Copulas to model dependencies between features or applying block-wise augmentation in lower-dimensional subspaces.
2. The manuscript (lines 35–45) claims substantial advantages over other approaches, particularly SDV-G. However, no theoretical justification or empirical evidence is provided to support this statement. It may strengthen the paper to include a comparative experiment or a theoretical analysis that clearly demonstrates the domains or conditions under which your method outperforms SDV-G.
3. While the introduction emphasizes that data augmentation is critical for improving model performance in low-resource or imbalanced settings, the experimental evaluation focuses solely on distributional similarity between real and generated data. To convincingly validate the utility of the proposed method, I encourage the authors to include experiments where the generated data is used to train models.
4. In line 175, the condition if F[:,j] == F[:,j] appears to be tautological and thus may be functionally meaningless. Should this be if I[:,j] == F[:,j] instead? Please verify and revise the pseudocode for clarity and correctness.
5. Insufficient discussion of related work. The paper currently lacks a sufficient discussion of related literature on data augmentation. Most of the references are from before 2023. For instance, [a] also estimates the distributional similarity between real and generated data. [c-e] are newly proposed data augmentation methods. The authors are strongly advised to read the related literature:
[a] "Investigating the effectiveness of data augmentation from similarity and diversity: An empirical study." Pattern Recognition 148 (2024): 110204.
[b] "A survey of synthetic data augmentation methods in machine vision." Machine Intelligence Research 21.5 (2024): 831-869.
[c] PostAugment
[d] EntAugment
[e] FreeAugment
**PeerJ Staff Note:** The PeerJ's policy is that any additional references suggested during peer review should only be included if the authors find them relevant and useful.
Why are all figures and tables placed at the end of the manuscript rather than embedded near their corresponding places in the text?
-
All text and materials provided via this peer-review history page are made available under a Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.