All reviews of published articles are made public. This includes manuscript files, peer review comments, author rebuttals and revised materials. Note: This was optional for articles submitted before 13 February 2023.
Peer reviewers are encouraged (but not required) to provide their names to the authors when submitting their peer review. If they agree to provide their name, then their personal profile page will reflect a public acknowledgment that they performed a review (even if the article is rejected). If the article is accepted, then reviewers who provided their name will be associated with the article itself.
Authors have addressed all of the reviewers' comments.
This manuscript is ready for publication.
[# PeerJ Staff Note - this decision was reviewed and approved by Massimiliano Fasi, a PeerJ Section Editor covering this Section #]
Dear Authors,
Your work is very interesting and your changes improved its quality.
Nevertheless, I believe that the points raised by Reviewer 2 are very important.
First and foremost, the evaluation for some K > 2 should give a stronger validity of the work and it would result in a better generalization assessment of your methodology. If you are afraid about the readability you can use the supplementary material for the extra tests.
Then, I believe that more than real datasets, you should focus on providing a comparison on benchmarks that you can find in the literature and that contains convex-shape cluster.
As a suggestion, you can start from this paper:
Clustering benchmark datasets exploiting the fundamental clustering problems at the following link.
https://www.sciencedirect.com/science/article/pii/S2352340920303954.
Please try to address the comments.
No comment.
No comment.
No comment.
The authors have addressed my concerns sufficiently. I have no additional comments.
The writing has been improved in the revised version.
Regarding the experimental design, the authors do not appear to have addressed the reviewer’s previous concerns. The current experiments remain largely descriptive, lacking rigorous quantitative metrics that would allow for meaningful comparison between the different indices.
The authors' justification for selecting k = 2 is also unconvincing. The experiments with k more than 2 are necessary. Only discussing k=2 would limit the generalization of the findings generated in this work.
Furthermore, the decision to treat the labels estimated from ESC as a gold standard remains questionable. While the authors dismiss the use of datasets with ground truth labels, such cases are in fact widely used for evaluating clustering methods. It is common standard practice to perform clustering without using the true labels, and then assess performance using metrics such as ARI, NMI against the known ground truth.
Same as noted in the Experimental design section. I also highly suggest the authors fully address the concerns about the experimental section in their revision.
Dear Authors,
Please go through al the Reviewer's comments and carefully address each of them.
In particular, Reviewer#2 raised questions about the number of Clusters taken into account and the need for quantitative measures to evaluate and compare consistency more rigorously.
Best,
M.P.
**PeerJ Staff Note:** Please ensure that all review, editorial, and staff comments are addressed in a response letter and that any edits or clarifications mentioned in the letter are also inserted into the revised manuscript where appropriate.
The authors of this manuscript have compared six evaluation metrics for unsupervised machine learning i.e., clustering to assess their feasibility of use in real-world use-cases. The authors compared the reliability of these metrics on six artificial datasets which had obvious separation of clusters based on class types (e.g., matrices of zeros and ones) or intuition (e.g., brush strokes). There is also a method proposed called the Euclidean similarity clustering (ESC) which assigns (by default) data points to one cluster and then all other samples to either this cluster or another (with k = 2) based on a threshold. Tests using this method and the adjusted Rand Index (ARI) for external validation on two real-world dataset based on electronic health records indicate that the Silhouette Coefficient and the Davies-Bouldin index are the most consistent with ARI.
The manuscript was easy to follow, had sufficient references, and tables/figures were legible and defined fairly well. The authors have conducted empirical experiments on multiple scenarios to validate their findings, and this is significant.
The research falls with the aims and scope of this journal and with computer science. The problem statement and research question was well-defined. Adequate information has been provided in the manuscript and the code to attempt experimental reconstruction to validate findings.
I have some questions to hopefully improve the standards of the manuscript.
1. Some mention should be made of imbalanced data points or outliers, especially when real-world datasets are used as imbalanced class types are common across various fields (e.g., fraud detection, predictive maintenance). How would these metrics and the suggested method perform under such an edge case?
2. What are the authors' educated opinions on the effect of increasing dimensionality (as commonly seen in EHR data) and its tendency to affect the behavior and discriminatory power of the six internal clustering metrics, especially regarding the “curse of dimensionality” and sparsity issues? The max. no. of features seen based on real-world EHR data is 20 (i.e., Diabetes Types 1 EHR dataset) and while the results are significant, some comment on more complex datasets should be highlighted.
3. How do the authors define "sparse datasets" in Section IV? Is this defined based on the ratio of the dataset dimensionality to its size or average inter-point distances?
4. While Dunn index ranked among the most effective on artificial convex datasets, it seems like it was outperformed by Silhouette and DB on real EHR data. Can the authors elaborate on the specific characteristics of those real datasets (e.g., distribution skew, noise levels) that reduce Dunn’s reliability?
Minor comment:
**Shouldn't algorithm 1 (row 5) be y element of [2; M] instead of [2; N]?
The authors compare six internal clustering metrics:Silhouette coefficient, Davies-Bouldin index, Dunn index, Calinski-Harabasz index, Shannon entropy, and Gap statistics, using both artificial and real datasets commonly encountered in bioinformatics and health informatics. The paper presents a systematic analysis with clear writing. Additionally, the literature references are appropriate. While the writing is clear, the authors could enhance the paper by extracting key results and providing a more concise summary of the overall conclusions from different experiments. Furthermore, it would be beneficial if the authors could give more motivation for their design choices (e.g., the rationale behind developing ESC) and explain why they selected these particular real datasets for the study.
The paper presents extensive experiments showcasing examples of clustering problems, which I appreciate. However, the analysis of the experimental results remains questionable. Many of the experiments rely heavily on visual inspection or exploratory analysis. For instance, in several tables, the authors conclude that the trends are consistent with the ARI but fail to assess the degree of this consistency, which is crucial for comparing the metrics effectively. I strongly recommend that the authors incorporate quantitative measures to evaluate and compare consistency more rigorously. Moreover, the excessive reliance on descriptive analysis makes the results difficult for readers to digest and interpret. Without quantitative comparisons (e.g., some consistency metrics), the conclusions lack statistical rigor and robustness.
Furthermore, some analyses are limited by the assumption of only two clusters, which is overly simplistic and impedes the generalizability of the findings. Clustering problems often involve a varying number of clusters, ranging from a few to hundreds. Restricting the analysis to such a small is insufficient for capturing the complexity of real-world scenarios.
Additionally, since the primary goal is to benchmark these metrics, I believe the authors should include real datasets with ground truth labels. Using labels estimated from ESC as the gold standard to calculate ARI is questionable and lacks a solid foundation.
Given the points I raised in Section 2, although some findings in this paper may be valid, I do not believe they are sufficiently robust or statistically sound. The authors should focus on improving their comparison approaches.
All text and materials provided via this peer-review history page are made available under a Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.