Javascript is disabled in your browser. Please enable Javascript to view PeerJ.

Review History
Efficient processing of complex XSD using Hive and Spark

All reviews of published articles are made public. This includes manuscript files, peer review comments, author rebuttals and revised materials. Note: This was optional for articles submitted before 13 February 2023.

Peer reviewers are encouraged (but not required) to provide their names to the authors when submitting their peer review. If they agree to provide their name, then their personal profile page will reflect a public acknowledgment that they performed a review (even if the article is rejected). If the article is accepted, then reviewers who provided their name will be associated with the article itself.

View examples of open peer review.

Summary

The initial submission of this article was received on April 23rd, 2021 and was peer-reviewed by 2 reviewers and the Academic Editor.
The Academic Editor made their initial decision on June 1st, 2021.
The first revision was submitted on June 29th, 2021 and was reviewed by 1 reviewer and the Academic Editor.
The article was Accepted by the Academic Editor on July 6th, 2021.

Version 0.2 (accepted)

Yilun Shang · Jul 6, 2021 · Academic Editor

Accept

The paper can be accepted. Congratulations.

Aarti Chugh · Jul 5, 2021

Basic reporting

I appreciate authors for working on all observations. The proof reading of document further ensure clear and unambiguous English. Hence, the re submission is acceptable for publishing purpose.

Experimental design

Since, authors have answered all observations in detail, I am satisfied with the improved version of document. No more questions from my side.

Validity of the findings

All findings are validated properly. No further queries.

Additional comments

All previously given observations are considered and corrected in re-submission by authors. .All the best for future endeavors.

Cite this review as

Chugh A (2021) Peer Review #2 of "Efficient processing of complex XSD using Hive and Spark (v0.2)". PeerJ Computer Science https://doi.org/10.7287/peerj-cs.652v0.2/reviews/2

Download Version 0.2 (PDF) Download author's response letter (v0.2) - submitted Jun 29, 2021

Version 0.1 (original submission)

Yilun Shang · Jun 1, 2021 · Academic Editor

Major Revisions

We have received two reviewers for the paper. Both reviewers found some merits of the paper but they also pointed out some drawbacks. The experiment design and methodology need to be further clarified. Please provide detailed responses to the reviewers. Note that you should not cite references recommended by reviewers if not appropriate.

[# PeerJ Staff Note: Please ensure that all review comments are addressed in a response letter and any edits or clarifications mentioned in the letter are also inserted into the revised manuscript where appropriate. It is a common mistake to address reviewer questions in the response letter but not in the revised manuscript. If a reviewer raised a question then your readers will probably have the same question so you should ensure that the manuscript can stand alone without the response letter. Directions on how to prepare a response letter can be found at: https://peerj.com/benefits/academic-rebuttal-letters/ #]

[# PeerJ Staff Note: The Academic Editor has identified that the English language must be improved. PeerJ can provide language editing services - please contact us at [email protected] for pricing (be sure to provide your manuscript number and title) #]

[# PeerJ Staff Note: It is PeerJ policy that additional references suggested during the peer-review process should only be included if the authors are in agreement that they are relevant and useful #]

Reviewer 1 · May 1, 2021

Basic reporting

The authors have a good structure for the paper, however, there are grammatical errors and sentence construction errors in multiple places. I have described a few examples below

"However, a more common approach in that works involves the simplest examples of XML documents, even though, the real data sets are composed of complex schemas that include nested arrays and structures."
"The reporting tool used is Spark SQL but no details about the implementation are presented."
"With the purpose of reducing the lack of methods for processing XML files with complex schemas, in this study we present our approach based on three main methods: (1) catalog, (2) deserialization, and (3) positional explode."

I would recommend them to correct these

Experimental design

The problem statement that the authors are trying to solve is not clearly stated.
Are they trying to compare big data frameworks on fast they can process complex XSD or are they trying to prove that their approach based on cataloging, deserialization and positional explode is superior to other approaches in the related work?
if they are trying to do the former, they need to consider several performance characteristics. Big data frameworks usually run on clusters, not on a single nodes.
Apache spark version used (1.6.0) is outdated and retired, I would recommend the authors to use Apache Spark 3.0 since its performance is almost that in 1.6.0
The same goes for Apache Hive. For end customers, the reason to do internal tables and external tables are very different, however, that choice does affect performance. The authors have not disclosed whether while using internal tables, their results are affected by caching in Hive.

Validity of the findings

While the authors prove that their approach of catalog, deserialization and positional explode works in Big Data Frameworks, they have not compared that with other approaches for XML parsing that the related works have described

The authors have not explored why Hive or Spark performs better. When they mention Hive performs better for queries to extract individual values or attributes, what is the reason behind this? They need to explore the open-source code to understand what is the root cause. This would improve the validity of their findings.

Additional comments

Please compare with the latest version of the big data frameworks since they are more up-to date and the scan processing time is shorter there.
Please explore deeper into the frameworks to find the reasons for your findings
Also explore big data cluster results

Cite this review as

Anonymous Reviewer (2021) Peer Review #1 of "Efficient processing of complex XSD using Hive and Spark (v0.1)". PeerJ Computer Science https://doi.org/10.7287/peerj-cs.652v0.1/reviews/1

Aarti Chugh · May 20, 2021

Basic reporting

I appreciate authors for their research contribution. The paper is well-organized and contributes to novel research work which falls in Computer Science Research domain of the journal.

1. Few sentences which need to be re framed/reexamined as some how these sentences meaning is not clear. Line numbers are mentioned below:
46-48
193-194
273-274
280-281
307-308

2. I found some of the fundamental papers related to work done. Authors can check and include these in related work or wherever it seems suitable:

Dmitry Vasilenko, “An Empirical Study on XML Schema Idiosyncrasies in Big Data Processing”, in International Journal on Computer Science and Engineering, October 2015.
Dmitry Vasilenko, Mahesh Kurapati,.” Efficient Processing of XML Documents in Hadoop Map Reduce, IJCSE, 2014, Vol.6, No.9,p.329–333.
Song Kunfang and Hongwei Lu, “Efficient Querying Distributed Big-XML Data using MapReduce”, Int. J. Grid High Perform. Comput. 8, 3 (July 2016), 70–79. DOI:https://doi.org/10.4018/IJGHPC.2016070105

Experimental design

1. At line number 444-445- I request if you can elaborate or reference why it is needed to create the raw table at first?

2. Please discuss which type of queries have been selected to evaluate the proposed algorithm. Also, mention whether the same type of queries can be applicable for other application datasets?

Validity of the findings

No Comment

Additional comments

1. The paper is devoted to important task of Big Data. The authors have presented query processing algorithms for complex XML files. The practical value of article is good.

2. The proposed algorithms can be further tested on different size big datasets to validate them for implementation on real big datasets.

3. For further research, authors can take benchmark datasets and queries.

All the best.

Download annotated manuscript

Cite this review as

Chugh A (2021) Peer Review #2 of "Efficient processing of complex XSD using Hive and Spark (v0.1)". PeerJ Computer Science https://doi.org/10.7287/peerj-cs.652v0.1/reviews/2

Download Original Submission (PDF) - submitted Apr 23, 2021

All text and materials provided via this peer-review history page are made available under a Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.

Review History Efficient processing of complex XSD using Hive and Spark

Summary

Version 0.2 (accepted)

Yilun Shang · Jul 6, 2021 · Academic Editor

Aarti Chugh · Jul 5, 2021

Basic reporting

Experimental design

Validity of the findings

Additional comments

Version 0.1 (original submission)

Yilun Shang · Jun 1, 2021 · Academic Editor

Reviewer 1 · May 1, 2021

Basic reporting

Experimental design

Validity of the findings

Additional comments

Aarti Chugh · May 20, 2021

Basic reporting

Experimental design

Validity of the findings

Additional comments

Review History
Efficient processing of complex XSD using Hive and Spark