Review History


All reviews of published articles are made public. This includes manuscript files, peer review comments, author rebuttals and revised materials. Note: This was optional for articles submitted before 13 February 2023.

Peer reviewers are encouraged (but not required) to provide their names to the authors when submitting their peer review. If they agree to provide their name, then their personal profile page will reflect a public acknowledgment that they performed a review (even if the article is rejected). If the article is accepted, then reviewers who provided their name will be associated with the article itself.

View examples of open peer review.

Summary

  • The initial submission of this article was received on April 30th, 2025 and was peer-reviewed by 2 reviewers and the Academic Editor.
  • The Academic Editor made their initial decision on July 5th, 2025.
  • The first revision was submitted on September 9th, 2025 and was reviewed by 1 reviewer and the Academic Editor.
  • A further revision was submitted on October 22nd, 2025 and was reviewed by 1 reviewer and the Academic Editor.
  • The article was Accepted by the Academic Editor on December 3rd, 2025.

Version 0.3 (accepted)

· · Academic Editor

Accept

Congratulations. I am happy to recommend accepting this paper. Minor edits to the English are required.

[# PeerJ Staff Note - this decision was reviewed and approved by Mehmet Cunkas, a PeerJ Section Editor covering this Section #]

**PeerJ Staff Note:** Although the Academic and Section Editors are happy to accept your article as being scientifically sound, a final check of the manuscript shows that it would benefit from further English editing. Therefore, please identify necessary edits and address these while in proof stage.

Reviewer 1 ·

Basic reporting

The paper is now clear and well structured. The authors rebalanced sections, refined terminology and improved formatting as recommended.
I have only one minor remark, the authors do not need to list all authors when there are many and should follow standard citation rules aligned to the peerj guidelines. For example, if APA 7th edition is the requirements, one must list up to 20 authors. If there are 21 or more, include the first 19, then use an ellipsis (…) and add the final author.

Experimental design

The authors clarified better the methodology. I believe that the experiments are now well justified and the evaluation is more robust.

Validity of the findings

This version moderated claims, added structured analyses and balanced conclusions.

Cite this review as

Version 0.2

· · Academic Editor

Major Revisions

Thanks for the revision submission. Kindly address all the reviewer comments. I would rather not go for multiple review rounds in the future if Major revision is continuously recommended by the reviewer.

Reviewer 1 ·

Basic reporting

The overall structure of the paper is clear, but it is overloaded with technical detail in the Methods section and relatively light in the Results and Discussion, which creates an imbalance. The reader is guided step by step through the construction of the system, but the evaluation findings and their implications are condensed and partly repetitive.

The manuscript claims this is the first successful application of GraphRAG in GIS. But it cites contemporary GIS LLMs such as BB-GeoGPT and K2. There is also GeoGraphRAG. To avoid overclaiming, it is better to reframe this as “among the first” or “to our knowledge, one of the earliest.”

There is also a formatting issue, the manuscript text is not aligned consistently and appear cluttered. I recommend that the authors use the “Justify” option in Word and adjust spacing as in the template.

Experimental design

I noticed that the abstract and methods still describe GraphRAG as if the model was trained or fine-tuned. But in reality, the LLM was only queried with retrieval. This is misleading. I would suggest replacing “fine-tuning/training” with “integration” or “retrieval-enhanced generation”.

The paper says that three models were evaluated (plain Llama, RAG, G-Pro Bot), but later states that first one was excluded from automated evaluation. This inconsistency makes it unclear what comparisons were actually made. Please clarify exactly which models were used in which evaluation stage.

You excluded ESRI and Google GIS assistants as baselines, arguing they are “task automation” tools. But in the Introduction, you compare G-Pro Bot directly to them. This is contradictory. I would suggest either including them in some way or softening the claims in the introduction.

The knowledge base is only 12 documents (about 948k words). This is small, but you call GIS highly interdisciplinary and complex. That feels inconsistent. I suggest either expanding the number/size or acknowledging that this version is a proof-of-concept with limited coverage.

You note that G-Pro Bot produces long answers, but you do not test any fixes. I think it would be useful to try prompt engineering and tuning generation parameters (e.g., max tokens, temperature) and report on the improvements.

Validity of the findings

The human evaluation mentions errors like verbosity and two problem cases, but this is very shallow. A structured breakdown of error types (hallucinations, omissions, irrelevant details, etc.) with counts is more meaningful for the analysis.

The expanded human evaluation, dataset and added bias controls are clear improvements. For future work, I encourage using a larger question set and involving evaluators from multiple institutions to strengthen generalizability.

You state that SBERT and METEOR correlate well with human evaluation, but then admit they fail to capture “directness.” This contradiction is not reconciled. For which dimensions are these metrics useful, and for which are they inadequate?
The methodology section says all 22 questions were evaluated, but the results discussion often refers only to specific examples. This makes the reporting feel incomplete.
The paper frames the system as lightweight and resource-accessible, yet also acknowledges long response times and high computational costs of GraphRAG. This feels contradictory if the trade-off is not well explained.
The conclusion highlights G-Pro Bot’s superiority, but the results show clear weaknesses (verbosity, misinterpretation, missing formulas). The conclusion feels too positive compared to the evidence. I suggest balancing it by emphasizing both strengths and unresolved limitations equally.

Cite this review as

Version 0.1 (original submission)

· · Academic Editor

Major Revisions

Thanks for the resubmission. Kindly address all reviewers' comments. Some concerns are raised about the language and the readability of the article. Kindly address each comment and resubmit the revision.

One more point of improvement. The article has a lot of practical implications (including applications across multiple sectors). It's better to mention the contribution of the paper--both theoretical and practical-- in a separate section. Further, elaborate on the discussions section, and also make comparisons with existing literature.

**Language Note:** The Academic Editor has also identified that the English language must be improved. PeerJ can provide language editing services - please contact us at [email protected] for pricing (be sure to provide your manuscript number and title). Alternatively, you should make your own arrangements to improve the language quality and provide details in your response letter. – PeerJ Staff

Reviewer 1 ·

Basic reporting

The manuscript generally uses clear and professional English, though minor grammatical and typographical errors should be corrected. The introduction provides adequate context and motivation, and the literature cited is current and relevant. The structure aligns with PeerJ standards. However, some terminology should be explained, and some remarks about the methodology and evaluation should be made.

Experimental design

The article aligns with the journal’s aims and article type, and the overall approach is technically relevant; however, there are notable concerns regarding methodological clarity and replicability. Several key details, such as knowledge graph construction parameters, indexing strategy, retrieval settings, and prompt formatting, are either missing or insufficiently described. Although the code and dataset are shared, the lack of transparency in these areas limits full reproducibility. Evaluation methods and metrics are described, but their suitability for the GIS domain could be better justified. Citations are generally appropriate and relevant. These are the detailed remarks:

• “Training” is repeatedly used (for example in Lines 76, 111, 121), but the method involves fine-tuning and retrieval-augmented generation with pre-trained models (Llama3.1). There is no evidence that the base model was trained from scratch. GraphRAG uses retrieval-augmented generation, not parameter training of the base LLM. The model is queried, not trained. Only the knowledge base, chunking, and retrieval pipeline are customized.
• “Removing sensitive information” is mentioned, but not defined. What types of sensitive info? Why and how were these removed?
• The description of KG construction is not complete. I suggest adding algorithmic detail on how entities/relations were extracted (rule-based, LLM prompting, spaCy, etc.). In addition, some details are needed, such as thresholds for similarity, or how missing instances were defined or resolved.
• In the evaluation section, the design and validity of the tests raise several concerns. The standard answers used for automated evaluation were reportedly derived from the source documents, but the authors do not explain who summarized them or whether any review process ensured their quality. The human evaluation process is described as involving 166 valid respondents, but the paper omits essential details about the questionnaire interface, whether respondents were blinded to model identity, and how inter-rater disagreements were handled.
• The choice of baseline models is somewhat limited. The paper compares G-Pro Bot to a vanilla Llama3.1 8B and a LangChain-RAG model. That is good. But it does not include any commercial GIS-specific AI tools, such as those from ESRI or Google, which were discussed in the introduction, or at least LLMs with a higher number of parameters (70B, for example). This will show how much the Graph RAG approach will improve the results in comparison with using a larger model. Anyway, including or at least acknowledging the exclusion of such tools would help situate the contribution more clearly within the broader landscape of GIS chatbot development.
• The study reports quantitative metrics (SBERT, METEOR, and vote counts), but there is no qualitative error analysis of generated responses. I recommend having a structured breakdown of error types (e.g., hallucination, omission, irrelevant detail, lexical redundancy) to provide a deeper understanding of model behavior and failure modes.
• The choice of Sentence-BERT and METEOR as automated metrics is good for general NLP tasks, but their reliability in domain-specific or technical contexts like GIS is not enough. I recommend validating whether these metrics correlate with human judgments in this specific domain.
• The code and corpus are mentioned to be shared on Zenodo, but I recommend including detailed configuration parameters, hyperparameters, exact prompts, and examples of entity/edge extraction rules. This will make the paper self-contained.
• The paper should include some examples of prompt templates or query formatting, as the prompt design is important to RAG system performance.

Validity of the findings

The paper does not explicitly assess impact or novelty but presents a potentially valuable application of GraphRAG in the GIS domain. While the conclusion is well-aligned with the presented results and acknowledges key limitations, such as response latency and verbosity, it would benefit from a clearer discussion of unresolved challenges and more concrete future directions. The experiments are generally well-structured, but limitations in question diversity, evaluation depth, and methodological transparency weaken their strength. Further refinement and replication would enhance the study's reliability. The major remarks regarding this are:

• The choice of Sentence-BERT and METEOR as automated metrics is good for general NLP tasks, but their reliability in domain-specific or technical contexts like GIS is not enough. I recommend validating whether these metrics correlate with human judgments in this specific domain.
• The manual evaluation only includes 7 questions out of 22. This actually limits the generalizability of the results. This is a major shortcoming of the paper. Without expanding the number of test examples, the paper claims remain insufficiently supported, and in my opinion, it is difficult to justify the robustness of the results or the suitability of the work for publication.
• The attention check design is described but not shown in full. I suggest including it.
• Only limited discussion is given to failure cases (Q5, Q6). You can take this opportunity to discuss model weaknesses.

Additional comments

Minor remarks:
• "Llama3.1 8B" is referred to variously as: Llama3.1 8B, llama3.1 8B, and Llama3-8b. These should be standardized throughout.
• METEOR, SBERT, RAG, and KG are introduced correctly, but should be briefly re-explained where reused, especially for readers skimming sections.
• The quality of the formatting, spacing, and alignment should be improved.
• PDF contains broken pagination in multiple sections (e.g., table titles on one page, tables on the next).
• Line 322: “training llama3.1 8B using Graph RAG” again misleading. GraphRAG does not train the base model but fine-tunes generation through KG-based retrieval.

Cite this review as

Reviewer 2 ·

Basic reporting

The subject is interesting, but it is not an easy paper to read. The English is fine, but the nascency of the field suggests that a fuller description of the process and purpose of building the G-Pro bot would be beneficial. The background does not provide adequate justification for why this work was important.

Experimental design

Strengths. The use of humans and computers for model evaluation is good.

Concerns.
1. The main methodological issue is that, as described, the study is not replicable. The training process is not fully defined. More details on data processing could be helpful. How were hyperparameters selected, for example? How many documents (in words, pages) are in the GIS knowledge base? For a RAG to make sense, there has to be a very large language base. This justifies the vectorizing of the documents ahead of the search. How old were these documents?

2. A second issue is the lack of rigor in human evaluation. More detail is required in data processing, as well as who the human evaluators are and why, and how they were chosen. How was bias minimized? Were they making blind evaluations?

3. I am not sure that comparing the cosine similarities is really appropriate for judging the quality of responses. It's a reasonable starting point, but the authors should show some examples so the reader can qualitatively judge this. This is especially important since, as the authors admit, the difference is small.

4. The model the authors use is small and would not compare well with an RAG using the Gemini or OpenAI APIs.

Validity of the findings

The findings seem reasonable, but it is hard to assess given the lack of detail in the methods. On the face of it, the findings do seem valid. The natural question is whether a small model like an 8B parameter Llama is the right benchmark model. There are other local models that are bigger and probably better, not to mention the big players that provide API access for using RAGs.

Cite this review as

All text and materials provided via this peer-review history page are made available under a Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.