Review History


All reviews of published articles are made public. This includes manuscript files, peer review comments, author rebuttals and revised materials. Note: This was optional for articles submitted before 13 February 2023.

Peer reviewers are encouraged (but not required) to provide their names to the authors when submitting their peer review. If they agree to provide their name, then their personal profile page will reflect a public acknowledgment that they performed a review (even if the article is rejected). If the article is accepted, then reviewers who provided their name will be associated with the article itself.

View examples of open peer review.

Summary

  • The initial submission of this article was received on December 18th, 2024 and was peer-reviewed by 3 reviewers and the Academic Editor.
  • The Academic Editor made their initial decision on February 20th, 2025.
  • The first revision was submitted on August 11th, 2025 and was reviewed by 2 reviewers and the Academic Editor.
  • The article was Accepted by the Academic Editor on August 18th, 2025.

Version 0.2 (accepted)

· · Academic Editor

Accept

Dear Authors

Thank you for addressing the reviewers' comments. The reviewers think that your paper is sufficiently improved and ready for publication.

Best wishes,

[# PeerJ Staff Note - this decision was reviewed and approved by Shawn Gomez, a PeerJ Section Editor covering this Section #]

·

Basic reporting

This version looks better, thanks for addressing review comments.

Experimental design

This version looks better, thanks for addressing review comments.

Validity of the findings

This version looks better, thanks for addressing review comments.

Additional comments

This version looks better, thanks for addressing review comments.

Reviewer 2 ·

Basic reporting

Yes

Experimental design

Yes

Validity of the findings

Yes

Additional comments

Past comments have been addressed.

Version 0.1 (original submission)

· · Academic Editor

Major Revisions

Dear Authors,

Thank you for submitting your manuscript. Feedback from the reviewers is now available. It is not recommended that your article be published in its current format. However, we strongly recommend that you address the issues raised by the reviewers and resubmit your paper after making the necessary changes.

Warm regards,

·

Basic reporting

1. The manuscript is written in professional English, but certain sentences are overly complex. Simplifying them would enhance readability.
2. The references are relevant and sufficient. However, citations for the specific LLMs and their implementations are missing. Suggest adding primary references for Llama-3 and Mixtral.

Experimental design

1. The methodology is clear but lacks details on hyperparameters, computational resources, and training configurations, which are necessary for replication.
2. The research fits well within the scope of the journal and addresses a relevant gap in biodiversity information retrieval.

Validity of the findings

1. The datasets appear robust, but additional analysis (e.g., on data diversity or bias) would add more value.
2. The conclusions align with the findings, but stronger links to the research question and broader implications could certainly benefit this study.

Additional comments

This is a great study. I enjoyed reading this. Refer below for more comments.


Abstract:
1. The abstract provides a clear summary, but it can be further refined for brevity. Suggestion: Replace “critical information about how these models are designed, trained, and evaluated is challenging to access and comprehend” for better readability.
2. The abstract does not clearly highlight the novel contribution. Suggest explicitly mentioning how this work advances biodiversity studies or generalizes across fields.

Introduction:
The introduction provides sufficient context

1. Explain "Retrieval-Augmented Generation (RAG)" for readers unfamiliar with terms
2. Can you mention why previous approaches to retrieving methodological details in biodiversity publications were insufficient.


Experimental Setup:
1. While the datasets are mentioned, further details on data preprocessing and selection criteria are needed. Ex. Include specifics about how the “364 publications from Ecological Informatics” were selected and curated.
2. The use of accuracy (69.5%) is described, but other evaluation metrics (e.g., precision, recall) should be included to give a fuller picture of model performance.
3. The methods section briefly mentions the LLMs and their integration, but it should also explain why these specific LLMs (e.g., Llama-3, Mixtral) were chosen over others.


Results:
1. The results are primarily text-based. Suggest including visual aids like bar graphs or tables comparing model outputs to human annotators’ evaluations for better clarity.
2. This study does not assess the statistical significance of the performance difference between the model outputs and human annotations.


Discussion:
1. The discussion could elaborate more on how this approach can be extended beyond biodiversity publications. Ex. Can it be applied to health or engineering
2. While the conclusion mentions future directions, it does not address the limitations of the study. Suggest discussing potential challenges, such as bias in the LLM training datasets or scalability issues with larger datasets.


Conclusion:
1. Is there any future work that could be done? Add a specific proposal for future work, such as integrating non-textual data (e.g., images, graphs) into the information retrieval pipeline.

Reviewer 2 ·

Basic reporting

.

Experimental design

.

Validity of the findings

.

Additional comments

This study explores the use of multiple Large Language Models (LLMs) combined with a Retrieval-Augmented Generation (RAG) approach to automatically extract deep learning methodological details from biodiversity-related scientific publications. It demonstrates the pipeline's ability to enhance information retrieval and reproducibility across studies, achieving a notable accuracy of 69.5% based solely on textual content while addressing gaps in methodological transparency.

Strengths:
1) The use of multiple LLMs provides diverse perspectives and increases robustness in information retrieval.
2) The study effectively addresses reproducibility challenges in deep learning methodologies.
3) Detailed evaluation metrics, including inter-annotator agreement and cosine similarity, enhance the validity of findings.
4) The integration of environmental footprint analysis showcases an awareness of sustainability in computational research.

Weaknesses:

1) The paper does not fully address limitations related to the absence of multimodal data (e.g., code, figures, tables) in the pipeline.
2) Some LLMs (e.g., Mixtral models) exhibit low inter-annotator agreement, which affects overall reliability.
3) The complexity of certain competency questions results in low response rates, leaving significant gaps in retrieved information.
4) The methodology heavily relies on manual filtering and evaluation, limiting scalability and generalizability.

While the study provides valuable insights and a novel approach, critical gaps in multimodal data integration, low consistency in some LLM outputs, and reliance on manual evaluations require substantial revision to ensure methodological rigor and scalability.

Reviewer 3 ·

Basic reporting

This paper reports on the methodology used by the authors to extract information about deep learning methodologies adopted in research papers from unstructured PDF data.

The paper is well-written, but it lacks a clear motivation. Initially, the authors present the problem as a lack of transparency, but the focus then shifts to identifying papers and extracting information on biodiversity research. These two points feel disconnected. The authors need to clearly articulate their task and motivation, ensuring they form a cohesive narrative. Currently, several key goals remain mentioned but unconnected:
1. Lack of transparency in the literature
2. Information retrieval in biodiversity research
3. Information extraction from unstructured data
4. Accuracy and diversity of information extraction
Clarifying these points and establishing a logical flow will better engage readers and strengthen the argument for their final goal.

The related works section is poorly structured, mirroring the introduction’s lack of coherence. Instead of covering the fundamental literature related to the task, it includes unrelated topics. A well-organized related works section should:

Cover existing research relevant to each aspect of the task.
Provide a structured discussion leading up to recent information extraction approaches.
Currently, the first parts of this section read more like an extended introduction rather than a review of prior work. Reorganizing this section to ensure relevance and logical progression will improve its clarity and effectiveness.

Experimental design

In this matter, I believe some parts of the experiments need more explanation. First, the concept of filtering mentioned throughout the experiments is unclear. The authors state that they filtered out papers that did not have a DL methodology, but based on the pipeline, the initial goal was to identify papers containing DL methodologies, with aspects ensuring this. If filtering was necessary, it suggests some ambiguity in the detection process. This part may not be explained properly and needs clarification on why filtering was required and how it was implemented.

Additionally, I could not understand why the authors calculated the LLM-pair cosine similarity of the answers. How does this contribute to the task at hand? What insights does this similarity provide, and how is it useful in achieving the study’s objectives? This part of the analysis needs further justification, as its relevance to the research is not immediately clear.

Validity of the findings

The experiments seem to evaluate different aspects and compare against human judges data, which is good. However, I would have liked to see some baselines, particularly comparisons with other information extraction approaches. Does this extensive pipeline justify its complexity? Would a simple LLM prompt be sufficient to extract the necessary information? Without such comparisons, it is difficult to determine whether the proposed approach is truly effective.

As of now, I cannot fully verify if the results are satisfactory. While multiple evaluation approaches are presented, they seem limited in scope. A broader evaluation, including baseline comparisons, would provide a clearer picture of the pipeline’s effectiveness and its advantages over simpler alternatives.

All text and materials provided via this peer-review history page are made available under a Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.