All reviews of published articles are made public. This includes manuscript files, peer review comments, author rebuttals and revised materials. Note: This was optional for articles submitted before 13 February 2023.
Peer reviewers are encouraged (but not required) to provide their names to the authors when submitting their peer review. If they agree to provide their name, then their personal profile page will reflect a public acknowledgment that they performed a review (even if the article is rejected). If the article is accepted, then reviewers who provided their name will be associated with the article itself.
Dear Author,
Your paper has been revised. It has been accepted for publication in PeerJ Computer Science. Thank you for your fine contribution.
[# PeerJ Staff Note - this decision was reviewed and approved by Mehmet Cunkas, a PeerJ Section Editor covering this Section #]
None
None
None
The authors have thoroughly addressed all of my comments.
All my comments are addressed.
It's good to publish the paper.
-
**PeerJ Staff Note:** Please ensure that all review, editorial, and staff comments are addressed in a response letter and that any edits or clarifications mentioned in the letter are also inserted into the revised manuscript where appropriate.
-
-
-
I commend the authors for thoroughly addressing my concerns. I have no further comments, except for one minor one regarding citations.
The manuscript cites Kim, Muhn, and Nikolaev (2024) multiple times. However, the authors have withdrawn their preprint from public circulation. According to their statement on https://arxiv.org/abs/2407.17866
"A co-author identified inconsistencies in the data and analyses while attempting to replicate past analyses from the working paper. Accordingly, we have temporarily withdrawn the working paper from circulation while we review the research findings."
Given this withdrawal, I recommend finding an alternative citation to support the relevant points in the manuscript.
Reference: Kim, A. G., Muhn, M., & Nikolaev, V. V. (2024). Financial Statement Analysis with Large Language Models. arXiv preprint arXiv:2407.17866. [Withdrawn]
The authors have addressed all my concerns.
-
-
The revision is better.
Show some best and worst results and analyze these cases in detail.
They should do more experiments and compare the results with other research on the same topic.
Limitations and future direction of research should be expanded.
They should use more figures to visualize the method and the result.
**PeerJ Staff Note:** Please ensure that all review, editorial, and staff comments are addressed in a response letter and that any edits or clarifications mentioned in the letter are also inserted into the revised manuscript where appropriate.
**Language Note:** When you prepare your next revision, please either (i) have a colleague who is proficient in English and familiar with the subject matter review your manuscript, or (ii) contact a professional editing service to review your manuscript. PeerJ can provide language editing services - you can contact us at [email protected] for pricing (be sure to provide your manuscript number and title). – PeerJ Staff
• Writing quality – The prose is clear and professional.
• Structure & referencing – The introduction offers solid background and a concise summary. The manuscript follows PeerJ’s section order, with current, relevant citations and Zenodo links for code and data.
• Figures & tables – Figure 1 is fully legible, and all tables are well laid out and self-contained.
• Research question – The study asks an original, well-motivated question: Can large language models extract return-predictive sentiment from Japanese 10-K MD&A sections?
• Dataset & scope – The sample covers 11,135 firm-years (2014–2023) for March fiscal-year-end firms; the scope is appropriate and transparently justified. Clarify why the series starts in 2014 and explain the sharp drop in observations in 2023 versus 2022.
• Reproducibility – Code, data, and model checkpoints are provided.
• Evaluation choices – Long–short portfolios and factor alphas are suitable. A brief note on why value weighting, not equal weighting, was chosen would complete the picture.
• Economic magnitude – Annualized alphas of -6% to -10% are economically large; comparing them with well-known anomalies (e.g., size or momentum) would contextualize their importance.
• Robustness – Add (i) equal-weight portfolios and (ii) a simple transaction-cost adjustment to gauge implementability.
1. Sign interpretation – Explain briefly why a higher tone score predicts lower future returns (scoring scale vs. behavioral story).
2. Standard-error specification – State lag length and any clustering.
3. Hyperparameters – Report full API settings for each LLM.
4. Tone coding – Specify how each word is scored (e.g., +1 = positive, 0 = neutral, -1 = negative).
5. Replace “GPT-4-mini” with “GPT-4o-mini.”
6. In Table 2, explain why the means of Tone Ratio and Tone Score are negative, whereas those from the language models are all positive.
This study set out to examine whether advanced large language models (LLMs) can extract return-predictive information from Japanese 10-K reports. Below are some comments.
In sentiment analysis, the sentiment is not always represented using the categorical model (e.g., positive and negative). A new dimensional model that can represent sentiments with sentiment scores on multiple dimensions (e.g., valence and arousal) has been proposed. The authors should discuss the dimensional sentiment analysis (e.g., multi-dimensional relation model, refining word embeddings, Chinese EmoBank, etc).
-
-
This paper used 3 LLMs: ChatGPT, Claude, and Gemini to extract sentiment from Japanese 10-K reports for predicting future stock returns. The authors did experiments and compared them with traditional dictionary-based methods and a DeBERTaV2-based model to evaluate the information extraction.
The papers didn’t explain in detail why they chose 3 LLMs: ChatGPT, Claude, and Gemini to extract sentiment.
They didn’t present their prompt in detail.
The two compared methods, traditional dictionary-based methods and a DeBERTaV2-based model, are not explained well.
Evaluating the extracted sentiment is not presented well.
They didn’t analyze the result in detail.
They didn’t discuss efficiently why GPT-4 and Claude got the better results.
They didn’t compare with any research.
The technique of the paper is weak.
The paper lacks novelty, and its experiments are not strong.
The authors should add more recent studies to the references.
All text and materials provided via this peer-review history page are made available under a Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.