Review History


All reviews of published articles are made public. This includes manuscript files, peer review comments, author rebuttals and revised materials. Note: This was optional for articles submitted before 13 February 2023.

Peer reviewers are encouraged (but not required) to provide their names to the authors when submitting their peer review. If they agree to provide their name, then their personal profile page will reflect a public acknowledgment that they performed a review (even if the article is rejected). If the article is accepted, then reviewers who provided their name will be associated with the article itself.

View examples of open peer review.

Summary

  • The initial submission of this article was received on March 13th, 2024 and was peer-reviewed by 2 reviewers and the Academic Editor.
  • The Academic Editor made their initial decision on June 5th, 2024.
  • The first revision was submitted on August 4th, 2024 and was reviewed by 2 reviewers and the Academic Editor.
  • A further revision was submitted on August 29th, 2024 and was reviewed by 2 reviewers and the Academic Editor.
  • The article was Accepted by the Academic Editor on September 17th, 2024.

Version 0.3 (accepted)

· Sep 17, 2024 · Academic Editor

Accept

Based on the reviewer comments, the manuscript can be accepted for publication.

[# PeerJ Staff Note - this decision was reviewed and approved by Xiangjie Kong, a PeerJ Section Editor covering this Section #]

Reviewer 1 ·

Basic reporting

I was a reviewer of the earlier versions of this manuscript. The authors have now fully addressed the points I raised. I believe that this work will make a valuable contribution to the literature.

Experimental design

see above

Validity of the findings

see above

·

Basic reporting

The authors made changes that made the text clearer, and some details were added that highlight the care taken in the method employed in the work. The inclusion of references to disorders in the ICD-10 was very good. The addition of the "Human Labeling" and "Comparison to Human Evaluation" sections reinforces what was previously described. The authors also mention potential future work in the final sections, highlighting opportunities for the scientific community.

Experimental design

No comment

Validity of the findings

No comment

Version 0.2

· Aug 21, 2024 · Academic Editor

Minor Revisions

Based on the reviewer comments, the manuscript must be further revised.

Reviewer 1 ·

Basic reporting

I was a reviewer of an earlier version of this manuscript and thus focused on the changes made by the authors based on this review cycle. The basic reporting improved from the initial to the current version of the manuscript. It is now written in a comprehensible way. Also, the structure and the terminology used are now much clearer.

Experimental design

As mentioned above, the goal of this work is much clearer now. Nonetheless, at several points in the manuscript, I was left puzzled as to whether I understood the goals. As one example, in lines 65-67, the authors write, "One potential application of such a system is to support users who have been diagnosed with a specific psychiatric disorder or who suspect they have one and are seeking to develop a coping strategy." This is problematic for several reasons. First, the authors neglected to include expert raters (which I suggested in my first review). Thus, how can they make sure the data is correctly labeled? If there are no datasets in Russian, the authors would need to find a way to include human expertise. Second, are laypersons good at recognizing psychiatric symptoms?

In lines 204-206, the authors write something different: "The research gap, thus, can be defined as the lack of cheap fully automated pre-filtering and pre-annotation approaches that could be loosely verified with the existing non-clinical labels
and further supplied to professional psychiatrists for expert annotation. "

So, my question remains: What is the goal of this paper? Where does human expertise come into play? Is this research meant to be used by patients (i.e., people with a psychiatric diagnosis) or laypeople (i.e., people who suspect having a psychiatric disorder)? Or is this research basis for further research?

Validity of the findings

See my points above. The authors must be clearer regarding where to position their findings in the research process. Testing their data with real human experts would be extremely helpful and increase the validity of the findings.

·

Basic reporting

The article presents a performance comparison between LLM models on various classification tasks within the mental health domain. The authors utilize data obtained from three different social media platforms, where they are native Russian speakers.

The text is clear and easily understandable. Improvements have been made on the points indicated in the first round of review, such as the inclusion of additional related works in the literature that address the topic or adjacent subjects to the article's title.

Some points in the text caught my attention as I reflected on the article. In the first paragraph of the introduction section, the authors claim that psychology and psychiatry are increasingly using LLMs. Even with all the popularity that LLMs currently have, is there any reference to support this statement, or do the authors have the experience to make such a claim?

Text in line 249 should be moved to line 225 for better reading.

Experimental design

Regarding the Materials and Methods section, what sources did the authors use to list the common mental disorders? DSM, ICD, or another? Given that Table 1 lists a set of mental disorders, why not use a source with predefined disorders that could guide data collection?

What did the authors mean in the sentence: "Choosing among multiple arguments for and against class balancing we prioritize ecological validity and, respectively, natural class proportions." I think it should be rewritten for better clarity.

In the Computer Experiments section, what was the justification for choosing only three classes, given that the dataset contains more disorders? (Line 336)

Validity of the findings

I'm satisfied with the results of this stage.

Additional comments

I believe the text has been greatly improved compared to the previous version, with more clarity and details about the research. I have asked more questions, but I think they are just suggestions for improvement that the authors can address or refute if necessary in future versions.

Version 0.1 (original submission)

· Jun 5, 2024 · Academic Editor

Major Revisions

Based on the review reports, the manuscript must be revised accurately.

[# PeerJ Staff Note: It is PeerJ policy that additional references suggested during the peer-review process should *only* be included if the authors are in agreement that they are relevant and useful #]

Reviewer 1 ·

Basic reporting

The manuscript is - in principle - written comprehensively. That said, I also believe that streamlining the argument (see below) and improving the English language would help. The most important points are the manuscript's focus and the missing link to relevant literature. Why is it important to have LLMs for psychology/psychiatry? I read the manuscript multiple times, but I'm still not sure. Several reasons might account for this.

First, while the title is relatively broad ("psychology"), the authors then subsequently mention "psychiatric topics," "health care",... That is way too broad and too unspecific. The manuscript would benefit from focusing on a specific problem the authors tackle.

Second, I was surprised that highly relevant papers in the field of psychology and LLM were not cited—for example, Binz & Schulz (PNAS) or Demszky et al. (Nature Reviews Psychology). Consulting the relevant literature would help the authors streamline their argumentation.

Third, writing is often difficult to follow. Apart from some typos (e.g., "LMM" instead of "LLM"), some concepts are just not well introduced. For example, the authors mention "LangChain" technology in the abstract, yet they never introduced it in detail. Since their work is highly related to this technology, I believe this manuscript would benefit enormously from better integration.

Experimental design

Four LLMs were employed:
1. mDeBERTa-v3-base-xnli-multilingual-nli-2mil7
2. multilingual-MiniLMv2-L6-mnli-xnli
3. distilbert-base-uncased-mnli
4. DeBERTa-v3-base-mnli-fever-anli

Data was collected from three platforms and processed in stages: initial classification, data filtering, and fine-tuning using standard and NLI methods. The performance of LLMs on lemmatized and non-lemmatized texts was compared.

Yet, as the authors mention in their introduction, psychological data (e.g., counseling) is not publicly and openly available; I wondered why the authors did not follow the example of Fu et al. (2023) and involved ratings of human experts.

Validity of the findings

See my points above, the manuscript would substantially benefit from integrating human expert ratings. This would validate the authors' analyses and allow the evaluation of the data sets the authors used as a basis for their research.
Yet, assessing the validity of the findings remains difficult, not only because of the missing human evaluations but also because of the lack of clarity of the manuscript's focus.
Also, linking the present results back to the LangChain technology (i.e. describing a workflow, for example) would be good.

Additional comments

The authors present a manuscript that would benefit (1) from focussing on a specific topic (incl. integration of the relevant literature and a description of a workflow, for example, using the LangChain technology) - as it is now, it's too broad and unspecific, (2) integrating human ratings, this allows for a better assessment of data validity, and (3) English language editing.

·

Basic reporting

The paper is well structured, and it is clear the main objective of the paper is the comparison of LLMs in different NLP tasks in the context of mental health diseases. I describe below some aspects that can be directly improved or corrected or that require more details.

About the paper in general:
1. There are some typos in the text, where the authors have changed LLM for LMM (abstract, sec 1, sec zero-shot classification)

- About Related Work Section:
1. Even though the prominence of LLM has gained traditional media attention from 2023 and 2022, the basis of LLM has been discussed before 2023. Works like: "Esfahani et al. Transfer learning for depression: Early detection and severity prediction from social media postings", and "Wang X et al. Depression Risk Prediction for Chinese Microblogs via Deep-Learning Methods" uses LLM, such as BERT or Roberta, as a tool for text classification.
2. Also, the authors should search for works that explore the Russian vocabulary domain. Works like the two below already explore this aspect.
1. Romanovskyi O, Pidbutska N, Knysh A. Elomia chatbot: The effectiveness of artificial intelligence in the fight for mental health.
2. Panicheva P, Mararitsa L, Sorokin S, Koltsova O, Rosso P. Predicting subjective well-being in a high-risk sample of Russian mental health app users.
3. The reference from [Peters at all, 2023] has discussed the use of LLMs using a questionable dataset since the MyPersonality app has been disabled due to its ethical concerns about gathering users' data. How this work did influence the authors?

Experimental design

About the experiment design, it would be really insightful and enriching to the paper if the authors compare the applied experiment method with correlated works in literature. This would bring a better validation and justification of the work.
Also, I believe that it lacks an explanation of the scientific methodology used in this research. I believe the paper has space for the authors to describe the justification of the methodology.

Even though the authors clearly describe the objective, and the results are consistent with the proposal at the beginning of the text, it is not clear how the paper identifies the research gaps and contributes to them.

Also, I would like to know from the authors if they have the confidence to ensure that the experiments are sufficient to prove their point of view. Are there open challenges for this effort? Which future work can be done to improve this work? Also, what is the justification for the different sizes of posts for each category?

Validity of the findings

No comment

All text and materials provided via this peer-review history page are made available under a Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.