To increase transparency, PeerJ operates a system of 'optional signed reviews and history'. This takes two forms: (1) peer reviewers are encouraged, but not required, to provide their names (if they do so, then their profile page records the articles they have reviewed), and (2) authors are given the option of reproducing their entire peer review history alongside their published article (in which case the complete peer review process is provided, including revisions, rebuttal letters and editor decision letters).
Both reviewers and I are happy with the revisions made and recommend the paper to be accepted as it is.
The revised article conforms with PeerJ standards. The authors' revisions clarify the presentation, tie the figures better to the narrative, and improve the style overall.
The revisions clarify the experimental design and technical approach. The revised version is fully satisfactory.
The revised manuscript presents the findings and conclusions clearly.
It is clear from the revised manuscript that the authors have made a concerted effort to respond to the reviewers' comments on the earlier version of the manuscript.
Thanks for the authors' efforts on addressing the comments from the three reviewers. Now, my major concerns to this article have been resolved. I feel that the paper could be published in PeerJ.
The article presents an interesting research problem and is well motivated and presented. The technique proposed for training role-specific language models is novel. Please, note the recommendations of the three reviewers and make sure you address their questions and suggestions when preparing the final version of the paper.
The abstract and introduction of the article are quite well written. From the abstract it is clear what problem the authors addressed and what results they obtained. From the introduction how their work relates to broader research, and what approach they took. Many manuscripts that I have reviewed fail to meet this standard.
I do have one request though. The authors mention that they propose three quantification models, and then propose to fuse the results of these models. It would be helpful to briefly summarize why the authors chose this approach. Do these models have characteristics that lead one to suggest a priori that this is the best approach to take? Have others employed them for similar tasks? The authors explain this a bit later, but only in the context of a detailed description of the methods.. Also the correspondence between the narrative and the diagram in Figure 1 could be made clearer, since the diagram says “N-gram language model” and the narrative says “Maximum likelihood model training with human-generated transcripts”. I gather these refer to the same thing.
Overall I felt that the body of the manuscript clearly describes the methods used. I found the technique of training role-specific language models to be clever and novel.
I would like to see some clarification of the following points:
- How does the diarization algorithm perform in the case where speakers overlap or interrupt each other? The authors note that this is an infrequent occurrence.
Was it necessary to train an acoustic model specially for this purpose? How much performance improvement does this offer over off-the-shelf acoustic models?
The experimental design is clear, and evaluates the contributions of each component of the method.
The results appear to be valid, however I did not examine the modeling algorithms in detail.
In the final manuscript, make sure that all acronyms (e.g., MFCCs, OOV, SRILM) are explained or are obvious in context.
p. 3: “clinical trails”
p. 5: “experiment results” -> “experimental results”
p. 9: “highly biased result” -> “highly biased results”
p. 11: “while drops” -> “while it drops”
p. 13: “pathes” -> “paths”
p. 14: “a SP” -> “an SP”
The paper is readable but the English has to be improved. In particular using the first person style “We......” is should be removed. There are also frequent grammatical errors, to many to list; the whole style in which the paper is written should be revised.
From the engineering point of view, the paper does not introduce any novel technologies. Existing methods were put together for a specific application. However, this particular application (automatic rating) could be of interest to many readers.
Please improve English.
Generally speaking, this article is well written. Most parts of the article reach professional English writing standards.
To further improve clarity of the article, I feel that the authors need enrich most of captions in both figures and tables. For example, Figure 2 is quite complicate and its existing caption is too brief to properly guide the readers to fully understand this figure.
Mostly, the experimental design presented in this article meets research community’s standard. However, I feel that the following issues need the author's attentions.
around the line 60, the authors mentioned that they already “quantified prosodic features of the therapist and patient, and … “. I am wondering why the authors only limited their experiments on analyzing ASR outputs rather than considering prosodic cues (could be in a very simple format) given their previous research findings.
regarding 2.4 speaker role matching, I feel that many possible useful cues could be used beyond the current LM only approach. For example, it is possible that a therapist always initializes the dialog and he or she tends to use shorter time compared to a patient during the entire dialog.
regarding section 3, the motivations of using both n-gram LM and MaxEnt model were not clearly introduced. Reader need know why these methods were considered to be useful.
regarding Table 4, on which levels, the VAD and diarization, were evaluated. This seems not very clear from reading the paper.
around the line 415, session level ASR WER has a large variation range (from about 20% to 90%). Since the article focused on predicting empathy based on lexical cues, the accuracy of ASR impacts the prediction accuracy very much. In this sense, I am wondering whether the authors should focus their study on the sessions with good enough ASR WER. It is hard to convince the readers that ASR result with a WER about 0.4 or 0.5 still could be processed by the proposed empathy prediction method.
Overall, based on the research questions made and the experiments being conducted, the findings presented in this article are valid.
One issue I spotted is from Table 9. For RP, the Acc on ORA-T case was worse than ORA-D case (79.0% < 86.8%). This seems be against the intuition. Why the result from transcriptions (WER in such case is close to 0.0) could be worse than the result from an ASR output with an averaged WER about 0.4.
All text and materials provided via this peer-review history page are made available under a Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.