All reviews of published articles are made public. This includes manuscript files, peer review comments, author rebuttals and revised materials. Note: This was optional for articles submitted before 13 February 2023.
Peer reviewers are encouraged (but not required) to provide their names to the authors when submitting their peer review. If they agree to provide their name, then their personal profile page will reflect a public acknowledgment that they performed a review (even if the article is rejected). If the article is accepted, then reviewers who provided their name will be associated with the article itself.
I am happy with the way the authors have addressed the comments from the reviewers.
Please thoroughly address the comments of the reviewers. A more rigorous statistical validation as suggested by reviewer 2 would make the paper much stronger. If this is not viable, explain why.
This article describes a computational model for computers to learn subjective information about emotions by playing a game named EMO20Q (emotion twenty questions).
The research work is well presented and motivated. The methodology is suitable. The English is very good and easy to read. A few English/style-related suggestions are indicated below:
- Line 143. Revise grammar: “a logical device to used to model”
- Line 215. The advice given to the reader about using “his or her episodic buffer” is not appropriate in scientific writing.
- Line 399. “there were unique 99 emotion words” -> “there were 99 unique emotion words”
The experiments have been carefully designed and have been reported in a clear and convincing way. I understand that carrying out experiments for evaluating this kind of techniques is costly. However, to highlight the importance of the contribution it would have been desirable (although not mandatory) to compare the proposed method to some baseline method.
The experimental results are well described. Both the discussion and conclusion sections are suitable. It would be useful to discuss some additional issues. For instance, have the authors analyzed if the approach could be generalized to identify other kinds of words that may not be emotions? It is clear that for emotion-related words it is easier to keep a controlled vocabulary and standardized questions. But how this could be adapted to other domains? On the other hand, if the domain is limited to emotions, why not taking advantage of other mechanisms, such as ad hoc ontologies describing emotions?
There are a few aspects that require further clarification:
1. Why have the authors decided to use a Bayesian learning model instead of other kinds of models? For this kind of problems, a decision tree may also be appropriate. A more detailed justification for their choice should be presented.
2. It is not clear how the feature engineering process is carried out. Examples of features should be included.
3. It is not clear how questions are derived. Are the standardized questions the only ones that are used or does the system learn additional questions through the human-human EMO20Q games?
4. How do the questions relate to the features?
5. Please, give further details on the process that is applied to learn new vocabulary.
6. By new vocabulary do you only mean new words describing emotions or do you also mean other terms related to emotions that might be used as part of the questions?
An interesting study of 20Q-games about emotion words.
The algorithm is very simple - maybe a little bit too simple for really being novel in any way - nonetheless, I found the paper an interesting read and generally a good engineering contribution on how to go about solving the proposed challenge.
The philosophical relations to the good-old Greeks is a matter of taste - I find it a little bit overly lengthy but judge it as acceptable.
clearly the psychological studies could have been done in a better manner - while the authors emphasize that they wanted to keep any option open, I think it would have been a better idea to restrict the emotion words to say 100 or so.
In consequence, the results are hard to interpret. Statistical tests due to the different groups are hard to conduct and thus the result remain qualitative.
The qualitative results are hard to interpret - did the system really make progress (statistical significant difference between results in table 5 and 6?).
I think one could at least compare performance of the system in the 100 subject experiments comparing, for example, performance with the first 50 subjects with those from the second 50 subjects (e.g. with respect to avg. turns in the easy and medium categories). A similar comparison could be made between the preliminary and the mechanical turk study.
I upload the annotated pdf - there were quite a few spelling and simple English mistakes. Please correct these. I also recommend in the future to do a final spell-check before submitting - these mistakes suggest that the manuscript has not undergone sufficient care in its preparation. Also verify the equation and explain your choices in the algorithm (as annotated in the pdf).
All text and materials provided via this peer-review history page are made available under a Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.