All reviews of published articles are made public. This includes manuscript files, peer review comments, author rebuttals and revised materials. Note: This was optional for articles submitted before 13 February 2023.
Peer reviewers are encouraged (but not required) to provide their names to the authors when submitting their peer review. If they agree to provide their name, then their personal profile page will reflect a public acknowledgment that they performed a review (even if the article is rejected). If the article is accepted, then reviewers who provided their name will be associated with the article itself.
I am completely satisfied with the authors' response.
I propose you to make all necessary changes and additions as soon as possible and resubmitted a revised version.
The manuscript is very well written and clear. Even without a background in NLP, it is possible to understand the methods, simulations, and results. The classifier itself is based on plausible assumptions, allows an intuitive interpretation of (most) features, and performs comparably well as more computationally-intensive competitors. Overall, I think the present manuscript does not have any major flaws. There are still several points that should be improved in a revision:
Detection problem (p.2):
A forth goal that is desirable is the applicability of the classifier to text generated by people who may not be native speakers or members of minorities. This is question of the ethics of using AI methods within a social context. It is sufficient to shortly mention this issue somewhere in the main text (maybe in the discussion or after l.112)
The authors should check whether all abbreviations are defined when first mentioned (e.g., SVM, GPT, BERT, tf-idf, POS-tag, ...).
The headings of the feature sets could mention to which subsection in the main text they belong (e.g., coherence, repetitiveness, ...)
Tables: The Labels for rows and columns should clearly distinguish between "Test Data" and "Training Data" as opposed to "Classifier" - since the same type of classifier (neural network) is used for all cases (e.g., Table 3). Moreover, columns and rows should not be switched between tables (e.g., Table 9 & 10).
It could be highlighted that the present work has another benefit: it shows directions for future improvements for machine-generated language.
### Minor Issues:
l. 34: Instead of "bad actors", it would be more appropriate to refer to "actors with questionable/immoral/unethical intentions" (or similar)
l. 56: methods of educating the public may also be difficult to implement effectively from a psychological perspective, e.g.: Lewandowsky, S., Ecker, U. K. H., Seifert, C. M., Schwarz, N., & Cook, J. (2012). Misinformation and its correction: Continued influence and successful debiasing. *Psychological Science in the Public Interest*, *13*(3), 106–131. https://doi.org/10.1177/1529100612451018
l.143: "not to overfit" -> is this a typo? I think this should be "not to underfit"
Table 2: Are the human data sets merged for all training and test sets or matched to the corresponding data set of the language-generation method?
l.383-386: it should be clarified that for each point of the grid, the neural network is trained with an optimization algorithm (backwards propagation?). Currently, this reads as if the models were trained with a grid search - but this appears to be only the case for some of the tuning parameters.
Table 6: It was difficult to detect the pattern highlighted in italics. Underscored numbers are better to detect. Moreover, the column order could be adjusted to have the truncated (s,xl) and full-distribution (s-k,xl-k) datasets next to each other (this holds for all tables). This would in general facilitate recognizing the qualitative patterns discussed by the authors.
l.466: how was overfitting detected? if the model performs well, this cannot be too bad?
As the authors state correctly, a major benefit of the classifier is that the features have a direct intuitive interpretation and can be communicated to lay audiences and practitioners using these methods. To further highlight this fact, the authors could add a few concrete examples of sentences in which some of the less common features of the classifier are illustrated (either in the main text or on the tables in the appendix). For instance, the new "conjunction overlap" (l.252) seems to refer to matches of the form: ""[x y z] and [x y z]". Similarly, specific examples would facilitate the discussion of named entities and coreference chains. I think this would further strengthen the argument that the feature-based classifier has an intuitive interpretation in contrast to competitors. However, this is merely an optional suggestion and not a mandatory requirement for a revision.
I think it is a good idea to use AUC instead of accuracy. The authors could discuss this in l.79-88. Moreover, it might help to discuss the concepts of sensitivity (probability of detecting machine-generated language) and specificity (counter-probability of the false-positive rate) of binary classifications. The AUC is a measure that takes both criteria into account within a single number.
Why are the classifiers combined with the "tf-idf-baseline models"? According to the argument in l.409-413, it seems more appropriate to combine two feature-based classifiers - one trained with a truncated training set (s-k, xl-k) and another one with the full distribution (s, xl). This makes sense as both classifiers outperform the tf-idf models (l.453). Maybe this is what the authors did, but it is currently not clear.
The authors state several hypotheses about the expected direction of differences between machine- and human-generated text. For instance, "We expect a more diverse, human-written text to have a higher share of unique words" (l.248) or "We expect human text to contain more sentiment-related keywords" (l.310). The authors could pick up these hypotheses later in the results section and discuss whether their hypotheses hold in the trained classifiers (this would not require additional tables).
The authors discuss transferability across training sets of different language-generation methods. At some place (e.g., Discussion or l.89-101), it is important to also discuss the issue whether the training sets are representative of "real" text. Put differently, it is not clear whether the classifiers trained with these data would work on data from twitter, facebook, etc. Maybe the authors can specify boundary conditions that need to hold for their classifier to be applicable.
The GPT-2 samples are currently not available (https://storage.googleapis.com/gpt-2/output-dataset/v1/). The authors state correctly that: "These addresses would need to be updated in the code should they ever change."
I consider the text to be a very good entry into the issue of detecting automatically generated text. The authors reflect on related topics (detection of fake news, authorship attribution, etc.), which gives readers a broader context - the methods are common or similar.
As far as I can judge, I do not see any inconsistencies in the text, the analytical part is given clearly and is based on verifiable methodology and data.
The results are absolutely credible.
In further research, from my perspective, the article asks questions to observe the repetition (word forms, grammatical forms) in a small and, on the contrary, a large range of text, probably the cohesion of the text in these different scopes is controlled by opposite tendencies. Even if this were not the case, the text turns out to be inspiring questions based on clearly analyzed data.
All text and materials provided via this peer-review history page are made available under a Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.