All reviews of published articles are made public. This includes manuscript files, peer review comments, author rebuttals and revised materials. Note: This was optional for articles submitted before 13 February 2023.
Peer reviewers are encouraged (but not required) to provide their names to the authors when submitting their peer review. If they agree to provide their name, then their personal profile page will reflect a public acknowledgment that they performed a review (even if the article is rejected). If the article is accepted, then reviewers who provided their name will be associated with the article itself.
The reviewer seems satisfied about the recent changes and therefore I can recommend this article for acceptance.
[# PeerJ Staff Note - this decision was reviewed and approved by Jyotismita Chaki, a PeerJ Section Editor covering this Section #]
-
-
-
The authors have addressed my concerns.
**PeerJ Staff Note:** Please ensure that all review, editorial, and staff comments are addressed in a response letter and that any edits or clarifications mentioned in the letter are also inserted into the revised manuscript where appropriate.
The authors have improved the literature review and incorporated important references.
I do think the novelty of this kind of approach is a bit overstated (See "Additional comments"), which I mention here primarily in the context of a literature review: ideally, the literature review would not only cite relevant work but also *motivate* the current work as a function of that work.
The other concern is that I am not sure how the results connect to a broader theoretical question. It seems to me the paper addresses the very high-level positioning (can LLMs be used to simulate psychological data) and the very local hypotheses (do LLMs show a correlation between which items they say are memorable and which items they say relate to each other, etc.). But I am still not entirely convinced about the intermediate *theory* that is being addressed here. The authors have provided additional context about work on garden-path sentence processing, which I appreciate, but which unfortunately doesn't totally address my concern. I am aware that garden-path sentences have been used to great effect to address questions about human sentence processing. The question is what assessing ChatGPT's responses to garden-path sentences addresses theoretically.
I think the local question being assessed is good, and the design matches the hypothesis.
The data are provided; I have not replicated every statistical analysis myself, but they appear to be sound and follow best practices to the best of my knowledge.
As noted by both myself and R1 in the initial round of reviews, there's a pretty large extant field of "machine psychology" already; the authors have incorporated additional citations into their revisions, though I still feel the novelty of this contribution conceptually/methodologically is a bit overstated. E.g., Binz & Schulz (2023) demonstrated that ChatGPT could be used to reproduce a number of classic cognitive psychology studies; moreover, language models have been used to model psycholinguistic phenomena for many years (e.g., Linzen et al., 2016).
I am happy to assess the quality of the work itself (see above), but I really do think there's a mismatch between the tone of the introduction/discussion (which reads to me as though the paper is providing a proof of concept as a novel conceptual/methodological approach) and the work that was done. This could be considered a relatively minor revision (just rewriting) and I've marked it as such; but in another sense I think it is somewhat deep, because if the concept/method itself is not new, it makes it even more important that the empirical contribution addresses some sort of theoretical question—and I'm just still not sure what question is being answered here.
**PeerJ Staff Note:** Please ensure that all review, editorial, and staff comments are addressed in a response letter and that any edits or clarifications mentioned in the letter are also inserted into the revised manuscript where appropriate.
Summary: This paper investigates the question of whether LLMs can be used to predict human performance in two memory tasks (that the authors call relatedness and memorability) when using garden path sentences with fitting and non-fitting contexts as materials. Both ChatGPT and human subjects carry out these tasks and their performance is compared, finding some correlation. Based on these results, the authors argue for a new field of 'machine psychology' - exploring the correelation between LLMs and humans on cognitive tasks.
Contributions:
1. a new paradigm using garden-path sentences in a fitting or unfitting context and two 'memory tasks' to study memory formation.
2. results on the comparison between performance on these tasks by ChatGPT and humans.
Professionality of the writing: the paper is well written in professional English and with a professional article structure.
Literature positioning: the number of literature references is adequate if not extensive.
The paper is fairly self contained.
I am not completely convinced that the paper fits in the scope of the journal as in my mind its primary objective - to demonstrate sufficient correlation between human and LLM performance at cognitive task to (eventually) justify use of LLMs in alternative to humans - would be either in the social sciences or in artificial intelligence, not in the Biological Sciences, Environmental Sciences, Medical Sciences, and Health Sciences.
This said, the research questions are sufficiently clear, and the execution of the design is sound. However, i am not completely convinced by the new paradigm--i.e., by the extent to which testing relatedness and memorability in fitting vs non-fitting contexts provides clear evidence about memorization in general, let alone for comparing human vs. LLM performance.
The analysis of LLM behavior is also narrow, being focused on only one LLM (ChatGPT) and on one type of prompt per task.
I am happy to accept the accuracy of the results. I am not however convinced that the results support the main conclusion of the authors:
Our research revealed ChatGPT’s ability to make accurate predictions for the performance of human memory despite not possessing it
Such conclusions seem premature. What the results show is a certain degree of correlation between ChatGPT performance on these two tasks and human performance. Further work is needed first to establish what exactly these two 'memory tasks' tell us about human or machine cognition; and then to carry out similar results for other tasks.
This work is interesting. I agreee with the authors that 'machine psychology' is a promising area of research and that there is some evidence that LLMs do not just display near-human (or, in some cases, super-human) performance at language tasks, but also more in general at cognitive tasks. I am also in favor of finding new tasks to investigate the 'interplay' between machine and human cognition.
However, I do not think that the paper is publishable in a high quality journal in its present state.
First of all, the focus of the study is very narrow. In fact, while the reported study is well-designed and properly conducted, and the results are useful, the size of the contribution is more like what I would expect from a workshop paper. There is a massive literature assessing LLMs performance at cognitive tasks, and such papers tend to compare LLM performance to human performance on a variety of tasks, such as those in CogBench
Julian Coda-Forno, Marcel Binz, Jane X Wang, Eric Schulz (2024). CogBench: a large language model walks into a psychology lab. Proceedings of the 41st International Conference on Machine Learning, PMLR 235:9076-9108.
Which brings me to my second point: the 'machine psychology' area the authors advocate already exists, and it is extremely lively, with tens of papers appearing monthly. See, e.g.
Qian Niu, Junyu Liu, Ziqian Bi, Pohsun Feng, Benji Peng, Keyu Chen, and Ming Li (2024). Large Language Models and Cognitive Science: A Comprehensive Review of Similarities, Differences, and Challenges. https://arxiv.org/html/2409.02387v3
The authors however do not seem to be fully aware of this literature - which admittedly may be because much of this work appeared in AI conferences or journals. But I would encourage them to invest some time researching what's already there, and argue why their experimental approach / results are more convincing than what's already out there.
Typos and small errors
Line 46 reference missing
the lack of transparency in how exactly LLMs function poses a challenge (Schwartz, 2022,?)
Clarity: The article is relatively clear for the most part; an exception is the argumentative connection between the introduction and the main body work. That is, the theoretical or empirical motivation for the current work could have been explained in more detail, particularly with respect to the specific design choices. Why are LLMs (and ChatGPT in particular) a good candidate for this particular phenomenon (i.e., garden-path ambiguity and sentence memorability)?
Background: The background discussion and literature review addressed some of the related literature on “machine psychology”, but did not engage with a large body of work using LLMs within linguistics and psychology. Most related is probably the body of work using LLMs to norm experimental stimuli, which includes calculating relatedness estimates. Older work uses word embeddings (Thompson & Lupyan 2018; Utsumi, 2020), but even more relevant is more recent work using GPT-4 and related models to reproduce psycholinguistic norms (Martínez et al., 2024; Martínez et al., 2025; Rivière et al., 2025; Brysbaert et al., 2025; Trott, 2024a; Trott, 2024b).
Data: The raw data is available in the Zenodo repository.
Self-contained: Yes, the results are self-contained and relevant to the introduction and hypotheses.
Relevant references:
Thompson, B., & Lupyan, G. (2018). Automatic estimation of lexical concreteness in 77 languages. In Proceedings of the Annual Meeting of the Cognitive Science Society (Vol. 40).
Brysbaert, M., Martínez, G., & Reviriego, P. (2025). Moving beyond word frequency based on tally counting: AI-generated familiarity estimates of words and phrases are an interesting additional index of language knowledge. Behavior Research Methods, 57(1), 1-15.
Martínez, G., Molero, J. D., González, S., Conde, J., Brysbaert, M., & Reviriego, P. (2025). Using large language models to estimate features of multi-word expressions: Concreteness, valence, arousal. Behavior Research Methods, 57(1), 1-11.
Martínez, G., Conde, J., Merino-Gómez, E., Bermúdez-Margaretto, B., Hernández, J. A., Reviriego, P., & Brysbaert, M. (2024). Establishing vocabulary tests as a benchmark for evaluating large language models. PloS one, 19(12), e0308259.
Trott, S. (2024). Can large language models help augment English psycholinguistic datasets?. Behavior Research Methods, 56(6), 6082-6100.
Trott, S. (2024). Large language models and the wisdom of small crowds. Open Mind, 8, 723-738.
Utsumi, A. (2020). Exploring what is encoded in distributional word vectors: A neurobiologically motivated analysis. Cognitive Science, 44(6), e12844.
Rivière, P. D., Beatty-Martínez, A. L., & Trott, S. (2024). Evaluating Contextualized Representations of (Spanish) Ambiguous Words: A New Lexical Resource and Empirical Analysis. arXiv preprint arXiv:2406.14678.
**PeerJ Staff Note:** It is PeerJ policy that additional references suggested during the peer-review process should only be included if the authors are in agreement that they are relevant and useful.
Research question: As noted above, I found the motivation and explanation of the research question hard to follow. Of course, any researcher must select a particular phenomenon to focus on; but it was unclear to me why this phenomenon was a particularly good candidate for exploring the use of LLMs in psychology research, or why LLMs would be particularly illuminating for understanding this phenomenon (especially given the differences in “memory” for LLMs vs. humans). Concretely, I ended up uncertain whether the primary goal was shedding light on this particular phenomenon (and if so, how LLMs would be useful for that) or demonstrating the utility of LLMs in mimicking psychological data (and if so, how robust, generalizability, and useful these results actually would be).
Investigation:
LLM: Only one closed-source LLM (GPT-4) was used. Is there a reason the authors did not explore the use of other LLMs? Using only a single model (and a closed-source one at that) raises further challenges for reproducibility and questions about generalizability. Given that the same API can be used for eliciting responses from other OpenAI models, it seems like it would be relatively straightforward to collect at least other OpenAI models, and perhaps also Claude or even an open-source model using the HuggingFace Endpoints API.
Methods: How were multiple responses to the same trial by an LLM treated? Were they averaged? These are in theory not independent responses. Additionally, how many tokens did you allow the LLM to generate?
Materials: more details on the materials would be useful—what kinds of features were controlled for across conditions? Were these materials sourced from existing studies?
Detail: In general I think the methods are described with enough clarity to reproduce them.
All data have been robust and I believe the internal validity is sound. A major question is external validity, i.e., across other LLMs or psychological phenomena. A second question concerns the extent to which the conclusions mirror the actual research question (see also above).
All text and materials provided via this peer-review history page are made available under a Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.