ChatGPT is transforming peer review — how can we use it responsibly?


Since the artificial intelligence (AI) chatbot ChatGPT was released in late 2022, computer scientists have noticed a troubling trend: chatbots are increasingly used to peer review research papers that end up in the proceedings of major conferences.

There are several telltale signs. Reviews penned by AI tools stand out because of their formal tone and verbosity — traits commonly associated with the writing style of large language models (LLMs). For example, words such as commendable and meticulous are now ten times more common in peer reviews than they were before 2022. AI-generated reviews also tend to be superficial and generalized, often don’t mention specific sections of the submitted paper and lack references.

That’s what my colleagues and I at Stanford University in California found when we examined some 50,000 peer reviews for computer-science articles published in conference proceedings in 2023 and 2024. We estimate that 7–17% of the sentences in the reviews were written by LLMs on the basis of the writing style and the frequency at which certain words occur (W. Liang et al. Proc. 41st Int. Conf. Mach. Learn. 235, 29575–29620; 2024).

Lack of time might be one reason for using LLMs to write peer reviews. We found that the rate of LLM-generated text is higher in reviews that were submitted close to the deadline. This trend will only intensify. Already, editors struggle to secure timely reviews and reviewers are overwhelmed with requests.

Fortunately, AI systems can help to solve the problem that they have created. For that, LLM use must be restricted to specific tasks — to correct language and grammar, answer simple manuscript-related questions and identify relevant information, for instance. However, if used irresponsibly, LLMs risk undermining the integrity of the scientific process. It is therefore crucial and urgent that the scientific community establishes norms about how to use these models responsibly in the academic peer-review process.

First, it is essential to recognize that the current generation of LLMs cannot replace expert human reviewers. Despite their capabilities, LLMs cannot exhibit in-depth scientific reasoning. They also sometimes generate nonsensical responses, known as hallucinations. A common complaint from researchers who were given LLM-written reviews of their manuscripts was that the feedback lacked technical depth, particularly in terms of methodological critique (W. Liang et al. NEJM AI 1, AIoa2400196; 2024). LLMs can also easily overlook mistakes in a research paper.

Given those caveats, thoughtful design and guard rails are required when deploying LLMs. For reviewers, an AI chatbot assistant could provide feedback on how to make vague suggestions more actionable for authors before the peer review is submitted. It could also highlight sections of the paper, potentially missed by the reviewer, that already address questions raised in the review.

To assist editors, LLMs can retrieve and summarize related papers to help them contextualize the work and verify adherence to submission checklists (for instance, to ensure that statistics are properly reported). These are relatively low-risk LLM applications that could save reviewers and editors time if implemented well.

LLMs might, however, make mistakes even when performing low-risk information-retrieval and summarization tasks. Therefore, LLM outputs should be viewed as a starting point, not as the final answer. Users should still cross-check the LLM’s work.

Journals and conferences might be tempted to use AI algorithms to detect LLM use in peer reviews and papers, but their efficacy is limited. Although such detectors can highlight obvious instances of AI-generated text, they are prone to producing false positives — for example, by flagging text written by scientists whose first language is not English as AI-generated. Users can also avoid detection by strategically prompting the LLM. Detectors often struggle to distinguish reasonable uses of an LLM — to polish raw text, for instance — from inappropriate ones, such as using a chatbot to write the entire report.

Ultimately, the best way to prevent AI from dominating peer review might be to foster more human interactions during the process. Platforms such as OpenReview encourage reviewers and authors to have anonymized interactions, resolving questions through several rounds of discussion. OpenReview is now being used by several major computer-science conferences and journals.

The tidal wave of LLM use in academic writing and peer review cannot be stopped. To navigate this transformation, journals and conference venues should establish clear guidelines and put in place systems to enforce them. At the very least, journals should ask reviewers to transparently disclose whether and how they use LLMs during the review process. We also need innovative, interactive peer-review platforms adapted to the age of AI that can automatically constrain the use of LLMs to a limited set of tasks. In parallel, we need much more research on how AI can responsibly assist with certain peer-review tasks. Establishing community norms and resources will help to ensure that LLMs benefit reviewers, editors and authors without compromising the integrity of the scientific process.


Leave a Reply

Your email address will not be published. Required fields are marked *