AI's Peer Review: GPT-4 Matches Human Experts in Scientific Feedback

Sep 16, 20242 min read

Updated: Oct 8, 2024

This article, titled "Can Large Language Models Provide Useful Feedback on Research Papers? A Large-Scale Empirical Analysis", is a collaborative effort by a diverse team of researchers from various academic backgrounds and institutions. This study was published on July 17, 2024, in NEJM AI, volume 1, issue 8. The article explores the potential of large language models (LLMs) in providing useful feedback on research papers, presenting a large-scale empirical analysis to address this question. This groundbreaking study explores the potential of large language models (LLMs), specifically GPT-4, to provide useful feedback on research papers. The researchers developed an automated pipeline using GPT-4 to generate structured feedback on scientific papers and conducted two large-scale studies to evaluate its effectiveness.

Key Findings:

Retrospective Analysis:
- Compared GPT-4's feedback with human peer reviewer feedback on 3,096 papers from Nature family journals and 1,709 papers from the ICLR conference.
- The overlap between GPT-4 and human reviewer feedback (30.85% for Nature journals, 39.23% for ICLR) was comparable to the overlap between two human reviewers (28.58% for Nature journals, 35.25% for ICLR).
- GPT-4's feedback showed higher overlap with human reviewers for weaker papers (e.g., rejected ICLR papers).
Prospective User Study:
- Surveyed 308 researchers from 110 US institutions in AI and computational biology.
- 57.4% of users found GPT-4 generated feedback helpful or very helpful.
- 82.4% found it more beneficial than feedback from at least some human reviewers.
Characteristics of GPT-4 Feedback:
- GPT-4 was more likely to identify issues raised by multiple human reviewers.
- The model tended to focus on certain aspects of feedback more than humans (e.g., research implications).
- GPT-4 generated non-generic, paper-specific feedback.

Limitations:

GPT-4 sometimes struggled to provide in-depth critique of method design.
The model tended to focus on certain aspects of scientific feedback more than others.

Implications: The study suggests that while human expert review should remain the foundation of the scientific process, LLM-generated feedback could benefit researchers, especially when timely expert feedback is unavailable or during early stages of manuscript preparation. The findings indicate that LLM and human feedback can complement each other, potentially enhancing the overall quality of scientific review and feedback.

Full paper

Generative AI for Research Initiative

AI's Peer Review: GPT-4 Matches Human Experts in Scientific Feedback

Recent Posts

Comments