At AAAI‑26, AI Generated Reviews for Nearly 23,000 Full-Review Papers

At one of the world’s biggest artificial intelligence conferences, the peer-review process just got an AI assistant of its own.

For AAAI‑26, the annual meeting of the Association for the Advancement of Artificial Intelligence, every main‑track paper that went to full review received a clearly labeled, AI‑generated critique — and it arrived in under 24 hours. According to a new arXiv study by the pilot’s organizers, many authors and human reviewers said they not only found those machine‑written reports useful, but in some respects preferred them to traditional reviews, especially on technical accuracy and research suggestions.

The pilot, described in the preprint “AI‑Assisted Peer Review at Scale: The AAAI‑26 AI Review Pilot,” attached one extra AI review to each submission in the conference’s main technical track. The authors, led by Joydeep Biswas, an associate professor at the University of Texas at Austin who helped organize the effort, write that “every main-track submission at AAAI-26 received one clearly identified AI review from a state-of-the-art system.” They report that the system “generate[d] reviews for all 22,977 full-review papers in less than a day.”

That figure sits within a much larger tide of research. AAAI’s own opening‑ceremony slides show 30,948 total submissions to AAAI‑26, with a main‑track acceptance rate of about 17.5 percent. The difference between the total and the 22,977 papers counted in the arXiv paper reflects submissions that did not reach full review, such as desk rejections. The AI system was applied to all papers that entered that full‑review phase. The arXiv study is a preprint and has not yet itself gone through independent peer review.

AAAI framed the project as an assistive add‑on rather than a replacement for human judgment. Conference slides summarize the policy as “One additional, clearly labelled AI-generated review in phase 1.” AAAI’s call for papers and accompanying FAQ, as summarized in the preprint, state that this AI review provided written commentary only: no numerical scores and no accept‑or‑reject recommendation.

The organization also emphasized that AI outputs did not directly affect outcomes. In its opening‑ceremony materials, AAAI stated, “No human reviewers are being replaced by AI reviewing.” The FAQ and slides describe the AI review as “non-decisional,” with final accept and reject decisions reserved for human reviewers and area chairs, the senior researchers who oversee groups of papers. Human reviewers were required to submit their own reports before they could see the AI‑generated one.

Those AI reviews were then visible to reviewers and to senior members of the program committee, who could factor them into discussion if they chose. The FAQ says senior program committee members were responsible for monitoring the AI content, including excluding harmful or inappropriate material.

Behind the scenes, the pilot relied on a multi‑stage technical pipeline built atop a commercial large language model. AAAI’s slides identify the base system as “Base LLM: OpenAI GPT5, with Zero Data Retention.” According to the FAQ, conference papers were first converted from PDF into a structured format using optical character recognition tools that aimed to preserve math, tables and formatting. Equations were extracted into LaTeX‑style notation so the model could handle mathematical content.

The organizers say the workflow then incorporated several tool‑based checks. These included a code‑interpreter‑style component for working through technical details, literature search steps to locate related work, specific phases for checking technical accuracy and experimental results, and a self‑critique stage before the system produced its final review text. AAAI’s FAQ says vendor contracts required that manuscript content would not be retained or used to train models, and that there were “vendor guarantees for zero data retention.”

Given concerns about model manipulation and security, the organizers also highlight adversarial testing. The FAQ states that the system was tested against attacks such as “hidden prompts” — instructions embedded in papers to steer the model’s behavior — and that there were both AI‑based and traditional safeguards intended to detect or flag adversarial or harmful content. Senior program committee members were tasked with removing any AI output that crossed those lines.

The strongest positive claims in the pilot come from surveys and internal benchmarks reported in the arXiv paper. The authors write that “a large-scale survey of AAAI-26 authors and program committee members showed that participants not only found AI reviews useful, but actually preferred them to human reviews on key dimensions such as technical accuracy and research suggestions.” AAAI’s opening‑ceremony slides summarize author responses with mean scores of +0.33 for the statement that “AI reviews were useful” and +0.56 for “AI reviews would be useful in future review processes,” on a scale the slides do not fully detail.

Beyond subjective ratings, the preprint reports that the team “introduce a novel benchmark and find that our system substantially outperforms a simple LLM-generated review baseline at detecting a variety of scientific weaknesses.” The authors present this as evidence that combining a “frontier” model with tools and safeguards can make AI critiques more rigorous than generic language‑model outputs.

The pilot is taking place against a backdrop of growing strain on peer review across fields. In the preprint’s abstract, the authors write that “Scientific peer review faces mounting strain as submission volumes surge, making it increasingly difficult to sustain review quality, consistency, and timeliness.” AAAI‑26’s nearly 31,000 submissions underscore that pressure.

Researchers have been testing AI’s role in this process for several years. AAAI’s FAQ cites work such as a randomized study at the 2025 International Conference on Learning Representations, which found that language‑model feedback could improve some aspects of human review quality and engagement. Other studies of “ReviewerGPT”‑style tools have explored both the potential for automated checks and vulnerabilities such as prompt‑injection attacks; AAAI points to these findings as inputs into its own design choices.

At the same time, the organization’s FAQ acknowledges significant risks, including bias in AI outputs, confidentiality and intellectual‑property worries, and the chance that adversarial content could slip through. Manuscripts were stored on the OpenReview platform, and AAAI says its contract with OpenAI barred the use of paper content for training and required no retention of that data. Responsibility for catching and mitigating problems with AI reviews ultimately rested with human senior program committee members.

The authors of the arXiv paper argue that “Together, these results show that state-of-the-art AI methods can already make meaningful contributions to scientific peer review at conference scale, opening a path toward the next generation of synergistic human-AI teaming for evaluating research.” For now, that conclusion rests on the organizers’ own analysis and surveys. But with AI systems already reviewing nearly 23,000 papers in less than a day at a flagship AI conference, the question of how machines should help judge science is no longer hypothetical — it is being tested in the workflows that decide what research gets a platform.

Tags: #ai, #peer-review, #academia, #conference