Researchers report agentic LLMs outperformed expert biologists and produced robot-assembled DNA in lab tests

·

A newly posted arXiv paper from researchers at SecureBio and Panoplia Labs reports that current agentic large language models can outperform expert human biologists on several biosecurity-relevant tasks, including what the authors describe as a real-world lab demonstration in which model-written code ran a robot that successfully assembled DNA in three experiments.

That claim stands out because the work goes beyond a standard text-based AI benchmark. According to the manuscript, the models were tested as agents that could write code, use software tools and interact with laboratory automation. That makes the result more consequential for biosecurity than a question-answer test, particularly because one task touches DNA-synthesis screening safeguards used by commercial gene providers.

The paper, “ABC-Bench: An Agentic Bio-Capabilities Benchmark for Biosecurity,” was authored by Andrew Bo Liu, Samira Nedungadi, Bryce Cai, Alex Kleinman, Harmon Bhasin and Seth Donoughe. It introduces ABC-Bench, which evaluates tool-using LLMs on three tasks the authors describe as biosecurity-relevant: writing code for OpenTrons liquid-handling robots to carry out Gibson Assembly, a common DNA assembly method; designing DNA fragments for in vitro assembly, meaning assembly outside a living organism; and designing fragments that could evade DNA-synthesis screening.

The authors say they collected about 175 hours of expert human baseline data. In the paper, Ph.D. biologists with at least two years of coding experience averaged 24% on the benchmark. The top model tested, Grok 3, scored 53% across tasks, according to the manuscript. The arXiv abstract says, “All tested LLM agents outperformed the median expert human baseliner on all three tasks.” The paper also says model performance was strongest on tasks grounded in published knowledge and documented protocols or APIs, and weaker on a task that required more novel bioinformatics reasoning.

The paper’s most consequential result is its wet-lab validation claim. The authors report three experiments in which an OpenAI model generated OpenTrons Protocol API v2 scripts that were then run on a physical OpenTrons robot. According to the paper, the resulting DNA assemblies were transformed into DH5α competent cells, a standard lab strain used for cloning, and resulting clones were checked with whole-plasmid sequencing by Plasmidsaurus using Oxford Nanopore sequencing and custom analysis. In the OpenReview version of the paper, the authors wrote: “OpenAI’s GPT‑4o‑mini‑high produced code that, when run on an OpenTrons robot, successfully assembled DNA with the expected sequences in three independent experiments.” The manuscript says the robot work used lab space and an OpenTrons system acknowledged through Tufts Launchpad Biolabs.

Still, that wet-lab result should be read as the authors’ reported finding, not an independently established one. As of June 10, there does not appear to be a publicly reported third-party replication of the robot experiment. That caveat matters because the central claim is not just that a model could answer biology questions or draft plausible code, but that model-generated instructions were used to operate accessible lab automation and produce validated DNA assemblies in the real world.

OpenTrons systems are relatively affordable, scriptable lab robots commonly used in academic and community lab settings, which is part of why the result is notable if it holds up. The benchmark’s screening-evasion task also speaks to a real safeguard: DNA synthesis companies commonly use industry and U.S. government screening guidance to flag risky orders, and the paper tests whether models can work around that kind of screening.

An earlier version of the work appeared on OpenReview for the NeurIPS 2025 BioSafe GenAI workshop. The current manuscript was submitted to arXiv on June 9 as arXiv:2606.11150. The authors say ABC-Bench is already being used by major AI companies, naming Anthropic and OpenAI as examples.

Tags: #ai, #biosecurity, #llm, #syntheticbiology