RAND paper warns AI-biology tests need clearer design, scoring and reporting for policy use

By ChatGPT — AI-generated · Published:

A newly posted paper tied to RAND argues that one of the biggest policy problems in AI biosecurity is not simply whether AI agents can perform biology-related tasks. It is whether the tests used to measure those abilities are designed, run and documented well enough for governments and institutions to draw meaningful conclusions.

The paper, “Measuring Biological Capabilities and Risks of AI Agents,” was posted to arXiv on June 18 as arXiv:2606.19899v1. The authors listed on the arXiv entry are Patricia Paskov, Jeffrey Lee, Kyle Brady and Alyssa Worland. The entry also lists report number PEA4710-1, indicating a connection to RAND’s perspective and report series. The paper is framed as a methods-and-policy contribution, not as a claim of a dramatic new biological capability.

In its abstract, the authors say the paper addresses a fast-moving governance question: “This paper addresses a rapidly emerging policy challenge: how to generate and interpret credible evidence about the biological capabilities and risks of AI scientists, or agentic AI systems capable of autonomously or collaboratively performing multi-step scientific tasks.”

Its central argument is that so-called agentic evaluations in biology can be highly sensitive to design decisions that are often left implicit or poorly documented. According to the abstract, “Our central contribution is a set of practical, experience-grounded considerations -- drawing from our own evaluations -- that show how choices around defining, designing, running, scoring, and documenting evaluations materially shape what results do and do not imply about risk.”

That point matters because policymakers, funders and AI companies are increasingly looking to evaluation results when judging how risky advanced AI systems may be in scientific settings. The paper explicitly says its audience includes policymakers, funders, biosecurity practitioners, frontier AI labs, AI providers, scientific institutions and third-party evaluation organizations.

In other words, the warning is about interpretation. A test result showing that an AI system completed a biology-related task may mean very different things depending on how the task was framed, what tools the system was allowed to use, how performance was scored and how thoroughly the setup was reported. Without that context, benchmark results can look more definitive than they really are.

The paper builds on a line of RAND work that has already explored how frontier AI systems interact with real-world biological workflows. In February, RAND published “Bridging the Digital to Physical Divide: Evaluating LLM Agents on Benchtop DNA Acquisition.” RAND’s summary said the study evaluated eight frontier large language model agents on their ability to design DNA segments, interact with a benchtop DNA synthesizer and generate laboratory protocols.

RAND summarized that report this way: “Performance varied among the models, but all tested LLMs designed biologically coherent DNA segments in some attempts.” The organization said that study was conducted by its Center on AI, Security, and Technology within RAND Global and Emerging Risks, and that RAND research reports undergo peer review.

The new paper lands amid a broader policy debate over how AI could change the life sciences and the biosecurity measures surrounding them. The National Academies’ 2025 report, “The Age of AI in the Life Sciences: Benefits and Biosecurity Considerations,” concluded that AI is increasing capabilities in the life sciences and recommended monitoring and policy responses. In 2024, the White House Office of Science and Technology Policy also issued a Framework for Nucleic Acid Synthesis Screening, reflecting growing attention to how biological materials are screened and how AI could affect that threat landscape.

The RAND-linked arXiv paper does not present a single headline-grabbing breakthrough. Its contribution is more basic, and potentially more important for policy: a reminder that if governments and institutions are going to rely on AI-biology evaluations, they need clearer standards for what those tests actually measure and what their results do — and do not — say about risk.

Tags: #ai, #biosecurity, #rand, #benchmarks