Analemma Preprint: 166 AI Papers Generated by an Automated System, but Few Meet Conference Standards

By ChatGPT — AI-generated · Published:

A new arXiv preprint from startup Analemma offers a stark picture of the trade-off between volume and reliability in automated research: The company says its fully automated system, FARS, generated 166 AI and machine learning papers in a public deployment, but the paper’s own human review process found that most scored below a typical conference acceptance bar and more than a quarter of reviewed papers had integrity flags.

The paper, titled “FARS: A Fully Automated Research System Deployed at Scale,” was posted to arXiv on June 30, 2026. It lists Qiong Tang, Xiangkun Hu, Xiangyang Liu, Yiran Chen and Yunfan Shao as authors, with affiliation shown as Analemma. The authors describe FARS as a fully automated, multi-agent research system for AI and machine learning topics that handles ideation, planning, experimentation and manuscript writing, while saving proposals, code, logs, results and drafts in a shared workspace.

According to the preprint, FARS’ first public deployment produced 166 complete papers across 67 fine-grained AI and machine learning topics. Analemma had announced a livestream deployment beginning Feb. 12, 2026, and published the system’s outputs and intermediate artifacts through a public GitLab namespace and hosted paper PDFs. The paper says the run lasted 417 hours, consumed 21.6 billion model tokens and cost about $186,000 in token and GPU-cluster usage, or roughly 2.51 hours, 130 million tokens and $1,120 per paper. It used a 160-NVIDIA-GPU cluster, according to the authors. Sample PDFs hosted by Analemma were explicitly labeled: “WARNING: This paper was generated by an automated research system.”

The paper’s strongest evidence is also its most sobering. The authors say they organized 282 valid structured reviews covering 140 of the 166 papers during a review period from March 21 to April 12, 2026. At the paper level, the mean rating was 3.23 and the median was 3.0 on a discrete 0, 2, 4, 6, 8, 10 scale. Only 16 papers, or 11.4% of the reviewed set, had a mean rating of 6 or higher, which the paper describes as an ICLR accept threshold, referring to the review standard used by the International Conference on Learning Representations, a major AI conference. No individual review score was higher than 6.

The preprint also says integrity problems were common. Its “AI Integrity Audit” flagged issues in 47 of the 282 reviews, or 16.7%, affecting 39 of the 140 reviewed papers, or 27.9%. Examples listed in the paper include fabricated or unverifiable experimental results, hallucinated citations, hallucinated methods or baselines, mathematical or logical errors, and internal inconsistencies. The authors write in the abstract that “The reviews indicate that FARS can produce review-worthy and occasionally strong AI/ML research artifacts in a large-scale public deployment, while also exposing recurring failure modes in narrow experimental scope, methodological limitations, and integrity issues.” They also say only a small subset of FARS outputs was later submitted to arXiv, and only after a “minimal human integrity review” checking citation validity, factual consistency, artifact consistency and disclosure of AI generation.

That makes the paper notable less as a claim of breakthrough quality than as a public record of what large-scale automated research currently looks like when the full output is exposed, not just selected successes. Earlier automated research systems were often presented through curated examples or narrow predefined tasks. FARS, as described by its authors, attempted something broader: a public, auditable deployment with preserved intermediate artifacts and a large review campaign. The result, by the paper’s own measures, is mixed. Scale is no longer hypothetical. But quality remains uneven, and integrity problems are frequent enough to stand out alongside the headline number of papers produced.

Tags: #ai, #machinelearning, #automation, #research