Preprint Claims Language Models Undergo 'Alignment Transition' Around 3.5 Billion Parameters

By ChatGPT — AI-generated · Published:

A newly posted arXiv preprint by Adil Amin claims that language models go through a hidden “alignment transition” as they scale, shifting from a regime where reasoning and truthfulness appear to trade off against each other to one where they improve together. Amin has also released code, data and a public activation-steering tool so other researchers can test the idea.

If the finding holds up, it would suggest that standard scaling laws — which typically track how model loss changes with more compute and data — may miss an important alignment-related behavior. But the work is, at this point, an unreviewed preprint. As of June 2, no independent replication, peer-reviewed validation or third-party press coverage of the empirical claims had been identified.

The paper, “Lying Is Just a Phase: The Hidden Alignment Transition in Language Model Scaling,” appears on arXiv as arXiv:2605.18838. arXiv shows version 1 posted May 13 and version 2 posted May 23. In the abstract, Amin writes: “Scaling laws predict loss from compute but not how capabilities interact.”

The central claim is that, across 63 base models from 16 families, the relationship between reasoning and truthfulness changes sign at a family-dependent critical scale. According to the paper, that critical scale is about 3.5 billion parameters, with a bootstrap 95% confidence interval of 2.9 billion to 13.4 billion.

Amin reports that below that threshold, reasoning and truthfulness are strongly anticorrelated in the tested family examples, meaning gains in one tend to come with declines in the other. Above it, the paper says, they become cooperative. The abstract gives one example of a strong negative correlation below the transition, reporting r = -0.989 with p = 4 x 10^-5 from a nonparametric permutation test.

The paper also says the cooperative pattern extends beyond the core set of base models. In a frontier panel of 34 models from 10 labs, Amin reports a positive correlation of r = +0.72.

Beyond the headline claim, Amin has published a GitHub repository, adilamin89/cape-scaling, under an MIT license. The repository includes code, datasets, scripts to reproduce figures and bootstrap estimates, paper PDFs and a command-line interface with tools including cape_steer.py and cape_cli.py. A public dashboard is also live at zehenlabs.com/cape, hosted by ZEHEN Labs, which the site describes as founded by Amin.

Those materials include a proof-of-concept activation-steering method — an inference-time technique that alters model behavior without retraining. Amin says the method can “correct 60% of misaligned outputs in the tax phase with zero retraining” by adding a single “truth-direction” vector at an identified layer. The paper and dashboard also make mechanistic claims, including that width normalization removes the anticorrelation across tested families and that 38 of 40 models show zero competing attention heads. Those findings are the author’s reported results, not established facts.

The broader context is that most scaling-law research has focused on loss: how prediction error changes as models get bigger and are trained with more compute and data. Amin’s claim is different. It argues that two capabilities — reasoning and truthfulness — can interact differently at different scales even when ordinary loss curves do not clearly show it. The benchmarks cited in the paper, including TruthfulQA and HellaSwag, are established evaluation sets commonly used to assess truthfulness and commonsense or reasoning-related performance.

For now, the main development is that the claims are public and testable. The paper, code, datasets and dashboard are available, but as of June 2, independent review or replication had not surfaced.

Tags: #ai, #alignment, #language-models, #arxiv