DeepSeek Paper Proposes ‘Manifold’ Fix to Stabilize Giant AI Models

A geometric tweak aimed at the biggest models

On Dec. 31, 2025, as much of the AI industry focused on chip export rules and GPU shortages, Chinese AI lab DeepSeek posted a research preprint proposing a different kind of constraint: a change to the internal “wiring” of large neural networks.

The 26-page paper does not introduce a new chatbot or a larger model. Instead, it describes Manifold-Constrained Hyper-Connections (mHC), a technique DeepSeek says can stabilize training at tens of billions of parameters and improve reasoning results for roughly a 6.7% increase in compute.

Fixing the “identity shortcut” problem at extreme scale

Most transformer-based language models rely on residual connections, a design popularized by 2015’s ResNet work in computer vision. In simplified form, each layer learns an adjustment and adds it back to the input:

y = x + F(x)

That “identity shortcut” helps gradients and information flow through very deep networks.

In 2024 and 2025, researchers including groups at ByteDance proposed Hyper-Connections, which use multiple residual streams in parallel and learn matrices that mix those streams between layers. The goal: richer information routing and performance gains.

DeepSeek argues that at very large scales, unconstrained Hyper-Connections introduce a new failure mode. Because the mixing matrices are free to become arbitrary linear transforms, small multiplicative errors can compound across many layers.

In the preprint (signed by 19 DeepSeek researchers), the team reports that in large training runs using standard Hyper-Connections, signal strength along some residual paths could grow by more than 3,000× at certain steps—leading to sharp loss spikes and, in some cases, training collapse for models in the tens of billions of parameters.

Constraining the mixing matrices to a manifold

mHC attempts to preserve the benefits of multi-stream residual paths while restoring identity-like behavior.

DeepSeek’s approach is to constrain each stream-mixing matrix to a manifold of doubly stochastic matrices—nonnegative matrices where each row and column sums to 1. Intuitively, these act like soft permutations or weighted averages, helping bound how much a signal can be amplified.

To enforce this during training, the paper uses the Sinkhorn–Knopp algorithm inside the model’s forward pass, iteratively normalizing rows and columns until the matrix approximately satisfies the doubly stochastic conditions.

DeepSeek reports that this projection reduces worst-case residual-path amplification from thousands of times under vanilla Hyper-Connections to about 1.6× with mHC.

Benchmark gains with modest overhead

DeepSeek evaluated mHC on transformer-style models of roughly 3B, 9B, and 27B parameters. At 27B, the paper compares:

  • a standard transformer with plain residual connections,
  • a Hyper-Connections variant,
  • and an mHC version.

According to the results, the mHC model trained stably to at least 50,000 steps, with smoother loss curves and gradient norms than unconstrained Hyper-Connections, which showed spikes and occasional divergence.

Across language and reasoning benchmarks, the constrained approach outperformed both baselines. The paper reports, among other results:

  • BIG-Bench Hard: ~51 (mHC) vs ~43.8 (plain residual) and ~48.9 (Hyper-Connections)
  • DROP: ~53.9 (mHC) vs ~47.0 (plain residual) and ~51.6 (Hyper-Connections)

Smaller gains—typically 1 to 2.3 points—were also reported on GSM8K, MATH, MMLU, and TriviaQA, with the authors claiming benefits grow with model size.

DeepSeek says mHC adds about 6.7% to training time versus a plain residual transformer when using four parallel residual streams.

To keep overhead down, the team describes several systems optimizations, including:

  • fusing Sinkhorn iterations into custom GPU kernels to reduce memory traffic,
  • selectively recomputing intermediate activations during backpropagation to save memory,
  • scheduling mHC operations on separate high-priority GPU streams to overlap with network communication.

The paper argues that these measures “hide” much of the extra math under existing bottlenecks.

Strategic context: efficiency in the AI race

The work arrives as DeepSeek plays an outsized role in China’s AI ecosystem. Founded in 2023 in Hangzhou and backed by High-Flyer Capital, DeepSeek drew international attention in early 2025 with DeepSeek-R1, a 671B-parameter reasoning model released with open weights.

The mHC paper is not a product announcement. Instead, it offers a reusable architectural component other labs could adopt—consistent with DeepSeek’s broader strategy of publishing techniques and, historically, releasing weights.

Observers have also framed mHC as evidence that the competition is shifting from raw access to advanced chips toward algorithmic and systems efficiency. With U.S. export controls tightened in 2023 and 2024, Chinese labs have increasingly pursued domestic accelerators such as Huawei’s Ascend and methods that extract more capability per unit of compute.

What remains uncertain

Key questions remain before mHC can be treated as a new standard:

  • The results have not yet been independently replicated at comparable scale.
  • It is unclear how the method behaves in even larger models or in multimodal systems.
  • Some of the reported efficiency depends on sophisticated kernel fusion and training-pipeline engineering that may be difficult for smaller teams to reproduce.

Even so, the paper’s core claim is straightforward: as models scale, the bottleneck is not only chips, but stability of the underlying mathematics. DeepSeek argues that carefully chosen constraints can mean the difference between a large model that collapses mid-training and one that trains smoothly and reasons better—at a cost increase still under 10%.

Tags: #ai, #deepseek, #machinelearning, #transformers, #china