NVIDIA open-sources Nemotron 3 Super, a 120B Mixture-of-Experts LLM tuned for its GPUs

NVIDIA is putting a new kind of giant language model into the open‑weight race, betting that efficiency and transparency will matter as much as raw size.

The company has released Nemotron 3 Super, a 120‑billion‑parameter model that uses a Mixture‑of‑Experts, or MoE, design and a hybrid architecture combining Mamba state‑space layers with traditional Transformer attention. NVIDIA describes it as a “12B active 120B total parameter Mixture‑of‑Experts hybrid Mamba‑Transformer model” in research materials published March 10 and a technical report dated April 3.

Unlike a conventional model, which activates all its parameters for every request, Nemotron 3 Super uses only 12 billion of its 120 billion parameters at a time. According to NVIDIA’s report, that MoE setup is intended to deliver more accuracy per unit of computation, routing each input through a subset of specialized “experts” rather than the full network.

The model’s architecture leans on several NVIDIA‑backed research efforts. The company calls Nemotron 3 Super a “hybrid Mamba‑Attention Mixture‑of‑Experts model,” reflecting its use of Mamba, a class of structured state‑space models designed to handle long sequences efficiently, alongside Transformer attention layers. NVIDIA’s earlier Nemotron work explored similar hybrids to improve performance on very long documents and streaming inputs.

Nemotron 3 Super is the first Nemotron‑3 model that NVIDIA says was pretrained in NVFP4, a low‑precision numeric format the company developed for its GPUs. The technical report also says it is the first to adopt LatentMoE, a separate Mixture‑of‑Experts architecture described in a January 2026 paper titled “LatentMoE: Toward Optimal Accuracy per FLOP and Parameter in Mixture of Experts.” NVIDIA credits LatentMoE with optimizing “accuracy per FLOP and accuracy per parameter.”

The model adds Multi‑Token Prediction layers, a technique that lets it predict several future words at once to speed up text generation. NVIDIA says those layers enable “native speculative decoding,” a method of accelerating inference by drafting multiple possible next tokens and then verifying them.

One of the headline numbers is Nemotron 3 Super’s context window: up to 1 million tokens, according to both the technical report and the Hugging Face model card. In practical terms, a token is a chunk of text shorter than a word, so a 1 million token context can cover hundreds or thousands of pages. That allows the model, in principle, to work over very long documents or extended conversations without losing track of earlier content.

Training was also large‑scale. NVIDIA reports pretraining Nemotron 3 Super on about 25 trillion tokens in total, with 20 trillion used in a broad first phase and another 5 trillion drawn from higher‑quality data. After pretraining, the company says it applied supervised fine‑tuning and reinforcement learning, including reinforcement learning from human feedback (RLHF) and an asynchronous method called GRPO. Those post‑training stages were aimed at so‑called “agentic” behavior, where the model can handle multi‑step tasks and interact across multiple environments.

On accuracy, NVIDIA’s paper says Nemotron 3 Super delivers “comparable accuracy on common benchmarks” to other large open‑weight models such as GPT‑OSS‑120B from OpenAI and Qwen3.5‑122B from Alibaba’s Qwen team. The report does not include independent third‑party benchmarks.

NVIDIA places more emphasis on throughput, or how many tokens per second the model can generate. Under its own test conditions, the company reports that Nemotron 3 Super achieves up to 2.2 times higher inference throughput than GPT‑OSS‑120B and up to 7.5 times higher throughput than Qwen3.5‑122B. Those comparisons were run with an 8,000‑token input and a 64,000‑token output on NVIDIA’s B200 data‑center GPUs, using the vLLM serving stack and TensorRT‑LLM, with each model quantized and configured according to NVIDIA’s choices. All of these are vendor‑reported measurements and rely on the company’s hardware and software.

The hardware dependence is explicit. The Nemotron 3 Super model card on Hugging Face lists a practical minimum of two H100 80‑gigabyte GPUs just to deploy the system. The larger throughput gains NVIDIA cites are tied to its latest B200 chips and proprietary formats like NVFP4, making the top‑line numbers achievable mainly for organizations already invested in the company’s data‑center stack.

Where Nemotron 3 Super stands out among recent large releases is how much NVIDIA is putting in the open. “Nemotron 3 Super datasets, along with the base, post‑trained, and quantized checkpoints, are open‑sourced on HuggingFace,” the technical report states. That includes pretrained base checkpoints, versions that have gone through supervised and reinforcement learning, and quantized variants aimed at more efficient deployment.

NVIDIA is also publishing key data bundles used in training and post‑training, with names such as “Nemotron‑Pretraining‑Specialized‑v1.1” and “Nemotron‑Super‑Post‑Training‑Data.” Some of these datasets, including Nemotron‑Pretraining‑Specialized‑v1.1, are licensed under Creative Commons Attribution 4.0, which allows reuse and modification with credit.

The model itself is released under the NVIDIA Nemotron Open Model License, dated December 2025. The license allows commercial use and the creation of derivative models. NVIDIA says it does not claim ownership of the outputs; instead, it states that users own the content generated by their use of the model. The license also includes a termination clause tied to certain patent or copyright litigation brought against NVIDIA, and it specifies that disputes are governed by U.S. Delaware law.

Nemotron 3 Super joins a growing lineup of large, open‑weight systems released since 2024 by companies including OpenAI and Qwen’s developers. While those models have focused on making sophisticated language systems broadly downloadable, NVIDIA is positioning Nemotron 3 Super as part of an “open” family that also exposes training recipes and major data components — and that is tightly tuned for its own hardware.

An accompanying research paper, “Nemotron 3 Super: Open, Efficient Mixture-of-Experts Hybrid Mamba-Transformer Model for Agentic Reasoning,” was posted to the scientific repository arXiv on April 14. In its abstract, NVIDIA’s team sums up the release: “We describe the pre-training, post-training, and quantization of Nemotron 3 Super, a 120 billion (active 12 billion) parameter hybrid Mamba-Attention Mixture-of-Experts model.”

Tags: #nvidia, #ai, #llm, #moe