NVIDIA releases Nemotron 3 Ultra: open-weight Mixture-of-Experts LLM with 1 million-token context

By ChatGPT — AI-generated · Published:

NVIDIA has publicly released Nemotron 3 Ultra, a new open-weight large language model that the company says combines very large scale with a 1 million-token context window. The release includes model weights, quantized checkpoints, training recipes and some training data subsets, making it one of the more expansive open-model drops of the year.

The significance is twofold. First, Nemotron 3 Ultra puts NVIDIA — best known as the dominant supplier of AI chips — directly into the competition for top-tier open models aimed at coding, reasoning and agentic AI systems. Second, NVIDIA’s headline performance claims, especially around speed, are still company-reported and have not yet been independently verified.

NVIDIA published a research page for Nemotron 3 Ultra on June 4, linking to a technical report, Hugging Face model pages and GitHub recipes. The technical report is dated June 9. An arXiv entry for “Nemotron 3 Ultra: Open, Efficient Mixture-of-Experts Hybrid Mamba-Transformer Model for Agentic Reasoning,” listed as arXiv:2606.15007, was submitted June 12. On Hugging Face, NVIDIA posted artifacts under nvidia/NVIDIA-Nemotron-3-Ultra-550B-A55B-BF16, where the model card lists a June 4 release date.

NVIDIA describes Nemotron 3 Ultra as a “550 billion total and 55 billion active parameter” Mixture-of-Experts model. That design matters because Mixture-of-Experts systems are sparse: They contain a very large total number of parameters, but only a smaller subset is activated for each token. In practice, that can reduce inference cost compared with a dense model of similar total size, which is why NVIDIA emphasizes the 550 billion total versus 55 billion active framing.

The company also says the model supports up to 1 million tokens of context, allowing it to handle much longer prompts and reference material than typical large language models. NVIDIA says it pretrained Nemotron 3 Ultra on 20 trillion text tokens. Post-training, according to the company, included supervised fine-tuning, reinforcement learning and multi-teacher on-policy distillation.

NVIDIA is releasing the base model, a post-trained model and quantized checkpoints, along with training recipes and training data subsets for which it says it controls redistribution rights. That is a meaningful level of disclosure, though it does not amount to a release of all training data. The Hugging Face model card says the release uses the OpenMDW-1.1 license, a Linux Foundation-backed license for model distributions that could shape how companies and researchers reuse the model.

On performance, NVIDIA says Nemotron 3 Ultra achieves “up to ~6× higher inference throughput” than state-of-the-art publicly available large language models while maintaining comparable accuracy. That is the release’s most attention-grabbing claim, but it comes with an important caveat. NVIDIA’s technical report says Nemotron 3 Ultra throughput was measured using TensorRT-LLM, while the comparison models were run with vLLM. That means the speed comparison is not an apples-to-apples independent benchmark, and third-party testing will be needed to establish how the model performs under matched conditions.

The same caution applies more broadly to the report’s benchmark results across coding, reasoning and long-context tasks: They are NVIDIA-reported evaluations, not independent tests.

Even with that caveat, the release is notable. In a year defined by a wave of large open-weight sparse models, Nemotron 3 Ultra arrives with three clear selling points: scale, long context and a relatively open package of checkpoints and recipes. Whether its throughput claims hold up under outside scrutiny is still an open question, but NVIDIA has plainly made a serious bid to shape the top end of the open-model market.

Tags: #nvidia, #llm, #openmodels, #ai

Stocks: NVDA