i-am-ai

NVIDIA just dropped NeMo AutoModel, and it's the kind of optimization story that makes infrastructure nerds giddy: a single import swap that delivers 3.4-3.7× faster training throughput and 29-32% less GPU memory on Mixture-of-Experts models. No refactoring, no new APIs, just performance.

The library builds cleanly on top of HuggingFace Transformers v5's new first-class MoE support, adding Expert Parallelism, DeepEP fused all-to-all dispatch, and TransformerEngine kernels. The result is that you can fine-tune models like Qwen3-30B-A3B or Nemotron 3 Ultra 550B A55B substantially faster using the exact same from_pretrained() call you already know.

The one-line change

Here's the entire API difference. Before:

from transformers import AutoModelForCausalLM

After:

from nemo_automodel import NeMoAutoModelForCausalLM as AutoModelForCausalLM

That's it. NeMoAutoModelForCausalLM subclasses AutoModelForCausalLM, so everything downstream—data loaders, training loops, saving checkpoints—stays identical. And critically, save_pretrained() still emits standard HuggingFace checkpoints that vLLM and SGLang can load.

For popular MoE architectures like Qwen3, NVIDIA Nemotron, GPT-OSS, and DeepSeek V3, NeMo AutoModel ships hand-tuned implementations with TransformerEngine attention, fused linear layers, and custom expert kernels. For everything else, it falls back to vanilla HF while still applying optimizations like Liger kernel patching.

The benchmarks

NVIDIA tested two regimes: full fine-tuning a 550B frontier model across 16 H100 nodes, and training two 30B MoE models on a single 8-GPU node.

Nemotron 3 Ultra 550B A55B: multi-node scale

This is a 550B-parameter hybrid model with Mamba2, LatentMoE, and Multi-Token Prediction. The benchmark was a full fine-tune across 16 H100 nodes (128 GPUs total), with every parameter updated and Adam optimizer states materialized.

NeMo AutoModel with Expert Parallelism set to 64 achieved 815 TFLOP/s per GPU and peaked at 58.2 GiB memory per GPU. The kicker: Transformers v5 runs out of memory at this scale, so there's no baseline number to report. Expert Parallelism shards the experts across GPUs to bring the footprint within budget, which is what makes the full fine-tune possible at all.

Qwen3-30B-A3B: single-node speedup

On a single node with 8× H100 80GB GPUs, NeMo AutoModel with EP=8 hit 11,340 TPS/GPU average versus 3,075 for Transformers v5 with FlashAttention2 and grouped_mm enabled. That's 3.69× faster.

Peak memory dropped from 68.2 GiB to 48.1 GiB—a 29% reduction. Forward pass went from 582ms to 194ms (3× faster), and backward from 758ms to 178ms (4.26× faster).

Transformers v4 with hub code deadlocked entirely on this model. The issue: v4 stored Qwen3 MoE experts as a ModuleList of 128 individual MLP modules, each separately FSDP-wrapped. The forward pass looped only over experts that received tokens, which meant different ranks skipped different experts, causing mismatched FSDP collectives and indefinite hangs. Transformers v5 fixed this by storing experts as fused 3D parameter tensors.

Nemotron 3 Nano 30B A3B

Same single-node setup, same story. NeMo AutoModel delivered 15,421 TPS/GPU versus 4,583 for Transformers v5—a 3.36× speedup. Peak memory dropped from 62.1 GiB to 42.5 GiB (32% reduction). Forward pass: 283ms → 109ms (2.6× faster). Backward: 611ms → 157ms (3.89× faster).

Transformers v4 with trust_remote_code=True hub modeling code clocked in at 1,807 TPS/GPU and 61.9 GiB peak memory. It didn't deadlock like Qwen3 because NVIDIA's hub code iterates all experts regardless of token assignment.

Where the speedup comes from

Three sources:

Expert Parallelism (EP) distributes expert weights across GPUs. EP=8 cuts the per-GPU MoE footprint by 8×, freeing headroom for larger batch sizes or longer sequences. This is why Qwen3 dropped from 68.2 GiB to 48.1 GiB and Nemotron Nano from 62.1 GiB to 42.5 GiB.

DeepEP fused all-to-all dispatch overlaps communication with computation. Instead of separate AllGather/ReduceScatter collectives for expert routing, DeepEP fuses token dispatch into optimized GPU kernels that run concurrently with expert compute. This is the piece Transformers v5 doesn't have yet.

TransformerEngine kernels accelerate core operations—fused attention, linear layers, RMSNorm—across all layer types, not just MoE layers. These provide consistent speedups over PyTorch and FlashAttention equivalents.

Building on Transformers v5

Transformers v5 shipped the MoE foundations that make NeMo AutoModel possible. The big three:

Expert backends

v5 introduced the experts_implementation parameter with three backends:

eager: for-loop over selected experts (debugging, compatibility)
batched_mm: duplicates expert params, single batched GEMM via torch.bmm (good for small inputs, fast with torch.compile)
grouped_mm: orders tokens by expert, single grouped GEMM via torch.nn.functional.grouped_mm (memory-efficient, no param duplication—best for training)

The grouped_mm backend is the key training optimization. NeMo AutoModel takes it further by combining grouped GEMM with DeepEP dispatch and TransformerEngine linear layers.

Dynamic weight loading

v5's reversible weight conversion lets NeMo AutoModel load each model family without per-model checkpoint plumbing. It can focus engineering on reusable core ops while save_pretrained() still emits standard HF checkpoints.

First-class distributed training

v5 integrated PyTorch's DeviceMesh directly into from_pretrained(), making multi-GPU training a native concept rather than an afterthought. NeMo AutoModel rides this to expose Expert Parallelism with minimal API surface.

Using it

For single-GPU or basic multi-GPU, just swap the import. For Expert Parallelism across 8 GPUs:

import torch.distributed as dist
from nemo_automodel import NeMoAutoModelForCausalLM
from nemo_automodel.recipes._dist_utils import create_distributed_setup_from_config

dist.init_process_group(backend="nccl")

dist_setup = create_distributed_setup_from_config({
    "strategy": "fsdp2",
    "ep_size": 8,
})

model = NeMoAutoModelForCausalLM.from_pretrained(
    "nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-BF16",
    dtype=torch.bfloat16,
    distributed_setup=dist_setup,
)

You get FSDP2, Expert Parallelism, TransformerEngine kernels, and DeepEP dispatch from a single from_pretrained() call.

The bigger picture

This is what good infrastructure looks like: optimization without disruption. The Transformers ecosystem benefits when libraries can extend it cleanly rather than forking it.

NeMo AutoModel demonstrates that expert parallelism and fused dispatch can ride on top of v5's MoE foundations without requiring API rewrites. The AutoModel subclassing pattern means the entire HF training ecosystem—Trainer, Accelerate, dataset utilities—continues to work. And the checkpoint compatibility means inference engines like vLLM don't need NeMo-specific code paths.

For practitioners, the takeaway is simple: if you're fine-tuning MoE models at scale, the one-line import swap is worth benchmarking. The 3-4× training speedup and 30% memory reduction translate directly to lower cloud bills and faster iteration cycles.

And for the ecosystem, it's a proof point that Transformers v5's MoE architecture is extensible enough to support specialized backends without fragmenting the API surface. That's the kind of design that lets the ecosystem scale.

The one-line change

Here's the entire API difference. Before:

from transformers import AutoModelForCausalLM

After:

from nemo_automodel import NeMoAutoModelForCausalLM as AutoModelForCausalLM

The benchmarks

NVIDIA tested two regimes: full fine-tuning a 550B frontier model across 16 H100 nodes, and training two 30B MoE models on a single 8-GPU node.

Nemotron 3 Ultra 550B A55B: multi-node scale

Qwen3-30B-A3B: single-node speedup

On a single node with 8× H100 80GB GPUs, NeMo AutoModel with EP=8 hit 11,340 TPS/GPU average versus 3,075 for Transformers v5 with FlashAttention2 and grouped_mm enabled. That's 3.69× faster.

Peak memory dropped from 68.2 GiB to 48.1 GiB—a 29% reduction. Forward pass went from 582ms to 194ms (3× faster), and backward from 758ms to 178ms (4.26× faster).

Nemotron 3 Nano 30B A3B

Where the speedup comes from

Three sources:

Building on Transformers v5

Transformers v5 shipped the MoE foundations that make NeMo AutoModel possible. The big three:

Expert backends

v5 introduced the experts_implementation parameter with three backends:

eager: for-loop over selected experts (debugging, compatibility)
batched_mm: duplicates expert params, single batched GEMM via torch.bmm (good for small inputs, fast with torch.compile)
grouped_mm: orders tokens by expert, single grouped GEMM via torch.nn.functional.grouped_mm (memory-efficient, no param duplication—best for training)

The grouped_mm backend is the key training optimization. NeMo AutoModel takes it further by combining grouped GEMM with DeepEP dispatch and TransformerEngine linear layers.

Dynamic weight loading

First-class distributed training

Using it

For single-GPU or basic multi-GPU, just swap the import. For Expert Parallelism across 8 GPUs:

import torch.distributed as dist
from nemo_automodel import NeMoAutoModelForCausalLM
from nemo_automodel.recipes._dist_utils import create_distributed_setup_from_config

dist.init_process_group(backend="nccl")

dist_setup = create_distributed_setup_from_config({
    "strategy": "fsdp2",
    "ep_size": 8,
})

model = NeMoAutoModelForCausalLM.from_pretrained(
    "nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-BF16",
    dtype=torch.bfloat16,
    distributed_setup=dist_setup,
)

You get FSDP2, Expert Parallelism, TransformerEngine kernels, and DeepEP dispatch from a single from_pretrained() call.

The bigger picture

This is what good infrastructure looks like: optimization without disruption. The Transformers ecosystem benefits when libraries can extend it cleanly rather than forking it.

NVIDIA NeMo AutoModel: 3.7× faster MoE fine-tuning with a one-line import

The one-line change

The benchmarks

Nemotron 3 Ultra 550B A55B: multi-node scale

Qwen3-30B-A3B: single-node speedup

Nemotron 3 Nano 30B A3B

Where the speedup comes from

Building on Transformers v5

Expert backends

Dynamic weight loading

First-class distributed training

Using it

The bigger picture

NVIDIA NeMo AutoModel: 3.7× faster MoE fine-tuning with a one-line import

The one-line change

The benchmarks

Nemotron 3 Ultra 550B A55B: multi-node scale

Qwen3-30B-A3B: single-node speedup

Nemotron 3 Nano 30B A3B

Where the speedup comes from

Building on Transformers v5

Expert backends

Dynamic weight loading

First-class distributed training

Using it

The bigger picture