The Robot Data Problem Meets Parameter-Efficient Fine-Tuning
Robot learning has a grinding data problem. Collecting real-world demonstrations is slow, expensive, and doesn't scale. You need diverse trajectories to train robust policies, but every hour of teleoperation costs real human time and real robot wear. The alternative—synthetic data from video world models—has been theoretically appealing but practically elusive. Until now.
NVIDIA just dropped a complete fine-tuning recipe for Cosmos Predict 2.5 that makes this concrete. The guide walks through parameter-efficient fine-tuning of their 2B-parameter video world model using LoRA and DoRA adapters, with full code in diffusers and accelerate. This isn't a research sketch—it's a production-ready pipeline for generating synthetic robot trajectories at scale.
The key insight: full fine-tuning of a 2B-parameter model is both expensive and risky (catastrophic forgetting of general knowledge), but injecting small trainable adapter modules into the frozen base model reduces memory requirements while keeping adapter files portable. You can fine-tune on a single 80GB GPU and swap adapters for different domains at inference. That's the unlock.
LoRA vs DoRA: Architectures for Efficient Adaptation
The recipe supports both LoRA (Low-Rank Adaptation) and DoRA (weight-Decomposed low-Rank Adaptation). LoRA injects trainable low-rank matrices into attention projections and feedforward layers, leaving the base weights frozen. DoRA extends this by decomposing each weight into magnitude and direction before applying the low-rank update.
In practice, switching between them is a single boolean flag: use_dora=True. No other training loop changes required. The adapters target six module types in the DiT transformer: to_q, to_k, to_v, to_out.0, ff.net.0.proj, and ff.net.2. All VAE and text encoder weights stay frozen.
With rank 32 adapters, you get roughly 50M trainable parameters. That's 2.5% of the base model. The recipe sets lora_alpha = lora_rank, which keeps the scaling factor at 1.0—the LoRA update is applied at full strength without dampening.
One critical detail: LoRA parameters are upcast to float32 even under bf16 mixed precision training. This is for numerical stability. The base model runs in bfloat16, but the adapter gradients accumulate in full precision.
Rectified Flow Training: Predicting Velocity, Not Noise
Cosmos Predict 2.5 uses rectified flow rather than DDPM-style noise prediction. The model learns to predict the velocity that linearly transports a noise sample toward clean data. At timestep t, a noisy interpolation xt = σt·noise + (1−σt)·clean is constructed, and the model predicts the target velocity noise − clean via MSE loss.
The first two frames of each video are used as conditioning—no noise is added to their latents. The model conditions on these frames, the timestep, and the text prompt embeddings. Loss is computed only on the non-conditioned frames. This is standard for video generation models, but the rectified flow formulation is cleaner than epsilon-prediction schedules.
Timesteps are sampled from a logit-normal distribution during training. This gives more weight to mid-range noise levels where the model has to do actual denoising work, rather than uniform sampling that wastes capacity on trivial near-clean or near-noise states.
Dataset and Preprocessing: 92 Robot Manipulation Videos
The training dataset is nvidia/GR1-100 on Hugging Face: 92 robot manipulation videos with text prompts describing pick-and-place tasks. The eval set is 50 prompt-image pairs from nvidia/PhysicalAI-Robotics-GR00T-Eval. The model generates a video conditioned on the initial frame and text prompt.
The VideoDataset class handles temporal augmentation. For videos longer than the target frame count, it samples a random contiguous window each epoch. This is crucial—without temporal jitter, the model would overfit to specific frame ranges in each video. The random windowing acts as free data augmentation.
Internally, VideoProcessor from diffusers.video_processor resizes and normalizes raw frames into tensors of shape (channels, frames, height, width). The recipe defaults to 432×768 resolution, which is a reasonable trade-off between visual fidelity and memory.
Training Configuration: Linear Warmup and Decay
The optimizer is vanilla AdamW with configurable learning rate and weight decay. The scheduler is get_linear_schedule_with_warmup from diffusers.optimization, which linearly warms up the learning rate over a set number of steps, peaks at scheduler_f_max × learning_rate, then linearly decays to scheduler_f_min × learning_rate over the remaining training steps.
This is more sophisticated than constant learning rate or cosine decay. You get a warmup phase to stabilize gradients, a plateau for most of training, and a gradual decay to smooth convergence. The f_min and f_max factors give you fine-grained control over the learning rate envelope.
Checkpointing happens every checkpointing_epochs epochs. accelerator.save_state() writes a pytorch_lora_weights.safetensors file—the adapter file you'll load at inference. This is the whole point of LoRA: the adapter is small (under 200MB for rank 32) and portable. You can version-control it, share it, and swap it without touching the base model.
Multi-GPU Training with Accelerate
The recipe uses accelerate for single- and multi-GPU training. Minimum requirement is one 80GB GPU, but 8× H100s are recommended for faster iteration. accelerate launch --mixed_precision="bf16" handles distributed data parallelism automatically.
Gradient checkpointing is enabled by default (--gradient_checkpointing). This trades compute for memory by recomputing activations during the backward pass instead of storing them. Essential for fitting large models on consumer GPUs.
TF32 is allowed (--allow_tf32) for faster matmuls on Ampere and newer. This is a precision mode that uses 19-bit mantissas instead of 23-bit, which is effectively FP32 range with BF16 precision. Safe for most deep learning workloads and ~2× faster on modern NVIDIA hardware.
Inference: Loading Adapters and Generating Videos
At inference, you load the base pipeline and merge the trained adapter:
pipe = Cosmos2_5_PredictBasePipeline.from_pretrained(
"nvidia/Cosmos-Predict2.5-2B",
revision="diffusers/base/post-trained",
torch_dtype=torch.bfloat16,
)
pipe.load_lora_weights("path/to/checkpoint/pytorch_lora_weights.safetensors")
Then generate videos conditioned on text and initial frames. The pipeline handles frame conditioning, rectified flow sampling, and VAE decoding automatically. You get physically plausible robot trajectories that respect the scene geometry and task semantics.
This is the intended use case: generate diverse synthetic demonstrations for downstream policy learning. The fine-tuned model captures domain-specific priors (robot morphology, camera viewpoint, object types) while retaining the base model's physical reasoning.
Why This Matters for Robot Learning
Parameter-efficient fine-tuning bridges the gap between general-purpose video models and domain-specific robot data. Cosmos Predict 2.5 out-of-the-box is trained on internet video—it understands physics and common objects, but not your specific robot or task distribution.
Full fine-tuning would overwrite that general knowledge. LoRA/DoRA adapters inject task-specific patterns while preserving the base model's capabilities. You get the best of both worlds: broad physical reasoning plus narrow domain adaptation.
The synthetic trajectories generated by the fine-tuned model can augment real demonstrations for policy training. This is the core value proposition of world models in robotics: data efficiency through simulation. Real demos provide ground truth, synthetic demos provide coverage.
Of course, there's a fidelity gap. Synthetic trajectories won't perfectly match real-world dynamics—no model is that good yet. But for pre-training, behavior cloning warm-starts, or curriculum learning, they're already useful. And the gap is closing fast.
Open Questions and Next Steps
This recipe is a solid starting point, but several questions remain:
- How does adapter rank trade off against sample efficiency? Rank 32 is reasonable, but maybe rank 64 or 128 captures finer details. Or maybe rank 16 is enough and saves memory.
- How well do LoRA adapters generalize to held-out camera viewpoints or object geometries? Fine-tuning on a narrow distribution is easy; interpolating and extrapolating is hard.
- Can you train multi-task adapters that handle multiple manipulation primitives, or do you need one adapter per task?
The eval dataset includes 50 test prompts, but the blog post doesn't report quantitative metrics like FVD (Fréchet Video Distance) or user studies. Those would help calibrate expectations.
Still, this is the most complete public recipe for fine-tuning a production video world model for robotics. The code is in diffusers, the datasets are on Hugging Face, and the hardware requirements are within reach of academic labs. That's a big deal. The robot learning stack just got a lot more accessible.