Forecasting beats perception when you need to act
Perception is retrospective. Modern vision models can track points through video with stunning accuracy, but they're always explaining motion that already happened. If you're building a robot that needs to pick up a cup, or a video generator that has to render physically plausible motion, you need to look forward.
That's the premise behind MolmoMotion, a new open motion forecasting model from AllenAI. Give it a video frame, mark some 3D points on an object, add a language instruction like "Move and rotate the wooden bowl with fruit on the table," and it predicts where those points will travel over the next few seconds in 3D space. It's paired with MolmoMotion-1M, the largest dataset of action-described 3D point trajectories yet assembled—1.16 million videos—and PointMotionBench, a 2.7K-clip human-validated benchmark for measuring object-centric forecasting accuracy.
Everything is open: model weights, dataset, benchmark, code.
Why 3D points beat pixels for motion
Most video generation models represent motion implicitly through pixels. MolmoMotion takes a different approach: sparse 3D points in world space that stick to object surfaces. It's less glamorous than rendering full frames, but it solves three problems at once.
First, it's class-agnostic. You don't need separate templates for hands, rigid objects, or articulated bodies—points can describe all of them. Second, it's view-stable. The same physical motion gets the same representation across cameras and viewpoints, because the coordinates live in a shared world frame. Third, it's directly usable. Robot policies and video generators can consume these trajectories without translation.
Among the representations AllenAI considered, 3D points were the only option that satisfied all three constraints. It's a deliberate trade: you lose pixel-level detail but gain a motion primitive that generalizes across tasks and domains.
Two architectures: autoregressive and flow-matching
MolmoMotion uses Molmo 2 as its backbone, which lets it connect language instructions to objects and points in an image. Given a short video history, an action description, and query points with their initial 3D positions, it identifies the object, the points, and the intended motion, then forecasts the future trajectory.
AllenAI trained two variants:
-
MolmoMotion-AR (autoregressive) predicts coordinates step-by-step as structured text, writing out the future trajectory in temporal order. Each new coordinate is conditioned on what came before, which encourages smooth rollouts and gives the strongest accuracy when the future is well-defined.
-
MolmoMotion-FM (flow-matching) predicts trajectories in continuous 3D space by transforming noise into motion. It's better suited for representing uncertainty when an instruction admits multiple plausible futures.
The autoregressive variant treats 3D coordinates like tokens—a design choice borrowed from vision-language models that encode spatial data as text. The flow-matching variant works directly in coordinate space, which makes it more expressive when forecasting ambiguous or stochastic motion.
Building MolmoMotion-1M: 3D trajectories from internet video
The core challenge was data. Large-scale video is everywhere, but videos with 3D point trajectories grounded to specific objects and paired with action descriptions don't exist at scale. Existing 3D-track datasets are small and domain-limited.
AllenAI built an automatic annotation pipeline that extracts object-grounded 3D trajectories from unconstrained video. Given a video and its action description, the pipeline:
- Grounds the moving object and samples query points on it
- Tracks dense 2D points on the object
- Lifts those tracks into a shared metric 3D frame
- Filters unreliable trajectories using object-level spatial and temporal consistency priors
- Clips the video to intervals where the object actually moves
Raw tracks from internet video are noisy—depth errors, tracking jitter, points drifting off the object. The filtering and smoothing steps are critical. Objects also spend much of a video sitting still, so temporal segmentation isolates the motion events that matter.
The result: MolmoMotion-1M, spanning 736 motion types and 5.6K distinct objects. To their knowledge, it's the largest corpus of action-described, object-grounded 3D point trajectories assembled to date.
PointMotionBench: a real forecasting test
For evaluation, AllenAI built PointMotionBench: 2.7K clips covering 111 object categories and 61 motion types, from indoor manipulation to egocentric hand-object interaction to outdoor dynamic scenes. Each clip comes with human validation.
Models get the current observation, object query points, and an action description. They're evaluated on how accurately their predicted 3D trajectories match the object's actual future motion. It's a direct quantitative test of forecasting, not just whether a generated track looks plausible.
This matters. Generative models can produce smooth, realistic-looking motion that's still physically wrong. PointMotionBench measures whether you're predicting what will actually happen.
Forecasting performance: MolmoMotion vs. baselines
On PointMotionBench, MolmoMotion outperforms all existing 3D motion forecasting methods AllenAI tested: pixel-space video generators, parametric 3D methods, and a constant-velocity baseline.
The model forecasts diverse object and scene motions. Examples from the blog:
- A lint roller moving back and forth on cloth
- A bowl sliding and rotating on a table
- A flamingo walking right while dipping its beak in water
- A car following a road as it turns
In each case, the predicted path follows the language instruction and stays close to ground truth. The forecasts aren't just smooth—they're accurate.
Downstream tasks: robotics and video generation
MolmoMotion's 3D trajectories aren't just for benchmarking. AllenAI tested them on two downstream applications.
Robotics planning
For robot manipulation, they used predicted trajectories as goals for a motion planner. The setup: give the robot a language instruction, forecast where the object will move, then plan a trajectory to achieve that motion.
The results show that MolmoMotion's forecasts help robots execute manipulation tasks more reliably. The 3D trajectories are in world coordinates, so they can be fed directly to planners without translation.
Video generation
For controllable video synthesis, they conditioned a video generator on MolmoMotion's predicted trajectories. The motion forecast acts as a sparse control signal that guides pixel-level rendering.
This is the inverse of using video models for forecasting. Instead of asking a pixel generator to predict motion, you predict motion explicitly with MolmoMotion, then use that to steer video synthesis. The division of labor is clean: 3D motion forecasting handles physical plausibility, the video model handles rendering.
Limitations and what this enables
AllenAI calls out several limitations. The model works best on objects that maintain coherent surface structure—it struggles with extreme deformations or fluids. It's trained on internet video, which skews toward common objects and everyday actions. And like all forecasters, it has to deal with the fact that many instructions admit multiple plausible futures.
But the bigger story is what this unlocks. Motion forecasting as a first-class task with open models, datasets, and benchmarks means researchers can build on this without starting from scratch. The 3D point representation is simple enough to integrate into existing pipelines—robot planners already work in 3D, video generators can consume trajectories as control signals.
The architecture choices are also telling. Using a vision-language model as the backbone means MolmoMotion benefits from all the grounding and reasoning capabilities Molmo 2 brings. It's not a specialized motion model bolted onto a frozen vision encoder—it's a VLM that learned to forecast.
The broader pattern: learning to look forward
MolmoMotion fits a broader shift in how we think about world models. For a long time, the dominant paradigm was generative video: learn to predict the next frame. But video prediction is a hard problem that bundles physics, appearance, and viewpoint together. You have to get all three right to produce a plausible frame.
Forecasting 3D motion is a narrower, more tractable problem. You don't have to render pixels—you just predict where points will move. And because the representation is explicit and interpretable, you can measure accuracy directly against ground truth.
This is the same argument that drove neural radiance fields and 3D Gaussians: decouple geometry from appearance, predict structure explicitly, and let rendering be a separate step. MolmoMotion applies that logic to motion itself.
The fact that it works—and that it beats video generators at forecasting—suggests that world models don't have to be monolithic pixel predictors. You can carve off motion as its own problem, solve it well, and integrate the solution into downstream systems.
That's the real contribution here. Not just a better forecasting model, but a demonstration that motion can be learned, represented, and used as a standalone primitive. AllenAI is betting that this decomposition will unlock progress faster than trying to do everything in pixel space. The open release means we'll find out quickly whether they're right.