NEWS  /  Analysis

Fei-Fei Li Lays Out Vision for Spatial Intelligence and Hybrid World Models

By  xinyue  Nov 25, 2025, 11:06 p.m. ET

To build truly general artificial intelligence, AI must move beyond the confines of text.

Stanford professor and World Labs founder Fei-Fei Li outlined an expansive vision for "spatial intelligence" and defended the role of explicit 3D world generation in future AI systems, in a wide-ranging interview released on Monday.

The discussion spanned foundational debates over world models, the technical architecture behind her startup's first product, and the limits of today's physics-aware AI.

Li, a leading figure in computer vision, argued that the next phase of artificial intelligence will be shaped less by language and more by machines' ability to perceive and reason about the physical world. Human cognition, she said, is fundamentally embodied and multimodal—a process that depends on vision, action, and interaction rather than text alone.

"Language captures only a subset of human knowledge," she said. "Much of what we know comes from interacting with the world, often without using language at all."

The remarks come as major AI labs push to build world models—systems that internalize 3D structure, physical dynamics and causal relationships. Li's approach diverges from that of deep-learning pioneer Yann LeCun, who has emphasized implicit, abstract representations that do not require models to explicitly generate scenes. She rejected the idea of a rivalry, saying the field ultimately needs both.

"We're intellectually on the same continuum," she said. "For a universal world model, implicit and explicit representations will both be indispensable."

World Labs' First Model Targets Explicit, Navigable 3D Worlds

Li's comments centered on Marble, World Labs' inaugural model, built on what her team calls a Real-Time Frame Model (RTFM). Unlike video-generation systems that output sequences of frames, Marble generates persistent, navigable 3D environments with object permanence and consistent geometry across viewpoints. It can take in text, images, video or rough spatial layouts and run in real time on a single Nvidia H100 GPU.

Maintaining internal coherence, Li said, required extensive engineering. "In early frame-based generation models, when you moved the camera, object consistency would collapse," she said. Marble's behavior remains largely statistical, not physics-driven: modern generative models, she noted, still imitate patterns in training data rather than compute formal forces.

"I don't think AI today is yet capable of abstracting the laws of physics," she said. "For Einstein-style abstraction, we haven't seen evidence that Transformers can do that." She nonetheless expects progress in physical reasoning within five years.

The Search for a 'Universal Task Function' in Vision

Li identified the absence of a unifying objective for spatial AI as a major research bottleneck. The success of language models was driven by next-token prediction, where training and inference are perfectly aligned. No equivalent exists for vision.

"Next-frame prediction is powerful, because the world has continuity," she said. "But it collapses a 3D world into 2D frames. And animals don't do perfect 3D reconstruction—yet they navigate extremely well."

A universal objective for spatial learning, she said, remains an open question.

Li's long-term vision is a "Neural Spatial Engine" that merges generative models with traditional physics engines used in game development. Physics engines compute collisions and rigid-body dynamics; generative models excel at producing rich content. She expects the two to converge.

"Ultimately, physics engines and world-generation models will merge," she said. "We're still at the beginning."

Such systems could make the creation of interactive 3D worlds inexpensive and accessible, enabling what she described as a "multiverse" of low-cost digital environments for education, entertainment, simulation, and scientific research.

Li said world models operating in robotics and other embodied settings must move beyond static training regimes. "Continuous learning is essential," she said, pointing to a future mix of context-based memory, online learning and algorithmic advances.

She emphasized that spatial intelligence is central to the broader quest for more general AI. "You can't put out a fire with language alone," she said. "A lot of human intelligence goes beyond symbols."

Li closed on a broadly optimistic note, predicting meaningful advances within the next half-decade, despite persistent uncertainty. "Some advances have surprised me by happening faster, and others slower," she said. "But five years is a reasonable timeframe."

Please sign in and then enter your comment