Icon 4D-LRM:
Large Space-Time Reconstruction Model From and To Any View at Any Time

Adobe Research1     University of Michigan2    
UNC Chapel Hill3     University of Virginia4     Oregon State University5    

TL;DR

We introduce Large Space-Time Reconstruction Model (4D-LRM), a data-driven 4D reconstruction model that takes sparse input views at any time and renders arbitrary novel view-time combinations.

Abstract

Can we scale 4D pretraining to learn general space-time representations that reconstruct an object from a few views at some times to any view at any time? We provide an affirmative answer with 4D-LRM, the first large-scale 4D reconstruction model that takes input from unconstrained views and timestamps and renders arbitrary novel view-time combinations. Unlike prior 4D approaches, e.g., optimizationbased, geometry-based, or generative, that struggle with efficiency, generalization, or faithfulness, 4D-LRM learns a unified space-time representation and directly predicts per-pixel 4D Gaussian primitives from posed image tokens across time, enabling fast, high-quality rendering at, in principle, infinite frame rate. Our results demonstrate that scaling spatiotemporal pretraining enables accurate and efficient 4D reconstruction. We show that 4D-LRM generalizes to novel objects, interpolates across time, and handles diverse camera setups. It reconstructs 24-frame sequences in one forward pass with less than 1.5 seconds on a single A100 GPU.

Key Insights

  1. 4D-LRM adopts a clean and minimal Transformer design to reconstruct dynamic objects from sparse, posed views across arbitrary times and viewpoints.
  2. 4D-LRM unifies space and time by predicting 4D Gaussian primitives directly from multi-view tokens.
  3. 4D-LRM scales effectively with data and model size with strong generalization and efficient inference.

Method

4D-LRM adopts a unified treatment of space and time, representing a dynamic object as a cloud of anisotropic 4D Gaussians. We train a simple Transformer to regress 4D Gaussian primitives from a set of images with camera poses and timestamps. Each input image is tokenized by patchifying the temporally posed frames. The resulting multi-view image tokens are concatenated in temporal order and passed through a series of transformer blocks. Optional set of N learnable free Gaussian tokens append the image tokens for greater generative flexibility.

4D Point Cloud Visualization

Use the viser-based interactive viewer below to explore the 4D point clouds. Click and drag to navigate within the scene.

4D-LRM Results (Consistent4D)

256 x 256 resolution, 24 frames x 2, alternating canonical views as input.
Input Images / Open Interactive Gaussian Viewer (Beta ver.)

Front View

Back View

Left View

Right View

Turntable View

Input Images / Open Interactive Gaussian Viewer (Beta ver.)

Front View

Back View

Left View

Right View

Turntable View

Input Images / Open Interactive Gaussian Viewer (Beta ver.)

Front View

Back View

Left View

Right View

Turntable View

Input Images / Open Interactive Gaussian Viewer (Beta ver.)

Front View

Back View

Left View

Right View

Turntable View

4D-LRM Results (Objaverse Test)

256 x 256 resolution, 24 frames, alternating canonical views as input.

Front View

Back View

Left View

Right View

Turntable View


Front View

Back View

Left View

Right View

Turntable View


Front View

Back View

Left View

Right View

Turntable View


Front View

Back View

Left View

Right View

Turntable View

Front View

Back View

Left View

Right View

Turntable View

Front View

Back View

Left View

Right View

Turntable View

Scaling Behaviors

Training-Time Scaling

We observe that increasing the number of target views slightly improves convergence speed, though at the cost of increased iteration time. Introducing free Gaussians from scratch does not significantly impact reconstruction quality but substantially slows down training. Additionally, we find that the 4DGS representation with 3DGS + HexPlane is less expressive than the unified space-time formulation, which informed our final design choice. We also note that enforcing strict temporal alignment degrades performance, whereas pixel alignment improves reconstruction quality. This supports our earlier observation that 4D-LRM effectively redistributes Gaussians to unseen time intervals to handle sparse temporal supervision.
  1. 4D-LRM-Base: Transformer with 768 hidden dimensions, 12 layers, and 12 attention heads, trained with 12 random input views and 12 random target views. No free Gaussians.
  2. #Target x 2: Trained with 12 random input views and 24 random target views.
  3. w/ Hexplane: Instead of unified space-time representation, an alternative 4DGS representation with decomposed neural voxel encoding inspired by HexPlane.
  4. w/ Temp Align: Similar to the idea of pixel-aligned Gaussians, we force μT to the input frame time, reducing the parameterization to dim4DGS = 19.
  5. w/ Free GS: Trained with N = 1024 free Gaussian tokens from scratch.

Inference-Time Scaling

We analyze inference-time scaling as the number of input views varies. In terms of PSNR and SSIM, performance improves with more input views and peaks at 48, after which it begins to decline slightly. We attribute this to two factors: (1) excessive Gaussians may overcrowd the 4D representation, reducing its quality, and (2) the Transformer struggles with very long input sequences. This observation suggests a promising future direction: designing 4D-LRM variants that can handle longer contexts with hybrid models (Long-LRM) and incorporate test-time training (LaCT).

BibTeX

@article{ma20254dlrm,
  title={4D-LRM: Large Space-Time Reconstruction Model From and To Any View at Any Time}, 
  author={Ziqiao Ma and Xuweiyi Chen and Shoubin Yu and Sai Bi and Kai Zhang and Ziwen Chen and Sihan Xu and Jianing Yang and Zexiang Xu and Kalyan Sunkavalli and Mohit Bansal and Joyce Chai and Hao Tan},
  year={2025},
  journal={arXiv:2506.18890},
}