Spatial Intelligence Stack

Pixels in. World state out.

We capture egocentric and overhead video in real environments and lift every frame into structured world state — dense depth, panoptic segmentation, hand pose — in real time, on commodity hardware.

Focus Areas

Spatial Intelligence

Real-time 3D scene understanding, spatial reasoning, and world modeling for embodied agents.

Vision-Language Models

Multimodal architectures connecting visual perception with language understanding and generation.

Embodied AI

Training agents that perceive and act in the physical world — from robotics to AR/VR systems.

World Index — Live Samples

Our stack lifts ordinary RGB video into structured world state — dense depth, panoptic segmentation, and hand pose — running in real time on commodity hardware. Below: raw outputs captured from egocentric and overhead viewpoints in unstructured everyday environments.

Depth map of an egocentric scene with objects and a hand
Depth Egocentric view · monocular depth estimation
Depth map of the same scene from a different viewpoint
Depth Novel viewpoint · consistent geometry
Segmented kitchen scene with hand keypoints overlaid
Segmented Open-vocabulary regions + hand keypoints
Segmented overhead view of a cluttered cooking workspace
Segmented Cluttered scene · 28 detected regions

Latest Research

View all →