Spatial Intelligence Stack

Pixels in. World state out.

We capture egocentric and overhead video in real environments and lift every frame into structured world state — dense depth, panoptic segmentation, hand pose — in real time, on commodity hardware.

See Live Samples Read Research GitHub

Focus Areas

◆

Spatial Intelligence

Real-time 3D scene understanding, spatial reasoning, and world modeling for embodied agents.

◉

Vision-Language Models

Multimodal architectures connecting visual perception with language understanding and generation.

◇

Embodied AI

Training agents that perceive and act in the physical world — from robotics to AR/VR systems.

World Index — Live Samples

From our real-time scene understanding pipeline

Our stack lifts ordinary RGB video into structured world state — dense depth, panoptic segmentation, and hand pose — running in real time on commodity hardware. Below: raw outputs captured from egocentric and overhead viewpoints in unstructured everyday environments.

Depth map of an egocentric scene with objects and a hand — Depth Egocentric view · monocular depth estimation

Depth map of the same scene from a different viewpoint — Depth Novel viewpoint · consistent geometry

Segmented kitchen scene with hand keypoints overlaid — Segmented Open-vocabulary regions + hand keypoints

Segmented overhead view of a cluttered cooking workspace — Segmented Cluttered scene · 28 detected regions

Latest Research

View all →

Apr 1, 2026

spatial-reasoningvision-transformers

Spatial Reasoning in Vision Transformers: A Survey and New Directions

We survey recent advances in spatial reasoning capabilities of vision transformers and propose a new benchmark for evaluating 3D scene understanding from 2D inputs.

Mar 20, 2026

world-indexreal-time

Building a Real-Time World Index: From Pixels to Semantic Maps

How we approach the problem of building structured, queryable representations of physical environments from streaming video — our architecture and lessons learned.

Mar 5, 2026

object-detectionedge-deployment

Open-Vocabulary Object Detection on the Edge: Practical Tricks That Work

Notes from our experience deploying open-vocabulary detection models on edge devices — what works, what doesn't, and the trade-offs we've made.