Building a Real-Time World Index: From Pixels to Semantic Maps
How we approach the problem of building structured, queryable representations of physical environments from streaming video — our architecture and lessons learned.
Imagine pointing a camera at a room and instantly getting a structured, searchable representation of everything in it — objects, their relationships, materials, affordances. That’s the vision behind our World Index project.
The Problem
Traditional approaches to scene understanding operate on single frames: detect objects, segment regions, classify materials. But the real world is persistent and continuous. When you walk through a building, you’re not processing independent snapshots — you’re building a mental model that accumulates information over time.
We want our systems to do the same.
Architecture Overview
Our pipeline has three stages:
1. Perception Layer
The perception layer processes each frame through a lightweight vision backbone (we use a modified EfficientViT) to extract:
- Object detections with 6-DoF pose estimates
- Semantic segments with material and surface properties
- Depth estimates from monocular cues
The key constraint is latency: everything runs at >30fps on edge hardware. We achieve this through aggressive quantization (INT8) and a multi-scale feature pyramid that shares computation across tasks.
2. Fusion Engine
Raw per-frame detections are noisy and incomplete. The fusion engine maintains a persistent 3D representation by:
- Tracking objects across frames using appearance + geometric matching
- Accumulating observations — each new view refines the estimate of object shape, pose, and semantics
- Resolving conflicts — when two frames disagree, we use a confidence-weighted update rule
We use an H3-based spatial index (resolution 9) for efficient spatial queries. This lets us answer “what’s near coordinate X?” in constant time regardless of scene size.
3. Semantic Layer
The raw geometric map gets enriched with semantic relationships:
kitchen_counter_01:
type: surface
material: granite
objects_on: [toaster_01, cutting_board_02, mug_07]
adjacent_to: [sink_01, stove_01]
affordance: [place_object, prepare_food]
These relationships are inferred by a small graph neural network that operates on the spatial graph structure, not the raw pixels. This makes it fast (< 1ms per update) and robust to visual noise.
Lessons Learned
1. Latency budgets are everything. We spent two months optimizing a beautiful but slow SLAM-based fusion engine before realizing it would never hit 30fps on our target hardware. Starting with a latency budget and working backward would have saved significant time.
2. Don’t fight the noise — model it. Early versions tried to produce clean, definitive scene representations. The system works much better when we explicitly model uncertainty at every level and let downstream consumers decide their confidence threshold.
3. Evaluation is the bottleneck. We can train and iterate on models quickly. What slows us down is evaluating whether the world index is actually useful — which requires task-specific benchmarks that don’t exist yet.
Current Status
The system runs in real-time on NVIDIA Jetson Orin and produces reasonable scene graphs for indoor environments. We’re currently working on:
- Outdoor scene support (different depth ranges, lighting conditions)
- Multi-agent fusion (combining views from multiple cameras)
- Natural language queries over the index (“where did I last see my keys?”)
We plan to open-source the perception layer and fusion engine once they’re stable enough for external use.
Questions or want to test it? Email research@clawvision.org.