Building a Real-Time World Index: From Pixels to Semantic Maps

Imagine pointing a camera at a room and instantly getting a structured, searchable representation of everything in it — objects, their relationships, materials, affordances. That’s the vision behind our World Index project.

The Problem

Traditional approaches to scene understanding operate on single frames: detect objects, segment regions, classify materials. But the real world is persistent and continuous. When you walk through a building, you’re not processing independent snapshots — you’re building a mental model that accumulates information over time.

We want our systems to do the same.

Architecture Overview

Our pipeline has three stages:

1. Perception Layer

The perception layer processes each frame through a lightweight vision backbone (we use a modified EfficientViT) to extract:

Object detections with 6-DoF pose estimates
Semantic segments with material and surface properties
Depth estimates from monocular cues

The key constraint is latency: everything runs at >30fps on edge hardware. We achieve this through aggressive quantization (INT8) and a multi-scale feature pyramid that shares computation across tasks.

2. Fusion Engine

Raw per-frame detections are noisy and incomplete. The fusion engine maintains a persistent 3D representation by:

Tracking objects across frames using appearance + geometric matching
Accumulating observations — each new view refines the estimate of object shape, pose, and semantics
Resolving conflicts — when two frames disagree, we use a confidence-weighted update rule

We use an H3-based spatial index (resolution 9) for efficient spatial queries. This lets us answer “what’s near coordinate X?” in constant time regardless of scene size.

3. Semantic Layer

The raw geometric map gets enriched with semantic relationships:

kitchen_counter_01:
  type: surface
  material: granite
  objects_on: [toaster_01, cutting_board_02, mug_07]
  adjacent_to: [sink_01, stove_01]
  affordance: [place_object, prepare_food]

These relationships are inferred by a small graph neural network that operates on the spatial graph structure, not the raw pixels. This makes it fast (< 1ms per update) and robust to visual noise.

Lessons Learned

1. Latency budgets are everything. We spent two months optimizing a beautiful but slow SLAM-based fusion engine before realizing it would never hit 30fps on our target hardware. Starting with a latency budget and working backward would have saved significant time.

2. Don’t fight the noise — model it. Early versions tried to produce clean, definitive scene representations. The system works much better when we explicitly model uncertainty at every level and let downstream consumers decide their confidence threshold.

3. Evaluation is the bottleneck. We can train and iterate on models quickly. What slows us down is evaluating whether the world index is actually useful — which requires task-specific benchmarks that don’t exist yet.

Current Status

The system runs in real-time on NVIDIA Jetson Orin and produces reasonable scene graphs for indoor environments. We’re currently working on:

Outdoor scene support (different depth ranges, lighting conditions)
Multi-agent fusion (combining views from multiple cameras)
Natural language queries over the index (“where did I last see my keys?”)

We plan to open-source the perception layer and fusion engine once they’re stable enough for external use.

Questions or want to test it? Email research@clawvision.org.