Open-Vocabulary Object Detection on the Edge: Practical Tricks That Work

Open-vocabulary object detection — detecting objects from free-text descriptions rather than a fixed class set — is one of the most exciting recent advances in computer vision. Models like OWL-ViT and Grounding DINO can detect virtually anything you describe in natural language.

But deploying them on edge hardware? That’s a different story.

The Challenge

Our target platform is an embedded GPU with roughly 1/20th the compute of a datacenter GPU. A standard Grounding DINO model runs at ~2fps on this hardware. We need 30fps. That’s a 15x gap.

Here’s what we’ve tried and learned.

What Works

1. Text Encoding is a One-Time Cost

The biggest insight: in most real-world applications, the vocabulary doesn’t change every frame. If you’re looking for “person”, “car”, and “bicycle” in a traffic camera feed, you can pre-compute and cache the text embeddings.

This alone removes ~40% of the per-frame compute cost. The remaining model is essentially a standard detection backbone with cross-attention to cached text features.

2. Resolution Matters More Than Model Size

Counter-intuitively, we found that running a larger model at lower resolution often beats a smaller model at higher resolution. The reason: the text-visual alignment learned during pre-training is more robust in larger models, so it degrades less when you reduce input size.

Config	mAP@50	FPS (edge)
Small model, 640px	38.2	24
Small model, 416px	31.7	42
Base model, 416px	41.3	28
Base model, 320px	37.9	45

We run the base model at 416px — good accuracy at practical frame rates.

3. Quantization-Aware Fine-Tuning

Standard post-training quantization (PTQ) to INT8 drops accuracy significantly for these models — the cross-attention layers are particularly sensitive. Quantization-aware training (QAT) with a short fine-tuning phase recovers most of the accuracy:

FP16 baseline: 44.1 mAP
PTQ INT8: 37.3 mAP (-6.8)
QAT INT8: 42.8 mAP (-1.3)

The QAT phase only takes ~2 hours on a single GPU with 50k images.

What Doesn’t Work (Yet)

Knowledge distillation from giant models. We tried distilling from a ViT-L teacher to a ViT-S student. The student learns the closed-vocabulary behavior well but struggles to maintain open-vocabulary generalization. The text-visual alignment seems to require a minimum model capacity.

Dynamic vocabulary pruning. The idea was to reduce the number of text embeddings per frame based on scene context. In practice, the overhead of the pruning decision offsets the savings, and you occasionally miss detections when the pruner is wrong.

Deployment Notes

Our production pipeline:

Text embeddings pre-computed and cached at startup
Frames arrive at 30fps, downscaled to 416px
INT8 quantized base model processes each frame in ~35ms
NMS and tracking post-processing in ~3ms
Total latency: ~38ms end-to-end (26fps sustained)

The remaining gap to 30fps comes from memory bandwidth limitations. We’re exploring structured pruning of the visual backbone to reduce memory traffic.

Code

We’ll release our QAT training recipe and deployment configs for Jetson Orin shortly. The training code is based on the Grounding DINO codebase with minimal modifications.

Working on edge deployment of vision models? We’d love to compare notes — research@clawvision.org.