Open-Vocabulary Object Detection on the Edge: Practical Tricks That Work
Notes from our experience deploying open-vocabulary detection models on edge devices — what works, what doesn't, and the trade-offs we've made.
Open-vocabulary object detection — detecting objects from free-text descriptions rather than a fixed class set — is one of the most exciting recent advances in computer vision. Models like OWL-ViT and Grounding DINO can detect virtually anything you describe in natural language.
But deploying them on edge hardware? That’s a different story.
The Challenge
Our target platform is an embedded GPU with roughly 1/20th the compute of a datacenter GPU. A standard Grounding DINO model runs at ~2fps on this hardware. We need 30fps. That’s a 15x gap.
Here’s what we’ve tried and learned.
What Works
1. Text Encoding is a One-Time Cost
The biggest insight: in most real-world applications, the vocabulary doesn’t change every frame. If you’re looking for “person”, “car”, and “bicycle” in a traffic camera feed, you can pre-compute and cache the text embeddings.
This alone removes ~40% of the per-frame compute cost. The remaining model is essentially a standard detection backbone with cross-attention to cached text features.
2. Resolution Matters More Than Model Size
Counter-intuitively, we found that running a larger model at lower resolution often beats a smaller model at higher resolution. The reason: the text-visual alignment learned during pre-training is more robust in larger models, so it degrades less when you reduce input size.
| Config | mAP@50 | FPS (edge) |
|---|---|---|
| Small model, 640px | 38.2 | 24 |
| Small model, 416px | 31.7 | 42 |
| Base model, 416px | 41.3 | 28 |
| Base model, 320px | 37.9 | 45 |
We run the base model at 416px — good accuracy at practical frame rates.
3. Quantization-Aware Fine-Tuning
Standard post-training quantization (PTQ) to INT8 drops accuracy significantly for these models — the cross-attention layers are particularly sensitive. Quantization-aware training (QAT) with a short fine-tuning phase recovers most of the accuracy:
- FP16 baseline: 44.1 mAP
- PTQ INT8: 37.3 mAP (-6.8)
- QAT INT8: 42.8 mAP (-1.3)
The QAT phase only takes ~2 hours on a single GPU with 50k images.
What Doesn’t Work (Yet)
Knowledge distillation from giant models. We tried distilling from a ViT-L teacher to a ViT-S student. The student learns the closed-vocabulary behavior well but struggles to maintain open-vocabulary generalization. The text-visual alignment seems to require a minimum model capacity.
Dynamic vocabulary pruning. The idea was to reduce the number of text embeddings per frame based on scene context. In practice, the overhead of the pruning decision offsets the savings, and you occasionally miss detections when the pruner is wrong.
Deployment Notes
Our production pipeline:
- Text embeddings pre-computed and cached at startup
- Frames arrive at 30fps, downscaled to 416px
- INT8 quantized base model processes each frame in ~35ms
- NMS and tracking post-processing in ~3ms
- Total latency: ~38ms end-to-end (26fps sustained)
The remaining gap to 30fps comes from memory bandwidth limitations. We’re exploring structured pruning of the visual backbone to reduce memory traffic.
Code
We’ll release our QAT training recipe and deployment configs for Jetson Orin shortly. The training code is based on the Grounding DINO codebase with minimal modifications.
Working on edge deployment of vision models? We’d love to compare notes — research@clawvision.org.