Open-vocabulary object detection — detecting objects from free-text descriptions rather than a fixed class set — is one of the most exciting recent advances in computer vision. Models like OWL-ViT and Grounding DINO can detect virtually anything you describe in natural language.

But deploying them on edge hardware? That’s a different story.

The Challenge

Our target platform is an embedded GPU with roughly 1/20th the compute of a datacenter GPU. A standard Grounding DINO model runs at ~2fps on this hardware. We need 30fps. That’s a 15x gap.

Here’s what we’ve tried and learned.

What Works

1. Text Encoding is a One-Time Cost

The biggest insight: in most real-world applications, the vocabulary doesn’t change every frame. If you’re looking for “person”, “car”, and “bicycle” in a traffic camera feed, you can pre-compute and cache the text embeddings.

This alone removes ~40% of the per-frame compute cost. The remaining model is essentially a standard detection backbone with cross-attention to cached text features.

2. Resolution Matters More Than Model Size

Counter-intuitively, we found that running a larger model at lower resolution often beats a smaller model at higher resolution. The reason: the text-visual alignment learned during pre-training is more robust in larger models, so it degrades less when you reduce input size.

ConfigmAP@50FPS (edge)
Small model, 640px38.224
Small model, 416px31.742
Base model, 416px41.328
Base model, 320px37.945

We run the base model at 416px — good accuracy at practical frame rates.

3. Quantization-Aware Fine-Tuning

Standard post-training quantization (PTQ) to INT8 drops accuracy significantly for these models — the cross-attention layers are particularly sensitive. Quantization-aware training (QAT) with a short fine-tuning phase recovers most of the accuracy:

  • FP16 baseline: 44.1 mAP
  • PTQ INT8: 37.3 mAP (-6.8)
  • QAT INT8: 42.8 mAP (-1.3)

The QAT phase only takes ~2 hours on a single GPU with 50k images.

What Doesn’t Work (Yet)

Knowledge distillation from giant models. We tried distilling from a ViT-L teacher to a ViT-S student. The student learns the closed-vocabulary behavior well but struggles to maintain open-vocabulary generalization. The text-visual alignment seems to require a minimum model capacity.

Dynamic vocabulary pruning. The idea was to reduce the number of text embeddings per frame based on scene context. In practice, the overhead of the pruning decision offsets the savings, and you occasionally miss detections when the pruner is wrong.

Deployment Notes

Our production pipeline:

  1. Text embeddings pre-computed and cached at startup
  2. Frames arrive at 30fps, downscaled to 416px
  3. INT8 quantized base model processes each frame in ~35ms
  4. NMS and tracking post-processing in ~3ms
  5. Total latency: ~38ms end-to-end (26fps sustained)

The remaining gap to 30fps comes from memory bandwidth limitations. We’re exploring structured pruning of the visual backbone to reduce memory traffic.

Code

We’ll release our QAT training recipe and deployment configs for Jetson Orin shortly. The training code is based on the Grounding DINO codebase with minimal modifications.


Working on edge deployment of vision models? We’d love to compare notes — research@clawvision.org.