Model briefingModel: LocateAnything-3BID: images-production.s3.amazonaws.com/uploads

Find Objects in Images and Videos with Natural Language

This is a practical object-detection pick because the workflow starts with a simple question: can the model find the thing you describe? LocateAnything turns natural language into boxes or points on an image or video, which makes it easier to test grounding before building a heavier vision system around it.

PublishedJune 1, 2026
Read time3 min
Tested byNeural Expedition
Video generationImage generation

Field notes

What it does

LocateAnything-3B is a vision-language model for visual grounding. Instead of only answering a question about an image, it returns structured locations for the objects, text, regions, or interface elements you ask for.

That makes it useful when the location is the real output. You can ask it to find all red shirts in a crowded photo, point to a button in a screenshot, locate text regions in a document, or track a described object across sampled video frames.

The model supports several related workflows: object detection, phrase grounding, GUI grounding, OCR-style text localization, document layout detection, and pointing. The public Space packages those modes into a browser demo, while the model card includes a local worker example for running the same kind of grounding workflow yourself.

How to try it

Start with the LocateAnything Space. Upload an image with several objects or a screenshot with visible interface elements, then ask for one concrete target such as "the search button", "people wearing red shirts", or "all text boxes".

For the first test, choose something you can verify by eye. Check whether the returned boxes are tight, whether it misses small objects, and whether it confuses similar items in a busy scene. After that, try a document image or GUI screenshot, because those cases show whether the model is useful beyond natural photos.

If the demo looks useful, move to the model page and local worker example. The weights and code path are public, but local use expects a CUDA setup and a capable NVIDIA GPU. Treat the hosted demo as the fastest way to decide whether the grounding quality is worth that setup.

Caveat

This is not a lightweight laptop tool. The public demo removes most setup friction, but local reproduction is GPU-heavy and the NVIDIA license is research-only/non-commercial. Also inspect the boxes closely on crowded scenes, small text, and similar objects; those are the cases where a grounding model can look confident while being slightly off.

What you can do with it

  • Find objects or regions from plain-language descriptions.
  • Test visual grounding for screenshots, documents, and GUI agents.
  • Build annotation helpers for images where labels need coordinates.
  • Compare object detection, pointing, OCR localization, and phrase grounding in one workflow.
  • Prototype perception steps before wiring them into a robotics or automation system.

Try the demo

View model page