LocateAnything-3B is a vision-language model for visual grounding. Instead of only answering a question about an image, it returns structured locations for the objects, text, regions, or interface elements you ask for.
That makes it useful when the location is the real output. You can ask it to find all red shirts in a crowded photo, point to a button in a screenshot, locate text regions in a document, or track a described object across sampled video frames.
The model supports several related workflows: object detection, phrase grounding, GUI grounding, OCR-style text localization, document layout detection, and pointing. The public Space packages those modes into a browser demo, while the model card includes a local worker example for running the same kind of grounding workflow yourself.