Model briefingModel: Marlin 2BID: images-production.s3.amazonaws.com/uploads

Find moments in videos with natural language

This is a practical video-understanding pick because it turns a video into something you can inspect and search. Marlin 2B gives you a browser demo first, then a local model path if you want to build the same captioning and timestamp search workflow yourself.

PublishedJune 1, 2026
Read time3 min
Tested byNeural Expedition
Video generation

Field notes

What it does

Marlin 2B is a small video-language model built for two useful jobs: describing what happens in a clip and finding when a described event occurs.

In caption mode, it returns a scene summary plus timestamped events. In find mode, you give it a query like "a person enters the room" or "the skateboarder lands the trick", and it returns a start and end time when that event appears.

That makes the workflow different from a generic video chatbot. The useful output is not only a sentence about the video. It is a structured view of the clip that you can scan, verify, and connect to a timeline. The public Space packages this into a browser demo with example videos, while the backing model page shows a local Transformers path for running the same model on a capable GPU.

How to try it

Start with the Marlin 2B Video Understanding Space. Choose one of the example clips or upload a short video, then run caption mode first. Look at whether the scene summary matches the video and whether the event timestamps line up with what you see.

For a second test, switch to find mode and ask for one concrete event. Use something visually obvious, such as a person entering, a vehicle turning, or a trick landing. The important thing to inspect is not just whether it found the event, but whether the returned start and end times are tight enough to be useful.

If the demo fits your use case, move to the NemoStation model page. Local use is public and model-backed, but it expects a CUDA setup and video dependencies such as torchcodec, so the hosted Space is the fastest first check.

Caveat

This is still a GPU video workflow, not a lightweight browser-only model. Treat the Space as the quick test path, then expect local reproduction to need CUDA, video decoding dependencies, and some patience with longer clips.

What you can do with it

  • Turn raw video clips into scene summaries with timestamped events.
  • Search a video for a described action without scrubbing manually.
  • Build rough indexes for product footage, sports clips, tutorials, or surveillance review.
  • Compare event grounding quality before adding video search to an internal tool.
  • Prototype workflows where the timestamp matters as much as the caption.

Try the demo

View model page