All terms
Multimodal
Video Understanding
Extracting meaning from video, including what happens and when.
Definition
Video understanding covers tasks like recognizing actions, pinpointing when something happens, writing captions, answering questions about a clip, and finding highlights. Models must capture both what appears—objects and scenes—and how it changes over time. Approaches include specialized neural networks for video and large multimodal models that take in sampled frames alongside text or audio.