Exploring contemporary VLMs capability in long-form video understanding

Project Description

Recent leaps in Vision-Language Models (VLMs) have unlocked impressive short-clip understanding, yet their capacity to reason across extended, continuous video—where meaning emerges slowly through evolving scenes, dialogue arcs, and subtle visual cues—remains largely unexplored. This project asks: what does “long-form multimodal understanding” truly entail, and how far do current VLMs really go? We will probe state-of-the-art models’ temporal memory, cross-shot cohesion, and narrative comprehension by systematically analyzing their behavior on hours-long documentaries, multi-episode series, and unedited real-world footage. We mainly focus on uncovering conceptual strengths and brittleness—identifying where today’s architectures succeed, where they falter, and which inductive biases or prompting strategies might extend their reach. The work is open-ended and research-intensive; it demands curiosity about video cognition, solid coding skills, and the persistence to dissect complex, ambiguous results.

Supervisor

SONG Yangqiu

Quota

Course type

UROP1000

UROP1100

UROP2100

UROP3100

UROP3200

UROP4100

Applicant's Roles

Working together with a PhD student on task formulation, designing experiments, analyzing results, and writing research papers.

Applicant's Learning Objectives

Have hands-on experience in playing with VLMs. Learn how to research with them for diverse reasoning scenarios.

Complexity of the project

Challenging

Apply Return home