Exploring contemporary VLMs capability in long-form video understanding
Project Description
Recent leaps in Vision-Language Models (VLMs) have unlocked impressive short-clip understanding, yet their capacity to reason across extended, continuous video—where meaning emerges slowly through evolving scenes, dialogue arcs, and subtle visual cues—remains largely unexplored. This project asks: what does “long-form multimodal understanding” truly entail, and how far do current VLMs really go? We will probe state-of-the-art models’ temporal memory, cross-shot cohesion, and narrative comprehension by systematically analyzing their behavior on hours-long documentaries, multi-episode series, and unedited real-world footage. We mainly focus on uncovering conceptual strengths and brittleness—identifying where today’s architectures succeed, where they falter, and which inductive biases or prompting strategies might extend their reach. The work is open-ended and research-intensive; it demands curiosity about video cognition, solid coding skills, and the persistence to dissect complex, ambiguous results.
Supervisor
SONG Yangqiu
Quota
5
Course type
UROP1000
UROP1100
UROP2100
UROP3100
UROP3200
UROP4100
Applicant's Roles
Working together with a PhD student on task formulation, designing experiments, analyzing results, and writing research papers.
Applicant's Learning Objectives
Have hands-on experience in playing with VLMs. Learn how to research with them for diverse reasoning scenarios.
Complexity of the project
Challenging