Long-Context Multimodal Benchmark for MLLMs
Project Description
While today’s multimodal large language models (MLLMs) can chat fluently about images, their reliability under truly long, multi-turn, image-grounded conversations—where important details are scattered across many exchanges—remains unclear. This project asks a simple question: as conversations grow longer and more realistic, can current MLLMs stay grounded, consistent, and trustworthy? We will create a realistic long-form conversational setting and run systematic stress tests across representative state-of-the-art models, varying the length and complexity of interaction to reveal where performance holds up and where it breaks down. The goal is not to chase a single headline number, but to produce a clear, practical picture of long-context multimodal behavior—when these systems remain coherent, when they drift, and what kinds of failures matter most in real use.
Supervisor
SONG Yangqiu
Quota
5
Course type
UROP1000
UROP1100
UROP2100
UROP3100
UROP3200
UROP4100
Applicant's Roles
Working together with a PhD student on task formulation, designing experiments, analyzing results, and writing research papers.
Applicant's Learning Objectives
1. Develop hands-on experience studying how MLLMs behave in long-context, multi-turn multimodal conversations, with a focus on context retention, retrieval, and cross-turn coherence as interactions scale.
2. Learn to design and run reproducible evaluations for long-context settings—covering data construction, stress-test design, and careful control of confounders.
3. Build skills in failure analysis and robustness testing, identifying where models stay reliable, where they drift, and how performance changes with longer and noisier context.
Complexity of the project
Challenging