Mechanisms and Reinforcement Learning Approaches for Long-Context Understanding in LVLMs
Project Description
With the rapid development of large vision-language models (LVLMs), the ability to process and reason over long multimodal contexts, such as hundreds of interleaved images and texts, has become an increasingly critical challenge in AI research. While recent LVLMs have successfully extended their context windows, their actual performance on realistic long-context tasks remains unsatisfactory (our recent work, MMLongBench, verified this).

The research objective of this UROP project aims to dig deeper into the mechanisms and improvement of long-context capability in LVLMs. First, inspired by recent interpretability studies, we plan to analyze the retrieval behavior of attention heads, understanding how these heads operate and interact with long-context input. Second, we intend to explore reinforcement learning (RL) approaches, such as DeepSeek-R1, for enhancing long-context reasoning in multimodal models, following QwenLong-L1.
Supervisor
SONG Yangqiu
Quota
5
Course type
UROP1000
UROP1100
UROP2100
UROP3100
UROP3200
UROP4100
Applicant's Roles
Develop baseline models and experimental pipelines for multimodal long-context reasoning tasks.
Applicant's Learning Objectives
Gain hands-on experience with model interpretability, reinforcement learning, and evaluating state-of-the-art LVLMs on long-context tasks. Also, acquire practical skills in research project definition, experimental design, and academic writing in the field of multimodal AI.
Complexity of the project
Challenging