Mechanisms and Reinforcement Learning Approaches for Long-Context Understanding in LVLMs

Project Description

With the rapid development of large vision-language models (LVLMs), the ability to process and reason over long multimodal contexts, such as hundreds of interleaved images and texts, has become an increasingly critical challenge in AI research. While recent LVLMs have successfully extended their context windows, their actual performance on realistic long-context tasks remains unsatisfactory (our recent work, MMLongBench, verified this).

The research objective of this UROP project aims to dig deeper into the mechanisms and improvement of long-context capability in LVLMs. First, inspired by recent interpretability studies, we plan to analyze the retrieval behavior of attention heads, understanding how these heads operate and interact with long-context input. Second, we intend to explore reinforcement learning (RL) approaches, such as DeepSeek-R1, for enhancing long-context reasoning in multimodal models, following QwenLong-L1.

Supervisor

SONG Yangqiu

Quota

Course type

UROP1000

UROP1100

UROP2100

UROP3100

UROP3200

UROP4100

Applicant's Roles

Develop baseline models and experimental pipelines for multimodal long-context reasoning tasks.

Applicant's Learning Objectives

Gain hands-on experience with model interpretability, reinforcement learning, and evaluating state-of-the-art LVLMs on long-context tasks. Also, acquire practical skills in research project definition, experimental design, and academic writing in the field of multimodal AI.

Complexity of the project

Challenging

Apply Return home