Empower Multimodal Language Models with Long Video Understanding
Project Description
Open-weight multimodal models such as llava is trained to understand videos. While there are some existing efforts to adapt them to understand videos, the current efforts are often limited to short videos around 1 minutes. In this project, we aim to train open-weight multimodal language models to empower its ability to understand LONG videos that last for above 20 or 30 minutes.
Supervisor
HE, Junxian
Quota
1
Course type
UROP1100
UROP2100
UROP3100
UROP4100
Applicant's Roles
The applicant is expected to collect long video training data, train the model, and run evaluations, collaborating with other PhD students.
Applicant's Learning Objectives
The applicant is expected to 1. get familiar with foundation models, basic concepts and recent research progresses on multimodal foundation models; 2. learn to crawl and preprocess data for training foundation models; 3. learn how to train/fine-tune large models efficiently; 4. evaluate models; 5. write research papers.
Complexity of the project
Moderate