Visual-Audio World Models | Undergraduate Research Opportunities Program

Project Description

Recent advances in video world models enable high-quality visual synthesis, yet most systems remain largely limited to a single visual modality, lacking the synchronized sensory feedback crucial for realistic simulation. This project introduces a new paradigm of immersive world generation that integrates both visual and auditory modalities to construct an interactive, infinite-horizon world model. Instead of generating isolated video clips, the proposed system aims to create streaming audio-visual worlds that support continuous, real-time user engagement and human-like sensory experiences.

To achieve this, the project will involve collecting interactive audiovisual datasets, fine-tuning pretrained joint audio-visual diffusion models, and exploring distillation techniques to ensure streaming efficiency over infinite horizons. The resulting prototype will enable new forms of human–AI interaction, serving as a foundational step toward larger-scale research in immersive media creation and embodied interactive intelligence. This project offers undergraduate students hands-on research experience at the cutting edge of generative AI, multimodal perception, and world simulation.

Supervisor

ZHAN, Fangneng

Quota

1

Course type

UROP1000

UROP1100

UROP2100

UROP3100

UROP4100

Applicant's Roles

* Conduct literature reviews on video world models, generative AI, and audio-visual synthesis.
* Assist in collecting, preprocessing, and organizing interactive audio-visual datasets.
* Participate in coding, training, and fine-tuning multimodal diffusion models using deep learning frameworks (e.g., PyTorch).
* Help deploy and test the streaming prototype to evaluate real-time user interaction and performance.

Applicant's Learning Objectives

* Master core concepts in generative AI, specifically multimodal diffusion models and predictive world models.
* Gain practical, hands-on experience in handling and preprocessing large-scale video and audio datasets.
* Develop advanced programming and engineering skills in deep learning implementation, model training, and optimization.

Complexity of the project

Moderate