Visual-Audio World Models
Project Description
Recent advances in video world models enable high-quality visual synthesis, yet most systems remain largely limited to a single visual modality, lacking the synchronized sensory feedback crucial for realistic simulation. This project introduces a new paradigm of immersive world generation that integrates both visual and auditory modalities to construct an interactive, infinite-horizon world model. Instead of generating isolated video clips, the proposed system aims to create streaming audio-visual worlds that support continuous, real-time user engagement and human-like sensory experiences.
To achieve this, the project will involve collecting interactive audiovisual datasets, fine-tuning pretrained joint audio-visual diffusion models, and exploring distillation techniques to ensure streaming efficiency over infinite horizons. The resulting prototype will enable new forms of human–AI interaction, serving as a foundational step toward larger-scale research in immersive media creation and embodied interactive intelligence. This project offers undergraduate students hands-on research experience at the cutting edge of generative AI, multimodal perception, and world simulation.
To achieve this, the project will involve collecting interactive audiovisual datasets, fine-tuning pretrained joint audio-visual diffusion models, and exploring distillation techniques to ensure streaming efficiency over infinite horizons. The resulting prototype will enable new forms of human–AI interaction, serving as a foundational step toward larger-scale research in immersive media creation and embodied interactive intelligence. This project offers undergraduate students hands-on research experience at the cutting edge of generative AI, multimodal perception, and world simulation.
Supervisor
ZHAN, Fangneng
Quota
1
Course type
UROP1000
UROP1100
UROP2100
UROP3100
UROP4100
Applicant's Roles
* Conduct literature reviews on video world models, generative AI, and audio-visual synthesis.
* Assist in collecting, preprocessing, and organizing interactive audio-visual datasets.
* Participate in coding, training, and fine-tuning multimodal diffusion models using deep learning frameworks (e.g., PyTorch).
* Help deploy and test the streaming prototype to evaluate real-time user interaction and performance.
* Assist in collecting, preprocessing, and organizing interactive audio-visual datasets.
* Participate in coding, training, and fine-tuning multimodal diffusion models using deep learning frameworks (e.g., PyTorch).
* Help deploy and test the streaming prototype to evaluate real-time user interaction and performance.
Applicant's Learning Objectives
* Master core concepts in generative AI, specifically multimodal diffusion models and predictive world models.
* Gain practical, hands-on experience in handling and preprocessing large-scale video and audio datasets.
* Develop advanced programming and engineering skills in deep learning implementation, model training, and optimization.
* Gain practical, hands-on experience in handling and preprocessing large-scale video and audio datasets.
* Develop advanced programming and engineering skills in deep learning implementation, model training, and optimization.
Complexity of the project
Moderate