Feedforward 3D Reconstruction of Dynamic Scenes via Multiview Vision Transformers
Project Description
Recent advancements in 3D computer vision have witnessed a paradigm shift from per-scene optimization (e.g., NeRF, 3DGS) to generalizable, feedforward reconstruction models (Large Reconstruction Models). This project aims to explore and extend the capabilities of Multiview Vision Transformers (e.g., VGGT, PI3, DepthAnything3) to address the complex challenges of reconstructing dynamic 3D scenes from monocular video inputs.

Building upon our group's recent successes in 4D alignment and tracking (e.g., Align3R and Tracking World), this project will focus on three key research directions:
1. Dynamic Scene Reconstruction: Developing feedforward mechanisms to efficiently lift 2D monocular videos into consistent 4D representations without heavy per-scene optimization.
2. Human-Object Interaction (HOI): Investigating how Multiview Transformers can specifically capture and reconstruct the geometry and spatial relationships of humans interacting with objects, leveraging the semantic power of ViTs.
3. Downstream Applications: Extending the learned 3D representations to support advanced tasks such as 3D controllable video generation and spatial perception/understanding for embodied AI agents.

Students will have the opportunity to work with state-of-the-art frameworks and contribute to the next generation of spatial intelligence systems.

Related works from our group as references:
1. Align3R: https://igl-hkust.github.io/Align3R.github.io/ (accepted by CVPR 2025)
2. Tracking World: https://igl-hkust.github.io/TrackingWorld.github.io/ (accepted by NeurIPS 2025)
3. MVInverse: https://maddog241.github.io/mvinverse-page/
4. UniSH: https://murphylmf.github.io/UniSH/
Supervisor
LIU, Yuan
Co-Supervisor
YEUNG, Sai Kit
Quota
2
Course type
UROP1100
UROP2100
UROP3100
UROP4100
Applicant's Roles
The scope of this project is comprehensive. Each student will focus on a specific subset of the following tasks, tailored to their background, interests, and the project's current phase. Responsibilities include:
1. Foundational Learning: Gaining essential knowledge in 3D computer vision, specifically focusing on 3D reconstruction techniques (multiview geometry) and Transformer-based architectures.
2. Literature Review: conducting in-depth surveys of state-of-the-art papers, particularly those related to Multiview Vision Transformers (e.g., VGGT, PI3) and dynamic scene analysis.
3. Architecture Design: Participating in the design and modification of Multiview Transformer modules to better handle temporal video data or human-object interactions.
4. Model Implementation & Training: Implementing algorithms in PyTorch, preparing datasets, and conducting model training experiments on high-performance computing clusters.
5. Result Analysis: Evaluating model performance through quantitative metrics and qualitative visualization, identifying failure cases, and proposing improvements.
6. Academic Writing: Contributing to the preparation of research papers, including drafting sections, creating technical figures, and summarizing findings for potential conference submissions.
Applicant's Learning Objectives
This project offers a deep dive into the intersection of 3D Vision and Generative AI. In this project, you will:
1. Master 3D Vision Fundamentals: Go beyond the basics of deep learning to understand the geometry behind the pixels. You will master camera models, projection mathematics, and the theory behind reconstructing 3D worlds from 2D images.
2. Build Advanced AI Models: Get hands-on experience designing and training Vision Transformers. You will learn how to adapt architectures like VGGT and PI3 for dynamic tasks, debug complex training issues, and analyze model performance using real-world metrics.
3. Become a Researcher: Learn the end-to-end research lifecycle. We will guide you on how to efficiently read papers, synthesize ideas, write professional academic content, and present your findings to the scientific community.
Complexity of the project
Moderate