Trainable End-to-End VLM Agent for 3D Scene Manipulation
Project Description
In this project, we aim to develop and evaluate a multi-modal AI agent capable of interacting with and manipulating 3D environments using Blender as a simulation platform. The agent will be built upon large vision-language models (e.g., Qwen3-VL) and trained through a combination of supervised fine-tuning (SFT), reinforcement learning (RL) to interpret multi-modal instructions, perceive complex 3D scenes, and perform tasks such as object manipulation, scene editing, and dynamic camera control. By integrating perception, vision-language understanding, and action execution within a unified framework, this work advances the frontier of embodied AI in creative 3D domains, with potential applications in virtual production, educational tools, and generative 3D content creation.
Supervisor
XU, Dan
Quota
2
Course type
UROP3200
Applicant's Roles
- Design and implement a modular framework for integrating AI models with Blender’s Python API.
- Collect or generate synthetic datasets for training and evaluating agent behavior in 3D environments.
- Experiment with different prompting strategies and training methods to improve task success rates.
- Document results and prepare findings for publication in top-tier conferences.
Applicant's Learning Objectives
- Gain practical experience in building and training embodied agents that operate in 3D simulators.
- Master the integration of foundation models with low-level 3D software infrastructure via Python scripting.
- Deepen understanding of multimodal agent training pipelines, spanning data collection, SFT, RL, and evaluation design.
- Strengthen skills in reproducible research, systematic experimentation, and scientific communication for interdisciplinary AI communities.
Complexity of the project
Moderate