Vision-Language-Action (VLA) Model to Assemble LEGO Using Robots

Project Description

This project aims to develop a robust Vision-Language-Action (VLA) framework that empowers a robotic manipulator to assemble complex LEGO structures based on open-ended natural language instructions. While Large Language Models (LLMs) and Vision-Language Models (VLMs) have demonstrated impressive capabilities in reasoning and multimodal understanding, translating these high-level cognitive abilities into low-level robotic control remains a significant challenge, particularly in precision tasks like LEGO assembly.

The research will focus on bridging the gap between semantic understanding and physical actuation. The proposed system will consist of three core modules:

Instruction Parsing: Utilizing a VLM (e.g., GPT-4V or LLaVA) to interpret user instructions (e.g., "Build a red pyramid") and visual feedback from the workspace to generate a step-by-step assembly plan.

Spatial reasoning: Converting the assembly plan into specific geometric coordinates and grasp poses relative to the robot's frame. This involves identifying LEGO brick types, orientations, and target locations on the baseplate.

Action Execution: Translating spatial coordinates into motor primitives for a robotic arm (e.g., a Franka Emika Panda or UR5) to perform pick-and-place operations with high precision.

The project will investigate state-of-the-art techniques such as RT-2 (Robotic Transformer 2) or similar end-to-end VLA architectures, finetuning them on a custom dataset of LEGO assembly tasks. We aim to address the limitations of existing models in handling fine-grained manipulation and error recovery (e.g., correcting a misaligned brick).

Supervisor

WANG, Ziqi

Quota

Course type

UROP1100

UROP2100

UROP3100

UROP3200

UROP4100

Applicant's Roles

The student will function as a core member of the research team, contributing to both software implementation and experimental validation. Specific responsibilities include:

1. Algorithm Implementation & Training:

Assist in setting up the VLA pipeline using PyTorch.
Fine-tune pre-trained Vision-Language Models on project-specific datasets.
Implement motion planning algorithms (using libraries like MoveIt or OMPL) to execute the high-level commands generated by the VLA.

2. Simulation Environment Development:

Design and maintain a simulation environment in NVIDIA Isaac Sim or PyBullet that mirrors the real-world LEGO setup.
Create a pipeline for generating synthetic training data (images and segmentation masks of LEGO bricks).

3. Hardware Integration & Testing:

Interface the software stack with the physical robotic arm using ROS2 (Robot Operating System).
Calibrate the camera and robot coordinate systems (Hand-Eye Calibration).
Conduct experiments to evaluate success rates of assembly tasks and troubleshoot failure cases (e.g., grasp failures, occlusion issues).

4. Documentation:

Maintain detailed logs of experiments and model configurations.
Assist in writing the final technical report or conference paper draft.

Applicant's Learning Objectives

1. Technical Proficiency in VLA Models:

Gain a deep understanding of multimodal deep learning architectures, specifically how Vision Transformers (ViT) and LLMs are fused to control robotic agents.
Learn the mathematical foundations of SE(3) transformation matrices and robot kinematics/dynamics.

2. Hands-on Robotics Experience:

Acquire practical skills in ROS2 (Robot Operating System), camera calibration, and feedback control.
Understand the challenges of Sim-to-Real transfer, including the "reality gap" and techniques to mitigate it (e.g., Domain Randomization).

3. Research Methodology:
Learn how to formulate a research hypothesis, design meaningful ablation studies, and quantitatively evaluate model performance using metrics like "Success Rate" and "Task Completion Time."
Develop skills in reading and synthesizing state-of-the-art literature in computer vision and robotics.

4. Problem Solving & Debugging:
Enhance debugging skills in complex systems where software errors (code bugs) must be distinguished from hardware limitations (sensor noise).

Complexity of the project

Moderate

Apply Return home