Efficient Vision-Language-Action (VLA) Model for Robot Learning
Project Description
Embodied AI represents a pivotal frontier in artificial intelligence, combining egocentric computer vision, machine learning, and robotics to enable agents to learn, perceive, and act in dynamic environments. This project focuses on reproducing the OpenVLA (Open Vision-Language-Action) framework on the Open X-Embodiment (OpenX) dataset to advance research in multimodal learning for robot actions. By leveraging the publicly available codebase and dataset, we aim to streamline initial implementation efforts while exploring novel mechanisms to enhance robot action learning through richer integration of multimodal information. The use of a smaller language model with LoRA (Low-Rank Adaptation) ensures computational efficiency, making this an accessible and impactful project for undergraduate researchers.
Supervisor
XU Dan
Quota
3
Course type
UROP1000
UROP1100
UROP2100
UROP3100
UROP3200
UROP4100
Applicant's Roles
Reproduction of OpenVLA Framework:
Implement the OpenVLA pipeline using the Open X-Embodiment Dataset to validate and understand its approach to vision-language-action tasks.
Ensure reproducibility and performance benchmarks align with those reported in the original OpenVLA study.
Multimodal Integration for Robot Action Learning:
Design and experiment with novel mechanisms to incorporate additional modalities (e.g., haptics, audio, and environmental metadata) into the learning framework.
Investigate how multimodal fusion improves task performance and robustness.
Optimization and Efficiency:
Utilize smaller language models enhanced with LoRA to minimize computational cost without sacrificing performance.
Evaluate trade-offs between model size, efficiency, and task effectiveness.
Implement the OpenVLA pipeline using the Open X-Embodiment Dataset to validate and understand its approach to vision-language-action tasks.
Ensure reproducibility and performance benchmarks align with those reported in the original OpenVLA study.
Multimodal Integration for Robot Action Learning:
Design and experiment with novel mechanisms to incorporate additional modalities (e.g., haptics, audio, and environmental metadata) into the learning framework.
Investigate how multimodal fusion improves task performance and robustness.
Optimization and Efficiency:
Utilize smaller language models enhanced with LoRA to minimize computational cost without sacrificing performance.
Evaluate trade-offs between model size, efficiency, and task effectiveness.
Applicant's Learning Objectives
Learn and implement the OpenVLA framework;
Learn multimodal integration for robot action learning;
Learn multimodal optimization and efficiency;
Learn multimodal integration for robot action learning;
Learn multimodal optimization and efficiency;
Complexity of the project
Moderate