Advancing On Device Deployment of Vision–Language–Action / World Action Models for Embodied AI Systems

Project Description

This project aims to develop a real‑time embodied AI system on mobile manipulation platforms (e.g., dual‑arm robots or mobile bases equipped with manipulators) to support tasks such as object search, pick‑and‑place, and human–robot interaction in dynamic indoor environments. The key focus is on optimizing the end‑to‑end perception–reasoning–action pipeline of Vision–Language–Action (VLA) and World‑Action Models (WAMs) for efficient on‑device deployment. We will investigate techniques including pipelined sensing, cross‑modal early exit, and adaptive model configuration to achieve low‑latency and resource‑efficient execution on edge platforms (e.g., NVIDIA Jetson). The developed system will be validated on real robotic hardware in dynamic settings, enabling continuous robot navigation, manipulation, and interactive behaviors under real‑world constraints.

Supervisor

OUYANG, Xiaomin

Quota

Course type

UROP1100

UROP2100

UROP3100

UROP3200

UROP4100

Applicant's Roles

1. Develop lightweight frameworks for deploying LLMs on mobile devices.
2. Implement and benchmark baseline acceleration methods to evaluate latency, throughput, and energy efficiency for LLM inference on mobile platforms.
3. Design and prototype intelligent mobile GUI agents that autonomously operate device interfaces, leveraging LLM capabilities for efficient task automation.
4. Evaluate and optimize trade-offs among accuracy, latency, and resource consumption in mobile applications.

Applicant's Learning Objectives

1. Gain a solid foundation in efficient inference techniques for both large language models and mobile GUI agents.
2. Develop hands-on skills with model compression and acceleration techniques, specifically for mobile deployment.
3. Learn to balance trade-offs among accuracy, latency, and resource consumption in resource-constrained environments.
4. Gain experience in prototyping intelligent mobile applications and integrating multimodal systems for enhanced real-time interaction.

Complexity of the project

Moderate

Apply Return home