Multi-modal Emotion Recognition via Speech and Facial Expression Project | Undergraduate Research Opportunities Program

Project Description

This project aims to develop a multimodal emotion recognition system that integrates audio (speech) and visual (facial expression) cues to accurately detect and classify human emotional states. The research will leverage deep learning techniques—specifically convolutional neural networks (CNNs) for facial analysis and recurrent neural networks (RNNs) or transformers for speech prosody analysis. By fusing features from both modalities, the system aims to achieve higher accuracy and robustness compared to single-modality approaches.

Potential applications include human-robot interaction, mental health monitoring, and intelligent virtual assistants. Students will work with pre-collected datasets (e.g., RAVDESS, AFEW) and may have the opportunity to collect custom data using lab equipment.

Supervisor

SHI, Ling

Quota

1

Course type

UROP1000

UROP1100

UROP2100

UROP3100

UROP3200

UROP4100

Applicant's Roles

The applicant will participate in the development and training of deep learning models for emotion recognition. Programming skills in Python and basic knowledge of machine learning frameworks (PyTorch/TensorFlow) are required. Familiarity with computer vision or signal processing is preferred. Experience with multimodal data fusion is a plus.

Applicant's Learning Objectives

1. Gain hands-on experience in multimodal deep learning and feature fusion techniques.
2. Understand the challenges and state-of-the-art methods in emotion recognition research.
3. Develop skills in data preprocessing, model training, and evaluation for real-world affective computing applications.

Complexity of the project

Challenging