Multi-modal Emotion Recognition via Speech and Facial Expression Project
Project Description
This project aims to develop a multimodal emotion recognition system that integrates audio (speech) and visual (facial expression) cues to accurately detect and classify human emotional states. The research will leverage deep learning techniques—specifically convolutional neural networks (CNNs) for facial analysis and recurrent neural networks (RNNs) or transformers for speech prosody analysis. By fusing features from both modalities, the system aims to achieve higher accuracy and robustness compared to single-modality approaches.

Potential applications include human-robot interaction, mental health monitoring, and intelligent virtual assistants. Students will work with pre-collected datasets (e.g., RAVDESS, AFEW) and may have the opportunity to collect custom data using lab equipment.
Supervisor
SHI, Ling
Quota
1
Course type
UROP1000
UROP1100
UROP2100
UROP3100
UROP3200
UROP4100
Applicant's Roles
The applicant will participate in the development and training of deep learning models for emotion recognition. Programming skills in Python and basic knowledge of machine learning frameworks (PyTorch/TensorFlow) are required. Familiarity with computer vision or signal processing is preferred. Experience with multimodal data fusion is a plus.
Applicant's Learning Objectives
1. Gain hands-on experience in multimodal deep learning and feature fusion techniques.
2. Understand the challenges and state-of-the-art methods in emotion recognition research.
3. Develop skills in data preprocessing, model training, and evaluation for real-world affective computing applications.

Complexity of the project
Challenging