AI Safety and Agent Defense Mechanisms | Undergraduate Research Opportunities Program

Project Description

This project advances AI Safety by building and evaluating safety-aware AI agents robust to real-world social engineering (SE) attacks (e.g., pretexting, credential phishing, emotional manipulation, and targeted deception). We move beyond simplistic "jailbreaking" to simulate high-fidelity, human-like SE scenarios where AI agents interact with dynamic, adversarial social contexts. The work includes: (1) curating/annotating realistic SE attack datasets; (2) designing safety alignment mechanisms to help agents detect, resist, and report manipulation; (3) rigorous red-teaming to measure failure modes; and (4) validating defenses in interactive, scenario-based environments. Outcomes will improve AI trustworthiness in high-stakes human-AI interaction systems.

Supervisor

FUNG, May

Quota

2

Course type

UROP1000

UROP1100

UROP2100

UROP3100

UROP3200

UROP4100

Applicant's Roles

- Assist in designing and annotating realistic social engineering attack scenarios and datasets.
- Support development, testing, and prompt engineering of safety-aware AI agents.
- Conduct red-teaming experiments to probe agent vulnerabilities to social manipulation.
- Collect, clean, and analyze experimental results; document findings and safety failures.
- Collaborate in team meetings, literature review of AI safety, alignment, and SE research.

Applicant's Learning Objectives

- Gain foundational knowledge in AI Safety, alignment, robustness, and human-AI security.
- Understand social engineering tactics and how they exploit AI systems.
- Learn to build, test, and red-team AI agents in realistic adversarial environments.
- Develop skills in data curation, experimental design, and ethical AI evaluation.
- Contribute to publishable research and understand the UROP research workflow.

Complexity of the project

Challenging