AI Safety and Agent Defense Mechanisms
Project Description
This project advances AI Safety by building and evaluating safety-aware AI agents robust to real-world social engineering (SE) attacks (e.g., pretexting, credential phishing, emotional manipulation, and targeted deception). We move beyond simplistic "jailbreaking" to simulate high-fidelity, human-like SE scenarios where AI agents interact with dynamic, adversarial social contexts. The work includes: (1) curating/annotating realistic SE attack datasets; (2) designing safety alignment mechanisms to help agents detect, resist, and report manipulation; (3) rigorous red-teaming to measure failure modes; and (4) validating defenses in interactive, scenario-based environments. Outcomes will improve AI trustworthiness in high-stakes human-AI interaction systems.
Supervisor
FUNG, May
Quota
2
Course type
UROP1000
UROP1100
UROP2100
UROP3100
UROP3200
UROP4100
Applicant's Roles
- Assist in designing and annotating realistic social engineering attack scenarios and datasets.
- Support development, testing, and prompt engineering of safety-aware AI agents.
- Conduct red-teaming experiments to probe agent vulnerabilities to social manipulation.
- Collect, clean, and analyze experimental results; document findings and safety failures.
- Collaborate in team meetings, literature review of AI safety, alignment, and SE research.
- Support development, testing, and prompt engineering of safety-aware AI agents.
- Conduct red-teaming experiments to probe agent vulnerabilities to social manipulation.
- Collect, clean, and analyze experimental results; document findings and safety failures.
- Collaborate in team meetings, literature review of AI safety, alignment, and SE research.
Applicant's Learning Objectives
- Gain foundational knowledge in AI Safety, alignment, robustness, and human-AI security.
- Understand social engineering tactics and how they exploit AI systems.
- Learn to build, test, and red-team AI agents in realistic adversarial environments.
- Develop skills in data curation, experimental design, and ethical AI evaluation.
- Contribute to publishable research and understand the UROP research workflow.
- Understand social engineering tactics and how they exploit AI systems.
- Learn to build, test, and red-team AI agents in realistic adversarial environments.
- Develop skills in data curation, experimental design, and ethical AI evaluation.
- Contribute to publishable research and understand the UROP research workflow.
Complexity of the project
Challenging