Mechanistic Interpretability of Machine Learning models in the physical sciences

Project Description

Various neural networks and data-driven methods have demonstrated progress in modelling complex physical systems. However, these models are often “black boxes”, and this opacity limits both scientific insight and trust in AI-assisted discovery, particularly in fields where understanding the underlying principles is important.

Inspired by the recent mechanistic interpretability (MI) work for large language models, MI focuses on understanding the internal workings of neural networks. By analysing the representations and “circuits” (Olah et al 2022), we can potentially peek into how physical information (in the latent space) is learned and processed.

This project aims to be a first exposure to MI and how it might apply for deep learning in physical science problems. The work has flexibility to expand into more complex physical systems depending on student interest and progress.

Supervisor

MAK Julian

Quota

Course type

UROP1000

UROP1100

UROP2100

UROP3100

UROP3200

UROP4100

Applicant's Roles

* Experiment with neural networks and toy models in physics problems using Python
* Implement and apply MI techniques (sparse autoencoders, activation patching, feature visualisation, probing etc.)

Applicant's Learning Objectives

* Understand the latest MI research literature and methods
* Gain practical experience with modern deep learning techniques applied to physical science problems
* Develop intuition for how neural networks process and represent scientific information
* Understand the challenges and opportunities in explainable AI for science
* Experience interdisciplinary research combining AI/ML and physical sciences

Complexity of the project

Challenging

Apply Return home