Automatic and Scalable Data Collection and Pruning for LLM Training

Project Description

This project aims to develop an automated system for efficiently collecting and pruning vast datasets required for training Large Language Models (LLMs). It will focus on creating scalable algorithms that ensure data quality and relevance. The system will incorporate advanced data techniques in data mining, natural language processing, and machine learning, which are crucial for training robust and accurate LLMs.

Supervisor

ZHOU, Xiaofang

Quota

Course type

UROP1000

UROP1100

UROP2100

UROP3100

UROP4100

Applicant's Roles

Responsible for designing and implementing data collection and pruning algorithms, ensuring the diversity and quality of the dataset. Must be proficient in machine learning, natural language processing, and big data technologies. Ensures the reliability and efficiency of the data collection and pruning system.

Applicant's Learning Objectives

Gain expertise in creating sophisticated algorithms for data collection and pruning. Develop skills in managing large datasets, including data cleaning, transformation, and storage, essential for LLM training.

Complexity of the project

Moderate

Apply Return home