Automatic and Scalable Data Collection and Pruning for LLM Training
Project Description
This project aims to develop an automated system for efficiently collecting and pruning vast datasets required for training Large Language Models (LLMs). It will focus on creating scalable algorithms that ensure data quality and relevance. The system will incorporate advanced data techniques in data mining, natural language processing, and machine learning, which are crucial for training robust and accurate LLMs.
Supervisor
ZHOU, Xiaofang
Quota
2
Course type
UROP1000
UROP1100
UROP2100
UROP3100
UROP4100
Applicant's Roles
Responsible for designing and implementing data collection and pruning algorithms, ensuring the diversity and quality of the dataset. Must be proficient in machine learning, natural language processing, and big data technologies. Ensures the reliability and efficiency of the data collection and pruning system.
Applicant's Learning Objectives
Gain expertise in creating sophisticated algorithms for data collection and pruning. Develop skills in managing large datasets, including data cleaning, transformation, and storage, essential for LLM training.
Complexity of the project
Moderate