Automatic and Scalable Data Collection and Pruning for LLM Training
Project Description
This project aims to develop an automated system for efficiently collecting and pruning vast datasets required for training Large Language Models (LLMs). It will focus on creating scalable algorithms that can identify, gather, and filter relevant data from diverse sources while ensuring data quality and relevance. The system will incorporate advanced techniques in data mining, natural language processing, and machine learning to optimize the data preparation phase, crucial for training robust and accurate LLMs.
Supervisor
ZHOU, Xiaofang
Quota
4
Course type
UROP1100
UROP2100
UROP3100
UROP3200
Applicant's Roles
Responsible for designing and implementing data collection algorithms, ensuring the diversity and quality of the dataset. Must be proficient in machine learning, natural language processing, and big data technologies. Ensures the reliability and efficiency of the data collection and pruning system.
Applicant's Learning Objectives
Gain expertise in creating sophisticated algorithms for data collection and pruning, focusing on efficiency, scalability, and accuracy. Develop skills in managing large datasets, including data cleaning, transformation, and storage, essential for LLM training.
Complexity of the project
Moderate