Finetuning Large-Language Models for Low-Resource Languages
Project Description
This project focuses on building an OCR and translation system for Manchu, a critically endangered East Asian language with a unique script and limited existing digital resources. By finetuning both vision-language models (for text recognition) and large-language models (for translation), the project creates practical tools that can support historical and linguistic research and cultural preservation. The research will involve collecting, cleaning, and organizing datasets, followed by systematic model training to improve accuracy and usability of OCR and translation outputs.
Supervisor
CHUNG, Yan Hon Michael
Quota
2
Course type
UROP1100
Applicant's Roles
Applicants will be preparing the training data that forms the backbone of the project. This includes labeling Manchu word images for OCR tasks, cleaning and aligning Manchu–Chinese translation pairs, and organizing the curated data into a usable format for model training. Applicants with some familiarity in Python programming may contribute to automating parts of the workflow, though prior coding experience is not mandatory. Applicants will also provide opportunities to try with local model finetuning.
Applicant's Learning Objectives
By the end of the internship, applicants will:
1) Gain hands-on experience in curating datasets for AI training.
2) Learn practical skills in finetuning both LLMs and VLMs on local machines.
3) Understand the workflow of preparing, training, and testing machine learning models.
4) Be introduced to the Manchu language, including its script and historical context.
5) Develop transferable skills applicable to digital humanities and AI research.
1) Gain hands-on experience in curating datasets for AI training.
2) Learn practical skills in finetuning both LLMs and VLMs on local machines.
3) Understand the workflow of preparing, training, and testing machine learning models.
4) Be introduced to the Manchu language, including its script and historical context.
5) Develop transferable skills applicable to digital humanities and AI research.
Complexity of the project
Challenging