KV Cache Compression for Efficient and Truthful Generative Inference in Large Language Models
Project Description
Large Language Models (LLMs), while achieving remarkable success, are costly to deploy, particularly for tasks like dialogue systems and story writing that require long-form content generation. Often, a large amount of transient state information, referred to as the KV cache, is stored in GPU memory in addition to model parameters, scaling linearly with the sequence length and batch size. This project aims to propose an approach for implementing the KV cache that significantly reduces its memory footprint while preserving the "truthfulness" of LLMs. The research will focus on a comprehensive evaluation of the potential risks of hallucination introduced by KV cache compression techniques. By identifying how KV compression affects the factuality of generated tokens, the project will introduce techniques such as eviction, editing, or merging to mitigate hallucinations. Additionally, the algorithms will be implemented on multiple open-source LLMs and compared against traditional compression methods, demonstrating that the models can maintain performance while producing factually accurate responses. This project seeks to advance safe and reliable KV cache compression techniques, offering undergraduate students an opportunity to engage with cutting-edge research at the intersection in AI safety and LLMs compression deployment.
Supervisor
GUO, Song
Quota
2
Course type
UROP1100
Applicant's Roles
1. Research Analyst: The applicant will perform in-depth literature reviews on KV cache compression techniques and their impact on large language models (LLMs). This includes identifying potential safety risks, such as hallucination and factuality issues, associated with KV cache compression. This role involves collecting and synthesizing information on state-of-the-art techniques and contributing to the development of safety KV cache compression.
2. Data Scientist: The applicant will curate and manage datasets for testing and evaluating the performance of KV cache compression strategies. This involves preparing datasets to assess the factuality and "truthfulness" of model-generated responses and ensuring alignment with safety specifications. The applicant will also conduct quantitative and qualitative evaluations of LLM outputs, comparing proposed compression techniques against traditional approaches to ensure factual accuracy and efficiency.
3. Algorithm Developer: The applicant will be responsible for designing and implementing novel KV cache compression algorithms tailored for pre-filling and decoding stages in LLMs. This includes developing techniques such as eviction, editing, or merging to mitigate hallucinations while maintaining model performance.
Applicant's Learning Objectives
1. Understanding LLMs Compression Concepts: The applicant will gain a deep understanding of safety challenges specific to LLMs, particularly those arising from KV cache compression. This includes learning about hallucination risks, the impact of compression techniques on model truthfulness, and strategies to mitigate these risks while maintaining performance.
2. Developing Technical Skills: The applicant will enhance their technical skills in areas such as machine learning, algorithm development, and memory optimization. They will gain hands-on experience in designing and implementing advanced KV cache compression techniques (e.g., eviction, editing, or merging) and using tools and frameworks for evaluating generative model performance.
3. Gaining Research Experience: The applicant will gain experience in conducting academic research, including literature review, problem formulation, and experimental design. They will learn how to develop a research hypothesis, design experiments to test it, and analyze the results.
4. Enhancing Communication Skills: The applicant will improve their ability to communicate complex ideas clearly and effectively, both in written and oral forms. This includes learning how to document technical processes, write research reports, and present findings to audiences.
5. Understanding Safety Implications: The applicant will gain insights into the safety considerations involved in LLM compression. They will learn to balance the need for computational efficiency with the importance of generating truthful, reliable, and safe responses to minimize risks such as misinformation or hallucinations in real-world applications.
Complexity of the project
Moderate