A Provenance-Tracking Extension for DuckDB

Project Description

This project aims to design and implement a lightweight DuckDB extension that automatically captures data provenance for SQL queries, enabling users to ask not only "what is the result?" but also "which input rows produced this result, and how?". It is ideal for final-year students who enjoy systems-level programming and want practical exposure to modern analytical database internals, query processing, and developer tooling.

Background — What is Data Provenance?
Data provenance (also called data lineage) is metadata that describes the origin and derivation of a piece of data: given an output value or row, provenance tells you which input records it came from and how it was produced. In a database setting, the research literature distinguishes three classical flavours. Where-provenance answers "which input cell(s) does this output value literally come from?"; useful for tracing a column value back to its source. Why-provenance answers "which set of input rows witness, or justify, this output row?"; i.e. the rows that, together, made the row appear in the result. How-provenance is the most informative: it records the way input rows were combined (joined, unioned, aggregated) using an algebraic structure known as a provenance semiring, so you can reconstruct not just which rows contributed but how they were used.
Provenance matters in practice for debugging ("why did this dashboard suddenly show a spike?"), auditing and compliance ("which source records influenced this regulated report?"), data quality ("if this input row is wrong, which outputs are affected?"), reproducibility, and explainability of data pipelines and ML features. Most analytical engines, including DuckDB, do not expose provenance natively. The goal of this project is to add it as a first-class, queryable feature.

Ideal Candidate:
This project is ideal for students who are curious about how databases work under the hood and are comfortable reading and writing C++ code. You should enjoy debugging, working with a real open-source codebase, and reasoning carefully about correctness. Prior exposure to SQL is required; experience with compilers, query optimisers, or systems programming is a plus but not required. The project will be scoped to your background and can lean either more toward engine internals or more toward the developer-facing tooling and visualisation.

Skills Required:
Programming: C++ (for the extension) and Python (for the demo and tests)
Solid understanding of SQL (joins, GROUP BY, aggregates)
Comfort with a Unix shell, Git, and CMake-based build systems
Willingness to read an existing C++ codebase (DuckDB) and learn its extension API

Supervisor

ZHOU, Xiaofang

Quota

Course type

UROP1000

UROP1100

UROP2100

UROP3100

UROP3200

UROP4100

Applicant's Roles

You will build an extension for DuckDB (an in-process analytical database written in C++) that transparently tracks where-, why-, and how-provenance for a useful subset of SQL (projection, selection, joins, and simple aggregations). DuckDB exposes a well-documented extension API and supports loadable extensions written in C++. You will use this to add new SQL functions and table functions such as PROVENANCE() that return, for each output row, a compact lineage annotation referencing the contributing input row identifiers. Internally, the extension will rewrite or instrument query plans so that base tables are augmented with stable row IDs that are propagated through operators (e.g., concatenated on joins, grouped on aggregations) following well-known semiring-style annotation semantics. You will start from a small, well-scoped subset (single-table SELECT with filters and projections), then progressively add support for joins and GROUP BY. On top of the engine work, you will build a small Python or web interface (e.g., a Streamlit app or a Jupyter helper) that lets a user run a query, inspect any output row, and drill down into the input tuples that produced it. You will evaluate correctness against hand-crafted ground-truth lineage on small datasets and measure the runtime/storage overhead of provenance tracking on standard benchmarks such as a scaled-down TPC-H. The deliverables are: the DuckDB extension (source + build), a short demo application, a test suite, and a written report describing the design decisions and limitations.

Applicant's Learning Objectives

Review the basics of data provenance (where-, why-, and how-provenance) and how they relate to relational operators.
Learn the DuckDB extension API and build a minimal "hello world" extension as a starting point.
Implement row-ID propagation through SELECT, WHERE and PROJECT operators, exposed via a SQL-level PROVENANCE(...) function.
Extend the implementation to support equi-joins and simple aggregations (COUNT, SUM, AVG) with correct lineage semantics.
Build a small interactive demo (Streamlit or a Jupyter notebook) that visualises the lineage of a selected output row.
Evaluate correctness on small curated datasets and measure overhead (time and storage) on a scaled-down TPC-H workload.
Document the design, limitations, and possible extensions (e.g., updates, window functions) in a final report.

Complexity of the project

Challenging

Apply Return home