Publication | Publications - Systems Group, ETH Zurich

Research, January 2009

Data provenance is essential in applications such as scientific computing, curated databases, and data warehouses. Several systems have been developed that provide provenance functionality for the relational data model. These systems support only a small subset of SQL, a severe limitation in practice since most of the application domains that benefit from provenance information use complex queries. Such queries typically involve nested subqueries, aggregation and/or user defined functions. Without support for these constructs, a provenance management system is of limited use. In this paper we address this limitation by exploring the problem of provenance derivation when complex queries are involved. More precisely, we demonstrate that the widely used definition of Why-provenance fails in the presence of nested subqueries, and show how the definition can be modified to produce meaningful results for nested subqueries. We further present query rewrite rules to transform an SQL query into a query propagating provenance. The solution introduced in this paper allows us to track provenance information for a far wider subset of SQL than any of the existing approaches. We have incorporated these ideas into the Perm provenance management system engine and used it to evaluate the feasibility and performance of our approach.

@inproceedings{abc,
	abstract = {
        Data provenance is essential in applications such as scientific
        computing, curated databases, and data warehouses. Several
        systems have been developed that provide provenance
        functionality for the relational data model. These systems
        support only a small subset of SQL, a severe limitation in
        practice since most of the application domains that benefit from
        provenance information use complex queries. Such queries
        typically involve nested subqueries, aggregation and/or user
        defined functions. Without support for these constructs, a
        provenance management system is of limited use.

        In this paper we address this limitation by exploring the
        problem of provenance derivation when complex queries are
        involved. More precisely, we demonstrate that the widely used
        definition of Why-provenance fails in the presence of nested
        subqueries, and show how the definition can be modified to
        produce meaningful results for nested subqueries. We further
        present query rewrite rules to transform an SQL query into a
        query propagating provenance. The solution introduced in this
        paper allows us to track provenance information for a far wider
        subset of SQL than any of the existing approaches. We have
        incorporated these ideas into the Perm provenance management
        system engine and used it to evaluate the feasibility and
        performance of our approach.
      },
	author = {Boris Glavic and Gustavo Alonso},
	booktitle = {Research},
	title = {Provenance for nested subqueries},
	url = {http://doi.acm.org/10.1145/1516360.1516472},
	year = {2009}
}