Data Quality

When accessing various data sources, it is important to know the meaning of the data [ [14], [15]] and the quality of the data retrieved [ [11]]. Despite the elegance of the relational theory [ [4], [5]] and mechanisms such as integrity constraints [ [6], [7]] to ensure that the database state reflects the real world state [ [9]], many databases contain deficient data [ [16]]. If the data in the underlying base relations are deficient, then the query results obtained from these base relations may also contain deficient data even if the query processing mechanism is flawless. These deficient data, in turn, may lead to erroneous decisions that can result in significant social and economic impacts.
In this chapter, we present a mechanism based on the relational algebra to estimate the accuracy of data derived from base relations. We will introduce a data quality algebra [6] that estimates the quality of query results given the quality characteristics of the underlying base relations, in the context of a single relational database environment.
We will focus on the accuracy dimension. Thus, the accuracy dimension will be used in this chapter to refer to data quality although, as we saw in chapter 1, various dimensions such as interpretability, completeness, consistency, and timeliness have been identified as typical data quality dimensions [ [1], [2]].
We make the following assumptions:
| Assumption A1: | Query processing is... |