Data Quality

Current commercial relational database management systems (e.g., IBM's DB2, Oracle's Oracle DBMS, and Microsoft's SQL Server) and their underlying relational model are based on the assumption that data stored in the databases are correct. This assumption, however, has some nontrivial ramifications. Consider a join operation in a SQL query. Suppose that the data used in a join operation are incorrect, it would follow that the query results would most likely be incorrect. To what extent the query results would be incorrect and what their impact would be remains an open question, although researchers have started to investigate such issues [ [1], [22]].
The fundamental assumption of the relational model, that "data stored in the underlying databases are correct," is not without merit. To ensure data integrity, the relational model has facilities such as data dictionaries, integrity rules, and edits checks. In practice, however, dirty data pervade databases for various reasons [ [14], [26]]. Furthermore, the scope of data quality goes beyond accuracy and integrity as conceived by many in the database community. It is well established that other aspects of data quality such as believability and timeliness are equally, if not more, important from the end-user's perspective [ [2], [3], [15], [24], [27]].
In this chapter, we present two early attempts at extending the relational model to capture data quality attributes: The Polygen Model [ [31]] [1] and the Attribute-based Model [ [25]] [2].