Definition and Analysis of Population-based Data Completeness Measurement

Nurul A., Emran (2011) Definition and Analysis of Population-based Data Completeness Measurement. PhD thesis, The University of Manchester.


Download (2MB)


Poor quality data such as data with errors or missing values cause negative consequences in many application domains. An important aspect of data quality is completeness. One problem in data completeness is the problem of missing individuals in data sets. Within a data set, the individuals refer to the real world entities whose information is recorded. So far, in completeness studies however, there has been little discussion about how missing individuals are assessed. In this thesis, we propose the notion of population-based completeness (PBC) that deals with the missing individuals problem, with the aim of investigating what is required to measure PBC and to identify what is needed to support PBC measurements in practice. To achieve these aims, we analyse the elements of PBC and the requirements for PBC measurement, resulting in a denition of the PBC elements and PBC measurement formula. We propose an architecture for PBC measurement systems and determine the technical requirements of PBC systems in terms of software and hardware components. An analysis of the technical issues that arise in implementing PBC makes a contribution to an understanding of the feasibility of PBC measurements to provide accurate measurement results. Further exploration of a particular issue that was discovered in the analysis showed that when measuring PBC across multiple databases, data from those databases need to be integrated and materialised. Unfortunately, this requirement may lead to a large internal store for the PBC system that is impractical to maintain. We propose an approach to test the hypothesis that the available storage space can be optimised by materialising only partial information from the contributing databases, while retaining accuracy of the PBC measurements. Our approach involves substituting some of the attributes from the contributing databases with smaller alternatives, by exploiting the approximate functional dependencies(AFDs) that can be discovered within each local database. An analysis of the space-accuracy trade-offs of the approach leads to the development of an algorithm to assess candidate alternative attributes in terms of space-saving and accuracy (of PBC measurement). The result of several case studies conducted for proxy assessment contributes to an understanding of the space-accuracy trade-offs offered by the proxies. A better understanding of dealing with the completeness problem has been achieved through the proposal and the investigation of PBC, in terms of the requirements to measure and to support PBC in practice.

Item Type: Thesis (PhD)
Subjects: Z Bibliography. Library Science. Information Resources > ZA Information resources
Z Bibliography. Library Science. Information Resources > ZA Information resources > ZA4450 Databases
Divisions: Faculty of Information and Communication Technology > Department of Software Engineeering
Depositing User: Nurul A. Emran
Date Deposited: 29 Nov 2011 03:31
Last Modified: 21 Nov 2016 08:27
Statistic Details: View Download Statistic

Actions (login required)

View Item View Item