Duplicates detection approach within incomplete data sets using dynamic sorting key and hot deck compensation method

Abdulrahim, Abdulrazzak Ali Mohamed (2022) Duplicates detection approach within incomplete data sets using dynamic sorting key and hot deck compensation method. Doctoral thesis, Universiti Teknikal Malaysia Melaka.

[img] Text (24 Pages)
Duplicates detection approach within incomplete data sets using dynamic sorting key and hot deck compensation method.pdf - Submitted Version

Download (264kB)
[img] Text (Full Text)
Duplicates detection approach within incomplete data sets using dynamic sorting key and hot deck compensation method.pdf - Submitted Version
Restricted to Registered users only

Download (5MB)

Abstract

Duplicate record is a common problem within data sets, especially in huge-volume databases. The accuracy of duplicate detection determines the efficiency of the duplicate removal process. However, duplicate detection has become more challenging due to the presence of missing values within the records where during the clustering and matching process, missing values can cause records deemed similar to be inserted into the wrong group, hence, leading to undetected duplicates. Keeping a database free of duplicates is crucial for most use-cases, as their existence causes false negatives and false positives when matching queries against it. These two data quality issues have negative implications for tasks, such as in the medical field, where the patient may get drugs overdosage, which could, unfortunately, cause loss of life, or parcel delivery, where a parcel can get delivered to the wrong address. While research in duplicate detection is well-established and covers different aspects of both efficiency and effectiveness, our work in this thesis focuses on both. We propose novel method to improve preprocessing task to overcome the challenge posed by the presence of missing values on the efficiency of duplicates detection before duplicate detection takes place and apply the latter in datasets even when prior labeling is not available. In this thesis, duplicate detection improvement is proposed to deal with the presence of missing values within a data set through Duplicate Detection within the Incomplete Data set (DDID) method. DDID is based on a set of procedures to address the problem of missing data, which is to adopt a generic approach based on high-rank attributes (high uniqueness, low missing values ), followed by compensating the missing values in high-rank attributes using the Hot Deck compensation method. Dynamic sort keys and matching strings are created from the high-rank attributes in certain lengths. These procedures that were adopted in DDID aimed to validate the expected results in successive stages of detection and achieve a high matching rate of duplicate records despite the presence of missing values through a specific detecting mechanism. The experiments included the use of four benchmark data sets (restaurant, CDDB, MusicBrainz (A), MusicBrainz (B)) to detect duplicates. The missing values were hypothetically added to the key attributes with 4% for the Restaurant data set and 1.5% for the CDDB data set, using an arbitrary pattern to simulate both complete and incomplete data sets. DuDe toolkit was used to detect duplicates as a benchmark to make a relative comparison. Duplicates detection measures have been used to evaluate DDID in terms of accuracy and use performance improvement (PI) and statistical analysis to evaluate DDID in terms of elapsed time. The results of the experiments showed that the procedures adopted in the proposed method DDID achieved a significant improvement in the accuracy of detecting duplicates compared to DuDe as it reached in the first implementation stage, 18% with the Restaurant data set while 16% with the CDDB data set; and its reached 19% and 4% for both MusicBrainz(A) and MusicBrainz(B) respectively, as compared to DuDe. Similarly, DDID achieved significant improvement in the accuracy of detecting duplicates as compared to DuDe in the second implementation stage, reaching 24%, 18%, 30%, and 3% for Restaurant, CDDB, MusicBrainz(A), and MusicBrainz(B), data sets respectively. The analysis proved that even though the data sets were incomplete, DDID was able to offer better accuracy and faster duplicate detection as compared to DuDe. The adopted procedures also had a positive effect on limiting the defect of window size in the sorted neighbourhood method, as it maintained the stability of the accuracy of detection of duplicates, in addition to improving the performance of the tested blocking methods within this study. The results of this thesis not only contribute to expanding the body of knowledge in data management specifically in the area of data quality, where the focus is given to the problem of how to detect the presence of duplicates within data sets that are incomplete. But it can also contribute to the problem of industry-scale duplicate detection.

Item Type: Thesis (Doctoral)
Uncontrolled Keywords: Database management, Information retrieval, Data mining
Subjects: Q Science > Q Science (General)
Q Science > QA Mathematics
Divisions: Library > Disertasi > FTMK
Depositing User: Unnamed user with email nuraina0324@gmail.com
Date Deposited: 19 Sep 2024 16:37
Last Modified: 19 Sep 2024 16:37
URI: http://eprints.utem.edu.my/id/eprint/27720
Statistic Details: View Download Statistic

Actions (login required)

View Item View Item