Improving the efficiency of clustering algorithm for duplicates detection

Emran, Nurul Akmar and Abdul Rahim, Abdulrazzak Ali Mohamed and Kamal Baharin, Safiza Suhana and Othman, Zahriah and Salem, Awsan Thabet and Abd Aziz, Maslita and Md. Bohari, Nor Mas Aina and Abdullah, Noraswaliza (2023) Improving the efficiency of clustering algorithm for duplicates detection. Indonesian Journal Of Electrical Engineering And Computer Science, 30 (3). pp. 1586-1595. ISSN 2502-4752

[img] Text
0028217072023.pdf
Available under License Creative Commons Attribution Share Alike.

Download (779kB)

Abstract

Clustering method is a technique used for comparisons reduction between the candidates records in the duplicate detection process. The process of clustering records is affected by the quality of data. The more error-free the data, the more efficient the clustering algorithm, as data errors cause data to be placed in incorrect groups. Window algorithms suffer from the window size. The larger the window, the greater the number of unnecessary comparisons, and the smaller the window size may prevent the detection of duplicates that are supposed to be within the window. In this paper, we propose a data pre-processing method that increases the efficiency of window algorithms in grouping similar records together. In addition, the proposed method also deal s with the window size problem. In the proposed method, high-rank attributes are selected and then preparators are applied to the selected traits. A compensation algorithm is implemented to reduce the problem of missing and distorted sort keys. Two datasets (compact disc database (CDDB) and MusicBrainz) were used to test duplicates detection algorithms. The duplicates detection toolkit(DuDe) was used as a benchmark for the proposed method. Experiments showed that the proposed method achieved a high rate of accuracy in detecting duplicates. In addition, the proposed method.

Item Type: Article
Uncontrolled Keywords: Attribute selection, Clustering data, Duplicate detection, Missing values, Sort key
Divisions: Faculty of Information and Communication Technology
Depositing User: Norfaradilla Idayu Ab. Ghafar
Date Deposited: 06 Jan 2025 09:50
Last Modified: 06 Jan 2025 09:50
URI: http://eprints.utem.edu.my/id/eprint/28105
Statistic Details: View Download Statistic

Actions (login required)

View Item View Item