An enhanced robust association rules method for missing values imputation in Arabic language data set

Salem, Awsan Thabet (2023) An enhanced robust association rules method for missing values imputation in Arabic language data set. Masters thesis, Universiti Teknikal Malaysia Melaka.

[img] Text (24 Pages)
An enhanced robust association rules method for missing values imputation in Arabic language data set.pdf - Submitted Version

Download (441kB)
[img] Text (Full Text)
An enhanced robust association rules method for missing values imputation in Arabic language data set.pdf - Submitted Version
Restricted to Registered users only

Download (3MB)

Abstract

In data quality, missing values is one form of data completeness problem faced by people who deal with data. The failure to handle missing values usually causes unwanted consequences such as misleading analysis and decision-making. Thus, to deal with missing values, data imputation methods were proposed with the aim of improving the completeness of the data sets of concern. Data imputation’s accuracy is a common indicator of a data imputation method’s efficiency. However, the efficiency of data imputation in nominal data sets can be affected by the nature of the language in which the data set is written. Thus, there is a pressing need to deal with the problem, especially in non-Latin languages such as the Arabic language. In this thesis, the Enhanced Robust Association Rules (ERAR) method for missing values imputation is proposed. ERAR will improve the way to handle the Arabic language's complexity in terms of morphology and misspellings by adding an Arabic preparation step. The preparation step consists of Normalization, Error Detection, and Error Correction processes. ERAR is an extension of the Iterative method that adds filtering of frequent items. This method deals with high missing value rates by adjusting the support threshold in every iteration of the algorithm. This research aims to test the hypothesis that Arabic preparation and the filtering steps will improve the imputation processes in terms of accuracy, speed, and memory used. The findings discovered that with different missing value rates, ERAR was able to offer the highest accuracy percentage value reached 99% in the Arabic poetry data set, and speed as compared to the Iterative method in English and Arabic data sets at most MV rates, unfortunately not against the DT method. Nevertheless, the ERAR consumed the highest memory usage as compared to other methods during the imputation processes. In threshold values, the ERAR, Iterative methods are affected by different threshold values, where the accuracy decreases by reducing the support values, the same goes for elapsed time. in terms of memory usage, there is no clear effect. In the future, the research can be extended by covering the numerical data and other Arabic language issues. There is also room to improve ERAR in terms of memory use and speed.

Item Type: Thesis (Masters)
Uncontrolled Keywords: Mathematical statistics, Missing observations (Statistics), Multiple imputation (Statistics)
Subjects: Q Science > Q Science (General)
Q Science > QA Mathematics
Divisions: Library > Tesis > FTMK
Depositing User: Unnamed user with email nuraina0324@gmail.com
Date Deposited: 19 Sep 2024 16:42
Last Modified: 19 Sep 2024 16:42
URI: http://eprints.utem.edu.my/id/eprint/27718
Statistic Details: View Download Statistic

Actions (login required)

View Item View Item