An Integrated Principal Component Analysis And Weighted Apriori-T Algorithm For Imbalanced Data Root Cause Analysis

Ong, Phaik Ling (2016) An Integrated Principal Component Analysis And Weighted Apriori-T Algorithm For Imbalanced Data Root Cause Analysis. Masters thesis, Universiti Teknikal Malaysia Melaka.

[img] Text (24 Pages)
An Integrated Principal Component Analysis And Weighted Apriori-T Algorithm For Imbalanced Data Root Cause Analysis.pdf - Submitted Version

Download (70kB)


Root Cause Analysis (RCA) is often used in manufacturing analysis to prevent the reoccurrence of undesired events. Association rule mining (ARM) was introduced in RCA to extract frequently occur patterns, interesting correlations, associations or casual structures among items in the database. However, frequent pattern mining (FPM) using Apriori-like algorithms and support-confidence framework suffers from the myth of rare item problem in nature. This has greatly reduced the performance of RCA, especially in manufacturing domain, where existence of imbalanced data is a norm in a production plant. In addition, exponential growth of data causes high computational costs in Apriori-like algorithms. Hence, this research aims to propose a two stage FPM, integrating Principal Component Analysis (PCA) and Weighted Apriori-T (PCA-WAT) algorithm to address these problems. PCA is used to generate item weight by considering maximally distributed covariance to normalise the effect of rare items. Using PCA, significant rare item will have a higher weight while less significant high occurance item will have a lower weight. On the other hand, Apriori-T with indexing enumeration tree is used for low cost FPM. A semiconductor manufacturing case study with Work In Progress data and true alarm data is used to proof the proposed algorithm. The proposed PCA-WAT algorithm is benchmarked with the Apriori and Apriori-T algorithms.Comparison analysis on weighted support has been performed to evaluate the capability of PCA in normalising item’s support value. The experimental results have proven that PCA is able to normalise the item support value and reduce the influence of imbalance data in FPM.Both quality and performance measure are used as performance measurement. The quality measures aim to compare the frequent itemsets and interesting rules generated across different support and confidence thresholds, ranging from 5% to 20%, and 10% to 90% respectively.The rules validation involves a business analyst from the related field. The domain expert has verified that the generated rules are able to explain the contributing factors towards failure analysis. However, significant rare rules are not easily discovered because the normalized weighted support values are generally lower compared to the original suppport values. The performance measures aim to compare the execution time in second (s) and the execution Random Access Memory (RAM) in megabyte (MB). The experiment results proven that the implementation of Apriori-T has lowered the computational cost by at least 90% of computation time and 35.33% of computation RAM as compared to Apriori. The primary contribution of this study is to propose a two-stage FPM to perform RCA in manufacturing domain with the existence of imbalanced dataset. In conclusion, the proposed algorithm is able to overcome the rare item issue by implementing covariance based support value normalization and high computational costs issue by implementing indexing enumeration tree structure.Future work of this study should focus on rule interpretation to generate more human understandable rule by novice in data mining. In addition, suitable support and confidence thresholds are needed after the normalisation process to better discover the significant rare itemset.

Item Type: Thesis (Masters)
Uncontrolled Keywords: Data structures (Computer science), Computer algorithms, Imbalanced Data Root Cause Analysis
Subjects: Q Science > Q Science (General)
Q Science > QA Mathematics
Divisions: Library > Tesis > FTMK
Depositing User: Muhammad Afiz Ahmad
Date Deposited: 31 Mar 2017 01:30
Last Modified: 30 Nov 2020 16:27
Statistic Details: View Download Statistic

Actions (login required)

View Item View Item