Improved Random Forest For Feature Selection In Writer Identification

Sukor, Nooraziera Akmal (2015) Improved Random Forest For Feature Selection In Writer Identification. Masters thesis, Universiti Teknikal Malaysia Melaka.

[img] Text (24 Pages)
Improved Random Forest For Feature Selection In Writer Identification.pdf - Submitted Version

Download (604kB)

Abstract

Writer Identification (WI) is a process to determine the writer of a given handwriting sample. A handwriting sample consists of various types of features. These features are unique due to the writer’s characteristics and individuality, which challenges the identification process. Some features do not provide useful information and may cause to decrease the performance of a classifier. Thus, feature selection process is implemented in WI process. Feature selection is a process to identify and select the most significant features from presented features in handwriting documents and to eliminate the irrelevant features. Due to the WI framework, discretization process is applied before the feature selection process. Discretization process was proven to increase the classification performances and improved the identification performance in WI. An algorithm and framework of Improved Random Forest (IRF) tree was applied for feature selection process. RF tree is a collection of tree predictors used to ensemble decision tree models with a randomized selection of features at each split. It involved Classification and Regression Tree (CART) during the development of tree. Important features are measured by using Variable Importance (VI). While Mean Absolute Error (MAE) values use to identify the variance between writers, VI value was used for splitting process in tree and MAE value is to ensure the intra-class (same writer) invariance is lower than inter-class (different writer) invariance because lower intra-class invariance indicates accuracy to the real author. Number of selected features and the classification accuracy is used to indicate the performances of feature selection method. Experimental results have shown that the performances of IRF tree in discretized dataset produced third feature (f3) as the most important feature with average classification accuracy 99.19%. For un- discretized dataset, first feature (f1) and third feature (f3) are the most important features with average classification accuracy 40.79%.

Item Type: Thesis (Masters)
Uncontrolled Keywords: Writing Identification, Data processing, Neural networks (Computer science), Pattern recognition systems, Graphology, Improved Random Forest
Subjects: T Technology > T Technology (General)
T Technology > TA Engineering (General). Civil engineering (General)
Divisions: Library > Tesis > FTMK
Depositing User: Mohd Hannif Jamaludin
Date Deposited: 04 Aug 2016 03:45
Last Modified: 11 Nov 2020 13:06
URI: http://eprints.utem.edu.my/id/eprint/16842
Statistic Details: View Download Statistic

Actions (login required)

View Item View Item