Big data technology information extraction and fusion from non-homogenous web data sources

Lee, Qi Zian (2025) Big data technology information extraction and fusion from non-homogenous web data sources. Masters thesis, Universiti Teknikal Malaysia Melaka.

[img] Text (24 Pages)
Big data technology information extraction and fusion from non-homogenous web data sources (24 Pages).pdf - Submitted Version

Download (1MB)
[img] Text (Full Text)
Big data technology information extraction and fusion from non-homogenous web data sources.pdf - Submitted Version
Restricted to Registered users only

Download (4MB)

Abstract

Big data has played an ever-increasing role in various sectors of the economy. Despite the availability of big data technologies, many companies and organizations in Malaysia remain reluctant to adopt them. This study was conducted to develop a web extraction framework to extract data from the internet to assist adoption of big data technology. Web scrapping has been a popular method for collecting data from websites. This is because data on the internet is updated frequently thus making it a good source for getting accurate information. Analyzing data requires a large quantity of information to yield a good analysis result. However, the non-homogeneous nature of each website may cause the data from the different internet web sources to have different data making the quality of the data inconsistent. Previous study has propose the use of record linkage method to merge data from multiple website. The record linkage method proposed by previous study used deterministic technique to match data which match the string of matching variable to merge data. However, deterministic technique requires the matching variable to be an exact match to be able to match. Therefore, deterministic matching cannot take into account the dissimilarity such as spacing and different letter cases which can be common in web data due to it non￾homogenous nature. This study will explore the use of fuzzy matching technique in matching web data. Fuzzy matching uses Levenshtein distance to calculate the similarity of string and a threshold will be used to decide how similar to trigger a match. This enables fuzzy matching to match string that are only partially match instead of exact match. This study will begin by conducting a systematic review to determine the challenge of big data adoption and what data to extract. This study will implement the Technology-Organization-Environment (TOE) framework to examine the challenges faced by Malaysian organizations with regards to big data adoption. After the systematic review, a web data extraction framework will be developed to extract data that can assist big data adoption. The extracted data will then be merged to enhance the quality of the data. A comparison is made between deterministic matching and fuzzy matching on the performance of merging web data. The finding from this comparison shows that fuzzy matching has a slightly better performance in merging web data. This is due to fuzzy matching can match the string of matching variable that has different spacing and letter cases. A survey case study carried out in this study also shows that the extracted data is very helpful in helping user while purchasing the required big data software on the software market.

Item Type: Thesis (Masters)
Uncontrolled Keywords: Big data, Information extraction, Data fusion, Fuzzy matching, Web data
Subjects: Q Science
Q Science > QA Mathematics
Divisions: Faculty of Information and Communication Technology
Depositing User: Norhairol Khalid
Date Deposited: 10 Oct 2025 07:58
Last Modified: 10 Oct 2025 07:58
URI: http://eprints.utem.edu.my/id/eprint/29010
Statistic Details: View Download Statistic

Actions (login required)

View Item View Item