Emran, Nurul Akmar and Maskat, Ruhaila (2025) A review of visualization techniques for duplicate detection in cancer datasets. International Journal Of Advanced Computer Science And Applications, 16 (9). pp. 620-628. ISSN 2156-5570
|
Text
00282291020251218272384.pdf Available under License Creative Commons Attribution. Download (339kB) |
Abstract
As clinical cancer research increasingly depends on large, diverse datasets, concerns about data duplication have grown. Duplicates can undermine data integrity, skew analytical results, and reduce the reproducibility of studies. This review explores how visualization can play a critical role in identifying and managing duplicates in non-image clinical cancer data. Drawing from literature in biomedical informatics, data quality, and visual analytics, it synthesizes current approaches and highlights key challenges. Using a scoping review methodology, we analyzed studies published over the past two decades, focusing on non-image clinical datasets. Studies were selected based on relevance to duplicate detection and visualization, excluding those centered on image or video data. Major datasets like The Cancer Genome Atlas (TCGA), The Cancer Imaging Archive (TCIA), and the North American Association of Central Cancer Registries (NAACCR) are examined to show how duplication occurs across genomic, clinical, and registry data. The review assesses existing visualization techniques based on their scalability, interactivity, integration with deduplication algorithms, and how well they address core data quality dimensions. While some tools offer scalable and interactive features, few provide clear visual representations of duplicates, especially those involving complex temporal and multidimensional patterns. Several methodological gaps are identified, including limited integration of data quality metrics, inadequate support for tracking changes over time, and a lack of standardized evaluation frameworks. To address these issues, the review advocates for the development of practical, user-friendly visualization tools that combine duplicate detection with key indicators of data quality. By offering a more complete and intuitive view of clinical datasets, such tools can help researchers and clinicians make better-informed decisions, ultimately improving the reliability and impact of cancer research. Bridging the gap between technical detection and visual understanding is essential for advancing data-driven healthcare and ensuring high-quality, reproducible outcomes.
| Item Type: | Article |
|---|---|
| Uncontrolled Keywords: | Duplicate detection, Data duplication, Visualization, Deduplication, TCGA, TCIA, NAACCR |
| Divisions: | Faculty of Information and Communication Technology |
| Depositing User: | Norfaradilla Idayu Ab. Ghafar |
| Date Deposited: | 05 Dec 2025 03:51 |
| Last Modified: | 05 Dec 2025 03:51 |
| URI: | http://eprints.utem.edu.my/id/eprint/29146 |
| Statistic Details: | View Download Statistic |
Actions (login required)
![]() |
View Item |
