Fayou, Sun and Meng, Zuqiang and Ngo, Hea Choon and Sek, Yong Wee (2024) Clustering swap prediction for image-text pre-training. Scientific Reports, 14 (1). ISSN 2045-2322
Text
0130221062024105857.PDF Download (2MB) |
Abstract
It is essential to delve into the strategy of multimodal model pre-training, which is an obvious impact on downstream tasks. Currently, clustering learning has achieved noteworthy benefits in multiple methods. However, due to the availability of open image-text pairs, it is challenging for multimodal with clustering learning. In this paper, we propose an approach that utilizes clustering swap prediction strategy to learn image-text clustering embedding space by interaction prediction between image and text features. Unlike existing models with clustering learning, our method (Clus) allows for an open number of clusters for web-scale alt-text data. Furthermore, in order to train the image and text encoders efficiently, we introduce distillation learning approach and evaluate the performance of the image-encoder in downstream visual tasks. In addition, Clus is pre-trained end-to-end by using large-scale image-text pairs. Specifically, both text and image serve as ground truth for swap prediction, enabling effective representation learning. Concurrently, extensive experiments demonstrate that Clus achieves state-of-the-art performance on multiple downstream fine-tuning and zero-shot tasks (i.e., Image-Text Retrieval, VQA, NLVR2, Image Captioning, Object Detection, and Semantic Segmentation).
Item Type: | Article |
---|---|
Uncontrolled Keywords: | Cluster number, Clustering learning, Model pre-training, Swap prediction |
Divisions: | Faculty of Information and Communication Technology |
Depositing User: | Sabariah Ismail |
Date Deposited: | 25 Jul 2024 09:04 |
Last Modified: | 25 Jul 2024 09:04 |
URI: | http://eprints.utem.edu.my/id/eprint/27536 |
Statistic Details: | View Download Statistic |
Actions (login required)
View Item |