Sun, Fayou and Ngo, Hea Choon and Sek, Yong Wee and Zuqiang, Meng (2023) Associating multiple vision transformer layers for fine-grained image representation. AI OPEN, 4. pp. 130-136. ISSN 2666-6510
Text
013022106202410613870.pdf Download (2MB) |
Abstract
Accurate discriminative region proposal has an important effect for fine-grained image recognition. The vision transformer (ViT) brings about a striking effect in computer vision due to its innate multi-head self-attention mechanism. However, the attention maps are gradually similar after certain layers, and since ViT used a classification token to achieve classification, it is unable to effectively select discriminative image patches for fine- grained image classification. To accurately detect discriminative regions, we propose a novel network AMTrans, which efficiently increases layers to learn diverse features and utilizes integrated raw attention maps to capture more salient features. Specifically, we employ DeepViT as backbone to solve the attention collapse issue. Then, we fuse each head attention weight within each layer to produce an attention weight map. After that, we alternatively use recurrent residual refinement blocks to promote salient feature and then utilize the semantic grouping method to propose the discriminative feature region. A lot of experiments prove that AMTrans acquires the SOTA performance on four widely used fine-grained datasets under the same settings, involving Stanford- Cars, Stanford-Dogs, CUB-200-2011, and ImageNet.
Item Type: | Article |
---|---|
Uncontrolled Keywords: | Increasing vision transformer layers, Weight attention fusion, Feature detection enhancement |
Divisions: | Faculty of Information and Communication Technology |
Depositing User: | Norfaradilla Idayu Ab. Ghafar |
Date Deposited: | 06 Jan 2025 11:42 |
Last Modified: | 09 Jan 2025 15:30 |
URI: | http://eprints.utem.edu.my/id/eprint/28202 |
Statistic Details: | View Download Statistic |
Actions (login required)
View Item |