Cargando…

CTT: CNN Meets Transformer for Tracking

Siamese networks are one of the most popular directions in the visual object tracking based on deep learning. In Siamese networks, the feature pyramid network (FPN) and the cross-correlation complete feature fusion and the matching of features extracted from the template and search branch, respectiv...

Descripción completa

Detalles Bibliográficos
Autores principales: Yang, Chen, Zhang, Ximing, Song, Zongxi
Formato: Online Artículo Texto
Lenguaje:English
Publicado: MDPI 2022
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9105974/
https://www.ncbi.nlm.nih.gov/pubmed/35590900
http://dx.doi.org/10.3390/s22093210
Descripción
Sumario:Siamese networks are one of the most popular directions in the visual object tracking based on deep learning. In Siamese networks, the feature pyramid network (FPN) and the cross-correlation complete feature fusion and the matching of features extracted from the template and search branch, respectively. However, object tracking should focus on the global and contextual dependencies. Hence, we introduce a delicate residual transformer structure which contains a self-attention mechanism called encoder-decoder into our tracker as the part of neck. Under the encoder-decoder structure, the encoder promotes the interaction between the low-level features extracted from the target and search branch by the CNN to obtain global attention information, while the decoder replaces cross-correlation to send global attention information into the head module. We add a spatial and channel attention component in the target branch, which can further improve the accuracy and robustness of our proposed model for a low price. Finally, we detailly evaluate our tracker CTT on GOT-10k, VOT2019, OTB-100, LaSOT, NfS, UAV123 and TrackingNet benchmarks, and our proposed method obtains competitive results with the state-of-the-art algorithms.