Cargando…
Instance Sequence Queries for Video Instance Segmentation with Transformers
Existing methods for video instance segmentation (VIS) mostly rely on two strategies: (1) building a sophisticated post-processing to associate frame level segmentation results and (2) modeling a video clip as a 3D spatial-temporal volume with a limit of resolution and length due to memory constrain...
Autores principales: | , |
---|---|
Formato: | Online Artículo Texto |
Lenguaje: | English |
Publicado: |
MDPI
2021
|
Materias: | |
Acceso en línea: | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8271470/ https://www.ncbi.nlm.nih.gov/pubmed/34209420 http://dx.doi.org/10.3390/s21134507 |
_version_ | 1783721009805787136 |
---|---|
author | Xu, Zhujun Vivet, Damien |
author_facet | Xu, Zhujun Vivet, Damien |
author_sort | Xu, Zhujun |
collection | PubMed |
description | Existing methods for video instance segmentation (VIS) mostly rely on two strategies: (1) building a sophisticated post-processing to associate frame level segmentation results and (2) modeling a video clip as a 3D spatial-temporal volume with a limit of resolution and length due to memory constraints. In this work, we propose a frame-to-frame method built upon transformers. We use a set of queries, called instance sequence queries (ISQs), to drive the transformer decoder and produce results at each frame. Each query represents one instance in a video clip. By extending the bipartite matching loss to two frames, our training procedure enables the decoder to adjust the ISQs during inference. The consistency of instances is preserved by the corresponding order between query slots and network outputs. As a result, there is no need for complex data association. On TITAN Xp GPU, our method achieves a competitive 34.4% mAP at 33.5 FPS with ResNet-50 and 35.5% mAP at 26.6 FPS with ResNet-101 on the Youtube-VIS dataset. |
format | Online Article Text |
id | pubmed-8271470 |
institution | National Center for Biotechnology Information |
language | English |
publishDate | 2021 |
publisher | MDPI |
record_format | MEDLINE/PubMed |
spelling | pubmed-82714702021-07-11 Instance Sequence Queries for Video Instance Segmentation with Transformers Xu, Zhujun Vivet, Damien Sensors (Basel) Article Existing methods for video instance segmentation (VIS) mostly rely on two strategies: (1) building a sophisticated post-processing to associate frame level segmentation results and (2) modeling a video clip as a 3D spatial-temporal volume with a limit of resolution and length due to memory constraints. In this work, we propose a frame-to-frame method built upon transformers. We use a set of queries, called instance sequence queries (ISQs), to drive the transformer decoder and produce results at each frame. Each query represents one instance in a video clip. By extending the bipartite matching loss to two frames, our training procedure enables the decoder to adjust the ISQs during inference. The consistency of instances is preserved by the corresponding order between query slots and network outputs. As a result, there is no need for complex data association. On TITAN Xp GPU, our method achieves a competitive 34.4% mAP at 33.5 FPS with ResNet-50 and 35.5% mAP at 26.6 FPS with ResNet-101 on the Youtube-VIS dataset. MDPI 2021-06-30 /pmc/articles/PMC8271470/ /pubmed/34209420 http://dx.doi.org/10.3390/s21134507 Text en © 2021 by the authors. https://creativecommons.org/licenses/by/4.0/Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/). |
spellingShingle | Article Xu, Zhujun Vivet, Damien Instance Sequence Queries for Video Instance Segmentation with Transformers |
title | Instance Sequence Queries for Video Instance Segmentation with Transformers |
title_full | Instance Sequence Queries for Video Instance Segmentation with Transformers |
title_fullStr | Instance Sequence Queries for Video Instance Segmentation with Transformers |
title_full_unstemmed | Instance Sequence Queries for Video Instance Segmentation with Transformers |
title_short | Instance Sequence Queries for Video Instance Segmentation with Transformers |
title_sort | instance sequence queries for video instance segmentation with transformers |
topic | Article |
url | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8271470/ https://www.ncbi.nlm.nih.gov/pubmed/34209420 http://dx.doi.org/10.3390/s21134507 |
work_keys_str_mv | AT xuzhujun instancesequencequeriesforvideoinstancesegmentationwithtransformers AT vivetdamien instancesequencequeriesforvideoinstancesegmentationwithtransformers |