Cargando…
SIG-Former: monocular surgical instruction generation with transformers
PURPOSE: Automatic surgical instruction generation is a crucial part for intra-operative surgical assistance. However, understanding and translating surgical activities into human-like sentences are particularly challenging due to the complexity of surgical environment and the modal gap between imag...
Autores principales: | , , , |
---|---|
Formato: | Online Artículo Texto |
Lenguaje: | English |
Publicado: |
Springer International Publishing
2022
|
Materias: | |
Acceso en línea: | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9652298/ https://www.ncbi.nlm.nih.gov/pubmed/35900645 http://dx.doi.org/10.1007/s11548-022-02718-9 |
_version_ | 1784828438547267584 |
---|---|
author | Zhang, Jinglu Nie, Yinyu Chang, Jian Zhang, Jian Jun |
author_facet | Zhang, Jinglu Nie, Yinyu Chang, Jian Zhang, Jian Jun |
author_sort | Zhang, Jinglu |
collection | PubMed |
description | PURPOSE: Automatic surgical instruction generation is a crucial part for intra-operative surgical assistance. However, understanding and translating surgical activities into human-like sentences are particularly challenging due to the complexity of surgical environment and the modal gap between images and natural languages. To this end, we introduce SIG-Former, a transformer-backboned generation network to predict surgical instructions from monocular RGB images. METHODS: Taking a surgical image as input, we first extract its visual attentive feature map with a fine-tuned ResNet-101 model, followed by transformer attention blocks to correspondingly model its visual representation, text embedding and visual–textual relational feature. To tackle the loss-metric inconsistency between training and inference in sequence generation, we additionally apply a self-critical reinforcement learning approach to directly optimize the CIDEr score after regular training. RESULTS: We validate our proposed method on DAISI dataset, which contains 290 clinical procedures from diverse medical subjects. Extensive experiments demonstrate that our method outperforms the baselines and achieves promising performance on both quantitative and qualitative evaluations. CONCLUSION: Our experiments demonstrate that SIG-Former is capable of mapping dependencies between visual feature and textual information. Besides, surgical instruction generation is still at its preliminary stage. Future works include collecting large clinical dataset, annotating more reference instructions and preparing pre-trained models on medical images. |
format | Online Article Text |
id | pubmed-9652298 |
institution | National Center for Biotechnology Information |
language | English |
publishDate | 2022 |
publisher | Springer International Publishing |
record_format | MEDLINE/PubMed |
spelling | pubmed-96522982022-11-15 SIG-Former: monocular surgical instruction generation with transformers Zhang, Jinglu Nie, Yinyu Chang, Jian Zhang, Jian Jun Int J Comput Assist Radiol Surg Original Article PURPOSE: Automatic surgical instruction generation is a crucial part for intra-operative surgical assistance. However, understanding and translating surgical activities into human-like sentences are particularly challenging due to the complexity of surgical environment and the modal gap between images and natural languages. To this end, we introduce SIG-Former, a transformer-backboned generation network to predict surgical instructions from monocular RGB images. METHODS: Taking a surgical image as input, we first extract its visual attentive feature map with a fine-tuned ResNet-101 model, followed by transformer attention blocks to correspondingly model its visual representation, text embedding and visual–textual relational feature. To tackle the loss-metric inconsistency between training and inference in sequence generation, we additionally apply a self-critical reinforcement learning approach to directly optimize the CIDEr score after regular training. RESULTS: We validate our proposed method on DAISI dataset, which contains 290 clinical procedures from diverse medical subjects. Extensive experiments demonstrate that our method outperforms the baselines and achieves promising performance on both quantitative and qualitative evaluations. CONCLUSION: Our experiments demonstrate that SIG-Former is capable of mapping dependencies between visual feature and textual information. Besides, surgical instruction generation is still at its preliminary stage. Future works include collecting large clinical dataset, annotating more reference instructions and preparing pre-trained models on medical images. Springer International Publishing 2022-07-28 2022 /pmc/articles/PMC9652298/ /pubmed/35900645 http://dx.doi.org/10.1007/s11548-022-02718-9 Text en © The Author(s) 2022 https://creativecommons.org/licenses/by/4.0/Open AccessThis article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/ (https://creativecommons.org/licenses/by/4.0/) . |
spellingShingle | Original Article Zhang, Jinglu Nie, Yinyu Chang, Jian Zhang, Jian Jun SIG-Former: monocular surgical instruction generation with transformers |
title | SIG-Former: monocular surgical instruction generation with transformers |
title_full | SIG-Former: monocular surgical instruction generation with transformers |
title_fullStr | SIG-Former: monocular surgical instruction generation with transformers |
title_full_unstemmed | SIG-Former: monocular surgical instruction generation with transformers |
title_short | SIG-Former: monocular surgical instruction generation with transformers |
title_sort | sig-former: monocular surgical instruction generation with transformers |
topic | Original Article |
url | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9652298/ https://www.ncbi.nlm.nih.gov/pubmed/35900645 http://dx.doi.org/10.1007/s11548-022-02718-9 |
work_keys_str_mv | AT zhangjinglu sigformermonocularsurgicalinstructiongenerationwithtransformers AT nieyinyu sigformermonocularsurgicalinstructiongenerationwithtransformers AT changjian sigformermonocularsurgicalinstructiongenerationwithtransformers AT zhangjianjun sigformermonocularsurgicalinstructiongenerationwithtransformers |