Cargando…

SIG-Former: monocular surgical instruction generation with transformers

PURPOSE: Automatic surgical instruction generation is a crucial part for intra-operative surgical assistance. However, understanding and translating surgical activities into human-like sentences are particularly challenging due to the complexity of surgical environment and the modal gap between imag...

Descripción completa

Detalles Bibliográficos
Autores principales: Zhang, Jinglu, Nie, Yinyu, Chang, Jian, Zhang, Jian Jun
Formato: Online Artículo Texto
Lenguaje:English
Publicado: Springer International Publishing 2022
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9652298/
https://www.ncbi.nlm.nih.gov/pubmed/35900645
http://dx.doi.org/10.1007/s11548-022-02718-9
_version_ 1784828438547267584
author Zhang, Jinglu
Nie, Yinyu
Chang, Jian
Zhang, Jian Jun
author_facet Zhang, Jinglu
Nie, Yinyu
Chang, Jian
Zhang, Jian Jun
author_sort Zhang, Jinglu
collection PubMed
description PURPOSE: Automatic surgical instruction generation is a crucial part for intra-operative surgical assistance. However, understanding and translating surgical activities into human-like sentences are particularly challenging due to the complexity of surgical environment and the modal gap between images and natural languages. To this end, we introduce SIG-Former, a transformer-backboned generation network to predict surgical instructions from monocular RGB images. METHODS: Taking a surgical image as input, we first extract its visual attentive feature map with a fine-tuned ResNet-101 model, followed by transformer attention blocks to correspondingly model its visual representation, text embedding and visual–textual relational feature. To tackle the loss-metric inconsistency between training and inference in sequence generation, we additionally apply a self-critical reinforcement learning approach to directly optimize the CIDEr score after regular training. RESULTS: We validate our proposed method on DAISI dataset, which contains 290 clinical procedures from diverse medical subjects. Extensive experiments demonstrate that our method outperforms the baselines and achieves promising performance on both quantitative and qualitative evaluations. CONCLUSION: Our experiments demonstrate that SIG-Former is capable of mapping dependencies between visual feature and textual information. Besides, surgical instruction generation is still at its preliminary stage. Future works include collecting large clinical dataset, annotating more reference instructions and preparing pre-trained models on medical images.
format Online
Article
Text
id pubmed-9652298
institution National Center for Biotechnology Information
language English
publishDate 2022
publisher Springer International Publishing
record_format MEDLINE/PubMed
spelling pubmed-96522982022-11-15 SIG-Former: monocular surgical instruction generation with transformers Zhang, Jinglu Nie, Yinyu Chang, Jian Zhang, Jian Jun Int J Comput Assist Radiol Surg Original Article PURPOSE: Automatic surgical instruction generation is a crucial part for intra-operative surgical assistance. However, understanding and translating surgical activities into human-like sentences are particularly challenging due to the complexity of surgical environment and the modal gap between images and natural languages. To this end, we introduce SIG-Former, a transformer-backboned generation network to predict surgical instructions from monocular RGB images. METHODS: Taking a surgical image as input, we first extract its visual attentive feature map with a fine-tuned ResNet-101 model, followed by transformer attention blocks to correspondingly model its visual representation, text embedding and visual–textual relational feature. To tackle the loss-metric inconsistency between training and inference in sequence generation, we additionally apply a self-critical reinforcement learning approach to directly optimize the CIDEr score after regular training. RESULTS: We validate our proposed method on DAISI dataset, which contains 290 clinical procedures from diverse medical subjects. Extensive experiments demonstrate that our method outperforms the baselines and achieves promising performance on both quantitative and qualitative evaluations. CONCLUSION: Our experiments demonstrate that SIG-Former is capable of mapping dependencies between visual feature and textual information. Besides, surgical instruction generation is still at its preliminary stage. Future works include collecting large clinical dataset, annotating more reference instructions and preparing pre-trained models on medical images. Springer International Publishing 2022-07-28 2022 /pmc/articles/PMC9652298/ /pubmed/35900645 http://dx.doi.org/10.1007/s11548-022-02718-9 Text en © The Author(s) 2022 https://creativecommons.org/licenses/by/4.0/Open AccessThis article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/ (https://creativecommons.org/licenses/by/4.0/) .
spellingShingle Original Article
Zhang, Jinglu
Nie, Yinyu
Chang, Jian
Zhang, Jian Jun
SIG-Former: monocular surgical instruction generation with transformers
title SIG-Former: monocular surgical instruction generation with transformers
title_full SIG-Former: monocular surgical instruction generation with transformers
title_fullStr SIG-Former: monocular surgical instruction generation with transformers
title_full_unstemmed SIG-Former: monocular surgical instruction generation with transformers
title_short SIG-Former: monocular surgical instruction generation with transformers
title_sort sig-former: monocular surgical instruction generation with transformers
topic Original Article
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9652298/
https://www.ncbi.nlm.nih.gov/pubmed/35900645
http://dx.doi.org/10.1007/s11548-022-02718-9
work_keys_str_mv AT zhangjinglu sigformermonocularsurgicalinstructiongenerationwithtransformers
AT nieyinyu sigformermonocularsurgicalinstructiongenerationwithtransformers
AT changjian sigformermonocularsurgicalinstructiongenerationwithtransformers
AT zhangjianjun sigformermonocularsurgicalinstructiongenerationwithtransformers