Cargando…

SDA-CLIP: surgical visual domain adaptation using video and text labels

BACKGROUND: Surgical action recognition is an essential technology in context-aware-based autonomous surgery, whereas the accuracy is limited by clinical dataset scale. Leveraging surgical videos from virtual reality (VR) simulations to research algorithms for the clinical domain application, also k...

Descripción completa

Detalles Bibliográficos
Autores principales: Li, Yuchong, Jia, Shuangfu, Song, Guangbi, Wang, Ping, Jia, Fucang
Formato: Online Artículo Texto
Lenguaje:English
Publicado: AME Publishing Company 2023
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10585553/
https://www.ncbi.nlm.nih.gov/pubmed/37869278
http://dx.doi.org/10.21037/qims-23-376
_version_ 1785122979207708672
author Li, Yuchong
Jia, Shuangfu
Song, Guangbi
Wang, Ping
Jia, Fucang
author_facet Li, Yuchong
Jia, Shuangfu
Song, Guangbi
Wang, Ping
Jia, Fucang
author_sort Li, Yuchong
collection PubMed
description BACKGROUND: Surgical action recognition is an essential technology in context-aware-based autonomous surgery, whereas the accuracy is limited by clinical dataset scale. Leveraging surgical videos from virtual reality (VR) simulations to research algorithms for the clinical domain application, also known as domain adaptation, can effectively reduce the cost of data acquisition and annotation, and protect patient privacy. METHODS: We introduced a surgical domain adaptation method based on the contrastive language-image pretraining model (SDA-CLIP) to recognize cross-domain surgical action. Specifically, we utilized the Vision Transformer (ViT) and Transformer to extract video and text embeddings, respectively. Text embedding was developed as a bridge between VR and clinical domains. Inter- and intra-modality loss functions were employed to enhance the consistency of embeddings of the same class. Further, we evaluated our method on the MICCAI 2020 EndoVis Challenge SurgVisDom dataset. RESULTS: Our SDA-CLIP achieved a weighted F1-score of 65.9% (+18.9%) on the hard domain adaptation task (trained only with VR data) and 84.4% (+4.4%) on the soft domain adaptation task (trained with VR and clinical-like data), which outperformed the first place team of the challenge by a significant margin. CONCLUSIONS: The proposed SDA-CLIP model can effectively extract video scene information and textual semantic information, which greatly improves the performance of cross-domain surgical action recognition. The code is available at https://github.com/Lycus99/SDA-CLIP.
format Online
Article
Text
id pubmed-10585553
institution National Center for Biotechnology Information
language English
publishDate 2023
publisher AME Publishing Company
record_format MEDLINE/PubMed
spelling pubmed-105855532023-10-20 SDA-CLIP: surgical visual domain adaptation using video and text labels Li, Yuchong Jia, Shuangfu Song, Guangbi Wang, Ping Jia, Fucang Quant Imaging Med Surg Original Article BACKGROUND: Surgical action recognition is an essential technology in context-aware-based autonomous surgery, whereas the accuracy is limited by clinical dataset scale. Leveraging surgical videos from virtual reality (VR) simulations to research algorithms for the clinical domain application, also known as domain adaptation, can effectively reduce the cost of data acquisition and annotation, and protect patient privacy. METHODS: We introduced a surgical domain adaptation method based on the contrastive language-image pretraining model (SDA-CLIP) to recognize cross-domain surgical action. Specifically, we utilized the Vision Transformer (ViT) and Transformer to extract video and text embeddings, respectively. Text embedding was developed as a bridge between VR and clinical domains. Inter- and intra-modality loss functions were employed to enhance the consistency of embeddings of the same class. Further, we evaluated our method on the MICCAI 2020 EndoVis Challenge SurgVisDom dataset. RESULTS: Our SDA-CLIP achieved a weighted F1-score of 65.9% (+18.9%) on the hard domain adaptation task (trained only with VR data) and 84.4% (+4.4%) on the soft domain adaptation task (trained with VR and clinical-like data), which outperformed the first place team of the challenge by a significant margin. CONCLUSIONS: The proposed SDA-CLIP model can effectively extract video scene information and textual semantic information, which greatly improves the performance of cross-domain surgical action recognition. The code is available at https://github.com/Lycus99/SDA-CLIP. AME Publishing Company 2023-09-21 2023-10-01 /pmc/articles/PMC10585553/ /pubmed/37869278 http://dx.doi.org/10.21037/qims-23-376 Text en 2023 Quantitative Imaging in Medicine and Surgery. All rights reserved. https://creativecommons.org/licenses/by-nc-nd/4.0/Open Access Statement: This is an Open Access article distributed in accordance with the Creative Commons Attribution-NonCommercial-NoDerivs 4.0 International License (CC BY-NC-ND 4.0), which permits the non-commercial replication and distribution of the article with the strict proviso that no changes or edits are made and the original work is properly cited (including links to both the formal publication through the relevant DOI and the license). See: https://creativecommons.org/licenses/by-nc-nd/4.0 (https://creativecommons.org/licenses/by-nc-nd/4.0/) .
spellingShingle Original Article
Li, Yuchong
Jia, Shuangfu
Song, Guangbi
Wang, Ping
Jia, Fucang
SDA-CLIP: surgical visual domain adaptation using video and text labels
title SDA-CLIP: surgical visual domain adaptation using video and text labels
title_full SDA-CLIP: surgical visual domain adaptation using video and text labels
title_fullStr SDA-CLIP: surgical visual domain adaptation using video and text labels
title_full_unstemmed SDA-CLIP: surgical visual domain adaptation using video and text labels
title_short SDA-CLIP: surgical visual domain adaptation using video and text labels
title_sort sda-clip: surgical visual domain adaptation using video and text labels
topic Original Article
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10585553/
https://www.ncbi.nlm.nih.gov/pubmed/37869278
http://dx.doi.org/10.21037/qims-23-376
work_keys_str_mv AT liyuchong sdaclipsurgicalvisualdomainadaptationusingvideoandtextlabels
AT jiashuangfu sdaclipsurgicalvisualdomainadaptationusingvideoandtextlabels
AT songguangbi sdaclipsurgicalvisualdomainadaptationusingvideoandtextlabels
AT wangping sdaclipsurgicalvisualdomainadaptationusingvideoandtextlabels
AT jiafucang sdaclipsurgicalvisualdomainadaptationusingvideoandtextlabels