Cargando…

A review and comparative study of cancer detection using machine learning: SBERT and SimCSE application

BACKGROUND: Using visual, biological, and electronic health records data as the sole input source, pretrained convolutional neural networks and conventional machine learning methods have been heavily employed for the identification of various malignancies. Initially, a series of preprocessing steps...

Descripción completa

Detalles Bibliográficos
Autores principales: Mokoatle, Mpho, Marivate, Vukosi, Mapiye, Darlington, Bornman, Riana, Hayes, Vanessa. M.
Formato: Online Artículo Texto
Lenguaje:English
Publicado: BioMed Central 2023
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10037872/
https://www.ncbi.nlm.nih.gov/pubmed/36959534
http://dx.doi.org/10.1186/s12859-023-05235-x
_version_ 1784911966864670720
author Mokoatle, Mpho
Marivate, Vukosi
Mapiye, Darlington
Bornman, Riana
Hayes, Vanessa. M.
author_facet Mokoatle, Mpho
Marivate, Vukosi
Mapiye, Darlington
Bornman, Riana
Hayes, Vanessa. M.
author_sort Mokoatle, Mpho
collection PubMed
description BACKGROUND: Using visual, biological, and electronic health records data as the sole input source, pretrained convolutional neural networks and conventional machine learning methods have been heavily employed for the identification of various malignancies. Initially, a series of preprocessing steps and image segmentation steps are performed to extract region of interest features from noisy features. Then, the extracted features are applied to several machine learning and deep learning methods for the detection of cancer. METHODS: In this work, a review of all the methods that have been applied to develop machine learning algorithms that detect cancer is provided. With more than 100 types of cancer, this study only examines research on the four most common and prevalent cancers worldwide: lung, breast, prostate, and colorectal cancer. Next, by using state-of-the-art sentence transformers namely: SBERT (2019) and the unsupervised SimCSE (2021), this study proposes a new methodology for detecting cancer. This method requires raw DNA sequences of matched tumor/normal pair as the only input. The learnt DNA representations retrieved from SBERT and SimCSE will then be sent to machine learning algorithms (XGBoost, Random Forest, LightGBM, and CNNs) for classification. As far as we are aware, SBERT and SimCSE transformers have not been applied to represent DNA sequences in cancer detection settings. RESULTS: The XGBoost model, which had the highest overall accuracy of 73 ± 0.13 % using SBERT embeddings and 75 ± 0.12 % using SimCSE embeddings, was the best performing classifier. In light of these findings, it can be concluded that incorporating sentence representations from SimCSE’s sentence transformer only marginally improved the performance of machine learning models. SUPPLEMENTARY INFORMATION: The online version contains supplementary material available at 10.1186/s12859-023-05235-x.
format Online
Article
Text
id pubmed-10037872
institution National Center for Biotechnology Information
language English
publishDate 2023
publisher BioMed Central
record_format MEDLINE/PubMed
spelling pubmed-100378722023-03-25 A review and comparative study of cancer detection using machine learning: SBERT and SimCSE application Mokoatle, Mpho Marivate, Vukosi Mapiye, Darlington Bornman, Riana Hayes, Vanessa. M. BMC Bioinformatics Research BACKGROUND: Using visual, biological, and electronic health records data as the sole input source, pretrained convolutional neural networks and conventional machine learning methods have been heavily employed for the identification of various malignancies. Initially, a series of preprocessing steps and image segmentation steps are performed to extract region of interest features from noisy features. Then, the extracted features are applied to several machine learning and deep learning methods for the detection of cancer. METHODS: In this work, a review of all the methods that have been applied to develop machine learning algorithms that detect cancer is provided. With more than 100 types of cancer, this study only examines research on the four most common and prevalent cancers worldwide: lung, breast, prostate, and colorectal cancer. Next, by using state-of-the-art sentence transformers namely: SBERT (2019) and the unsupervised SimCSE (2021), this study proposes a new methodology for detecting cancer. This method requires raw DNA sequences of matched tumor/normal pair as the only input. The learnt DNA representations retrieved from SBERT and SimCSE will then be sent to machine learning algorithms (XGBoost, Random Forest, LightGBM, and CNNs) for classification. As far as we are aware, SBERT and SimCSE transformers have not been applied to represent DNA sequences in cancer detection settings. RESULTS: The XGBoost model, which had the highest overall accuracy of 73 ± 0.13 % using SBERT embeddings and 75 ± 0.12 % using SimCSE embeddings, was the best performing classifier. In light of these findings, it can be concluded that incorporating sentence representations from SimCSE’s sentence transformer only marginally improved the performance of machine learning models. SUPPLEMENTARY INFORMATION: The online version contains supplementary material available at 10.1186/s12859-023-05235-x. BioMed Central 2023-03-23 /pmc/articles/PMC10037872/ /pubmed/36959534 http://dx.doi.org/10.1186/s12859-023-05235-x Text en © The Author(s) 2023 https://creativecommons.org/licenses/by/4.0/Open AccessThis article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/ (https://creativecommons.org/licenses/by/4.0/) . The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/ (https://creativecommons.org/publicdomain/zero/1.0/) ) applies to the data made available in this article, unless otherwise stated in a credit line to the data.
spellingShingle Research
Mokoatle, Mpho
Marivate, Vukosi
Mapiye, Darlington
Bornman, Riana
Hayes, Vanessa. M.
A review and comparative study of cancer detection using machine learning: SBERT and SimCSE application
title A review and comparative study of cancer detection using machine learning: SBERT and SimCSE application
title_full A review and comparative study of cancer detection using machine learning: SBERT and SimCSE application
title_fullStr A review and comparative study of cancer detection using machine learning: SBERT and SimCSE application
title_full_unstemmed A review and comparative study of cancer detection using machine learning: SBERT and SimCSE application
title_short A review and comparative study of cancer detection using machine learning: SBERT and SimCSE application
title_sort review and comparative study of cancer detection using machine learning: sbert and simcse application
topic Research
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10037872/
https://www.ncbi.nlm.nih.gov/pubmed/36959534
http://dx.doi.org/10.1186/s12859-023-05235-x
work_keys_str_mv AT mokoatlempho areviewandcomparativestudyofcancerdetectionusingmachinelearningsbertandsimcseapplication
AT marivatevukosi areviewandcomparativestudyofcancerdetectionusingmachinelearningsbertandsimcseapplication
AT mapiyedarlington areviewandcomparativestudyofcancerdetectionusingmachinelearningsbertandsimcseapplication
AT bornmanriana areviewandcomparativestudyofcancerdetectionusingmachinelearningsbertandsimcseapplication
AT hayesvanessam areviewandcomparativestudyofcancerdetectionusingmachinelearningsbertandsimcseapplication
AT mokoatlempho reviewandcomparativestudyofcancerdetectionusingmachinelearningsbertandsimcseapplication
AT marivatevukosi reviewandcomparativestudyofcancerdetectionusingmachinelearningsbertandsimcseapplication
AT mapiyedarlington reviewandcomparativestudyofcancerdetectionusingmachinelearningsbertandsimcseapplication
AT bornmanriana reviewandcomparativestudyofcancerdetectionusingmachinelearningsbertandsimcseapplication
AT hayesvanessam reviewandcomparativestudyofcancerdetectionusingmachinelearningsbertandsimcseapplication