Cargando…

Deep active learning for classifying cancer pathology reports

BACKGROUND: Automated text classification has many important applications in the clinical setting; however, obtaining labelled data for training machine learning and deep learning models is often difficult and expensive. Active learning techniques may mitigate this challenge by reducing the amount o...

Descripción completa

Detalles Bibliográficos
Autores principales: De Angeli, Kevin, Gao, Shang, Alawad, Mohammed, Yoon, Hong-Jun, Schaefferkoetter, Noah, Wu, Xiao-Cheng, Durbin, Eric B., Doherty, Jennifer, Stroup, Antoinette, Coyle, Linda, Penberthy, Lynne, Tourassi, Georgia
Formato: Online Artículo Texto
Lenguaje:English
Publicado: BioMed Central 2021
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7941989/
https://www.ncbi.nlm.nih.gov/pubmed/33750288
http://dx.doi.org/10.1186/s12859-021-04047-1
_version_ 1783662228112670720
author De Angeli, Kevin
Gao, Shang
Alawad, Mohammed
Yoon, Hong-Jun
Schaefferkoetter, Noah
Wu, Xiao-Cheng
Durbin, Eric B.
Doherty, Jennifer
Stroup, Antoinette
Coyle, Linda
Penberthy, Lynne
Tourassi, Georgia
author_facet De Angeli, Kevin
Gao, Shang
Alawad, Mohammed
Yoon, Hong-Jun
Schaefferkoetter, Noah
Wu, Xiao-Cheng
Durbin, Eric B.
Doherty, Jennifer
Stroup, Antoinette
Coyle, Linda
Penberthy, Lynne
Tourassi, Georgia
author_sort De Angeli, Kevin
collection PubMed
description BACKGROUND: Automated text classification has many important applications in the clinical setting; however, obtaining labelled data for training machine learning and deep learning models is often difficult and expensive. Active learning techniques may mitigate this challenge by reducing the amount of labelled data required to effectively train a model. In this study, we analyze the effectiveness of 11 active learning algorithms on classifying subsite and histology from cancer pathology reports using a Convolutional Neural Network as the text classification model. RESULTS: We compare the performance of each active learning strategy using two differently sized datasets and two different classification tasks. Our results show that on all tasks and dataset sizes, all active learning strategies except diversity-sampling strategies outperformed random sampling, i.e., no active learning. On our large dataset (15K initial labelled samples, adding 15K additional labelled samples each iteration of active learning), there was no clear winner between the different active learning strategies. On our small dataset (1K initial labelled samples, adding 1K additional labelled samples each iteration of active learning), marginal and ratio uncertainty sampling performed better than all other active learning techniques. We found that compared to random sampling, active learning strongly helps performance on rare classes by focusing on underrepresented classes. CONCLUSIONS: Active learning can save annotation cost by helping human annotators efficiently and intelligently select which samples to label. Our results show that a dataset constructed using effective active learning techniques requires less than half the amount of labelled data to achieve the same performance as a dataset constructed using random sampling. SUPPLEMENTARY INFORMATION: The online version supplementary material available at 10.1186/s12859-021-04047-1.
format Online
Article
Text
id pubmed-7941989
institution National Center for Biotechnology Information
language English
publishDate 2021
publisher BioMed Central
record_format MEDLINE/PubMed
spelling pubmed-79419892021-03-10 Deep active learning for classifying cancer pathology reports De Angeli, Kevin Gao, Shang Alawad, Mohammed Yoon, Hong-Jun Schaefferkoetter, Noah Wu, Xiao-Cheng Durbin, Eric B. Doherty, Jennifer Stroup, Antoinette Coyle, Linda Penberthy, Lynne Tourassi, Georgia BMC Bioinformatics Research Article BACKGROUND: Automated text classification has many important applications in the clinical setting; however, obtaining labelled data for training machine learning and deep learning models is often difficult and expensive. Active learning techniques may mitigate this challenge by reducing the amount of labelled data required to effectively train a model. In this study, we analyze the effectiveness of 11 active learning algorithms on classifying subsite and histology from cancer pathology reports using a Convolutional Neural Network as the text classification model. RESULTS: We compare the performance of each active learning strategy using two differently sized datasets and two different classification tasks. Our results show that on all tasks and dataset sizes, all active learning strategies except diversity-sampling strategies outperformed random sampling, i.e., no active learning. On our large dataset (15K initial labelled samples, adding 15K additional labelled samples each iteration of active learning), there was no clear winner between the different active learning strategies. On our small dataset (1K initial labelled samples, adding 1K additional labelled samples each iteration of active learning), marginal and ratio uncertainty sampling performed better than all other active learning techniques. We found that compared to random sampling, active learning strongly helps performance on rare classes by focusing on underrepresented classes. CONCLUSIONS: Active learning can save annotation cost by helping human annotators efficiently and intelligently select which samples to label. Our results show that a dataset constructed using effective active learning techniques requires less than half the amount of labelled data to achieve the same performance as a dataset constructed using random sampling. SUPPLEMENTARY INFORMATION: The online version supplementary material available at 10.1186/s12859-021-04047-1. BioMed Central 2021-03-09 /pmc/articles/PMC7941989/ /pubmed/33750288 http://dx.doi.org/10.1186/s12859-021-04047-1 Text en © The Author(s) 2021 Open AccessThis article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated in a credit line to the data.
spellingShingle Research Article
De Angeli, Kevin
Gao, Shang
Alawad, Mohammed
Yoon, Hong-Jun
Schaefferkoetter, Noah
Wu, Xiao-Cheng
Durbin, Eric B.
Doherty, Jennifer
Stroup, Antoinette
Coyle, Linda
Penberthy, Lynne
Tourassi, Georgia
Deep active learning for classifying cancer pathology reports
title Deep active learning for classifying cancer pathology reports
title_full Deep active learning for classifying cancer pathology reports
title_fullStr Deep active learning for classifying cancer pathology reports
title_full_unstemmed Deep active learning for classifying cancer pathology reports
title_short Deep active learning for classifying cancer pathology reports
title_sort deep active learning for classifying cancer pathology reports
topic Research Article
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7941989/
https://www.ncbi.nlm.nih.gov/pubmed/33750288
http://dx.doi.org/10.1186/s12859-021-04047-1
work_keys_str_mv AT deangelikevin deepactivelearningforclassifyingcancerpathologyreports
AT gaoshang deepactivelearningforclassifyingcancerpathologyreports
AT alawadmohammed deepactivelearningforclassifyingcancerpathologyreports
AT yoonhongjun deepactivelearningforclassifyingcancerpathologyreports
AT schaefferkoetternoah deepactivelearningforclassifyingcancerpathologyreports
AT wuxiaocheng deepactivelearningforclassifyingcancerpathologyreports
AT durbinericb deepactivelearningforclassifyingcancerpathologyreports
AT dohertyjennifer deepactivelearningforclassifyingcancerpathologyreports
AT stroupantoinette deepactivelearningforclassifyingcancerpathologyreports
AT coylelinda deepactivelearningforclassifyingcancerpathologyreports
AT penberthylynne deepactivelearningforclassifyingcancerpathologyreports
AT tourassigeorgia deepactivelearningforclassifyingcancerpathologyreports