Cargando…

Screening PubMed abstracts: is class imbalance always a challenge to machine learning?

BACKGROUND: The growing number of medical literature and textual data in online repositories led to an exponential increase in the workload of researchers involved in citation screening for systematic reviews. This work aims to combine machine learning techniques and data preprocessing for class imb...

Descripción completa

Detalles Bibliográficos
Autores principales:	Lanera, Corrado, Berchialla, Paola, Sharma, Abhinav, Minto, Clara, Gregori, Dario, Baldi, Ileana
Formato:	Online Artículo Texto
Lenguaje:	English
Publicado:	BioMed Central 2019
Materias:	Methodology
Acceso en línea:	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6896747/ https://www.ncbi.nlm.nih.gov/pubmed/31810495 http://dx.doi.org/10.1186/s13643-019-1245-8

_version_	1783476849234411520
author	Lanera, Corrado Berchialla, Paola Sharma, Abhinav Minto, Clara Gregori, Dario Baldi, Ileana
author_facet	Lanera, Corrado Berchialla, Paola Sharma, Abhinav Minto, Clara Gregori, Dario Baldi, Ileana
author_sort	Lanera, Corrado
collection	PubMed
description	BACKGROUND: The growing number of medical literature and textual data in online repositories led to an exponential increase in the workload of researchers involved in citation screening for systematic reviews. This work aims to combine machine learning techniques and data preprocessing for class imbalance to identify the outperforming strategy to screen articles in PubMed for inclusion in systematic reviews. METHODS: We trained four binary text classifiers (support vector machines, k-nearest neighbor, random forest, and elastic-net regularized generalized linear models) in combination with four techniques for class imbalance: random undersampling and oversampling with 50:50 and 35:65 positive to negative class ratios and none as a benchmark. We used textual data of 14 systematic reviews as case studies. Difference between cross-validated area under the receiver operating characteristic curve (AUC-ROC) for machine learning techniques with and without preprocessing (delta AUC) was estimated within each systematic review, separately for each classifier. Meta-analytic fixed-effect models were used to pool delta AUCs separately by classifier and strategy. RESULTS: Cross-validated AUC-ROC for machine learning techniques (excluding k-nearest neighbor) without preprocessing was prevalently above 90%. Except for k-nearest neighbor, machine learning techniques achieved the best improvement in conjunction with random oversampling 50:50 and random undersampling 35:65. CONCLUSIONS: Resampling techniques slightly improved the performance of the investigated machine learning techniques. From a computational perspective, random undersampling 35:65 may be preferred.
format	Online Article Text
id	pubmed-6896747
institution	National Center for Biotechnology Information
language	English
publishDate	2019
publisher	BioMed Central
record_format	MEDLINE/PubMed
spelling	pubmed-68967472019-12-11 Screening PubMed abstracts: is class imbalance always a challenge to machine learning? Lanera, Corrado Berchialla, Paola Sharma, Abhinav Minto, Clara Gregori, Dario Baldi, Ileana Syst Rev Methodology BACKGROUND: The growing number of medical literature and textual data in online repositories led to an exponential increase in the workload of researchers involved in citation screening for systematic reviews. This work aims to combine machine learning techniques and data preprocessing for class imbalance to identify the outperforming strategy to screen articles in PubMed for inclusion in systematic reviews. METHODS: We trained four binary text classifiers (support vector machines, k-nearest neighbor, random forest, and elastic-net regularized generalized linear models) in combination with four techniques for class imbalance: random undersampling and oversampling with 50:50 and 35:65 positive to negative class ratios and none as a benchmark. We used textual data of 14 systematic reviews as case studies. Difference between cross-validated area under the receiver operating characteristic curve (AUC-ROC) for machine learning techniques with and without preprocessing (delta AUC) was estimated within each systematic review, separately for each classifier. Meta-analytic fixed-effect models were used to pool delta AUCs separately by classifier and strategy. RESULTS: Cross-validated AUC-ROC for machine learning techniques (excluding k-nearest neighbor) without preprocessing was prevalently above 90%. Except for k-nearest neighbor, machine learning techniques achieved the best improvement in conjunction with random oversampling 50:50 and random undersampling 35:65. CONCLUSIONS: Resampling techniques slightly improved the performance of the investigated machine learning techniques. From a computational perspective, random undersampling 35:65 may be preferred. BioMed Central 2019-12-06 /pmc/articles/PMC6896747/ /pubmed/31810495 http://dx.doi.org/10.1186/s13643-019-1245-8 Text en © The Author(s). 2019 Open AccessThis article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.
spellingShingle	Methodology Lanera, Corrado Berchialla, Paola Sharma, Abhinav Minto, Clara Gregori, Dario Baldi, Ileana Screening PubMed abstracts: is class imbalance always a challenge to machine learning?
title	Screening PubMed abstracts: is class imbalance always a challenge to machine learning?
title_full	Screening PubMed abstracts: is class imbalance always a challenge to machine learning?
title_fullStr	Screening PubMed abstracts: is class imbalance always a challenge to machine learning?
title_full_unstemmed	Screening PubMed abstracts: is class imbalance always a challenge to machine learning?
title_short	Screening PubMed abstracts: is class imbalance always a challenge to machine learning?
title_sort	screening pubmed abstracts: is class imbalance always a challenge to machine learning?
topic	Methodology
url	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6896747/ https://www.ncbi.nlm.nih.gov/pubmed/31810495 http://dx.doi.org/10.1186/s13643-019-1245-8
work_keys_str_mv	AT laneracorrado screeningpubmedabstractsisclassimbalancealwaysachallengetomachinelearning AT berchiallapaola screeningpubmedabstractsisclassimbalancealwaysachallengetomachinelearning AT sharmaabhinav screeningpubmedabstractsisclassimbalancealwaysachallengetomachinelearning AT mintoclara screeningpubmedabstractsisclassimbalancealwaysachallengetomachinelearning AT gregoridario screeningpubmedabstractsisclassimbalancealwaysachallengetomachinelearning AT baldiileana screeningpubmedabstractsisclassimbalancealwaysachallengetomachinelearning

Screening PubMed abstracts: is class imbalance always a challenge to machine learning?

Ejemplares similares