Cargando…

Automating document classification with distant supervision to increase the efficiency of systematic reviews: A case study on identifying studies with HIV impacts on female sex workers

There remains a limited understanding of the HIV prevention and treatment needs among female sex workers in many parts of the world. Systematic reviews of existing literature can help fill this gap; however, well-done systematic reviews are time-demanding and labor-intensive. Here, we propose an aut...

Descripción completa

Detalles Bibliográficos
Autores principales: Li, Xiaoxiao, Zhang, Amy, Al-Zaidy, Rabah, Rao, Amrita, Baral, Stefan, Bao, Le, Giles, C. Lee
Formato: Online Artículo Texto
Lenguaje:English
Publicado: Public Library of Science 2022
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9246134/
https://www.ncbi.nlm.nih.gov/pubmed/35771807
http://dx.doi.org/10.1371/journal.pone.0270034
_version_ 1784738901335736320
author Li, Xiaoxiao
Zhang, Amy
Al-Zaidy, Rabah
Rao, Amrita
Baral, Stefan
Bao, Le
Giles, C. Lee
author_facet Li, Xiaoxiao
Zhang, Amy
Al-Zaidy, Rabah
Rao, Amrita
Baral, Stefan
Bao, Le
Giles, C. Lee
author_sort Li, Xiaoxiao
collection PubMed
description There remains a limited understanding of the HIV prevention and treatment needs among female sex workers in many parts of the world. Systematic reviews of existing literature can help fill this gap; however, well-done systematic reviews are time-demanding and labor-intensive. Here, we propose an automatic document classification approach to a systematic review to significantly reduce the effort in reviewing documents and optimizing empiric decision making. We first describe a manual document classification procedure that is used to curate a pertinent training dataset and then propose three classifiers: a keyword-guided method, a cluster analysis-based method, and a random forest approach that utilizes a large set of feature tokens. This approach is used to identify documents studying female sex workers that contain content relevant to either HIV or experienced violence. We compare the performance of the three classifiers by cross-validation in terms of area under the curve of the receiver operating characteristic and precision and recall plot, and found random forest approach reduces the amount of manual reading for our example by 80%; in sensitivity analysis, we found that even trained with only 10% of data, the classifier can still avoid reading 75% of future documents (68% of total) while retaining 80% of relevant documents. In sum, the automated procedure of document classification presented here could improve both the precision and efficiency of systematic reviews and facilitate live reviews, where reviews are updated regularly. We expect to obtain a reasonable classifier by taking 20% of retrieved documents as training samples. The proposed classifier could also be used for more meaningfully assembling literature in other research areas and for rapid documents screening with a tight schedule, such as COVID-related work during the crisis.
format Online
Article
Text
id pubmed-9246134
institution National Center for Biotechnology Information
language English
publishDate 2022
publisher Public Library of Science
record_format MEDLINE/PubMed
spelling pubmed-92461342022-07-01 Automating document classification with distant supervision to increase the efficiency of systematic reviews: A case study on identifying studies with HIV impacts on female sex workers Li, Xiaoxiao Zhang, Amy Al-Zaidy, Rabah Rao, Amrita Baral, Stefan Bao, Le Giles, C. Lee PLoS One Research Article There remains a limited understanding of the HIV prevention and treatment needs among female sex workers in many parts of the world. Systematic reviews of existing literature can help fill this gap; however, well-done systematic reviews are time-demanding and labor-intensive. Here, we propose an automatic document classification approach to a systematic review to significantly reduce the effort in reviewing documents and optimizing empiric decision making. We first describe a manual document classification procedure that is used to curate a pertinent training dataset and then propose three classifiers: a keyword-guided method, a cluster analysis-based method, and a random forest approach that utilizes a large set of feature tokens. This approach is used to identify documents studying female sex workers that contain content relevant to either HIV or experienced violence. We compare the performance of the three classifiers by cross-validation in terms of area under the curve of the receiver operating characteristic and precision and recall plot, and found random forest approach reduces the amount of manual reading for our example by 80%; in sensitivity analysis, we found that even trained with only 10% of data, the classifier can still avoid reading 75% of future documents (68% of total) while retaining 80% of relevant documents. In sum, the automated procedure of document classification presented here could improve both the precision and efficiency of systematic reviews and facilitate live reviews, where reviews are updated regularly. We expect to obtain a reasonable classifier by taking 20% of retrieved documents as training samples. The proposed classifier could also be used for more meaningfully assembling literature in other research areas and for rapid documents screening with a tight schedule, such as COVID-related work during the crisis. Public Library of Science 2022-06-30 /pmc/articles/PMC9246134/ /pubmed/35771807 http://dx.doi.org/10.1371/journal.pone.0270034 Text en © 2022 Li et al https://creativecommons.org/licenses/by/4.0/This is an open access article distributed under the terms of the Creative Commons Attribution License (https://creativecommons.org/licenses/by/4.0/) , which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.
spellingShingle Research Article
Li, Xiaoxiao
Zhang, Amy
Al-Zaidy, Rabah
Rao, Amrita
Baral, Stefan
Bao, Le
Giles, C. Lee
Automating document classification with distant supervision to increase the efficiency of systematic reviews: A case study on identifying studies with HIV impacts on female sex workers
title Automating document classification with distant supervision to increase the efficiency of systematic reviews: A case study on identifying studies with HIV impacts on female sex workers
title_full Automating document classification with distant supervision to increase the efficiency of systematic reviews: A case study on identifying studies with HIV impacts on female sex workers
title_fullStr Automating document classification with distant supervision to increase the efficiency of systematic reviews: A case study on identifying studies with HIV impacts on female sex workers
title_full_unstemmed Automating document classification with distant supervision to increase the efficiency of systematic reviews: A case study on identifying studies with HIV impacts on female sex workers
title_short Automating document classification with distant supervision to increase the efficiency of systematic reviews: A case study on identifying studies with HIV impacts on female sex workers
title_sort automating document classification with distant supervision to increase the efficiency of systematic reviews: a case study on identifying studies with hiv impacts on female sex workers
topic Research Article
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9246134/
https://www.ncbi.nlm.nih.gov/pubmed/35771807
http://dx.doi.org/10.1371/journal.pone.0270034
work_keys_str_mv AT lixiaoxiao automatingdocumentclassificationwithdistantsupervisiontoincreasetheefficiencyofsystematicreviewsacasestudyonidentifyingstudieswithhivimpactsonfemalesexworkers
AT zhangamy automatingdocumentclassificationwithdistantsupervisiontoincreasetheefficiencyofsystematicreviewsacasestudyonidentifyingstudieswithhivimpactsonfemalesexworkers
AT alzaidyrabah automatingdocumentclassificationwithdistantsupervisiontoincreasetheefficiencyofsystematicreviewsacasestudyonidentifyingstudieswithhivimpactsonfemalesexworkers
AT raoamrita automatingdocumentclassificationwithdistantsupervisiontoincreasetheefficiencyofsystematicreviewsacasestudyonidentifyingstudieswithhivimpactsonfemalesexworkers
AT baralstefan automatingdocumentclassificationwithdistantsupervisiontoincreasetheefficiencyofsystematicreviewsacasestudyonidentifyingstudieswithhivimpactsonfemalesexworkers
AT baole automatingdocumentclassificationwithdistantsupervisiontoincreasetheefficiencyofsystematicreviewsacasestudyonidentifyingstudieswithhivimpactsonfemalesexworkers
AT gilesclee automatingdocumentclassificationwithdistantsupervisiontoincreasetheefficiencyofsystematicreviewsacasestudyonidentifyingstudieswithhivimpactsonfemalesexworkers