Cargando…
Automating document classification with distant supervision to increase the efficiency of systematic reviews: A case study on identifying studies with HIV impacts on female sex workers
There remains a limited understanding of the HIV prevention and treatment needs among female sex workers in many parts of the world. Systematic reviews of existing literature can help fill this gap; however, well-done systematic reviews are time-demanding and labor-intensive. Here, we propose an aut...
Autores principales: | , , , , , , |
---|---|
Formato: | Online Artículo Texto |
Lenguaje: | English |
Publicado: |
Public Library of Science
2022
|
Materias: | |
Acceso en línea: | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9246134/ https://www.ncbi.nlm.nih.gov/pubmed/35771807 http://dx.doi.org/10.1371/journal.pone.0270034 |
_version_ | 1784738901335736320 |
---|---|
author | Li, Xiaoxiao Zhang, Amy Al-Zaidy, Rabah Rao, Amrita Baral, Stefan Bao, Le Giles, C. Lee |
author_facet | Li, Xiaoxiao Zhang, Amy Al-Zaidy, Rabah Rao, Amrita Baral, Stefan Bao, Le Giles, C. Lee |
author_sort | Li, Xiaoxiao |
collection | PubMed |
description | There remains a limited understanding of the HIV prevention and treatment needs among female sex workers in many parts of the world. Systematic reviews of existing literature can help fill this gap; however, well-done systematic reviews are time-demanding and labor-intensive. Here, we propose an automatic document classification approach to a systematic review to significantly reduce the effort in reviewing documents and optimizing empiric decision making. We first describe a manual document classification procedure that is used to curate a pertinent training dataset and then propose three classifiers: a keyword-guided method, a cluster analysis-based method, and a random forest approach that utilizes a large set of feature tokens. This approach is used to identify documents studying female sex workers that contain content relevant to either HIV or experienced violence. We compare the performance of the three classifiers by cross-validation in terms of area under the curve of the receiver operating characteristic and precision and recall plot, and found random forest approach reduces the amount of manual reading for our example by 80%; in sensitivity analysis, we found that even trained with only 10% of data, the classifier can still avoid reading 75% of future documents (68% of total) while retaining 80% of relevant documents. In sum, the automated procedure of document classification presented here could improve both the precision and efficiency of systematic reviews and facilitate live reviews, where reviews are updated regularly. We expect to obtain a reasonable classifier by taking 20% of retrieved documents as training samples. The proposed classifier could also be used for more meaningfully assembling literature in other research areas and for rapid documents screening with a tight schedule, such as COVID-related work during the crisis. |
format | Online Article Text |
id | pubmed-9246134 |
institution | National Center for Biotechnology Information |
language | English |
publishDate | 2022 |
publisher | Public Library of Science |
record_format | MEDLINE/PubMed |
spelling | pubmed-92461342022-07-01 Automating document classification with distant supervision to increase the efficiency of systematic reviews: A case study on identifying studies with HIV impacts on female sex workers Li, Xiaoxiao Zhang, Amy Al-Zaidy, Rabah Rao, Amrita Baral, Stefan Bao, Le Giles, C. Lee PLoS One Research Article There remains a limited understanding of the HIV prevention and treatment needs among female sex workers in many parts of the world. Systematic reviews of existing literature can help fill this gap; however, well-done systematic reviews are time-demanding and labor-intensive. Here, we propose an automatic document classification approach to a systematic review to significantly reduce the effort in reviewing documents and optimizing empiric decision making. We first describe a manual document classification procedure that is used to curate a pertinent training dataset and then propose three classifiers: a keyword-guided method, a cluster analysis-based method, and a random forest approach that utilizes a large set of feature tokens. This approach is used to identify documents studying female sex workers that contain content relevant to either HIV or experienced violence. We compare the performance of the three classifiers by cross-validation in terms of area under the curve of the receiver operating characteristic and precision and recall plot, and found random forest approach reduces the amount of manual reading for our example by 80%; in sensitivity analysis, we found that even trained with only 10% of data, the classifier can still avoid reading 75% of future documents (68% of total) while retaining 80% of relevant documents. In sum, the automated procedure of document classification presented here could improve both the precision and efficiency of systematic reviews and facilitate live reviews, where reviews are updated regularly. We expect to obtain a reasonable classifier by taking 20% of retrieved documents as training samples. The proposed classifier could also be used for more meaningfully assembling literature in other research areas and for rapid documents screening with a tight schedule, such as COVID-related work during the crisis. Public Library of Science 2022-06-30 /pmc/articles/PMC9246134/ /pubmed/35771807 http://dx.doi.org/10.1371/journal.pone.0270034 Text en © 2022 Li et al https://creativecommons.org/licenses/by/4.0/This is an open access article distributed under the terms of the Creative Commons Attribution License (https://creativecommons.org/licenses/by/4.0/) , which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited. |
spellingShingle | Research Article Li, Xiaoxiao Zhang, Amy Al-Zaidy, Rabah Rao, Amrita Baral, Stefan Bao, Le Giles, C. Lee Automating document classification with distant supervision to increase the efficiency of systematic reviews: A case study on identifying studies with HIV impacts on female sex workers |
title | Automating document classification with distant supervision to increase the efficiency of systematic reviews: A case study on identifying studies with HIV impacts on female sex workers |
title_full | Automating document classification with distant supervision to increase the efficiency of systematic reviews: A case study on identifying studies with HIV impacts on female sex workers |
title_fullStr | Automating document classification with distant supervision to increase the efficiency of systematic reviews: A case study on identifying studies with HIV impacts on female sex workers |
title_full_unstemmed | Automating document classification with distant supervision to increase the efficiency of systematic reviews: A case study on identifying studies with HIV impacts on female sex workers |
title_short | Automating document classification with distant supervision to increase the efficiency of systematic reviews: A case study on identifying studies with HIV impacts on female sex workers |
title_sort | automating document classification with distant supervision to increase the efficiency of systematic reviews: a case study on identifying studies with hiv impacts on female sex workers |
topic | Research Article |
url | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9246134/ https://www.ncbi.nlm.nih.gov/pubmed/35771807 http://dx.doi.org/10.1371/journal.pone.0270034 |
work_keys_str_mv | AT lixiaoxiao automatingdocumentclassificationwithdistantsupervisiontoincreasetheefficiencyofsystematicreviewsacasestudyonidentifyingstudieswithhivimpactsonfemalesexworkers AT zhangamy automatingdocumentclassificationwithdistantsupervisiontoincreasetheefficiencyofsystematicreviewsacasestudyonidentifyingstudieswithhivimpactsonfemalesexworkers AT alzaidyrabah automatingdocumentclassificationwithdistantsupervisiontoincreasetheefficiencyofsystematicreviewsacasestudyonidentifyingstudieswithhivimpactsonfemalesexworkers AT raoamrita automatingdocumentclassificationwithdistantsupervisiontoincreasetheefficiencyofsystematicreviewsacasestudyonidentifyingstudieswithhivimpactsonfemalesexworkers AT baralstefan automatingdocumentclassificationwithdistantsupervisiontoincreasetheefficiencyofsystematicreviewsacasestudyonidentifyingstudieswithhivimpactsonfemalesexworkers AT baole automatingdocumentclassificationwithdistantsupervisiontoincreasetheefficiencyofsystematicreviewsacasestudyonidentifyingstudieswithhivimpactsonfemalesexworkers AT gilesclee automatingdocumentclassificationwithdistantsupervisiontoincreasetheefficiencyofsystematicreviewsacasestudyonidentifyingstudieswithhivimpactsonfemalesexworkers |