Cargando…

Cost sensitive hierarchical document classification to triage PubMed abstracts for manual curation

BACKGROUND: The Immune Epitope Database (IEDB) project manually curates information from published journal articles that describe immune epitopes derived from a wide variety of organisms and associated with different diseases. In the past, abstracts of scientific articles were retrieved by broad key...

Descripción completa

Detalles Bibliográficos
Autores principales: Seymour, Emily, Damle, Rohini, Sette, Alessandro, Peters, Bjoern
Formato: Online Artículo Texto
Lenguaje:English
Publicado: BioMed Central 2011
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3314711/
https://www.ncbi.nlm.nih.gov/pubmed/22182279
http://dx.doi.org/10.1186/1471-2105-12-482
_version_ 1782228134661193728
author Seymour, Emily
Damle, Rohini
Sette, Alessandro
Peters, Bjoern
author_facet Seymour, Emily
Damle, Rohini
Sette, Alessandro
Peters, Bjoern
author_sort Seymour, Emily
collection PubMed
description BACKGROUND: The Immune Epitope Database (IEDB) project manually curates information from published journal articles that describe immune epitopes derived from a wide variety of organisms and associated with different diseases. In the past, abstracts of scientific articles were retrieved by broad keyword queries of PubMed, and were classified as relevant (curatable) or irrelevant (not curatable) to the scope of the database by a Naïve Bayes classifier. The curatable abstracts were subsequently manually classified into categories corresponding to different disease domains. Over the past four years, we have examined how to further improve this approach in order to enhance classification performance and to reduce the need for manual intervention. RESULTS: Utilizing 89,884 abstracts classified by a domain expert as curatable or uncuratable, we found that a SVM classifier outperformed the previously used Naïve Bayes classifier for curatability predictions with an AUC of 0.899 and 0.854, respectively. Next, using a non-hierarchical and a hierarchical application of SVM classifiers trained on 22,833 curatable abstracts manually classified into three levels of disease specific categories we demonstrated that a hierarchical application of SVM classifiers outperformed non-hierarchical SVM classifiers for categorization. Finally, to optimize the hierarchical SVM classifiers' error profile for the curation process, cost sensitivity functions were developed to avoid serious misclassifications. We tested our design on a benchmark dataset of 1,388 references and achieved an overall category prediction accuracy of 94.4%, 93.9%, and 82.1% at the three levels of categorization, respectively. CONCLUSIONS: A hierarchical application of SVM algorithms with cost sensitive output weighting enabled high quality reference classification with few serious misclassifications. This enabled us to significantly reduce the manual component of abstract categorization. Our findings are relevant to other databases that are developing their own document classifier schema and the datasets we make available provide large scale real-life benchmark sets for method developers.
format Online
Article
Text
id pubmed-3314711
institution National Center for Biotechnology Information
language English
publishDate 2011
publisher BioMed Central
record_format MEDLINE/PubMed
spelling pubmed-33147112012-03-29 Cost sensitive hierarchical document classification to triage PubMed abstracts for manual curation Seymour, Emily Damle, Rohini Sette, Alessandro Peters, Bjoern BMC Bioinformatics Research Article BACKGROUND: The Immune Epitope Database (IEDB) project manually curates information from published journal articles that describe immune epitopes derived from a wide variety of organisms and associated with different diseases. In the past, abstracts of scientific articles were retrieved by broad keyword queries of PubMed, and were classified as relevant (curatable) or irrelevant (not curatable) to the scope of the database by a Naïve Bayes classifier. The curatable abstracts were subsequently manually classified into categories corresponding to different disease domains. Over the past four years, we have examined how to further improve this approach in order to enhance classification performance and to reduce the need for manual intervention. RESULTS: Utilizing 89,884 abstracts classified by a domain expert as curatable or uncuratable, we found that a SVM classifier outperformed the previously used Naïve Bayes classifier for curatability predictions with an AUC of 0.899 and 0.854, respectively. Next, using a non-hierarchical and a hierarchical application of SVM classifiers trained on 22,833 curatable abstracts manually classified into three levels of disease specific categories we demonstrated that a hierarchical application of SVM classifiers outperformed non-hierarchical SVM classifiers for categorization. Finally, to optimize the hierarchical SVM classifiers' error profile for the curation process, cost sensitivity functions were developed to avoid serious misclassifications. We tested our design on a benchmark dataset of 1,388 references and achieved an overall category prediction accuracy of 94.4%, 93.9%, and 82.1% at the three levels of categorization, respectively. CONCLUSIONS: A hierarchical application of SVM algorithms with cost sensitive output weighting enabled high quality reference classification with few serious misclassifications. This enabled us to significantly reduce the manual component of abstract categorization. Our findings are relevant to other databases that are developing their own document classifier schema and the datasets we make available provide large scale real-life benchmark sets for method developers. BioMed Central 2011-12-19 /pmc/articles/PMC3314711/ /pubmed/22182279 http://dx.doi.org/10.1186/1471-2105-12-482 Text en Copyright ©2011 Seymour et al; licensee BioMed Central Ltd. http://creativecommons.org/licenses/by/2.0 This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.
spellingShingle Research Article
Seymour, Emily
Damle, Rohini
Sette, Alessandro
Peters, Bjoern
Cost sensitive hierarchical document classification to triage PubMed abstracts for manual curation
title Cost sensitive hierarchical document classification to triage PubMed abstracts for manual curation
title_full Cost sensitive hierarchical document classification to triage PubMed abstracts for manual curation
title_fullStr Cost sensitive hierarchical document classification to triage PubMed abstracts for manual curation
title_full_unstemmed Cost sensitive hierarchical document classification to triage PubMed abstracts for manual curation
title_short Cost sensitive hierarchical document classification to triage PubMed abstracts for manual curation
title_sort cost sensitive hierarchical document classification to triage pubmed abstracts for manual curation
topic Research Article
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3314711/
https://www.ncbi.nlm.nih.gov/pubmed/22182279
http://dx.doi.org/10.1186/1471-2105-12-482
work_keys_str_mv AT seymouremily costsensitivehierarchicaldocumentclassificationtotriagepubmedabstractsformanualcuration
AT damlerohini costsensitivehierarchicaldocumentclassificationtotriagepubmedabstractsformanualcuration
AT settealessandro costsensitivehierarchicaldocumentclassificationtotriagepubmedabstractsformanualcuration
AT petersbjoern costsensitivehierarchicaldocumentclassificationtotriagepubmedabstractsformanualcuration