Cargando…

Literature classification for semi-automated updating of biological knowledgebases

BACKGROUND: As the output of biological assays increase in resolution and volume, the body of specialized biological data, such as functional annotations of gene and protein sequences, enables extraction of higher-level knowledge needed for practical application in bioinformatics. Whereas common typ...

Descripción completa

Detalles Bibliográficos
Autores principales: Olsen, Lars Rønn, Johan Kudahl, Ulrich, Winther, Ole, Brusic, Vladimir
Formato: Online Artículo Texto
Lenguaje:English
Publicado: BioMed Central 2013
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3852072/
https://www.ncbi.nlm.nih.gov/pubmed/24564403
http://dx.doi.org/10.1186/1471-2164-14-S5-S14
_version_ 1782478602747510784
author Olsen, Lars Rønn
Johan Kudahl, Ulrich
Winther, Ole
Brusic, Vladimir
author_facet Olsen, Lars Rønn
Johan Kudahl, Ulrich
Winther, Ole
Brusic, Vladimir
author_sort Olsen, Lars Rønn
collection PubMed
description BACKGROUND: As the output of biological assays increase in resolution and volume, the body of specialized biological data, such as functional annotations of gene and protein sequences, enables extraction of higher-level knowledge needed for practical application in bioinformatics. Whereas common types of biological data, such as sequence data, are extensively stored in biological databases, functional annotations, such as immunological epitopes, are found primarily in semi-structured formats or free text embedded in primary scientific literature. RESULTS: We defined and applied a machine learning approach for literature classification to support updating of TANTIGEN, a knowledgebase of tumor T-cell antigens. Abstracts from PubMed were downloaded and classified as either "relevant" or "irrelevant" for database update. Training and five-fold cross-validation of a k-NN classifier on 310 abstracts yielded classification accuracy of 0.95, thus showing significant value in support of data extraction from the literature. CONCLUSION: We here propose a conceptual framework for semi-automated extraction of epitope data embedded in scientific literature using principles from text mining and machine learning. The addition of such data will aid in the transition of biological databases to knowledgebases.
format Online
Article
Text
id pubmed-3852072
institution National Center for Biotechnology Information
language English
publishDate 2013
publisher BioMed Central
record_format MEDLINE/PubMed
spelling pubmed-38520722013-12-20 Literature classification for semi-automated updating of biological knowledgebases Olsen, Lars Rønn Johan Kudahl, Ulrich Winther, Ole Brusic, Vladimir BMC Genomics Research BACKGROUND: As the output of biological assays increase in resolution and volume, the body of specialized biological data, such as functional annotations of gene and protein sequences, enables extraction of higher-level knowledge needed for practical application in bioinformatics. Whereas common types of biological data, such as sequence data, are extensively stored in biological databases, functional annotations, such as immunological epitopes, are found primarily in semi-structured formats or free text embedded in primary scientific literature. RESULTS: We defined and applied a machine learning approach for literature classification to support updating of TANTIGEN, a knowledgebase of tumor T-cell antigens. Abstracts from PubMed were downloaded and classified as either "relevant" or "irrelevant" for database update. Training and five-fold cross-validation of a k-NN classifier on 310 abstracts yielded classification accuracy of 0.95, thus showing significant value in support of data extraction from the literature. CONCLUSION: We here propose a conceptual framework for semi-automated extraction of epitope data embedded in scientific literature using principles from text mining and machine learning. The addition of such data will aid in the transition of biological databases to knowledgebases. BioMed Central 2013-10-16 /pmc/articles/PMC3852072/ /pubmed/24564403 http://dx.doi.org/10.1186/1471-2164-14-S5-S14 Text en Copyright © 2013 Olsen et al.; licensee BioMed Central Ltd. http://creativecommons.org/licenses/by/2.0 This is an open access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.
spellingShingle Research
Olsen, Lars Rønn
Johan Kudahl, Ulrich
Winther, Ole
Brusic, Vladimir
Literature classification for semi-automated updating of biological knowledgebases
title Literature classification for semi-automated updating of biological knowledgebases
title_full Literature classification for semi-automated updating of biological knowledgebases
title_fullStr Literature classification for semi-automated updating of biological knowledgebases
title_full_unstemmed Literature classification for semi-automated updating of biological knowledgebases
title_short Literature classification for semi-automated updating of biological knowledgebases
title_sort literature classification for semi-automated updating of biological knowledgebases
topic Research
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3852072/
https://www.ncbi.nlm.nih.gov/pubmed/24564403
http://dx.doi.org/10.1186/1471-2164-14-S5-S14
work_keys_str_mv AT olsenlarsrønn literatureclassificationforsemiautomatedupdatingofbiologicalknowledgebases
AT johankudahlulrich literatureclassificationforsemiautomatedupdatingofbiologicalknowledgebases
AT wintherole literatureclassificationforsemiautomatedupdatingofbiologicalknowledgebases
AT brusicvladimir literatureclassificationforsemiautomatedupdatingofbiologicalknowledgebases