Cargando…

eGenPub, a text mining system for extending computationally mapped bibliography for UniProt Knowledgebase by capturing centrality

UniProt Knowledgebase (UniProtKB) is a publicly available database with access to a vast amount of protein sequence and functional information. To widen the scope of the publications associated with a protein entry, UniProt has introduced the computationally mapped additional bibliography section, w...

Descripción completa

Detalles Bibliográficos
Autores principales: Ding, Ruoyao, Boutet, Emmanuel, Lieberherr, Damien, Schneider, Michel, Tognolli, Michael, Wu, Cathy H, Vijay-Shanker, K, Arighi, Cecilia N
Formato: Online Artículo Texto
Lenguaje:English
Publicado: Oxford University Press 2017
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5691349/
https://www.ncbi.nlm.nih.gov/pubmed/29220476
http://dx.doi.org/10.1093/database/bax081
_version_ 1783279773973217280
author Ding, Ruoyao
Boutet, Emmanuel
Lieberherr, Damien
Schneider, Michel
Tognolli, Michael
Wu, Cathy H
Vijay-Shanker, K
Arighi, Cecilia N
author_facet Ding, Ruoyao
Boutet, Emmanuel
Lieberherr, Damien
Schneider, Michel
Tognolli, Michael
Wu, Cathy H
Vijay-Shanker, K
Arighi, Cecilia N
author_sort Ding, Ruoyao
collection PubMed
description UniProt Knowledgebase (UniProtKB) is a publicly available database with access to a vast amount of protein sequence and functional information. To widen the scope of the publications associated with a protein entry, UniProt has introduced the computationally mapped additional bibliography section, which includes literature collected from external sources. In this article, we describe a text mining system, eGenPub, which selects articles that are ‘about’ specific proteins and allows automatic identification of additional bibliography for given UniProt protein entries. Focusing on plant proteins initially, eGenPub utilizes a gene normalization tool called pGenN, and a trained support vector machine model, which achieves a precision of 95.3%, to predict whether an article, based on its abstract, should be linked to a given UniProt entry. We have conducted a full-scale PubMed processing using eGenPub for eight common plant species. Altogether, 9025 articles are identified as relevant bibliography for 4752 UniProt entries, among which 5252 are additional papers not in the existing publication section. These newly computationally mapped additional bibliography via eGenPub is being integrated in the UniProt production pipeline, and can be accessed via the UniProtKB protein entry publication view.
format Online
Article
Text
id pubmed-5691349
institution National Center for Biotechnology Information
language English
publishDate 2017
publisher Oxford University Press
record_format MEDLINE/PubMed
spelling pubmed-56913492017-12-11 eGenPub, a text mining system for extending computationally mapped bibliography for UniProt Knowledgebase by capturing centrality Ding, Ruoyao Boutet, Emmanuel Lieberherr, Damien Schneider, Michel Tognolli, Michael Wu, Cathy H Vijay-Shanker, K Arighi, Cecilia N Database (Oxford) Original Article UniProt Knowledgebase (UniProtKB) is a publicly available database with access to a vast amount of protein sequence and functional information. To widen the scope of the publications associated with a protein entry, UniProt has introduced the computationally mapped additional bibliography section, which includes literature collected from external sources. In this article, we describe a text mining system, eGenPub, which selects articles that are ‘about’ specific proteins and allows automatic identification of additional bibliography for given UniProt protein entries. Focusing on plant proteins initially, eGenPub utilizes a gene normalization tool called pGenN, and a trained support vector machine model, which achieves a precision of 95.3%, to predict whether an article, based on its abstract, should be linked to a given UniProt entry. We have conducted a full-scale PubMed processing using eGenPub for eight common plant species. Altogether, 9025 articles are identified as relevant bibliography for 4752 UniProt entries, among which 5252 are additional papers not in the existing publication section. These newly computationally mapped additional bibliography via eGenPub is being integrated in the UniProt production pipeline, and can be accessed via the UniProtKB protein entry publication view. Oxford University Press 2017-11-13 /pmc/articles/PMC5691349/ /pubmed/29220476 http://dx.doi.org/10.1093/database/bax081 Text en © The Author(s) 2017. Published by Oxford University Press. http://creativecommons.org/licenses/by/4.0/ This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted reuse, distribution, and reproduction in any medium, provided the original work is properly cited.
spellingShingle Original Article
Ding, Ruoyao
Boutet, Emmanuel
Lieberherr, Damien
Schneider, Michel
Tognolli, Michael
Wu, Cathy H
Vijay-Shanker, K
Arighi, Cecilia N
eGenPub, a text mining system for extending computationally mapped bibliography for UniProt Knowledgebase by capturing centrality
title eGenPub, a text mining system for extending computationally mapped bibliography for UniProt Knowledgebase by capturing centrality
title_full eGenPub, a text mining system for extending computationally mapped bibliography for UniProt Knowledgebase by capturing centrality
title_fullStr eGenPub, a text mining system for extending computationally mapped bibliography for UniProt Knowledgebase by capturing centrality
title_full_unstemmed eGenPub, a text mining system for extending computationally mapped bibliography for UniProt Knowledgebase by capturing centrality
title_short eGenPub, a text mining system for extending computationally mapped bibliography for UniProt Knowledgebase by capturing centrality
title_sort egenpub, a text mining system for extending computationally mapped bibliography for uniprot knowledgebase by capturing centrality
topic Original Article
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5691349/
https://www.ncbi.nlm.nih.gov/pubmed/29220476
http://dx.doi.org/10.1093/database/bax081
work_keys_str_mv AT dingruoyao egenpubatextminingsystemforextendingcomputationallymappedbibliographyforuniprotknowledgebasebycapturingcentrality
AT boutetemmanuel egenpubatextminingsystemforextendingcomputationallymappedbibliographyforuniprotknowledgebasebycapturingcentrality
AT lieberherrdamien egenpubatextminingsystemforextendingcomputationallymappedbibliographyforuniprotknowledgebasebycapturingcentrality
AT schneidermichel egenpubatextminingsystemforextendingcomputationallymappedbibliographyforuniprotknowledgebasebycapturingcentrality
AT tognollimichael egenpubatextminingsystemforextendingcomputationallymappedbibliographyforuniprotknowledgebasebycapturingcentrality
AT wucathyh egenpubatextminingsystemforextendingcomputationallymappedbibliographyforuniprotknowledgebasebycapturingcentrality
AT vijayshankerk egenpubatextminingsystemforextendingcomputationallymappedbibliographyforuniprotknowledgebasebycapturingcentrality
AT arighicecilian egenpubatextminingsystemforextendingcomputationallymappedbibliographyforuniprotknowledgebasebycapturingcentrality