Cargando…

pGenN, a Gene Normalization Tool for Plant Genes and Proteins in Scientific Literature

BACKGROUND: Automatically detecting gene/protein names in the literature and connecting them to databases records, also known as gene normalization, provides a means to structure the information buried in free-text literature. Gene normalization is critical for improving the coverage of annotation i...

Descripción completa

Detalles Bibliográficos
Autores principales: Ding, Ruoyao, Arighi, Cecilia N., Lee, Jung-Youn, Wu, Cathy H., Vijay-Shanker, K.
Formato: Online Artículo Texto
Lenguaje:English
Publicado: Public Library of Science 2015
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4530884/
https://www.ncbi.nlm.nih.gov/pubmed/26258475
http://dx.doi.org/10.1371/journal.pone.0135305
_version_ 1782384947266322432
author Ding, Ruoyao
Arighi, Cecilia N.
Lee, Jung-Youn
Wu, Cathy H.
Vijay-Shanker, K.
author_facet Ding, Ruoyao
Arighi, Cecilia N.
Lee, Jung-Youn
Wu, Cathy H.
Vijay-Shanker, K.
author_sort Ding, Ruoyao
collection PubMed
description BACKGROUND: Automatically detecting gene/protein names in the literature and connecting them to databases records, also known as gene normalization, provides a means to structure the information buried in free-text literature. Gene normalization is critical for improving the coverage of annotation in the databases, and is an essential component of many text mining systems and database curation pipelines. METHODS: In this manuscript, we describe a gene normalization system specifically tailored for plant species, called pGenN (pivot-based Gene Normalization). The system consists of three steps: dictionary-based gene mention detection, species assignment, and intra species normalization. We have developed new heuristics to improve each of these phases. RESULTS: We evaluated the performance of pGenN on an in-house expertly annotated corpus consisting of 104 plant relevant abstracts. Our system achieved an F-value of 88.9% (Precision 90.9% and Recall 87.2%) on this corpus, outperforming state-of-art systems presented in BioCreative III. We have processed over 440,000 plant-related Medline abstracts using pGenN. The gene normalization results are stored in a local database for direct query from the pGenN web interface (proteininformationresource.org/pgenn/). The annotated literature corpus is also publicly available through the PIR text mining portal (proteininformationresource.org/iprolink/).
format Online
Article
Text
id pubmed-4530884
institution National Center for Biotechnology Information
language English
publishDate 2015
publisher Public Library of Science
record_format MEDLINE/PubMed
spelling pubmed-45308842015-08-24 pGenN, a Gene Normalization Tool for Plant Genes and Proteins in Scientific Literature Ding, Ruoyao Arighi, Cecilia N. Lee, Jung-Youn Wu, Cathy H. Vijay-Shanker, K. PLoS One Research Article BACKGROUND: Automatically detecting gene/protein names in the literature and connecting them to databases records, also known as gene normalization, provides a means to structure the information buried in free-text literature. Gene normalization is critical for improving the coverage of annotation in the databases, and is an essential component of many text mining systems and database curation pipelines. METHODS: In this manuscript, we describe a gene normalization system specifically tailored for plant species, called pGenN (pivot-based Gene Normalization). The system consists of three steps: dictionary-based gene mention detection, species assignment, and intra species normalization. We have developed new heuristics to improve each of these phases. RESULTS: We evaluated the performance of pGenN on an in-house expertly annotated corpus consisting of 104 plant relevant abstracts. Our system achieved an F-value of 88.9% (Precision 90.9% and Recall 87.2%) on this corpus, outperforming state-of-art systems presented in BioCreative III. We have processed over 440,000 plant-related Medline abstracts using pGenN. The gene normalization results are stored in a local database for direct query from the pGenN web interface (proteininformationresource.org/pgenn/). The annotated literature corpus is also publicly available through the PIR text mining portal (proteininformationresource.org/iprolink/). Public Library of Science 2015-08-10 /pmc/articles/PMC4530884/ /pubmed/26258475 http://dx.doi.org/10.1371/journal.pone.0135305 Text en © 2015 Ding et al http://creativecommons.org/licenses/by/4.0/ This is an open-access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are properly credited.
spellingShingle Research Article
Ding, Ruoyao
Arighi, Cecilia N.
Lee, Jung-Youn
Wu, Cathy H.
Vijay-Shanker, K.
pGenN, a Gene Normalization Tool for Plant Genes and Proteins in Scientific Literature
title pGenN, a Gene Normalization Tool for Plant Genes and Proteins in Scientific Literature
title_full pGenN, a Gene Normalization Tool for Plant Genes and Proteins in Scientific Literature
title_fullStr pGenN, a Gene Normalization Tool for Plant Genes and Proteins in Scientific Literature
title_full_unstemmed pGenN, a Gene Normalization Tool for Plant Genes and Proteins in Scientific Literature
title_short pGenN, a Gene Normalization Tool for Plant Genes and Proteins in Scientific Literature
title_sort pgenn, a gene normalization tool for plant genes and proteins in scientific literature
topic Research Article
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4530884/
https://www.ncbi.nlm.nih.gov/pubmed/26258475
http://dx.doi.org/10.1371/journal.pone.0135305
work_keys_str_mv AT dingruoyao pgennagenenormalizationtoolforplantgenesandproteinsinscientificliterature
AT arighicecilian pgennagenenormalizationtoolforplantgenesandproteinsinscientificliterature
AT leejungyoun pgennagenenormalizationtoolforplantgenesandproteinsinscientificliterature
AT wucathyh pgennagenenormalizationtoolforplantgenesandproteinsinscientificliterature
AT vijayshankerk pgennagenenormalizationtoolforplantgenesandproteinsinscientificliterature