Cargando…

A simple approach for protein name identification: prospects and limits

BACKGROUND: Significant parts of biological knowledge are available only as unstructured text in articles of biomedical journals. By automatically identifying gene and gene product (protein) names and mapping these to unique database identifiers, it becomes possible to extract and integrate informat...

Descripción completa

Detalles Bibliográficos
Autores principales:	Fundel, Katrin, Güttler, Daniel, Zimmer, Ralf, Apostolakis, Joannis
Formato:	Texto
Lenguaje:	English
Publicado:	BioMed Central 2005
Materias:	Report
Acceso en línea:	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC1869007/ https://www.ncbi.nlm.nih.gov/pubmed/15960827 http://dx.doi.org/10.1186/1471-2105-6-S1-S15

_version_	1782133425745952768
author	Fundel, Katrin Güttler, Daniel Zimmer, Ralf Apostolakis, Joannis
author_facet	Fundel, Katrin Güttler, Daniel Zimmer, Ralf Apostolakis, Joannis
author_sort	Fundel, Katrin
collection	PubMed
description	BACKGROUND: Significant parts of biological knowledge are available only as unstructured text in articles of biomedical journals. By automatically identifying gene and gene product (protein) names and mapping these to unique database identifiers, it becomes possible to extract and integrate information from articles and various data sources. We present a simple and efficient approach that identifies gene and protein names in texts and returns database identifiers for matches. It has been evaluated in the recent BioCreAtIvE entity extraction and mention normalization task by an independent jury. METHODS: Our approach is based on the use of synonym lists that map the unique database identifiers for each gene/protein to the different synonym names. For yeast and mouse, synonym lists were used as provided by the organizers who generated them from public model organism databases. The synonym list for fly was generated directly from the corresponding organism database. The lists were then extensively curated in largely automated procedure and matched against MEDLINE abstracts by exact text matching. Rule-based and support vector machine-based post filters were designed and applied to improve precision. RESULTS: Our procedure showed high recall and precision with F-measures of 0.897 for yeast and 0.764/0.773 for mouse in the BioCreAtIvE assessment (Task 1B) and 0.768 for fly in a post-evaluation. CONCLUSION: The results were close to the best over all submissions. Depending on the synonym properties it can be crucial to consider context and to filter out erroneous matches. This is especially important for fly, which has a very challenging nomenclature for the protein name identification task. Here, the support vector machine-based post filter proved to be very effective.
format	Text
id	pubmed-1869007
institution	National Center for Biotechnology Information
language	English
publishDate	2005
publisher	BioMed Central
record_format	MEDLINE/PubMed
spelling	pubmed-18690072007-05-18 A simple approach for protein name identification: prospects and limits Fundel, Katrin Güttler, Daniel Zimmer, Ralf Apostolakis, Joannis BMC Bioinformatics Report BACKGROUND: Significant parts of biological knowledge are available only as unstructured text in articles of biomedical journals. By automatically identifying gene and gene product (protein) names and mapping these to unique database identifiers, it becomes possible to extract and integrate information from articles and various data sources. We present a simple and efficient approach that identifies gene and protein names in texts and returns database identifiers for matches. It has been evaluated in the recent BioCreAtIvE entity extraction and mention normalization task by an independent jury. METHODS: Our approach is based on the use of synonym lists that map the unique database identifiers for each gene/protein to the different synonym names. For yeast and mouse, synonym lists were used as provided by the organizers who generated them from public model organism databases. The synonym list for fly was generated directly from the corresponding organism database. The lists were then extensively curated in largely automated procedure and matched against MEDLINE abstracts by exact text matching. Rule-based and support vector machine-based post filters were designed and applied to improve precision. RESULTS: Our procedure showed high recall and precision with F-measures of 0.897 for yeast and 0.764/0.773 for mouse in the BioCreAtIvE assessment (Task 1B) and 0.768 for fly in a post-evaluation. CONCLUSION: The results were close to the best over all submissions. Depending on the synonym properties it can be crucial to consider context and to filter out erroneous matches. This is especially important for fly, which has a very challenging nomenclature for the protein name identification task. Here, the support vector machine-based post filter proved to be very effective. BioMed Central 2005-05-24 /pmc/articles/PMC1869007/ /pubmed/15960827 http://dx.doi.org/10.1186/1471-2105-6-S1-S15 Text en Copyright © 2005 Fundel et al; licensee BioMed Central Ltd http://creativecommons.org/licenses/by/2.0 This is an open access article distributed under the terms of the Creative Commons Attribution License ( (http://creativecommons.org/licenses/by/2.0) ), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.
spellingShingle	Report Fundel, Katrin Güttler, Daniel Zimmer, Ralf Apostolakis, Joannis A simple approach for protein name identification: prospects and limits
title	A simple approach for protein name identification: prospects and limits
title_full	A simple approach for protein name identification: prospects and limits
title_fullStr	A simple approach for protein name identification: prospects and limits
title_full_unstemmed	A simple approach for protein name identification: prospects and limits
title_short	A simple approach for protein name identification: prospects and limits
title_sort	simple approach for protein name identification: prospects and limits
topic	Report
url	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC1869007/ https://www.ncbi.nlm.nih.gov/pubmed/15960827 http://dx.doi.org/10.1186/1471-2105-6-S1-S15
work_keys_str_mv	AT fundelkatrin asimpleapproachforproteinnameidentificationprospectsandlimits AT guttlerdaniel asimpleapproachforproteinnameidentificationprospectsandlimits AT zimmerralf asimpleapproachforproteinnameidentificationprospectsandlimits AT apostolakisjoannis asimpleapproachforproteinnameidentificationprospectsandlimits AT fundelkatrin simpleapproachforproteinnameidentificationprospectsandlimits AT guttlerdaniel simpleapproachforproteinnameidentificationprospectsandlimits AT zimmerralf simpleapproachforproteinnameidentificationprospectsandlimits AT apostolakisjoannis simpleapproachforproteinnameidentificationprospectsandlimits

A simple approach for protein name identification: prospects and limits

Ejemplares similares