Cargando…

Extraction of human kinase mutations from literature, databases and genotyping studies

BACKGROUND: There is a considerable interest in characterizing the biological role of specific protein residue substitutions through mutagenesis experiments. Additionally, recent efforts related to the detection of disease-associated SNPs motivated both the manual annotation, as well as the automati...

Descripción completa

Detalles Bibliográficos
Autores principales:	Krallinger, Martin, Izarzugaza, Jose MG, Rodriguez-Penagos, Carlos, Valencia, Alfonso
Formato:	Texto
Lenguaje:	English
Publicado:	BioMed Central 2009
Materias:	Research
Acceso en línea:	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2745582/ https://www.ncbi.nlm.nih.gov/pubmed/19758464 http://dx.doi.org/10.1186/1471-2105-10-S8-S1

_version_	1782171977336750080
author	Krallinger, Martin Izarzugaza, Jose MG Rodriguez-Penagos, Carlos Valencia, Alfonso
author_facet	Krallinger, Martin Izarzugaza, Jose MG Rodriguez-Penagos, Carlos Valencia, Alfonso
author_sort	Krallinger, Martin
collection	PubMed
description	BACKGROUND: There is a considerable interest in characterizing the biological role of specific protein residue substitutions through mutagenesis experiments. Additionally, recent efforts related to the detection of disease-associated SNPs motivated both the manual annotation, as well as the automatic extraction, of naturally occurring sequence variations from the literature, especially for protein families that play a significant role in signaling processes such as kinases. Systematic integration and comparison of kinase mutation information from multiple sources, covering literature, manual annotation databases and large-scale experiments can result in a more comprehensive view of functional, structural and disease associated aspects of protein sequence variants. Previously published mutation extraction approaches did not sufficiently distinguish between two fundamentally different variation origin categories, namely natural occurring and induced mutations generated through in vitro experiments. RESULTS: We present a literature mining pipeline for the automatic extraction and disambiguation of single-point mutation mentions from both abstracts as well as full text articles, followed by a sequence validation check to link mutations to their corresponding kinase protein sequences. Each mutation is scored according to whether it corresponds to an induced mutation or a natural sequence variant. We were able to provide direct literature links for a considerable fraction of previously annotated kinase mutations, enabling thus more efficient interpretation of their biological characterization and experimental context. In order to test the capabilities of the presented pipeline, the mutations in the protein kinase domain of the kinase family were analyzed. Using our literature extraction system, we were able to recover a total of 643 mutations-protein associations from PubMed abstracts and 6,970 from a large collection of full text articles. When compared to state-of-the-art annotation databases and high throughput genotyping studies, the mutation mentions extracted from the literature overlap to a good extent with the existing knowledgebases, whereas the remaining mentions suggest new mutation records that were not previously annotated in the databases. CONCLUSION: Using the proposed residue disambiguation and classification approach, we were able to differentiate between natural variant and mutagenesis types of mutations with an accuracy of 93.88. The resulting system is useful for constructing a Gold Standard set of mutations extracted from the literature by human experts with minimal manual curation effort, providing direct pointers to relevant evidence sentences. Our system is able to recover mutations from the literature that are not present in state-of-the-art databases. Human expert manual validation of a subset of the literature extracted mutations conducted on 100 mutations from PubMed abstracts highlights that almost three quarters (72%) of the extracted mutations turned out to be correct, and more than half of these had not been previously annotated in databases.
format	Text
id	pubmed-2745582
institution	National Center for Biotechnology Information
language	English
publishDate	2009
publisher	BioMed Central
record_format	MEDLINE/PubMed
spelling	pubmed-27455822009-09-18 Extraction of human kinase mutations from literature, databases and genotyping studies Krallinger, Martin Izarzugaza, Jose MG Rodriguez-Penagos, Carlos Valencia, Alfonso BMC Bioinformatics Research BACKGROUND: There is a considerable interest in characterizing the biological role of specific protein residue substitutions through mutagenesis experiments. Additionally, recent efforts related to the detection of disease-associated SNPs motivated both the manual annotation, as well as the automatic extraction, of naturally occurring sequence variations from the literature, especially for protein families that play a significant role in signaling processes such as kinases. Systematic integration and comparison of kinase mutation information from multiple sources, covering literature, manual annotation databases and large-scale experiments can result in a more comprehensive view of functional, structural and disease associated aspects of protein sequence variants. Previously published mutation extraction approaches did not sufficiently distinguish between two fundamentally different variation origin categories, namely natural occurring and induced mutations generated through in vitro experiments. RESULTS: We present a literature mining pipeline for the automatic extraction and disambiguation of single-point mutation mentions from both abstracts as well as full text articles, followed by a sequence validation check to link mutations to their corresponding kinase protein sequences. Each mutation is scored according to whether it corresponds to an induced mutation or a natural sequence variant. We were able to provide direct literature links for a considerable fraction of previously annotated kinase mutations, enabling thus more efficient interpretation of their biological characterization and experimental context. In order to test the capabilities of the presented pipeline, the mutations in the protein kinase domain of the kinase family were analyzed. Using our literature extraction system, we were able to recover a total of 643 mutations-protein associations from PubMed abstracts and 6,970 from a large collection of full text articles. When compared to state-of-the-art annotation databases and high throughput genotyping studies, the mutation mentions extracted from the literature overlap to a good extent with the existing knowledgebases, whereas the remaining mentions suggest new mutation records that were not previously annotated in the databases. CONCLUSION: Using the proposed residue disambiguation and classification approach, we were able to differentiate between natural variant and mutagenesis types of mutations with an accuracy of 93.88. The resulting system is useful for constructing a Gold Standard set of mutations extracted from the literature by human experts with minimal manual curation effort, providing direct pointers to relevant evidence sentences. Our system is able to recover mutations from the literature that are not present in state-of-the-art databases. Human expert manual validation of a subset of the literature extracted mutations conducted on 100 mutations from PubMed abstracts highlights that almost three quarters (72%) of the extracted mutations turned out to be correct, and more than half of these had not been previously annotated in databases. BioMed Central 2009-08-27 /pmc/articles/PMC2745582/ /pubmed/19758464 http://dx.doi.org/10.1186/1471-2105-10-S8-S1 Text en Copyright © 2009 Krallinger et al; licensee BioMed Central Ltd. http://creativecommons.org/licenses/by/2.0 This is an open access article distributed under the terms of the Creative Commons Attribution License ( (http://creativecommons.org/licenses/by/2.0) ), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.
spellingShingle	Research Krallinger, Martin Izarzugaza, Jose MG Rodriguez-Penagos, Carlos Valencia, Alfonso Extraction of human kinase mutations from literature, databases and genotyping studies
title	Extraction of human kinase mutations from literature, databases and genotyping studies
title_full	Extraction of human kinase mutations from literature, databases and genotyping studies
title_fullStr	Extraction of human kinase mutations from literature, databases and genotyping studies
title_full_unstemmed	Extraction of human kinase mutations from literature, databases and genotyping studies
title_short	Extraction of human kinase mutations from literature, databases and genotyping studies
title_sort	extraction of human kinase mutations from literature, databases and genotyping studies
topic	Research
url	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2745582/ https://www.ncbi.nlm.nih.gov/pubmed/19758464 http://dx.doi.org/10.1186/1471-2105-10-S8-S1
work_keys_str_mv	AT krallingermartin extractionofhumankinasemutationsfromliteraturedatabasesandgenotypingstudies AT izarzugazajosemg extractionofhumankinasemutationsfromliteraturedatabasesandgenotypingstudies AT rodriguezpenagoscarlos extractionofhumankinasemutationsfromliteraturedatabasesandgenotypingstudies AT valenciaalfonso extractionofhumankinasemutationsfromliteraturedatabasesandgenotypingstudies

Extraction of human kinase mutations from literature, databases and genotyping studies

Ejemplares similares