Cargando…

FamPlex: a resource for entity recognition and relationship resolution of human protein families and complexes in biomedical text mining

BACKGROUND: For automated reading of scientific publications to extract useful information about molecular mechanisms it is critical that genes, proteins and other entities be correctly associated with uniform identifiers, a process known as named entity linking or “grounding.” Correct grounding is...

Descripción completa

Detalles Bibliográficos
Autores principales:	Bachman, John A., Gyori, Benjamin M., Sorger, Peter K.
Formato:	Online Artículo Texto
Lenguaje:	English
Publicado:	BioMed Central 2018
Materias:	Database
Acceso en línea:	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6022344/ https://www.ncbi.nlm.nih.gov/pubmed/29954318 http://dx.doi.org/10.1186/s12859-018-2211-5

_version_	1783335659956600832
author	Bachman, John A. Gyori, Benjamin M. Sorger, Peter K.
author_facet	Bachman, John A. Gyori, Benjamin M. Sorger, Peter K.
author_sort	Bachman, John A.
collection	PubMed
description	BACKGROUND: For automated reading of scientific publications to extract useful information about molecular mechanisms it is critical that genes, proteins and other entities be correctly associated with uniform identifiers, a process known as named entity linking or “grounding.” Correct grounding is essential for resolving relationships among mined information, curated interaction databases, and biological datasets. The accuracy of this process is largely dependent on the availability of machine-readable resources associating synonyms and abbreviations commonly found in biomedical literature with uniform identifiers. RESULTS: In a task involving automated reading of ∼215,000 articles using the REACH event extraction software we found that grounding was disproportionately inaccurate for multi-protein families (e.g., “AKT”) and complexes with multiple subunits (e.g.“NF- κB”). To address this problem we constructed FamPlex, a manually curated resource defining protein families and complexes as they are commonly encountered in biomedical text. In FamPlex the gene-level constituents of families and complexes are defined in a flexible format allowing for multi-level, hierarchical membership. To create FamPlex, text strings corresponding to entities were identified empirically from literature and linked manually to uniform identifiers; these identifiers were also mapped to equivalent entries in multiple related databases. FamPlex also includes curated prefix and suffix patterns that improve named entity recognition and event extraction. Evaluation of REACH extractions on a test corpus of ∼54,000 articles showed that FamPlex significantly increased grounding accuracy for families and complexes (from 15 to 71%). The hierarchical organization of entities in FamPlex also made it possible to integrate otherwise unconnected mechanistic information across families, subfamilies, and individual proteins. Applications of FamPlex to the TRIPS/DRUM reading system and the Biocreative VI Bioentity Normalization Task dataset demonstrated the utility of FamPlex in other settings. CONCLUSION: FamPlex is an effective resource for improving named entity recognition, grounding, and relationship resolution in automated reading of biomedical text. The content in FamPlex is available in both tabular and Open Biomedical Ontology formats at https://github.com/sorgerlab/famplex under the Creative Commons CC0 license and has been integrated into the TRIPS/DRUM and REACH reading systems.
format	Online Article Text
id	pubmed-6022344
institution	National Center for Biotechnology Information
language	English
publishDate	2018
publisher	BioMed Central
record_format	MEDLINE/PubMed
spelling	pubmed-60223442018-07-09 FamPlex: a resource for entity recognition and relationship resolution of human protein families and complexes in biomedical text mining Bachman, John A. Gyori, Benjamin M. Sorger, Peter K. BMC Bioinformatics Database BACKGROUND: For automated reading of scientific publications to extract useful information about molecular mechanisms it is critical that genes, proteins and other entities be correctly associated with uniform identifiers, a process known as named entity linking or “grounding.” Correct grounding is essential for resolving relationships among mined information, curated interaction databases, and biological datasets. The accuracy of this process is largely dependent on the availability of machine-readable resources associating synonyms and abbreviations commonly found in biomedical literature with uniform identifiers. RESULTS: In a task involving automated reading of ∼215,000 articles using the REACH event extraction software we found that grounding was disproportionately inaccurate for multi-protein families (e.g., “AKT”) and complexes with multiple subunits (e.g.“NF- κB”). To address this problem we constructed FamPlex, a manually curated resource defining protein families and complexes as they are commonly encountered in biomedical text. In FamPlex the gene-level constituents of families and complexes are defined in a flexible format allowing for multi-level, hierarchical membership. To create FamPlex, text strings corresponding to entities were identified empirically from literature and linked manually to uniform identifiers; these identifiers were also mapped to equivalent entries in multiple related databases. FamPlex also includes curated prefix and suffix patterns that improve named entity recognition and event extraction. Evaluation of REACH extractions on a test corpus of ∼54,000 articles showed that FamPlex significantly increased grounding accuracy for families and complexes (from 15 to 71%). The hierarchical organization of entities in FamPlex also made it possible to integrate otherwise unconnected mechanistic information across families, subfamilies, and individual proteins. Applications of FamPlex to the TRIPS/DRUM reading system and the Biocreative VI Bioentity Normalization Task dataset demonstrated the utility of FamPlex in other settings. CONCLUSION: FamPlex is an effective resource for improving named entity recognition, grounding, and relationship resolution in automated reading of biomedical text. The content in FamPlex is available in both tabular and Open Biomedical Ontology formats at https://github.com/sorgerlab/famplex under the Creative Commons CC0 license and has been integrated into the TRIPS/DRUM and REACH reading systems. BioMed Central 2018-06-28 /pmc/articles/PMC6022344/ /pubmed/29954318 http://dx.doi.org/10.1186/s12859-018-2211-5 Text en © The Author(s) 2018 Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.
spellingShingle	Database Bachman, John A. Gyori, Benjamin M. Sorger, Peter K. FamPlex: a resource for entity recognition and relationship resolution of human protein families and complexes in biomedical text mining
title	FamPlex: a resource for entity recognition and relationship resolution of human protein families and complexes in biomedical text mining
title_full	FamPlex: a resource for entity recognition and relationship resolution of human protein families and complexes in biomedical text mining
title_fullStr	FamPlex: a resource for entity recognition and relationship resolution of human protein families and complexes in biomedical text mining
title_full_unstemmed	FamPlex: a resource for entity recognition and relationship resolution of human protein families and complexes in biomedical text mining
title_short	FamPlex: a resource for entity recognition and relationship resolution of human protein families and complexes in biomedical text mining
title_sort	famplex: a resource for entity recognition and relationship resolution of human protein families and complexes in biomedical text mining
topic	Database
url	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6022344/ https://www.ncbi.nlm.nih.gov/pubmed/29954318 http://dx.doi.org/10.1186/s12859-018-2211-5
work_keys_str_mv	AT bachmanjohna famplexaresourceforentityrecognitionandrelationshipresolutionofhumanproteinfamiliesandcomplexesinbiomedicaltextmining AT gyoribenjaminm famplexaresourceforentityrecognitionandrelationshipresolutionofhumanproteinfamiliesandcomplexesinbiomedicaltextmining AT sorgerpeterk famplexaresourceforentityrecognitionandrelationshipresolutionofhumanproteinfamiliesandcomplexesinbiomedicaltextmining

FamPlex: a resource for entity recognition and relationship resolution of human protein families and complexes in biomedical text mining

Ejemplares similares