Cargando…

Identifying Human Phenotype Terms by Combining Machine Learning and Validation Rules

Named-Entity Recognition is commonly used to identify biological entities such as proteins, genes, and chemical compounds found in scientific articles. The Human Phenotype Ontology (HPO) is an ontology that provides a standardized vocabulary for phenotypic abnormalities found in human diseases. This...

Descripción completa

Detalles Bibliográficos
Autores principales: Lobo, Manuel, Lamurias, Andre, Couto, Francisco M.
Formato: Online Artículo Texto
Lenguaje:English
Publicado: Hindawi 2017
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5700471/
https://www.ncbi.nlm.nih.gov/pubmed/29250549
http://dx.doi.org/10.1155/2017/8565739
_version_ 1783281124855775232
author Lobo, Manuel
Lamurias, Andre
Couto, Francisco M.
author_facet Lobo, Manuel
Lamurias, Andre
Couto, Francisco M.
author_sort Lobo, Manuel
collection PubMed
description Named-Entity Recognition is commonly used to identify biological entities such as proteins, genes, and chemical compounds found in scientific articles. The Human Phenotype Ontology (HPO) is an ontology that provides a standardized vocabulary for phenotypic abnormalities found in human diseases. This article presents the Identifying Human Phenotypes (IHP) system, tuned to recognize HPO entities in unstructured text. IHP uses Stanford CoreNLP for text processing and applies Conditional Random Fields trained with a rich feature set, which includes linguistic, orthographic, morphologic, lexical, and context features created for the machine learning-based classifier. However, the main novelty of IHP is its validation step based on a set of carefully crafted manual rules, such as the negative connotation analysis, that combined with a dictionary can filter incorrectly identified entities, find missed entities, and combine adjacent entities. The performance of IHP was evaluated using the recently published HPO Gold Standardized Corpora (GSC), where the system Bio-LarK CR obtained the best F-measure of 0.56. IHP achieved an F-measure of 0.65 on the GSC. Due to inconsistencies found in the GSC, an extended version of the GSC was created, adding 881 entities and modifying 4 entities. IHP achieved an F-measure of 0.863 on the new GSC.
format Online
Article
Text
id pubmed-5700471
institution National Center for Biotechnology Information
language English
publishDate 2017
publisher Hindawi
record_format MEDLINE/PubMed
spelling pubmed-57004712017-12-17 Identifying Human Phenotype Terms by Combining Machine Learning and Validation Rules Lobo, Manuel Lamurias, Andre Couto, Francisco M. Biomed Res Int Research Article Named-Entity Recognition is commonly used to identify biological entities such as proteins, genes, and chemical compounds found in scientific articles. The Human Phenotype Ontology (HPO) is an ontology that provides a standardized vocabulary for phenotypic abnormalities found in human diseases. This article presents the Identifying Human Phenotypes (IHP) system, tuned to recognize HPO entities in unstructured text. IHP uses Stanford CoreNLP for text processing and applies Conditional Random Fields trained with a rich feature set, which includes linguistic, orthographic, morphologic, lexical, and context features created for the machine learning-based classifier. However, the main novelty of IHP is its validation step based on a set of carefully crafted manual rules, such as the negative connotation analysis, that combined with a dictionary can filter incorrectly identified entities, find missed entities, and combine adjacent entities. The performance of IHP was evaluated using the recently published HPO Gold Standardized Corpora (GSC), where the system Bio-LarK CR obtained the best F-measure of 0.56. IHP achieved an F-measure of 0.65 on the GSC. Due to inconsistencies found in the GSC, an extended version of the GSC was created, adding 881 entities and modifying 4 entities. IHP achieved an F-measure of 0.863 on the new GSC. Hindawi 2017 2017-11-09 /pmc/articles/PMC5700471/ /pubmed/29250549 http://dx.doi.org/10.1155/2017/8565739 Text en Copyright © 2017 Manuel Lobo et al. https://creativecommons.org/licenses/by/4.0/ This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.
spellingShingle Research Article
Lobo, Manuel
Lamurias, Andre
Couto, Francisco M.
Identifying Human Phenotype Terms by Combining Machine Learning and Validation Rules
title Identifying Human Phenotype Terms by Combining Machine Learning and Validation Rules
title_full Identifying Human Phenotype Terms by Combining Machine Learning and Validation Rules
title_fullStr Identifying Human Phenotype Terms by Combining Machine Learning and Validation Rules
title_full_unstemmed Identifying Human Phenotype Terms by Combining Machine Learning and Validation Rules
title_short Identifying Human Phenotype Terms by Combining Machine Learning and Validation Rules
title_sort identifying human phenotype terms by combining machine learning and validation rules
topic Research Article
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5700471/
https://www.ncbi.nlm.nih.gov/pubmed/29250549
http://dx.doi.org/10.1155/2017/8565739
work_keys_str_mv AT lobomanuel identifyinghumanphenotypetermsbycombiningmachinelearningandvalidationrules
AT lamuriasandre identifyinghumanphenotypetermsbycombiningmachinelearningandvalidationrules
AT coutofranciscom identifyinghumanphenotypetermsbycombiningmachinelearningandvalidationrules