Cargando…

A Hybrid Model for Family History Information Identification and Relation Extraction: Development and Evaluation of an End-to-End Information Extraction System

BACKGROUND: Family history information is important to assess the risk of inherited medical conditions. Natural language processing has the potential to extract this information from unstructured free-text notes to improve patient care and decision making. We describe the end-to-end information extr...

Descripción completa

Detalles Bibliográficos
Autores principales:	Kim, Youngjun, Heider, Paul M, Lally, Isabel RH, Meystre, Stéphane M
Formato:	Online Artículo Texto
Lenguaje:	English
Publicado:	JMIR Publications 2021
Materias:	Original Paper
Acceso en línea:	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8103307/ https://www.ncbi.nlm.nih.gov/pubmed/33885370 http://dx.doi.org/10.2196/22797

_version_	1783689294108426240
author	Kim, Youngjun Heider, Paul M Lally, Isabel RH Meystre, Stéphane M
author_facet	Kim, Youngjun Heider, Paul M Lally, Isabel RH Meystre, Stéphane M
author_sort	Kim, Youngjun
collection	PubMed
description	BACKGROUND: Family history information is important to assess the risk of inherited medical conditions. Natural language processing has the potential to extract this information from unstructured free-text notes to improve patient care and decision making. We describe the end-to-end information extraction system the Medical University of South Carolina team developed when participating in the 2019 National Natural Language Processing Clinical Challenge (n2c2)/Open Health Natural Language Processing (OHNLP) shared task. OBJECTIVE: This task involves identifying mentions of family members and observations in electronic health record text notes and recognizing the 2 types of relations (family member-living status relations and family member-observation relations). Our system aims to achieve a high level of performance by integrating heuristics and advanced information extraction methods. Our efforts also include improving the performance of 2 subtasks by exploiting additional labeled data and clinical text-based embedding models. METHODS: We present a hybrid method that combines machine learning and rule-based approaches. We implemented an end-to-end system with multiple information extraction and attribute classification components. For entity identification, we trained bidirectional long short-term memory deep learning models. These models incorporated static word embeddings and context-dependent embeddings. We created a voting ensemble that combined the predictions of all individual models. For relation extraction, we trained 2 relation extraction models. The first model determined the living status of each family member. The second model identified observations associated with each family member. We implemented online gradient descent models to extract related entity pairs. As part of postchallenge efforts, we used the BioCreative/OHNLP 2018 corpus and trained new models with the union of these 2 datasets. We also pretrained language models using clinical notes from the Medical Information Mart for Intensive Care (MIMIC-III) clinical database. RESULTS: The voting ensemble achieved better performance than individual classifiers. In the entity identification task, our top-performing system reached a precision of 78.90% and a recall of 83.84%. Our natural language processing system for entity identification took 3rd place out of 17 teams in the challenge. We ranked 4th out of 9 teams in the relation extraction task. Our system substantially benefited from the combination of the 2 datasets. Compared to our official submission with F(1) scores of 81.30% and 64.94% for entity identification and relation extraction, respectively, the revised system yielded significantly better performance (P<.05) with F(1) scores of 86.02% and 72.48%, respectively. CONCLUSIONS: We demonstrated that a hybrid model could be used to successfully extract family history information recorded in unstructured free-text notes. In this study, our approach to entity identification as a sequence labeling problem produced satisfactory results. Our postchallenge efforts significantly improved performance by leveraging additional labeled data and using word vector representations learned from large collections of clinical notes.
format	Online Article Text
id	pubmed-8103307
institution	National Center for Biotechnology Information
language	English
publishDate	2021
publisher	JMIR Publications
record_format	MEDLINE/PubMed
spelling	pubmed-81033072021-05-12 A Hybrid Model for Family History Information Identification and Relation Extraction: Development and Evaluation of an End-to-End Information Extraction System Kim, Youngjun Heider, Paul M Lally, Isabel RH Meystre, Stéphane M JMIR Med Inform Original Paper BACKGROUND: Family history information is important to assess the risk of inherited medical conditions. Natural language processing has the potential to extract this information from unstructured free-text notes to improve patient care and decision making. We describe the end-to-end information extraction system the Medical University of South Carolina team developed when participating in the 2019 National Natural Language Processing Clinical Challenge (n2c2)/Open Health Natural Language Processing (OHNLP) shared task. OBJECTIVE: This task involves identifying mentions of family members and observations in electronic health record text notes and recognizing the 2 types of relations (family member-living status relations and family member-observation relations). Our system aims to achieve a high level of performance by integrating heuristics and advanced information extraction methods. Our efforts also include improving the performance of 2 subtasks by exploiting additional labeled data and clinical text-based embedding models. METHODS: We present a hybrid method that combines machine learning and rule-based approaches. We implemented an end-to-end system with multiple information extraction and attribute classification components. For entity identification, we trained bidirectional long short-term memory deep learning models. These models incorporated static word embeddings and context-dependent embeddings. We created a voting ensemble that combined the predictions of all individual models. For relation extraction, we trained 2 relation extraction models. The first model determined the living status of each family member. The second model identified observations associated with each family member. We implemented online gradient descent models to extract related entity pairs. As part of postchallenge efforts, we used the BioCreative/OHNLP 2018 corpus and trained new models with the union of these 2 datasets. We also pretrained language models using clinical notes from the Medical Information Mart for Intensive Care (MIMIC-III) clinical database. RESULTS: The voting ensemble achieved better performance than individual classifiers. In the entity identification task, our top-performing system reached a precision of 78.90% and a recall of 83.84%. Our natural language processing system for entity identification took 3rd place out of 17 teams in the challenge. We ranked 4th out of 9 teams in the relation extraction task. Our system substantially benefited from the combination of the 2 datasets. Compared to our official submission with F(1) scores of 81.30% and 64.94% for entity identification and relation extraction, respectively, the revised system yielded significantly better performance (P<.05) with F(1) scores of 86.02% and 72.48%, respectively. CONCLUSIONS: We demonstrated that a hybrid model could be used to successfully extract family history information recorded in unstructured free-text notes. In this study, our approach to entity identification as a sequence labeling problem produced satisfactory results. Our postchallenge efforts significantly improved performance by leveraging additional labeled data and using word vector representations learned from large collections of clinical notes. JMIR Publications 2021-04-22 /pmc/articles/PMC8103307/ /pubmed/33885370 http://dx.doi.org/10.2196/22797 Text en ©Youngjun Kim, Paul M Heider, Isabel RH Lally, Stéphane M Meystre. Originally published in JMIR Medical Informatics (https://medinform.jmir.org), 22.04.2021. https://creativecommons.org/licenses/by/4.0/This is an open-access article distributed under the terms of the Creative Commons Attribution License (https://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work, first published in JMIR Medical Informatics, is properly cited. The complete bibliographic information, a link to the original publication on https://medinform.jmir.org/, as well as this copyright and license information must be included.
spellingShingle	Original Paper Kim, Youngjun Heider, Paul M Lally, Isabel RH Meystre, Stéphane M A Hybrid Model for Family History Information Identification and Relation Extraction: Development and Evaluation of an End-to-End Information Extraction System
title	A Hybrid Model for Family History Information Identification and Relation Extraction: Development and Evaluation of an End-to-End Information Extraction System
title_full	A Hybrid Model for Family History Information Identification and Relation Extraction: Development and Evaluation of an End-to-End Information Extraction System
title_fullStr	A Hybrid Model for Family History Information Identification and Relation Extraction: Development and Evaluation of an End-to-End Information Extraction System
title_full_unstemmed	A Hybrid Model for Family History Information Identification and Relation Extraction: Development and Evaluation of an End-to-End Information Extraction System
title_short	A Hybrid Model for Family History Information Identification and Relation Extraction: Development and Evaluation of an End-to-End Information Extraction System
title_sort	hybrid model for family history information identification and relation extraction: development and evaluation of an end-to-end information extraction system
topic	Original Paper
url	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8103307/ https://www.ncbi.nlm.nih.gov/pubmed/33885370 http://dx.doi.org/10.2196/22797
work_keys_str_mv	AT kimyoungjun ahybridmodelforfamilyhistoryinformationidentificationandrelationextractiondevelopmentandevaluationofanendtoendinformationextractionsystem AT heiderpaulm ahybridmodelforfamilyhistoryinformationidentificationandrelationextractiondevelopmentandevaluationofanendtoendinformationextractionsystem AT lallyisabelrh ahybridmodelforfamilyhistoryinformationidentificationandrelationextractiondevelopmentandevaluationofanendtoendinformationextractionsystem AT meystrestephanem ahybridmodelforfamilyhistoryinformationidentificationandrelationextractiondevelopmentandevaluationofanendtoendinformationextractionsystem AT kimyoungjun hybridmodelforfamilyhistoryinformationidentificationandrelationextractiondevelopmentandevaluationofanendtoendinformationextractionsystem AT heiderpaulm hybridmodelforfamilyhistoryinformationidentificationandrelationextractiondevelopmentandevaluationofanendtoendinformationextractionsystem AT lallyisabelrh hybridmodelforfamilyhistoryinformationidentificationandrelationextractiondevelopmentandevaluationofanendtoendinformationextractionsystem AT meystrestephanem hybridmodelforfamilyhistoryinformationidentificationandrelationextractiondevelopmentandevaluationofanendtoendinformationextractionsystem

A Hybrid Model for Family History Information Identification and Relation Extraction: Development and Evaluation of an End-to-End Information Extraction System

Ejemplares similares