Cargando…

Improving broad-coverage medical entity linking with semantic type prediction and large-scale datasets

OBJECTIVES: Biomedical natural language processing tools are increasingly being applied for broad-coverage information extraction—extracting medical information of all types in a scientific document or a clinical note. In such broad-coverage settings, linking mentions of medical concepts to standard...

Descripción completa

Detalles Bibliográficos
Autores principales:	Vashishth, Shikhar, Newman-Griffis, Denis, Joshi, Rishabh, Dutt, Ritam, Rosé, Carolyn P.
Formato:	Online Artículo Texto
Lenguaje:	English
Publicado:	2021
Materias:	Article
Acceso en línea:	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8952339/ https://www.ncbi.nlm.nih.gov/pubmed/34390853 http://dx.doi.org/10.1016/j.jbi.2021.103880

_version_	1784675590469582848
author	Vashishth, Shikhar Newman-Griffis, Denis Joshi, Rishabh Dutt, Ritam Rosé, Carolyn P.
author_facet	Vashishth, Shikhar Newman-Griffis, Denis Joshi, Rishabh Dutt, Ritam Rosé, Carolyn P.
author_sort	Vashishth, Shikhar
collection	PubMed
description	OBJECTIVES: Biomedical natural language processing tools are increasingly being applied for broad-coverage information extraction—extracting medical information of all types in a scientific document or a clinical note. In such broad-coverage settings, linking mentions of medical concepts to standardized vocabularies requires choosing the best candidate concepts from large inventories covering dozens of types. This study presents a novel semantic type prediction module for biomedical NLP pipelines and two automatically-constructed, large-scale datasets with broad coverage of semantic types. METHODS: We experiment with five off-the-shelf biomedical NLP toolkits on four benchmark datasets for medical information extraction from scientific literature and clinical notes. All toolkits adopt a staged approach of mention detection followed by two stages of medical entity linking: (1) generating a list of candidate concepts, and (2) picking the best concept among them. We introduce a semantic type prediction module to alleviate the problem of overgeneration of candidate concepts by filtering out irrelevant candidate concepts based on the predicted semantic type of a mention. We present MedType, a fully modular semantic type prediction model which we integrate into the existing NLP toolkits. To address the dearth of broad-coverage training data for medical information extraction, we further present WikiMed and PubMedDS, two large-scale datasets for medical entity linking. RESULTS: Semantic type filtering improves medical entity linking performance across all toolkits and datasets, often by several percentage points of F-1. Further, pretraining MedType on our novel datasets achieves state-of-the-art performance for semantic type prediction in biomedical text. CONCLUSIONS: Semantic type prediction is a key part of building accurate NLP pipelines for broad-coverage information extraction from biomedical text. We make our source code and novel datasets publicly available to foster reproducible research.
format	Online Article Text
id	pubmed-8952339
institution	National Center for Biotechnology Information
language	English
publishDate	2021
record_format	MEDLINE/PubMed
spelling	pubmed-89523392022-03-25 Improving broad-coverage medical entity linking with semantic type prediction and large-scale datasets Vashishth, Shikhar Newman-Griffis, Denis Joshi, Rishabh Dutt, Ritam Rosé, Carolyn P. J Biomed Inform Article OBJECTIVES: Biomedical natural language processing tools are increasingly being applied for broad-coverage information extraction—extracting medical information of all types in a scientific document or a clinical note. In such broad-coverage settings, linking mentions of medical concepts to standardized vocabularies requires choosing the best candidate concepts from large inventories covering dozens of types. This study presents a novel semantic type prediction module for biomedical NLP pipelines and two automatically-constructed, large-scale datasets with broad coverage of semantic types. METHODS: We experiment with five off-the-shelf biomedical NLP toolkits on four benchmark datasets for medical information extraction from scientific literature and clinical notes. All toolkits adopt a staged approach of mention detection followed by two stages of medical entity linking: (1) generating a list of candidate concepts, and (2) picking the best concept among them. We introduce a semantic type prediction module to alleviate the problem of overgeneration of candidate concepts by filtering out irrelevant candidate concepts based on the predicted semantic type of a mention. We present MedType, a fully modular semantic type prediction model which we integrate into the existing NLP toolkits. To address the dearth of broad-coverage training data for medical information extraction, we further present WikiMed and PubMedDS, two large-scale datasets for medical entity linking. RESULTS: Semantic type filtering improves medical entity linking performance across all toolkits and datasets, often by several percentage points of F-1. Further, pretraining MedType on our novel datasets achieves state-of-the-art performance for semantic type prediction in biomedical text. CONCLUSIONS: Semantic type prediction is a key part of building accurate NLP pipelines for broad-coverage information extraction from biomedical text. We make our source code and novel datasets publicly available to foster reproducible research. 2021-09 2021-08-12 /pmc/articles/PMC8952339/ /pubmed/34390853 http://dx.doi.org/10.1016/j.jbi.2021.103880 Text en https://creativecommons.org/licenses/by/4.0/This is an open access article under the CC BY license (http://creativecommons.org/licenses/by/4.0/ (https://creativecommons.org/licenses/by/4.0/) ).
spellingShingle	Article Vashishth, Shikhar Newman-Griffis, Denis Joshi, Rishabh Dutt, Ritam Rosé, Carolyn P. Improving broad-coverage medical entity linking with semantic type prediction and large-scale datasets
title	Improving broad-coverage medical entity linking with semantic type prediction and large-scale datasets
title_full	Improving broad-coverage medical entity linking with semantic type prediction and large-scale datasets
title_fullStr	Improving broad-coverage medical entity linking with semantic type prediction and large-scale datasets
title_full_unstemmed	Improving broad-coverage medical entity linking with semantic type prediction and large-scale datasets
title_short	Improving broad-coverage medical entity linking with semantic type prediction and large-scale datasets
title_sort	improving broad-coverage medical entity linking with semantic type prediction and large-scale datasets
topic	Article
url	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8952339/ https://www.ncbi.nlm.nih.gov/pubmed/34390853 http://dx.doi.org/10.1016/j.jbi.2021.103880
work_keys_str_mv	AT vashishthshikhar improvingbroadcoveragemedicalentitylinkingwithsemantictypepredictionandlargescaledatasets AT newmangriffisdenis improvingbroadcoveragemedicalentitylinkingwithsemantictypepredictionandlargescaledatasets AT joshirishabh improvingbroadcoveragemedicalentitylinkingwithsemantictypepredictionandlargescaledatasets AT duttritam improvingbroadcoveragemedicalentitylinkingwithsemantictypepredictionandlargescaledatasets AT rosecarolynp improvingbroadcoveragemedicalentitylinkingwithsemantictypepredictionandlargescaledatasets

Improving broad-coverage medical entity linking with semantic type prediction and large-scale datasets

Ejemplares similares