Cargando…

Extracting medication information from unstructured public health data: a demonstration on data from population-based and tertiary-based samples

BACKGROUND: Unstructured data from clinical epidemiological studies can be valuable and easy to obtain. However, it requires further extraction and processing for data analysis. Doing this manually is labor-intensive, slow and subject to error. In this study, we propose an automation framework for e...

Descripción completa

Detalles Bibliográficos
Autores principales:	Chen, Robert, Ho, Joyce C., Lin, Jin-Mann S.
Formato:	Online Artículo Texto
Lenguaje:	English
Publicado:	BioMed Central 2020
Materias:	Research Article
Acceso en línea:	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7559204/ https://www.ncbi.nlm.nih.gov/pubmed/33059588 http://dx.doi.org/10.1186/s12874-020-01131-7

_version_	1783594808256757760
author	Chen, Robert Ho, Joyce C. Lin, Jin-Mann S.
author_facet	Chen, Robert Ho, Joyce C. Lin, Jin-Mann S.
author_sort	Chen, Robert
collection	PubMed
description	BACKGROUND: Unstructured data from clinical epidemiological studies can be valuable and easy to obtain. However, it requires further extraction and processing for data analysis. Doing this manually is labor-intensive, slow and subject to error. In this study, we propose an automation framework for extracting and processing unstructured data. METHODS: The proposed automation framework consisted of two natural language processing (NLP) based tools for unstructured text data for medications and reasons for medication use. We first checked spelling using a spell-check program trained on publicly available knowledge sources and then applied NLP techniques. We mapped medication names into generic names using vocabulary from publicly available knowledge sources. We used WHO’s Anatomical Therapeutic Chemical (ATC) classification system to map generic medication names to medication classes. We processed the reasons for medication with the Lancaster stemmer method and then grouped and mapped to disease classes based on organ systems. Finally, we demonstrated this automation framework on two data sources for Mylagic Encephalomyelitis/ Chronic Fatigue Syndrome (ME/CFS): tertiary-based (n = 378) and population-based (n = 664) samples. RESULTS: A total of 8681 raw medication records were used for this demonstration. The 1266 distinct medication names (omitting supplements) were condensed to 89 ATC classification system categories. The 1432 distinct raw reasons for medication use were condensed to 65 categories via NLP. Compared to completion of the entire process manually, our automation process reduced the number of the terms requiring manual labor for mapping by 84.4% for medications and 59.4% for reasons for medication use. Additionally, this process improved the precision of the mapped results. CONCLUSIONS: Our automation framework demonstrates the usefulness of NLP strategies even when there is no established mapping database. For a less established database (e.g., reasons for medication use), the method is easily modifiable as new knowledge sources for mapping are introduced. The capability to condense large features into interpretable ones will be valuable for subsequent analytical studies involving techniques such as machine learning and data mining.
format	Online Article Text
id	pubmed-7559204
institution	National Center for Biotechnology Information
language	English
publishDate	2020
publisher	BioMed Central
record_format	MEDLINE/PubMed
spelling	pubmed-75592042020-10-15 Extracting medication information from unstructured public health data: a demonstration on data from population-based and tertiary-based samples Chen, Robert Ho, Joyce C. Lin, Jin-Mann S. BMC Med Res Methodol Research Article BACKGROUND: Unstructured data from clinical epidemiological studies can be valuable and easy to obtain. However, it requires further extraction and processing for data analysis. Doing this manually is labor-intensive, slow and subject to error. In this study, we propose an automation framework for extracting and processing unstructured data. METHODS: The proposed automation framework consisted of two natural language processing (NLP) based tools for unstructured text data for medications and reasons for medication use. We first checked spelling using a spell-check program trained on publicly available knowledge sources and then applied NLP techniques. We mapped medication names into generic names using vocabulary from publicly available knowledge sources. We used WHO’s Anatomical Therapeutic Chemical (ATC) classification system to map generic medication names to medication classes. We processed the reasons for medication with the Lancaster stemmer method and then grouped and mapped to disease classes based on organ systems. Finally, we demonstrated this automation framework on two data sources for Mylagic Encephalomyelitis/ Chronic Fatigue Syndrome (ME/CFS): tertiary-based (n = 378) and population-based (n = 664) samples. RESULTS: A total of 8681 raw medication records were used for this demonstration. The 1266 distinct medication names (omitting supplements) were condensed to 89 ATC classification system categories. The 1432 distinct raw reasons for medication use were condensed to 65 categories via NLP. Compared to completion of the entire process manually, our automation process reduced the number of the terms requiring manual labor for mapping by 84.4% for medications and 59.4% for reasons for medication use. Additionally, this process improved the precision of the mapped results. CONCLUSIONS: Our automation framework demonstrates the usefulness of NLP strategies even when there is no established mapping database. For a less established database (e.g., reasons for medication use), the method is easily modifiable as new knowledge sources for mapping are introduced. The capability to condense large features into interpretable ones will be valuable for subsequent analytical studies involving techniques such as machine learning and data mining. BioMed Central 2020-10-15 /pmc/articles/PMC7559204/ /pubmed/33059588 http://dx.doi.org/10.1186/s12874-020-01131-7 Text en © The Author(s) 2020 Open AccessThis article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated in a credit line to the data.
spellingShingle	Research Article Chen, Robert Ho, Joyce C. Lin, Jin-Mann S. Extracting medication information from unstructured public health data: a demonstration on data from population-based and tertiary-based samples
title	Extracting medication information from unstructured public health data: a demonstration on data from population-based and tertiary-based samples
title_full	Extracting medication information from unstructured public health data: a demonstration on data from population-based and tertiary-based samples
title_fullStr	Extracting medication information from unstructured public health data: a demonstration on data from population-based and tertiary-based samples
title_full_unstemmed	Extracting medication information from unstructured public health data: a demonstration on data from population-based and tertiary-based samples
title_short	Extracting medication information from unstructured public health data: a demonstration on data from population-based and tertiary-based samples
title_sort	extracting medication information from unstructured public health data: a demonstration on data from population-based and tertiary-based samples
topic	Research Article
url	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7559204/ https://www.ncbi.nlm.nih.gov/pubmed/33059588 http://dx.doi.org/10.1186/s12874-020-01131-7
work_keys_str_mv	AT chenrobert extractingmedicationinformationfromunstructuredpublichealthdataademonstrationondatafrompopulationbasedandtertiarybasedsamples AT hojoycec extractingmedicationinformationfromunstructuredpublichealthdataademonstrationondatafrompopulationbasedandtertiarybasedsamples AT linjinmanns extractingmedicationinformationfromunstructuredpublichealthdataademonstrationondatafrompopulationbasedandtertiarybasedsamples

Extracting medication information from unstructured public health data: a demonstration on data from population-based and tertiary-based samples

Ejemplares similares