Cargando…

Extracting Structured Genotype Information from Free-Text HLA Reports Using a Rule-Based Approach

BACKGROUND: Human leukocyte antigen (HLA) typing is important for transplant patients to prevent a severe mismatch reaction, and the result can also support the diagnosis of various disease or prediction of drug side effects. However, such secondary applications of HLA typing results are limited bec...

Descripción completa

Detalles Bibliográficos
Autores principales:	Lee, Kye Hwa, Kim, Hyo Jung, Kim, Yi-Jun, Kim, Ju Han, Song, Eun Young
Formato:	Online Artículo Texto
Lenguaje:	English
Publicado:	The Korean Academy of Medical Sciences 2020
Materias:	Original Article
Acceso en línea:	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7105511/ https://www.ncbi.nlm.nih.gov/pubmed/32233158 http://dx.doi.org/10.3346/jkms.2020.35.e78

_version_	1783512418753708032
author	Lee, Kye Hwa Kim, Hyo Jung Kim, Yi-Jun Kim, Ju Han Song, Eun Young
author_facet	Lee, Kye Hwa Kim, Hyo Jung Kim, Yi-Jun Kim, Ju Han Song, Eun Young
author_sort	Lee, Kye Hwa
collection	PubMed
description	BACKGROUND: Human leukocyte antigen (HLA) typing is important for transplant patients to prevent a severe mismatch reaction, and the result can also support the diagnosis of various disease or prediction of drug side effects. However, such secondary applications of HLA typing results are limited because they are typically provided in free-text format or PDFs on electronic medical records. We here propose a method to convert HLA genotype information stored in an unstructured format into a reusable structured format by extracting serotype/allele information. METHODS: We queried HLA typing reports from the clinical data warehouse of Seoul National University Hospital (SUPPREME) from 2000 to 2018 as a rule-development data set (64,024 reports) and from the most recent year (6,181 reports) as a test set. We used a rule-based natural language approach using a Python regex function to extract the 1) number of patients in the report, 2) clinical characteristics such as indication of the HLA testing, and 3) precise HLA genotypes. The performance of the rules and codes was evaluated by comparison between the extracted results from the test set and a validation set generated by manual curation. RESULTS: Among 11,287 reports for development set and 1,107 for the test set describing HLA typing for a single patient, iterative rule generation developed 124 extracting rules and 8 cleaning rules for HLA genotypes. Application of these rules extracted HLA genotypes with 0.892–0.999 precision and 0.795–0.998 recall for the five HLA genes. The precision and recall of the extracting rules for the number of patients in a report were 0.997 and 0.994 and those for the clinical variable extraction were 0.997 and 0.992, respectively. All extracted HLA alleles and serotypes were transformed according to formal HLA nomenclature by the cleaning rules. CONCLUSION: The rule-based HLA genotype extraction method shows reliable accuracy. We believe that there are significant number of patients who takes profit when this under-used genetic information will be return to them.
format	Online Article Text
id	pubmed-7105511
institution	National Center for Biotechnology Information
language	English
publishDate	2020
publisher	The Korean Academy of Medical Sciences
record_format	MEDLINE/PubMed
spelling	pubmed-71055112020-04-06 Extracting Structured Genotype Information from Free-Text HLA Reports Using a Rule-Based Approach Lee, Kye Hwa Kim, Hyo Jung Kim, Yi-Jun Kim, Ju Han Song, Eun Young J Korean Med Sci Original Article BACKGROUND: Human leukocyte antigen (HLA) typing is important for transplant patients to prevent a severe mismatch reaction, and the result can also support the diagnosis of various disease or prediction of drug side effects. However, such secondary applications of HLA typing results are limited because they are typically provided in free-text format or PDFs on electronic medical records. We here propose a method to convert HLA genotype information stored in an unstructured format into a reusable structured format by extracting serotype/allele information. METHODS: We queried HLA typing reports from the clinical data warehouse of Seoul National University Hospital (SUPPREME) from 2000 to 2018 as a rule-development data set (64,024 reports) and from the most recent year (6,181 reports) as a test set. We used a rule-based natural language approach using a Python regex function to extract the 1) number of patients in the report, 2) clinical characteristics such as indication of the HLA testing, and 3) precise HLA genotypes. The performance of the rules and codes was evaluated by comparison between the extracted results from the test set and a validation set generated by manual curation. RESULTS: Among 11,287 reports for development set and 1,107 for the test set describing HLA typing for a single patient, iterative rule generation developed 124 extracting rules and 8 cleaning rules for HLA genotypes. Application of these rules extracted HLA genotypes with 0.892–0.999 precision and 0.795–0.998 recall for the five HLA genes. The precision and recall of the extracting rules for the number of patients in a report were 0.997 and 0.994 and those for the clinical variable extraction were 0.997 and 0.992, respectively. All extracted HLA alleles and serotypes were transformed according to formal HLA nomenclature by the cleaning rules. CONCLUSION: The rule-based HLA genotype extraction method shows reliable accuracy. We believe that there are significant number of patients who takes profit when this under-used genetic information will be return to them. The Korean Academy of Medical Sciences 2020-02-12 /pmc/articles/PMC7105511/ /pubmed/32233158 http://dx.doi.org/10.3346/jkms.2020.35.e78 Text en © 2020 The Korean Academy of Medical Sciences. https://creativecommons.org/licenses/by-nc/4.0/ This is an Open Access article distributed under the terms of the Creative Commons Attribution Non-Commercial License (https://creativecommons.org/licenses/by-nc/4.0/) which permits unrestricted non-commercial use, distribution, and reproduction in any medium, provided the original work is properly cited.
spellingShingle	Original Article Lee, Kye Hwa Kim, Hyo Jung Kim, Yi-Jun Kim, Ju Han Song, Eun Young Extracting Structured Genotype Information from Free-Text HLA Reports Using a Rule-Based Approach
title	Extracting Structured Genotype Information from Free-Text HLA Reports Using a Rule-Based Approach
title_full	Extracting Structured Genotype Information from Free-Text HLA Reports Using a Rule-Based Approach
title_fullStr	Extracting Structured Genotype Information from Free-Text HLA Reports Using a Rule-Based Approach
title_full_unstemmed	Extracting Structured Genotype Information from Free-Text HLA Reports Using a Rule-Based Approach
title_short	Extracting Structured Genotype Information from Free-Text HLA Reports Using a Rule-Based Approach
title_sort	extracting structured genotype information from free-text hla reports using a rule-based approach
topic	Original Article
url	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7105511/ https://www.ncbi.nlm.nih.gov/pubmed/32233158 http://dx.doi.org/10.3346/jkms.2020.35.e78
work_keys_str_mv	AT leekyehwa extractingstructuredgenotypeinformationfromfreetexthlareportsusingarulebasedapproach AT kimhyojung extractingstructuredgenotypeinformationfromfreetexthlareportsusingarulebasedapproach AT kimyijun extractingstructuredgenotypeinformationfromfreetexthlareportsusingarulebasedapproach AT kimjuhan extractingstructuredgenotypeinformationfromfreetexthlareportsusingarulebasedapproach AT songeunyoung extractingstructuredgenotypeinformationfromfreetexthlareportsusingarulebasedapproach

Extracting Structured Genotype Information from Free-Text HLA Reports Using a Rule-Based Approach

Ejemplares similares