Cargando…

Deep Phenotyping of Chinese Electronic Health Records by Recognizing Linguistic Patterns of Phenotypic Narratives With a Sequence Motif Discovery Tool: Algorithm Development and Validation

BACKGROUND: Phenotype information in electronic health records (EHRs) is mainly recorded in unstructured free text, which cannot be directly used for clinical research. EHR-based deep-phenotyping methods can structure phenotype information in EHRs with high fidelity, making it the focus of medical i...

Descripción completa

Detalles Bibliográficos
Autores principales: Li, Shicheng, Deng, Lizong, Zhang, Xu, Chen, Luming, Yang, Tao, Qi, Yifan, Jiang, Taijiao
Formato: Online Artículo Texto
Lenguaje:English
Publicado: JMIR Publications 2022
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9206202/
https://www.ncbi.nlm.nih.gov/pubmed/35657661
http://dx.doi.org/10.2196/37213
_version_ 1784729287956365312
author Li, Shicheng
Deng, Lizong
Zhang, Xu
Chen, Luming
Yang, Tao
Qi, Yifan
Jiang, Taijiao
author_facet Li, Shicheng
Deng, Lizong
Zhang, Xu
Chen, Luming
Yang, Tao
Qi, Yifan
Jiang, Taijiao
author_sort Li, Shicheng
collection PubMed
description BACKGROUND: Phenotype information in electronic health records (EHRs) is mainly recorded in unstructured free text, which cannot be directly used for clinical research. EHR-based deep-phenotyping methods can structure phenotype information in EHRs with high fidelity, making it the focus of medical informatics. However, developing a deep-phenotyping method for non-English EHRs (ie, Chinese EHRs) is challenging. Although numerous EHR resources exist in China, fine-grained annotation data that are suitable for developing deep-phenotyping methods are limited. It is challenging to develop a deep-phenotyping method for Chinese EHRs in such a low-resource scenario. OBJECTIVE: In this study, we aimed to develop a deep-phenotyping method with good generalization ability for Chinese EHRs based on limited fine-grained annotation data. METHODS: The core of the methodology was to identify linguistic patterns of phenotype descriptions in Chinese EHRs with a sequence motif discovery tool and perform deep phenotyping of Chinese EHRs by recognizing linguistic patterns in free text. Specifically, 1000 Chinese EHRs were manually annotated based on a fine-grained information model, PhenoSSU (Semantic Structured Unit of Phenotypes). The annotation data set was randomly divided into a training set (n=700, 70%) and a testing set (n=300, 30%). The process for mining linguistic patterns was divided into three steps. First, free text in the training set was encoded as single-letter sequences (P: phenotype, A: attribute). Second, a biological sequence analysis tool—MEME (Multiple Expectation Maximums for Motif Elicitation)—was used to identify motifs in the single-letter sequences. Finally, the identified motifs were reduced to a series of regular expressions representing linguistic patterns of PhenoSSU instances in Chinese EHRs. Based on the discovered linguistic patterns, we developed a deep-phenotyping method for Chinese EHRs, including a deep learning–based method for named entity recognition and a pattern recognition–based method for attribute prediction. RESULTS: In total, 51 sequence motifs with statistical significance were mined from 700 Chinese EHRs in the training set and were combined into six regular expressions. It was found that these six regular expressions could be learned from a mean of 134 (SD 9.7) annotated EHRs in the training set. The deep-phenotyping algorithm for Chinese EHRs could recognize PhenoSSU instances with an overall accuracy of 0.844 on the test set. For the subtask of entity recognition, the algorithm achieved an F1 score of 0.898 with the Bidirectional Encoder Representations from Transformers–bidirectional long short-term memory and conditional random field model; for the subtask of attribute prediction, the algorithm achieved a weighted accuracy of 0.940 with the linguistic pattern–based method. CONCLUSIONS: We developed a simple but effective strategy to perform deep phenotyping of Chinese EHRs with limited fine-grained annotation data. Our work will promote the second use of Chinese EHRs and give inspiration to other non–English-speaking countries.
format Online
Article
Text
id pubmed-9206202
institution National Center for Biotechnology Information
language English
publishDate 2022
publisher JMIR Publications
record_format MEDLINE/PubMed
spelling pubmed-92062022022-06-19 Deep Phenotyping of Chinese Electronic Health Records by Recognizing Linguistic Patterns of Phenotypic Narratives With a Sequence Motif Discovery Tool: Algorithm Development and Validation Li, Shicheng Deng, Lizong Zhang, Xu Chen, Luming Yang, Tao Qi, Yifan Jiang, Taijiao J Med Internet Res Original Paper BACKGROUND: Phenotype information in electronic health records (EHRs) is mainly recorded in unstructured free text, which cannot be directly used for clinical research. EHR-based deep-phenotyping methods can structure phenotype information in EHRs with high fidelity, making it the focus of medical informatics. However, developing a deep-phenotyping method for non-English EHRs (ie, Chinese EHRs) is challenging. Although numerous EHR resources exist in China, fine-grained annotation data that are suitable for developing deep-phenotyping methods are limited. It is challenging to develop a deep-phenotyping method for Chinese EHRs in such a low-resource scenario. OBJECTIVE: In this study, we aimed to develop a deep-phenotyping method with good generalization ability for Chinese EHRs based on limited fine-grained annotation data. METHODS: The core of the methodology was to identify linguistic patterns of phenotype descriptions in Chinese EHRs with a sequence motif discovery tool and perform deep phenotyping of Chinese EHRs by recognizing linguistic patterns in free text. Specifically, 1000 Chinese EHRs were manually annotated based on a fine-grained information model, PhenoSSU (Semantic Structured Unit of Phenotypes). The annotation data set was randomly divided into a training set (n=700, 70%) and a testing set (n=300, 30%). The process for mining linguistic patterns was divided into three steps. First, free text in the training set was encoded as single-letter sequences (P: phenotype, A: attribute). Second, a biological sequence analysis tool—MEME (Multiple Expectation Maximums for Motif Elicitation)—was used to identify motifs in the single-letter sequences. Finally, the identified motifs were reduced to a series of regular expressions representing linguistic patterns of PhenoSSU instances in Chinese EHRs. Based on the discovered linguistic patterns, we developed a deep-phenotyping method for Chinese EHRs, including a deep learning–based method for named entity recognition and a pattern recognition–based method for attribute prediction. RESULTS: In total, 51 sequence motifs with statistical significance were mined from 700 Chinese EHRs in the training set and were combined into six regular expressions. It was found that these six regular expressions could be learned from a mean of 134 (SD 9.7) annotated EHRs in the training set. The deep-phenotyping algorithm for Chinese EHRs could recognize PhenoSSU instances with an overall accuracy of 0.844 on the test set. For the subtask of entity recognition, the algorithm achieved an F1 score of 0.898 with the Bidirectional Encoder Representations from Transformers–bidirectional long short-term memory and conditional random field model; for the subtask of attribute prediction, the algorithm achieved a weighted accuracy of 0.940 with the linguistic pattern–based method. CONCLUSIONS: We developed a simple but effective strategy to perform deep phenotyping of Chinese EHRs with limited fine-grained annotation data. Our work will promote the second use of Chinese EHRs and give inspiration to other non–English-speaking countries. JMIR Publications 2022-06-03 /pmc/articles/PMC9206202/ /pubmed/35657661 http://dx.doi.org/10.2196/37213 Text en ©Shicheng Li, Lizong Deng, Xu Zhang, Luming Chen, Tao Yang, Yifan Qi, Taijiao Jiang. Originally published in the Journal of Medical Internet Research (https://www.jmir.org), 03.06.2022. https://creativecommons.org/licenses/by/4.0/This is an open-access article distributed under the terms of the Creative Commons Attribution License (https://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work, first published in the Journal of Medical Internet Research, is properly cited. The complete bibliographic information, a link to the original publication on https://www.jmir.org/, as well as this copyright and license information must be included.
spellingShingle Original Paper
Li, Shicheng
Deng, Lizong
Zhang, Xu
Chen, Luming
Yang, Tao
Qi, Yifan
Jiang, Taijiao
Deep Phenotyping of Chinese Electronic Health Records by Recognizing Linguistic Patterns of Phenotypic Narratives With a Sequence Motif Discovery Tool: Algorithm Development and Validation
title Deep Phenotyping of Chinese Electronic Health Records by Recognizing Linguistic Patterns of Phenotypic Narratives With a Sequence Motif Discovery Tool: Algorithm Development and Validation
title_full Deep Phenotyping of Chinese Electronic Health Records by Recognizing Linguistic Patterns of Phenotypic Narratives With a Sequence Motif Discovery Tool: Algorithm Development and Validation
title_fullStr Deep Phenotyping of Chinese Electronic Health Records by Recognizing Linguistic Patterns of Phenotypic Narratives With a Sequence Motif Discovery Tool: Algorithm Development and Validation
title_full_unstemmed Deep Phenotyping of Chinese Electronic Health Records by Recognizing Linguistic Patterns of Phenotypic Narratives With a Sequence Motif Discovery Tool: Algorithm Development and Validation
title_short Deep Phenotyping of Chinese Electronic Health Records by Recognizing Linguistic Patterns of Phenotypic Narratives With a Sequence Motif Discovery Tool: Algorithm Development and Validation
title_sort deep phenotyping of chinese electronic health records by recognizing linguistic patterns of phenotypic narratives with a sequence motif discovery tool: algorithm development and validation
topic Original Paper
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9206202/
https://www.ncbi.nlm.nih.gov/pubmed/35657661
http://dx.doi.org/10.2196/37213
work_keys_str_mv AT lishicheng deepphenotypingofchineseelectronichealthrecordsbyrecognizinglinguisticpatternsofphenotypicnarrativeswithasequencemotifdiscoverytoolalgorithmdevelopmentandvalidation
AT denglizong deepphenotypingofchineseelectronichealthrecordsbyrecognizinglinguisticpatternsofphenotypicnarrativeswithasequencemotifdiscoverytoolalgorithmdevelopmentandvalidation
AT zhangxu deepphenotypingofchineseelectronichealthrecordsbyrecognizinglinguisticpatternsofphenotypicnarrativeswithasequencemotifdiscoverytoolalgorithmdevelopmentandvalidation
AT chenluming deepphenotypingofchineseelectronichealthrecordsbyrecognizinglinguisticpatternsofphenotypicnarrativeswithasequencemotifdiscoverytoolalgorithmdevelopmentandvalidation
AT yangtao deepphenotypingofchineseelectronichealthrecordsbyrecognizinglinguisticpatternsofphenotypicnarrativeswithasequencemotifdiscoverytoolalgorithmdevelopmentandvalidation
AT qiyifan deepphenotypingofchineseelectronichealthrecordsbyrecognizinglinguisticpatternsofphenotypicnarrativeswithasequencemotifdiscoverytoolalgorithmdevelopmentandvalidation
AT jiangtaijiao deepphenotypingofchineseelectronichealthrecordsbyrecognizinglinguisticpatternsofphenotypicnarrativeswithasequencemotifdiscoverytoolalgorithmdevelopmentandvalidation