Cargando…

Efficient and Accurate Extracting of Unstructured EHRs on Cancer Therapy Responses for the Development of RECIST Natural Language Processing Tools: Part I, the Corpus

PURPOSE: Electronic health records (EHRs) are created primarily for nonresearch purposes; thus, the amounts of data are enormous, and the data are crude, heterogeneous, incomplete, and largely unstructured, presenting challenges to effective analyses for timely, reliable results. Particularly, resea...

Descripción completa

Detalles Bibliográficos
Autores principales: Li, Yalun, Luo, Yung-Hung, Wampfler, Jason A., Rubinstein, Samuel M., Tiryaki, Firat, Ashok, Kumar, Warner, Jeremy L., Xu, Hua, Yang, Ping
Formato: Online Artículo Texto
Lenguaje:English
Publicado: American Society of Clinical Oncology 2020
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7265793/
https://www.ncbi.nlm.nih.gov/pubmed/32364754
http://dx.doi.org/10.1200/CCI.19.00147
_version_ 1783541190549831680
author Li, Yalun
Luo, Yung-Hung
Wampfler, Jason A.
Rubinstein, Samuel M.
Tiryaki, Firat
Ashok, Kumar
Warner, Jeremy L.
Xu, Hua
Yang, Ping
author_facet Li, Yalun
Luo, Yung-Hung
Wampfler, Jason A.
Rubinstein, Samuel M.
Tiryaki, Firat
Ashok, Kumar
Warner, Jeremy L.
Xu, Hua
Yang, Ping
author_sort Li, Yalun
collection PubMed
description PURPOSE: Electronic health records (EHRs) are created primarily for nonresearch purposes; thus, the amounts of data are enormous, and the data are crude, heterogeneous, incomplete, and largely unstructured, presenting challenges to effective analyses for timely, reliable results. Particularly, research dealing with clinical notes relevant to patient care and outcome is seldom conducted, due to the complexity of data extraction and accurate annotation in the past. RECIST is a set of widely accepted research criteria to evaluate tumor response in patients undergoing antineoplastic therapy. The aim for this study was to identify textual sources for RECIST information in EHRs and to develop a corpus of pharmacotherapy and response entities for development of natural language processing tools. METHODS: We focused on pharmacotherapies and patient responses, using 55,120 medical notes (n = 72 types) in Mayo Clinic’s EHRs from 622 randomly selected patients who signed authorization for research. Using the Multidocument Annotation Environment tool, we applied and evaluated predefined keywords, and time interval and note-type filters for identifying RECIST information and established a gold standard data set for patient outcome research. RESULTS: Key words reduced clinical notes to 37,406, and using four note types within 12 months postdiagnosis further reduced the number of notes to 5,005 that were manually annotated, which covered 97.9% of all cases (n = 609 of 622). The resulting data set of 609 cases (n = 503 for training and n = 106 for validation purpose), contains 736 fully annotated, deidentified clinical notes, with pharmacotherapies and four response end points: complete response, partial response, stable disease, and progressive disease. This resource is readily expandable to specific drugs, regimens, and most solid tumors. CONCLUSION: We have established a gold standard data set to accommodate development of biomedical informatics tools in accelerating research into antineoplastic therapeutic response.
format Online
Article
Text
id pubmed-7265793
institution National Center for Biotechnology Information
language English
publishDate 2020
publisher American Society of Clinical Oncology
record_format MEDLINE/PubMed
spelling pubmed-72657932021-05-04 Efficient and Accurate Extracting of Unstructured EHRs on Cancer Therapy Responses for the Development of RECIST Natural Language Processing Tools: Part I, the Corpus Li, Yalun Luo, Yung-Hung Wampfler, Jason A. Rubinstein, Samuel M. Tiryaki, Firat Ashok, Kumar Warner, Jeremy L. Xu, Hua Yang, Ping JCO Clin Cancer Inform ORIGINAL REPORTS PURPOSE: Electronic health records (EHRs) are created primarily for nonresearch purposes; thus, the amounts of data are enormous, and the data are crude, heterogeneous, incomplete, and largely unstructured, presenting challenges to effective analyses for timely, reliable results. Particularly, research dealing with clinical notes relevant to patient care and outcome is seldom conducted, due to the complexity of data extraction and accurate annotation in the past. RECIST is a set of widely accepted research criteria to evaluate tumor response in patients undergoing antineoplastic therapy. The aim for this study was to identify textual sources for RECIST information in EHRs and to develop a corpus of pharmacotherapy and response entities for development of natural language processing tools. METHODS: We focused on pharmacotherapies and patient responses, using 55,120 medical notes (n = 72 types) in Mayo Clinic’s EHRs from 622 randomly selected patients who signed authorization for research. Using the Multidocument Annotation Environment tool, we applied and evaluated predefined keywords, and time interval and note-type filters for identifying RECIST information and established a gold standard data set for patient outcome research. RESULTS: Key words reduced clinical notes to 37,406, and using four note types within 12 months postdiagnosis further reduced the number of notes to 5,005 that were manually annotated, which covered 97.9% of all cases (n = 609 of 622). The resulting data set of 609 cases (n = 503 for training and n = 106 for validation purpose), contains 736 fully annotated, deidentified clinical notes, with pharmacotherapies and four response end points: complete response, partial response, stable disease, and progressive disease. This resource is readily expandable to specific drugs, regimens, and most solid tumors. CONCLUSION: We have established a gold standard data set to accommodate development of biomedical informatics tools in accelerating research into antineoplastic therapeutic response. American Society of Clinical Oncology 2020-05-04 /pmc/articles/PMC7265793/ /pubmed/32364754 http://dx.doi.org/10.1200/CCI.19.00147 Text en © 2020 by American Society of Clinical Oncology https://creativecommons.org/licenses/by/4.0/ Licensed under the Creative Commons Attribution 4.0 License: https://creativecommons.org/licenses/by/4.0/
spellingShingle ORIGINAL REPORTS
Li, Yalun
Luo, Yung-Hung
Wampfler, Jason A.
Rubinstein, Samuel M.
Tiryaki, Firat
Ashok, Kumar
Warner, Jeremy L.
Xu, Hua
Yang, Ping
Efficient and Accurate Extracting of Unstructured EHRs on Cancer Therapy Responses for the Development of RECIST Natural Language Processing Tools: Part I, the Corpus
title Efficient and Accurate Extracting of Unstructured EHRs on Cancer Therapy Responses for the Development of RECIST Natural Language Processing Tools: Part I, the Corpus
title_full Efficient and Accurate Extracting of Unstructured EHRs on Cancer Therapy Responses for the Development of RECIST Natural Language Processing Tools: Part I, the Corpus
title_fullStr Efficient and Accurate Extracting of Unstructured EHRs on Cancer Therapy Responses for the Development of RECIST Natural Language Processing Tools: Part I, the Corpus
title_full_unstemmed Efficient and Accurate Extracting of Unstructured EHRs on Cancer Therapy Responses for the Development of RECIST Natural Language Processing Tools: Part I, the Corpus
title_short Efficient and Accurate Extracting of Unstructured EHRs on Cancer Therapy Responses for the Development of RECIST Natural Language Processing Tools: Part I, the Corpus
title_sort efficient and accurate extracting of unstructured ehrs on cancer therapy responses for the development of recist natural language processing tools: part i, the corpus
topic ORIGINAL REPORTS
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7265793/
https://www.ncbi.nlm.nih.gov/pubmed/32364754
http://dx.doi.org/10.1200/CCI.19.00147
work_keys_str_mv AT liyalun efficientandaccurateextractingofunstructuredehrsoncancertherapyresponsesforthedevelopmentofrecistnaturallanguageprocessingtoolspartithecorpus
AT luoyunghung efficientandaccurateextractingofunstructuredehrsoncancertherapyresponsesforthedevelopmentofrecistnaturallanguageprocessingtoolspartithecorpus
AT wampflerjasona efficientandaccurateextractingofunstructuredehrsoncancertherapyresponsesforthedevelopmentofrecistnaturallanguageprocessingtoolspartithecorpus
AT rubinsteinsamuelm efficientandaccurateextractingofunstructuredehrsoncancertherapyresponsesforthedevelopmentofrecistnaturallanguageprocessingtoolspartithecorpus
AT tiryakifirat efficientandaccurateextractingofunstructuredehrsoncancertherapyresponsesforthedevelopmentofrecistnaturallanguageprocessingtoolspartithecorpus
AT ashokkumar efficientandaccurateextractingofunstructuredehrsoncancertherapyresponsesforthedevelopmentofrecistnaturallanguageprocessingtoolspartithecorpus
AT warnerjeremyl efficientandaccurateextractingofunstructuredehrsoncancertherapyresponsesforthedevelopmentofrecistnaturallanguageprocessingtoolspartithecorpus
AT xuhua efficientandaccurateextractingofunstructuredehrsoncancertherapyresponsesforthedevelopmentofrecistnaturallanguageprocessingtoolspartithecorpus
AT yangping efficientandaccurateextractingofunstructuredehrsoncancertherapyresponsesforthedevelopmentofrecistnaturallanguageprocessingtoolspartithecorpus