Cargando…

Natural Language Processing for Surveillance of Cervical and Anal Cancer and Precancer: Algorithm Development and Split-Validation Study

BACKGROUND: Accurate identification of new diagnoses of human papillomavirus–associated cancers and precancers is an important step toward the development of strategies that optimize the use of human papillomavirus vaccines. The diagnosis of human papillomavirus cancers hinges on a histopathologic r...

Descripción completa

Detalles Bibliográficos
Autores principales: Oliveira, Carlos R, Niccolai, Patrick, Ortiz, Anette Michelle, Sheth, Sangini S, Shapiro, Eugene D, Niccolai, Linda M, Brandt, Cynthia A
Formato: Online Artículo Texto
Lenguaje:English
Publicado: JMIR Publications 2020
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7671846/
https://www.ncbi.nlm.nih.gov/pubmed/32469840
http://dx.doi.org/10.2196/20826
_version_ 1783611008883884032
author Oliveira, Carlos R
Niccolai, Patrick
Ortiz, Anette Michelle
Sheth, Sangini S
Shapiro, Eugene D
Niccolai, Linda M
Brandt, Cynthia A
author_facet Oliveira, Carlos R
Niccolai, Patrick
Ortiz, Anette Michelle
Sheth, Sangini S
Shapiro, Eugene D
Niccolai, Linda M
Brandt, Cynthia A
author_sort Oliveira, Carlos R
collection PubMed
description BACKGROUND: Accurate identification of new diagnoses of human papillomavirus–associated cancers and precancers is an important step toward the development of strategies that optimize the use of human papillomavirus vaccines. The diagnosis of human papillomavirus cancers hinges on a histopathologic report, which is typically stored in electronic medical records as free-form, or unstructured, narrative text. Previous efforts to perform surveillance for human papillomavirus cancers have relied on the manual review of pathology reports to extract diagnostic information, a process that is both labor- and resource-intensive. Natural language processing can be used to automate the structuring and extraction of clinical data from unstructured narrative text in medical records and may provide a practical and effective method for identifying patients with vaccine-preventable human papillomavirus disease for surveillance and research. OBJECTIVE: This study's objective was to develop and assess the accuracy of a natural language processing algorithm for the identification of individuals with cancer or precancer of the cervix and anus. METHODS: A pipeline-based natural language processing algorithm was developed, which incorporated machine learning and rule-based methods to extract diagnostic elements from the narrative pathology reports. To test the algorithm’s classification accuracy, we used a split-validation study design. Full-length cervical and anal pathology reports were randomly selected from 4 clinical pathology laboratories. Two study team members, blinded to the classifications produced by the natural language processing algorithm, manually and independently reviewed all reports and classified them at the document level according to 2 domains (diagnosis and human papillomavirus testing results). Using the manual review as the gold standard, the algorithm’s performance was evaluated using standard measurements of accuracy, recall, precision, and F-measure. RESULTS: The natural language processing algorithm’s performance was validated on 949 pathology reports. The algorithm demonstrated accurate identification of abnormal cytology, histology, and positive human papillomavirus tests with accuracies greater than 0.91. Precision was lowest for anal histology reports (0.87, 95% CI 0.59-0.98) and highest for cervical cytology (0.98, 95% CI 0.95-0.99). The natural language processing algorithm missed 2 out of the 15 abnormal anal histology reports, which led to a relatively low recall (0.68, 95% CI 0.43-0.87). CONCLUSIONS: This study outlines the development and validation of a freely available and easily implementable natural language processing algorithm that can automate the extraction and classification of clinical data from cervical and anal cytology and histology.
format Online
Article
Text
id pubmed-7671846
institution National Center for Biotechnology Information
language English
publishDate 2020
publisher JMIR Publications
record_format MEDLINE/PubMed
spelling pubmed-76718462020-11-20 Natural Language Processing for Surveillance of Cervical and Anal Cancer and Precancer: Algorithm Development and Split-Validation Study Oliveira, Carlos R Niccolai, Patrick Ortiz, Anette Michelle Sheth, Sangini S Shapiro, Eugene D Niccolai, Linda M Brandt, Cynthia A JMIR Med Inform Original Paper BACKGROUND: Accurate identification of new diagnoses of human papillomavirus–associated cancers and precancers is an important step toward the development of strategies that optimize the use of human papillomavirus vaccines. The diagnosis of human papillomavirus cancers hinges on a histopathologic report, which is typically stored in electronic medical records as free-form, or unstructured, narrative text. Previous efforts to perform surveillance for human papillomavirus cancers have relied on the manual review of pathology reports to extract diagnostic information, a process that is both labor- and resource-intensive. Natural language processing can be used to automate the structuring and extraction of clinical data from unstructured narrative text in medical records and may provide a practical and effective method for identifying patients with vaccine-preventable human papillomavirus disease for surveillance and research. OBJECTIVE: This study's objective was to develop and assess the accuracy of a natural language processing algorithm for the identification of individuals with cancer or precancer of the cervix and anus. METHODS: A pipeline-based natural language processing algorithm was developed, which incorporated machine learning and rule-based methods to extract diagnostic elements from the narrative pathology reports. To test the algorithm’s classification accuracy, we used a split-validation study design. Full-length cervical and anal pathology reports were randomly selected from 4 clinical pathology laboratories. Two study team members, blinded to the classifications produced by the natural language processing algorithm, manually and independently reviewed all reports and classified them at the document level according to 2 domains (diagnosis and human papillomavirus testing results). Using the manual review as the gold standard, the algorithm’s performance was evaluated using standard measurements of accuracy, recall, precision, and F-measure. RESULTS: The natural language processing algorithm’s performance was validated on 949 pathology reports. The algorithm demonstrated accurate identification of abnormal cytology, histology, and positive human papillomavirus tests with accuracies greater than 0.91. Precision was lowest for anal histology reports (0.87, 95% CI 0.59-0.98) and highest for cervical cytology (0.98, 95% CI 0.95-0.99). The natural language processing algorithm missed 2 out of the 15 abnormal anal histology reports, which led to a relatively low recall (0.68, 95% CI 0.43-0.87). CONCLUSIONS: This study outlines the development and validation of a freely available and easily implementable natural language processing algorithm that can automate the extraction and classification of clinical data from cervical and anal cytology and histology. JMIR Publications 2020-11-03 /pmc/articles/PMC7671846/ /pubmed/32469840 http://dx.doi.org/10.2196/20826 Text en ©Carlos R Oliveira, Patrick Niccolai, Anette Michelle Ortiz, Sangini S Sheth, Eugene D Shapiro, Linda M Niccolai, Cynthia A Brandt. Originally published in JMIR Medical Informatics (http://medinform.jmir.org), 03.11.2020. https://creativecommons.org/licenses/by/4.0/ This is an open-access article distributed under the terms of the Creative Commons Attribution License (https://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work, first published in JMIR Medical Informatics, is properly cited. The complete bibliographic information, a link to the original publication on http://medinform.jmir.org/, as well as this copyright and license information must be included.
spellingShingle Original Paper
Oliveira, Carlos R
Niccolai, Patrick
Ortiz, Anette Michelle
Sheth, Sangini S
Shapiro, Eugene D
Niccolai, Linda M
Brandt, Cynthia A
Natural Language Processing for Surveillance of Cervical and Anal Cancer and Precancer: Algorithm Development and Split-Validation Study
title Natural Language Processing for Surveillance of Cervical and Anal Cancer and Precancer: Algorithm Development and Split-Validation Study
title_full Natural Language Processing for Surveillance of Cervical and Anal Cancer and Precancer: Algorithm Development and Split-Validation Study
title_fullStr Natural Language Processing for Surveillance of Cervical and Anal Cancer and Precancer: Algorithm Development and Split-Validation Study
title_full_unstemmed Natural Language Processing for Surveillance of Cervical and Anal Cancer and Precancer: Algorithm Development and Split-Validation Study
title_short Natural Language Processing for Surveillance of Cervical and Anal Cancer and Precancer: Algorithm Development and Split-Validation Study
title_sort natural language processing for surveillance of cervical and anal cancer and precancer: algorithm development and split-validation study
topic Original Paper
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7671846/
https://www.ncbi.nlm.nih.gov/pubmed/32469840
http://dx.doi.org/10.2196/20826
work_keys_str_mv AT oliveiracarlosr naturallanguageprocessingforsurveillanceofcervicalandanalcancerandprecanceralgorithmdevelopmentandsplitvalidationstudy
AT niccolaipatrick naturallanguageprocessingforsurveillanceofcervicalandanalcancerandprecanceralgorithmdevelopmentandsplitvalidationstudy
AT ortizanettemichelle naturallanguageprocessingforsurveillanceofcervicalandanalcancerandprecanceralgorithmdevelopmentandsplitvalidationstudy
AT shethsanginis naturallanguageprocessingforsurveillanceofcervicalandanalcancerandprecanceralgorithmdevelopmentandsplitvalidationstudy
AT shapiroeugened naturallanguageprocessingforsurveillanceofcervicalandanalcancerandprecanceralgorithmdevelopmentandsplitvalidationstudy
AT niccolailindam naturallanguageprocessingforsurveillanceofcervicalandanalcancerandprecanceralgorithmdevelopmentandsplitvalidationstudy
AT brandtcynthiaa naturallanguageprocessingforsurveillanceofcervicalandanalcancerandprecanceralgorithmdevelopmentandsplitvalidationstudy