Cargando…

Identifying primary and recurrent cancers using a SAS-based natural language processing algorithm

OBJECTIVE: Significant limitations exist in the timely and complete identification of primary and recurrent cancers for clinical and epidemiologic research. A SAS-based coding, extraction, and nomenclature tool (SCENT) was developed to address this problem. MATERIALS AND METHODS: SCENT employs hiera...

Descripción completa

Detalles Bibliográficos
Autores principales: Strauss, Justin A, Chao, Chun R, Kwan, Marilyn L, Ahmed, Syed A, Schottinger, Joanne E, Quinn, Virginia P
Formato: Online Artículo Texto
Lenguaje:English
Publicado: BMJ Group 2013
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3638182/
https://www.ncbi.nlm.nih.gov/pubmed/22822041
http://dx.doi.org/10.1136/amiajnl-2012-000928
_version_ 1782475805961486336
author Strauss, Justin A
Chao, Chun R
Kwan, Marilyn L
Ahmed, Syed A
Schottinger, Joanne E
Quinn, Virginia P
author_facet Strauss, Justin A
Chao, Chun R
Kwan, Marilyn L
Ahmed, Syed A
Schottinger, Joanne E
Quinn, Virginia P
author_sort Strauss, Justin A
collection PubMed
description OBJECTIVE: Significant limitations exist in the timely and complete identification of primary and recurrent cancers for clinical and epidemiologic research. A SAS-based coding, extraction, and nomenclature tool (SCENT) was developed to address this problem. MATERIALS AND METHODS: SCENT employs hierarchical classification rules to identify and extract information from electronic pathology reports. Reports are analyzed and coded using a dictionary of clinical concepts and associated SNOMED codes. To assess the accuracy of SCENT, validation was conducted using manual review of pathology reports from a random sample of 400 breast and 400 prostate cancer patients diagnosed at Kaiser Permanente Southern California. Trained abstractors classified the malignancy status of each report. RESULTS: Classifications of SCENT were highly concordant with those of abstractors, achieving κ of 0.96 and 0.95 in the breast and prostate cancer groups, respectively. SCENT identified 51 of 54 new primary and 60 of 61 recurrent cancer cases across both groups, with only three false positives in 792 true benign cases. Measures of sensitivity, specificity, positive predictive value, and negative predictive value exceeded 94% in both cancer groups. DISCUSSION: Favorable validation results suggest that SCENT can be used to identify, extract, and code information from pathology report text. Consequently, SCENT has wide applicability in research and clinical care. Further assessment will be needed to validate performance with other clinical text sources, particularly those with greater linguistic variability. CONCLUSION: SCENT is proof of concept for SAS-based natural language processing applications that can be easily shared between institutions and used to support clinical and epidemiologic research.
format Online
Article
Text
id pubmed-3638182
institution National Center for Biotechnology Information
language English
publishDate 2013
publisher BMJ Group
record_format MEDLINE/PubMed
spelling pubmed-36381822014-03-08 Identifying primary and recurrent cancers using a SAS-based natural language processing algorithm Strauss, Justin A Chao, Chun R Kwan, Marilyn L Ahmed, Syed A Schottinger, Joanne E Quinn, Virginia P J Am Med Inform Assoc Research and Applications OBJECTIVE: Significant limitations exist in the timely and complete identification of primary and recurrent cancers for clinical and epidemiologic research. A SAS-based coding, extraction, and nomenclature tool (SCENT) was developed to address this problem. MATERIALS AND METHODS: SCENT employs hierarchical classification rules to identify and extract information from electronic pathology reports. Reports are analyzed and coded using a dictionary of clinical concepts and associated SNOMED codes. To assess the accuracy of SCENT, validation was conducted using manual review of pathology reports from a random sample of 400 breast and 400 prostate cancer patients diagnosed at Kaiser Permanente Southern California. Trained abstractors classified the malignancy status of each report. RESULTS: Classifications of SCENT were highly concordant with those of abstractors, achieving κ of 0.96 and 0.95 in the breast and prostate cancer groups, respectively. SCENT identified 51 of 54 new primary and 60 of 61 recurrent cancer cases across both groups, with only three false positives in 792 true benign cases. Measures of sensitivity, specificity, positive predictive value, and negative predictive value exceeded 94% in both cancer groups. DISCUSSION: Favorable validation results suggest that SCENT can be used to identify, extract, and code information from pathology report text. Consequently, SCENT has wide applicability in research and clinical care. Further assessment will be needed to validate performance with other clinical text sources, particularly those with greater linguistic variability. CONCLUSION: SCENT is proof of concept for SAS-based natural language processing applications that can be easily shared between institutions and used to support clinical and epidemiologic research. BMJ Group 2013 2012-08-02 /pmc/articles/PMC3638182/ /pubmed/22822041 http://dx.doi.org/10.1136/amiajnl-2012-000928 Text en Published by the BMJ Publishing Group Limited. For permission to use (where not already granted under a licence) please go to http://group.bmj.com/group/rights-licensing/permissions This is an open-access article distributed under the terms of the Creative Commons Attribution Non-commercial License, which permits use, distribution, and reproduction in any medium, provided the original work is properly cited, the use is non commercial and is otherwise in compliance with the license. See: http://creativecommons.org/licenses/by-nc/3.0/ and http://creativecommons.org/licenses/by-nc/3.0/legalcode
spellingShingle Research and Applications
Strauss, Justin A
Chao, Chun R
Kwan, Marilyn L
Ahmed, Syed A
Schottinger, Joanne E
Quinn, Virginia P
Identifying primary and recurrent cancers using a SAS-based natural language processing algorithm
title Identifying primary and recurrent cancers using a SAS-based natural language processing algorithm
title_full Identifying primary and recurrent cancers using a SAS-based natural language processing algorithm
title_fullStr Identifying primary and recurrent cancers using a SAS-based natural language processing algorithm
title_full_unstemmed Identifying primary and recurrent cancers using a SAS-based natural language processing algorithm
title_short Identifying primary and recurrent cancers using a SAS-based natural language processing algorithm
title_sort identifying primary and recurrent cancers using a sas-based natural language processing algorithm
topic Research and Applications
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3638182/
https://www.ncbi.nlm.nih.gov/pubmed/22822041
http://dx.doi.org/10.1136/amiajnl-2012-000928
work_keys_str_mv AT straussjustina identifyingprimaryandrecurrentcancersusingasasbasednaturallanguageprocessingalgorithm
AT chaochunr identifyingprimaryandrecurrentcancersusingasasbasednaturallanguageprocessingalgorithm
AT kwanmarilynl identifyingprimaryandrecurrentcancersusingasasbasednaturallanguageprocessingalgorithm
AT ahmedsyeda identifyingprimaryandrecurrentcancersusingasasbasednaturallanguageprocessingalgorithm
AT schottingerjoannee identifyingprimaryandrecurrentcancersusingasasbasednaturallanguageprocessingalgorithm
AT quinnvirginiap identifyingprimaryandrecurrentcancersusingasasbasednaturallanguageprocessingalgorithm