Cargando…

Developing a manually annotated clinical document corpus to identify phenotypic information for inflammatory bowel disease

BACKGROUND: Natural Language Processing (NLP) systems can be used for specific Information Extraction (IE) tasks such as extracting phenotypic data from the electronic medical record (EMR). These data are useful for translational research and are often found only in free text clinical notes. A key r...

Descripción completa

Detalles Bibliográficos
Autores principales: South, Brett R, Shen, Shuying, Jones, Makoto, Garvin, Jennifer, Samore, Matthew H, Chapman, Wendy W, Gundlapalli, Adi V
Formato: Texto
Lenguaje:English
Publicado: BioMed Central 2009
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2745683/
https://www.ncbi.nlm.nih.gov/pubmed/19761566
http://dx.doi.org/10.1186/1471-2105-10-S9-S12
_version_ 1782171985610014720
author South, Brett R
Shen, Shuying
Jones, Makoto
Garvin, Jennifer
Samore, Matthew H
Chapman, Wendy W
Gundlapalli, Adi V
author_facet South, Brett R
Shen, Shuying
Jones, Makoto
Garvin, Jennifer
Samore, Matthew H
Chapman, Wendy W
Gundlapalli, Adi V
author_sort South, Brett R
collection PubMed
description BACKGROUND: Natural Language Processing (NLP) systems can be used for specific Information Extraction (IE) tasks such as extracting phenotypic data from the electronic medical record (EMR). These data are useful for translational research and are often found only in free text clinical notes. A key required step for IE is the manual annotation of clinical corpora and the creation of a reference standard for (1) training and validation tasks and (2) to focus and clarify NLP system requirements. These tasks are time consuming, expensive, and require considerable effort on the part of human reviewers. METHODS: Using a set of clinical documents from the VA EMR for a particular use case of interest we identify specific challenges and present several opportunities for annotation tasks. We demonstrate specific methods using an open source annotation tool, a customized annotation schema, and a corpus of clinical documents for patients known to have a diagnosis of Inflammatory Bowel Disease (IBD). We report clinician annotator agreement at the document, concept, and concept attribute level. We estimate concept yield in terms of annotated concepts within specific note sections and document types. RESULTS: Annotator agreement at the document level for documents that contained concepts of interest for IBD using estimated Kappa statistic (95% CI) was very high at 0.87 (0.82, 0.93). At the concept level, F-measure ranged from 0.61 to 0.83. However, agreement varied greatly at the specific concept attribute level. For this particular use case (IBD), clinical documents producing the highest concept yield per document included GI clinic notes and primary care notes. Within the various types of notes, the highest concept yield was in sections representing patient assessment and history of presenting illness. Ancillary service documents and family history and plan note sections produced the lowest concept yield. CONCLUSION: Challenges include defining and building appropriate annotation schemas, adequately training clinician annotators, and determining the appropriate level of information to be annotated. Opportunities include narrowing the focus of information extraction to use case specific note types and sections, especially in cases where NLP systems will be used to extract information from large repositories of electronic clinical note documents.
format Text
id pubmed-2745683
institution National Center for Biotechnology Information
language English
publishDate 2009
publisher BioMed Central
record_format MEDLINE/PubMed
spelling pubmed-27456832009-09-18 Developing a manually annotated clinical document corpus to identify phenotypic information for inflammatory bowel disease South, Brett R Shen, Shuying Jones, Makoto Garvin, Jennifer Samore, Matthew H Chapman, Wendy W Gundlapalli, Adi V BMC Bioinformatics Proceedings BACKGROUND: Natural Language Processing (NLP) systems can be used for specific Information Extraction (IE) tasks such as extracting phenotypic data from the electronic medical record (EMR). These data are useful for translational research and are often found only in free text clinical notes. A key required step for IE is the manual annotation of clinical corpora and the creation of a reference standard for (1) training and validation tasks and (2) to focus and clarify NLP system requirements. These tasks are time consuming, expensive, and require considerable effort on the part of human reviewers. METHODS: Using a set of clinical documents from the VA EMR for a particular use case of interest we identify specific challenges and present several opportunities for annotation tasks. We demonstrate specific methods using an open source annotation tool, a customized annotation schema, and a corpus of clinical documents for patients known to have a diagnosis of Inflammatory Bowel Disease (IBD). We report clinician annotator agreement at the document, concept, and concept attribute level. We estimate concept yield in terms of annotated concepts within specific note sections and document types. RESULTS: Annotator agreement at the document level for documents that contained concepts of interest for IBD using estimated Kappa statistic (95% CI) was very high at 0.87 (0.82, 0.93). At the concept level, F-measure ranged from 0.61 to 0.83. However, agreement varied greatly at the specific concept attribute level. For this particular use case (IBD), clinical documents producing the highest concept yield per document included GI clinic notes and primary care notes. Within the various types of notes, the highest concept yield was in sections representing patient assessment and history of presenting illness. Ancillary service documents and family history and plan note sections produced the lowest concept yield. CONCLUSION: Challenges include defining and building appropriate annotation schemas, adequately training clinician annotators, and determining the appropriate level of information to be annotated. Opportunities include narrowing the focus of information extraction to use case specific note types and sections, especially in cases where NLP systems will be used to extract information from large repositories of electronic clinical note documents. BioMed Central 2009-09-17 /pmc/articles/PMC2745683/ /pubmed/19761566 http://dx.doi.org/10.1186/1471-2105-10-S9-S12 Text en Copyright © 2009 South et al; licensee BioMed Central Ltd. http://creativecommons.org/licenses/by/2.0 This is an open access article distributed under the terms of the Creative Commons Attribution License ( (http://creativecommons.org/licenses/by/2.0) ), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.
spellingShingle Proceedings
South, Brett R
Shen, Shuying
Jones, Makoto
Garvin, Jennifer
Samore, Matthew H
Chapman, Wendy W
Gundlapalli, Adi V
Developing a manually annotated clinical document corpus to identify phenotypic information for inflammatory bowel disease
title Developing a manually annotated clinical document corpus to identify phenotypic information for inflammatory bowel disease
title_full Developing a manually annotated clinical document corpus to identify phenotypic information for inflammatory bowel disease
title_fullStr Developing a manually annotated clinical document corpus to identify phenotypic information for inflammatory bowel disease
title_full_unstemmed Developing a manually annotated clinical document corpus to identify phenotypic information for inflammatory bowel disease
title_short Developing a manually annotated clinical document corpus to identify phenotypic information for inflammatory bowel disease
title_sort developing a manually annotated clinical document corpus to identify phenotypic information for inflammatory bowel disease
topic Proceedings
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2745683/
https://www.ncbi.nlm.nih.gov/pubmed/19761566
http://dx.doi.org/10.1186/1471-2105-10-S9-S12
work_keys_str_mv AT southbrettr developingamanuallyannotatedclinicaldocumentcorpustoidentifyphenotypicinformationforinflammatoryboweldisease
AT shenshuying developingamanuallyannotatedclinicaldocumentcorpustoidentifyphenotypicinformationforinflammatoryboweldisease
AT jonesmakoto developingamanuallyannotatedclinicaldocumentcorpustoidentifyphenotypicinformationforinflammatoryboweldisease
AT garvinjennifer developingamanuallyannotatedclinicaldocumentcorpustoidentifyphenotypicinformationforinflammatoryboweldisease
AT samorematthewh developingamanuallyannotatedclinicaldocumentcorpustoidentifyphenotypicinformationforinflammatoryboweldisease
AT chapmanwendyw developingamanuallyannotatedclinicaldocumentcorpustoidentifyphenotypicinformationforinflammatoryboweldisease
AT gundlapalliadiv developingamanuallyannotatedclinicaldocumentcorpustoidentifyphenotypicinformationforinflammatoryboweldisease