Cargando…

Pooling annotated corpora for clinical concept extraction

BACKGROUND: The availability of annotated corpora has facilitated the application of machine learning algorithms to concept extraction from clinical notes. However, high expenditure and labor are required for creating the annotations. A potential alternative is to reuse existing corpora from other i...

Descripción completa

Detalles Bibliográficos
Autores principales: Wagholikar, Kavishwar B, Torii, Manabu, Jonnalagadda, Siddhartha R, Liu, Hongfang
Formato: Online Artículo Texto
Lenguaje:English
Publicado: BioMed Central 2013
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3599895/
https://www.ncbi.nlm.nih.gov/pubmed/23294871
http://dx.doi.org/10.1186/2041-1480-4-3
_version_ 1782475554870525952
author Wagholikar, Kavishwar B
Torii, Manabu
Jonnalagadda, Siddhartha R
Liu, Hongfang
author_facet Wagholikar, Kavishwar B
Torii, Manabu
Jonnalagadda, Siddhartha R
Liu, Hongfang
author_sort Wagholikar, Kavishwar B
collection PubMed
description BACKGROUND: The availability of annotated corpora has facilitated the application of machine learning algorithms to concept extraction from clinical notes. However, high expenditure and labor are required for creating the annotations. A potential alternative is to reuse existing corpora from other institutions by pooling with local corpora, for training machine taggers. In this paper we have investigated the latter approach by pooling corpora from 2010 i2b2/VA NLP challenge and Mayo Clinic Rochester, to evaluate taggers for recognition of medical problems. The corpora were annotated for medical problems, but with different guidelines. The taggers were constructed using an existing tagging system MedTagger that consisted of dictionary lookup, part of speech (POS) tagging and machine learning for named entity prediction and concept extraction. We hope that our current work will be a useful case study for facilitating reuse of annotated corpora across institutions. RESULTS: We found that pooling was effective when the size of the local corpus was small and after some of the guideline differences were reconciled. The benefits of pooling, however, diminished as more locally annotated documents were included in the training data. We examined the annotation guidelines to identify factors that determine the effect of pooling. CONCLUSIONS: The effectiveness of pooling corpora, is dependent on several factors, which include compatibility of annotation guidelines, distribution of report types and size of local and foreign corpora. Simple methods to rectify some of the guideline differences can facilitate pooling. Our findings need to be confirmed with further studies on different corpora. To facilitate the pooling and reuse of annotated corpora, we suggest that – i) the NLP community should develop a standard annotation guideline that addresses the potential areas of guideline differences that are partly identified in this paper; ii) corpora should be annotated with a two-pass method that focuses first on concept recognition, followed by normalization to existing ontologies; and iii) metadata such as type of the report should be created during the annotation process.
format Online
Article
Text
id pubmed-3599895
institution National Center for Biotechnology Information
language English
publishDate 2013
publisher BioMed Central
record_format MEDLINE/PubMed
spelling pubmed-35998952013-03-23 Pooling annotated corpora for clinical concept extraction Wagholikar, Kavishwar B Torii, Manabu Jonnalagadda, Siddhartha R Liu, Hongfang J Biomed Semantics Research BACKGROUND: The availability of annotated corpora has facilitated the application of machine learning algorithms to concept extraction from clinical notes. However, high expenditure and labor are required for creating the annotations. A potential alternative is to reuse existing corpora from other institutions by pooling with local corpora, for training machine taggers. In this paper we have investigated the latter approach by pooling corpora from 2010 i2b2/VA NLP challenge and Mayo Clinic Rochester, to evaluate taggers for recognition of medical problems. The corpora were annotated for medical problems, but with different guidelines. The taggers were constructed using an existing tagging system MedTagger that consisted of dictionary lookup, part of speech (POS) tagging and machine learning for named entity prediction and concept extraction. We hope that our current work will be a useful case study for facilitating reuse of annotated corpora across institutions. RESULTS: We found that pooling was effective when the size of the local corpus was small and after some of the guideline differences were reconciled. The benefits of pooling, however, diminished as more locally annotated documents were included in the training data. We examined the annotation guidelines to identify factors that determine the effect of pooling. CONCLUSIONS: The effectiveness of pooling corpora, is dependent on several factors, which include compatibility of annotation guidelines, distribution of report types and size of local and foreign corpora. Simple methods to rectify some of the guideline differences can facilitate pooling. Our findings need to be confirmed with further studies on different corpora. To facilitate the pooling and reuse of annotated corpora, we suggest that – i) the NLP community should develop a standard annotation guideline that addresses the potential areas of guideline differences that are partly identified in this paper; ii) corpora should be annotated with a two-pass method that focuses first on concept recognition, followed by normalization to existing ontologies; and iii) metadata such as type of the report should be created during the annotation process. BioMed Central 2013-01-08 /pmc/articles/PMC3599895/ /pubmed/23294871 http://dx.doi.org/10.1186/2041-1480-4-3 Text en Copyright ©2013 Wagholikar et al; licensee BioMed Central Ltd. http://creativecommons.org/licenses/by/2.0 This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.
spellingShingle Research
Wagholikar, Kavishwar B
Torii, Manabu
Jonnalagadda, Siddhartha R
Liu, Hongfang
Pooling annotated corpora for clinical concept extraction
title Pooling annotated corpora for clinical concept extraction
title_full Pooling annotated corpora for clinical concept extraction
title_fullStr Pooling annotated corpora for clinical concept extraction
title_full_unstemmed Pooling annotated corpora for clinical concept extraction
title_short Pooling annotated corpora for clinical concept extraction
title_sort pooling annotated corpora for clinical concept extraction
topic Research
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3599895/
https://www.ncbi.nlm.nih.gov/pubmed/23294871
http://dx.doi.org/10.1186/2041-1480-4-3
work_keys_str_mv AT wagholikarkavishwarb poolingannotatedcorporaforclinicalconceptextraction
AT toriimanabu poolingannotatedcorporaforclinicalconceptextraction
AT jonnalagaddasiddharthar poolingannotatedcorporaforclinicalconceptextraction
AT liuhongfang poolingannotatedcorporaforclinicalconceptextraction