Cargando…

Towards comprehensive syntactic and semantic annotations of the clinical narrative

OBJECTIVE: To create annotated clinical narratives with layers of syntactic and semantic labels to facilitate advances in clinical natural language processing (NLP). To develop NLP algorithms and open source components. METHODS: Manual annotation of a clinical narrative corpus of 127 606 tokens foll...

Descripción completa

Detalles Bibliográficos
Autores principales: Albright, Daniel, Lanfranchi, Arrick, Fredriksen, Anwen, Styler, William F, Warner, Colin, Hwang, Jena D, Choi, Jinho D, Dligach, Dmitriy, Nielsen, Rodney D, Martin, James, Ward, Wayne, Palmer, Martha, Savova, Guergana K
Formato: Online Artículo Texto
Lenguaje:English
Publicado: BMJ Publishing Group 2013
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3756257/
https://www.ncbi.nlm.nih.gov/pubmed/23355458
http://dx.doi.org/10.1136/amiajnl-2012-001317
_version_ 1782282065583013888
author Albright, Daniel
Lanfranchi, Arrick
Fredriksen, Anwen
Styler, William F
Warner, Colin
Hwang, Jena D
Choi, Jinho D
Dligach, Dmitriy
Nielsen, Rodney D
Martin, James
Ward, Wayne
Palmer, Martha
Savova, Guergana K
author_facet Albright, Daniel
Lanfranchi, Arrick
Fredriksen, Anwen
Styler, William F
Warner, Colin
Hwang, Jena D
Choi, Jinho D
Dligach, Dmitriy
Nielsen, Rodney D
Martin, James
Ward, Wayne
Palmer, Martha
Savova, Guergana K
author_sort Albright, Daniel
collection PubMed
description OBJECTIVE: To create annotated clinical narratives with layers of syntactic and semantic labels to facilitate advances in clinical natural language processing (NLP). To develop NLP algorithms and open source components. METHODS: Manual annotation of a clinical narrative corpus of 127 606 tokens following the Treebank schema for syntactic information, PropBank schema for predicate-argument structures, and the Unified Medical Language System (UMLS) schema for semantic information. NLP components were developed. RESULTS: The final corpus consists of 13 091 sentences containing 1772 distinct predicate lemmas. Of the 766 newly created PropBank frames, 74 are verbs. There are 28 539 named entity (NE) annotations spread over 15 UMLS semantic groups, one UMLS semantic type, and the Person semantic category. The most frequent annotations belong to the UMLS semantic groups of Procedures (15.71%), Disorders (14.74%), Concepts and Ideas (15.10%), Anatomy (12.80%), Chemicals and Drugs (7.49%), and the UMLS semantic type of Sign or Symptom (12.46%). Inter-annotator agreement results: Treebank (0.926), PropBank (0.891–0.931), NE (0.697–0.750). The part-of-speech tagger, constituency parser, dependency parser, and semantic role labeler are built from the corpus and released open source. A significant limitation uncovered by this project is the need for the NLP community to develop a widely agreed-upon schema for the annotation of clinical concepts and their relations. CONCLUSIONS: This project takes a foundational step towards bringing the field of clinical NLP up to par with NLP in the general domain. The corpus creation and NLP components provide a resource for research and application development that would have been previously impossible.
format Online
Article
Text
id pubmed-3756257
institution National Center for Biotechnology Information
language English
publishDate 2013
publisher BMJ Publishing Group
record_format MEDLINE/PubMed
spelling pubmed-37562572013-12-11 Towards comprehensive syntactic and semantic annotations of the clinical narrative Albright, Daniel Lanfranchi, Arrick Fredriksen, Anwen Styler, William F Warner, Colin Hwang, Jena D Choi, Jinho D Dligach, Dmitriy Nielsen, Rodney D Martin, James Ward, Wayne Palmer, Martha Savova, Guergana K J Am Med Inform Assoc Research and Applications OBJECTIVE: To create annotated clinical narratives with layers of syntactic and semantic labels to facilitate advances in clinical natural language processing (NLP). To develop NLP algorithms and open source components. METHODS: Manual annotation of a clinical narrative corpus of 127 606 tokens following the Treebank schema for syntactic information, PropBank schema for predicate-argument structures, and the Unified Medical Language System (UMLS) schema for semantic information. NLP components were developed. RESULTS: The final corpus consists of 13 091 sentences containing 1772 distinct predicate lemmas. Of the 766 newly created PropBank frames, 74 are verbs. There are 28 539 named entity (NE) annotations spread over 15 UMLS semantic groups, one UMLS semantic type, and the Person semantic category. The most frequent annotations belong to the UMLS semantic groups of Procedures (15.71%), Disorders (14.74%), Concepts and Ideas (15.10%), Anatomy (12.80%), Chemicals and Drugs (7.49%), and the UMLS semantic type of Sign or Symptom (12.46%). Inter-annotator agreement results: Treebank (0.926), PropBank (0.891–0.931), NE (0.697–0.750). The part-of-speech tagger, constituency parser, dependency parser, and semantic role labeler are built from the corpus and released open source. A significant limitation uncovered by this project is the need for the NLP community to develop a widely agreed-upon schema for the annotation of clinical concepts and their relations. CONCLUSIONS: This project takes a foundational step towards bringing the field of clinical NLP up to par with NLP in the general domain. The corpus creation and NLP components provide a resource for research and application development that would have been previously impossible. BMJ Publishing Group 2013-09 2013-01-25 /pmc/articles/PMC3756257/ /pubmed/23355458 http://dx.doi.org/10.1136/amiajnl-2012-001317 Text en Published by the BMJ Publishing Group Limited. For permission to use (where not already granted under a licence) please go to http://group.bmj.com/group/rights-licensing/permissions This is an Open Access article distributed in accordance with the Creative Commons Attribution Non Commercial (CC BY-NC 3.0) license, which permits others to distribute, remix, adapt, build upon this work non-commercially, and license their derivative works on different terms, provided the original work is properly cited and the use is non-commercial. See: http://creativecommons.org/licenses/by-nc/3.0/
spellingShingle Research and Applications
Albright, Daniel
Lanfranchi, Arrick
Fredriksen, Anwen
Styler, William F
Warner, Colin
Hwang, Jena D
Choi, Jinho D
Dligach, Dmitriy
Nielsen, Rodney D
Martin, James
Ward, Wayne
Palmer, Martha
Savova, Guergana K
Towards comprehensive syntactic and semantic annotations of the clinical narrative
title Towards comprehensive syntactic and semantic annotations of the clinical narrative
title_full Towards comprehensive syntactic and semantic annotations of the clinical narrative
title_fullStr Towards comprehensive syntactic and semantic annotations of the clinical narrative
title_full_unstemmed Towards comprehensive syntactic and semantic annotations of the clinical narrative
title_short Towards comprehensive syntactic and semantic annotations of the clinical narrative
title_sort towards comprehensive syntactic and semantic annotations of the clinical narrative
topic Research and Applications
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3756257/
https://www.ncbi.nlm.nih.gov/pubmed/23355458
http://dx.doi.org/10.1136/amiajnl-2012-001317
work_keys_str_mv AT albrightdaniel towardscomprehensivesyntacticandsemanticannotationsoftheclinicalnarrative
AT lanfranchiarrick towardscomprehensivesyntacticandsemanticannotationsoftheclinicalnarrative
AT fredriksenanwen towardscomprehensivesyntacticandsemanticannotationsoftheclinicalnarrative
AT stylerwilliamf towardscomprehensivesyntacticandsemanticannotationsoftheclinicalnarrative
AT warnercolin towardscomprehensivesyntacticandsemanticannotationsoftheclinicalnarrative
AT hwangjenad towardscomprehensivesyntacticandsemanticannotationsoftheclinicalnarrative
AT choijinhod towardscomprehensivesyntacticandsemanticannotationsoftheclinicalnarrative
AT dligachdmitriy towardscomprehensivesyntacticandsemanticannotationsoftheclinicalnarrative
AT nielsenrodneyd towardscomprehensivesyntacticandsemanticannotationsoftheclinicalnarrative
AT martinjames towardscomprehensivesyntacticandsemanticannotationsoftheclinicalnarrative
AT wardwayne towardscomprehensivesyntacticandsemanticannotationsoftheclinicalnarrative
AT palmermartha towardscomprehensivesyntacticandsemanticannotationsoftheclinicalnarrative
AT savovaguerganak towardscomprehensivesyntacticandsemanticannotationsoftheclinicalnarrative