Cargando…

Concept selection for phenotypes and diseases using learn to rank

BACKGROUND: Phenotypes form the basis for determining the existence of a disease against the given evidence. Much of this evidence though remains locked away in text – scientific articles, clinical trial reports and electronic patient records (EPR) – where authors use the full expressivity of human...

Descripción completa

Detalles Bibliográficos
Autores principales: Collier, Nigel, Oellrich, Anika, Groza, Tudor
Formato: Online Artículo Texto
Lenguaje:English
Publicado: BioMed Central 2015
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4450611/
https://www.ncbi.nlm.nih.gov/pubmed/26034558
http://dx.doi.org/10.1186/s13326-015-0019-z
_version_ 1782374036689387520
author Collier, Nigel
Oellrich, Anika
Groza, Tudor
author_facet Collier, Nigel
Oellrich, Anika
Groza, Tudor
author_sort Collier, Nigel
collection PubMed
description BACKGROUND: Phenotypes form the basis for determining the existence of a disease against the given evidence. Much of this evidence though remains locked away in text – scientific articles, clinical trial reports and electronic patient records (EPR) – where authors use the full expressivity of human language to report their observations. RESULTS: In this paper we exploit a combination of off-the-shelf tools for extracting a machine understandable representation of phenotypes and other related concepts that concern the diagnosis and treatment of diseases. These are tested against a gold standard EPR collection that has been annotated with Unified Medical Language System (UMLS) concept identifiers: the ShARE/CLEF 2013 corpus for disorder detection. We evaluate four pipelines as stand-alone systems and then attempt to optimise semantic-type based performance using several learn-to-rank (LTR) approaches – three pairwise and one listwise. We observed that whilst overall Apache cTAKES tended to outperform other stand-alone systems on a strong recall (R = 0.57), precision was low (P = 0.09) leading to low-to-moderate F1 measure (F1 = 0.16). Moreover, there is substantial variation in system performance across semantic types for disorders. For example, the concept Findings (T033) seemed to be very challenging for all systems. Combining systems within LTR improved F1 substantially (F1 = 0.24) particularly for Disease or syndrome (T047) and Anatomical abnormality (T190). Whilst recall is improved markedly, precision remains a challenge (P = 0.15, R = 0.59).
format Online
Article
Text
id pubmed-4450611
institution National Center for Biotechnology Information
language English
publishDate 2015
publisher BioMed Central
record_format MEDLINE/PubMed
spelling pubmed-44506112015-06-02 Concept selection for phenotypes and diseases using learn to rank Collier, Nigel Oellrich, Anika Groza, Tudor J Biomed Semantics Research Article BACKGROUND: Phenotypes form the basis for determining the existence of a disease against the given evidence. Much of this evidence though remains locked away in text – scientific articles, clinical trial reports and electronic patient records (EPR) – where authors use the full expressivity of human language to report their observations. RESULTS: In this paper we exploit a combination of off-the-shelf tools for extracting a machine understandable representation of phenotypes and other related concepts that concern the diagnosis and treatment of diseases. These are tested against a gold standard EPR collection that has been annotated with Unified Medical Language System (UMLS) concept identifiers: the ShARE/CLEF 2013 corpus for disorder detection. We evaluate four pipelines as stand-alone systems and then attempt to optimise semantic-type based performance using several learn-to-rank (LTR) approaches – three pairwise and one listwise. We observed that whilst overall Apache cTAKES tended to outperform other stand-alone systems on a strong recall (R = 0.57), precision was low (P = 0.09) leading to low-to-moderate F1 measure (F1 = 0.16). Moreover, there is substantial variation in system performance across semantic types for disorders. For example, the concept Findings (T033) seemed to be very challenging for all systems. Combining systems within LTR improved F1 substantially (F1 = 0.24) particularly for Disease or syndrome (T047) and Anatomical abnormality (T190). Whilst recall is improved markedly, precision remains a challenge (P = 0.15, R = 0.59). BioMed Central 2015-06-01 /pmc/articles/PMC4450611/ /pubmed/26034558 http://dx.doi.org/10.1186/s13326-015-0019-z Text en © Collier et al.; licensee BioMed Central. 2015 This is an Open Access article distributed under the terms of the Creative Commons Attribution License(http://creativecommons.org/licenses/by/4.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.
spellingShingle Research Article
Collier, Nigel
Oellrich, Anika
Groza, Tudor
Concept selection for phenotypes and diseases using learn to rank
title Concept selection for phenotypes and diseases using learn to rank
title_full Concept selection for phenotypes and diseases using learn to rank
title_fullStr Concept selection for phenotypes and diseases using learn to rank
title_full_unstemmed Concept selection for phenotypes and diseases using learn to rank
title_short Concept selection for phenotypes and diseases using learn to rank
title_sort concept selection for phenotypes and diseases using learn to rank
topic Research Article
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4450611/
https://www.ncbi.nlm.nih.gov/pubmed/26034558
http://dx.doi.org/10.1186/s13326-015-0019-z
work_keys_str_mv AT colliernigel conceptselectionforphenotypesanddiseasesusinglearntorank
AT oellrichanika conceptselectionforphenotypesanddiseasesusinglearntorank
AT grozatudor conceptselectionforphenotypesanddiseasesusinglearntorank