Cargando…

Using text mining techniques to extract phenotypic information from the PhenoCHF corpus

BACKGROUND: Phenotypic information locked away in unstructured narrative text presents significant barriers to information accessibility, both for clinical practitioners and for computerised applications used for clinical research purposes. Text mining (TM) techniques have previously been applied su...

Descripción completa

Detalles Bibliográficos
Autores principales: Alnazzawi, Noha, Thompson, Paul, Batista-Navarro, Riza, Ananiadou, Sophia
Formato: Online Artículo Texto
Lenguaje:English
Publicado: BioMed Central 2015
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4474585/
https://www.ncbi.nlm.nih.gov/pubmed/26099853
http://dx.doi.org/10.1186/1472-6947-15-S2-S3
_version_ 1782377297272111104
author Alnazzawi, Noha
Thompson, Paul
Batista-Navarro, Riza
Ananiadou, Sophia
author_facet Alnazzawi, Noha
Thompson, Paul
Batista-Navarro, Riza
Ananiadou, Sophia
author_sort Alnazzawi, Noha
collection PubMed
description BACKGROUND: Phenotypic information locked away in unstructured narrative text presents significant barriers to information accessibility, both for clinical practitioners and for computerised applications used for clinical research purposes. Text mining (TM) techniques have previously been applied successfully to extract different types of information from text in the biomedical domain. They have the potential to be extended to allow the extraction of information relating to phenotypes from free text. METHODS: To stimulate the development of TM systems that are able to extract phenotypic information from text, we have created a new corpus (PhenoCHF) that is annotated by domain experts with several types of phenotypic information relating to congestive heart failure. To ensure that systems developed using the corpus are robust to multiple text types, it integrates text from heterogeneous sources, i.e., electronic health records (EHRs) and scientific articles from the literature. We have developed several different phenotype extraction methods to demonstrate the utility of the corpus, and tested these methods on a further corpus, i.e., ShARe/CLEF 2013. RESULTS: Evaluation of our automated methods showed that PhenoCHF can facilitate the training of reliable phenotype extraction systems, which are robust to variations in text type. These results have been reinforced by evaluating our trained systems on the ShARe/CLEF corpus, which contains clinical records of various types. Like other studies within the biomedical domain, we found that solutions based on conditional random fields produced the best results, when coupled with a rich feature set. CONCLUSIONS: PhenoCHF is the first annotated corpus aimed at encoding detailed phenotypic information. The unique heterogeneous composition of the corpus has been shown to be advantageous in the training of systems that can accurately extract phenotypic information from a range of different text types. Although the scope of our annotation is currently limited to a single disease, the promising results achieved can stimulate further work into the extraction of phenotypic information for other diseases. The PhenoCHF annotation guidelines and annotations are publicly available at https://code.google.com/p/phenochf-corpus.
format Online
Article
Text
id pubmed-4474585
institution National Center for Biotechnology Information
language English
publishDate 2015
publisher BioMed Central
record_format MEDLINE/PubMed
spelling pubmed-44745852015-06-25 Using text mining techniques to extract phenotypic information from the PhenoCHF corpus Alnazzawi, Noha Thompson, Paul Batista-Navarro, Riza Ananiadou, Sophia BMC Med Inform Decis Mak Proceedings BACKGROUND: Phenotypic information locked away in unstructured narrative text presents significant barriers to information accessibility, both for clinical practitioners and for computerised applications used for clinical research purposes. Text mining (TM) techniques have previously been applied successfully to extract different types of information from text in the biomedical domain. They have the potential to be extended to allow the extraction of information relating to phenotypes from free text. METHODS: To stimulate the development of TM systems that are able to extract phenotypic information from text, we have created a new corpus (PhenoCHF) that is annotated by domain experts with several types of phenotypic information relating to congestive heart failure. To ensure that systems developed using the corpus are robust to multiple text types, it integrates text from heterogeneous sources, i.e., electronic health records (EHRs) and scientific articles from the literature. We have developed several different phenotype extraction methods to demonstrate the utility of the corpus, and tested these methods on a further corpus, i.e., ShARe/CLEF 2013. RESULTS: Evaluation of our automated methods showed that PhenoCHF can facilitate the training of reliable phenotype extraction systems, which are robust to variations in text type. These results have been reinforced by evaluating our trained systems on the ShARe/CLEF corpus, which contains clinical records of various types. Like other studies within the biomedical domain, we found that solutions based on conditional random fields produced the best results, when coupled with a rich feature set. CONCLUSIONS: PhenoCHF is the first annotated corpus aimed at encoding detailed phenotypic information. The unique heterogeneous composition of the corpus has been shown to be advantageous in the training of systems that can accurately extract phenotypic information from a range of different text types. Although the scope of our annotation is currently limited to a single disease, the promising results achieved can stimulate further work into the extraction of phenotypic information for other diseases. The PhenoCHF annotation guidelines and annotations are publicly available at https://code.google.com/p/phenochf-corpus. BioMed Central 2015-06-15 /pmc/articles/PMC4474585/ /pubmed/26099853 http://dx.doi.org/10.1186/1472-6947-15-S2-S3 Text en Copyright © 2015 Alnazzawi et al.; licensee BioMed Central Ltd. http://creativecommons.org/licenses/by/4.0 This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/4.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.
spellingShingle Proceedings
Alnazzawi, Noha
Thompson, Paul
Batista-Navarro, Riza
Ananiadou, Sophia
Using text mining techniques to extract phenotypic information from the PhenoCHF corpus
title Using text mining techniques to extract phenotypic information from the PhenoCHF corpus
title_full Using text mining techniques to extract phenotypic information from the PhenoCHF corpus
title_fullStr Using text mining techniques to extract phenotypic information from the PhenoCHF corpus
title_full_unstemmed Using text mining techniques to extract phenotypic information from the PhenoCHF corpus
title_short Using text mining techniques to extract phenotypic information from the PhenoCHF corpus
title_sort using text mining techniques to extract phenotypic information from the phenochf corpus
topic Proceedings
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4474585/
https://www.ncbi.nlm.nih.gov/pubmed/26099853
http://dx.doi.org/10.1186/1472-6947-15-S2-S3
work_keys_str_mv AT alnazzawinoha usingtextminingtechniquestoextractphenotypicinformationfromthephenochfcorpus
AT thompsonpaul usingtextminingtechniquestoextractphenotypicinformationfromthephenochfcorpus
AT batistanavarroriza usingtextminingtechniquestoextractphenotypicinformationfromthephenochfcorpus
AT ananiadousophia usingtextminingtechniquestoextractphenotypicinformationfromthephenochfcorpus