Cargando…

Establishing a baseline for literature mining human genetic variants and their relationships to disease cohorts

BACKGROUND: The Variome corpus, a small collection of published articles about inherited colorectal cancer, includes annotations of 11 entity types and 13 relation types related to the curation of the relationship between genetic variation and disease. Due to the richness of these annotations, the c...

Descripción completa

Detalles Bibliográficos
Autores principales:	Verspoor, Karin M., Heo, Go Eun, Kang, Keun Young, Song, Min
Formato:	Online Artículo Texto
Lenguaje:	English
Publicado:	BioMed Central 2016
Materias:	Research
Acceso en línea:	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4959367/ https://www.ncbi.nlm.nih.gov/pubmed/27454860 http://dx.doi.org/10.1186/s12911-016-0294-3

_version_	1782444391945732096
author	Verspoor, Karin M. Heo, Go Eun Kang, Keun Young Song, Min
author_facet	Verspoor, Karin M. Heo, Go Eun Kang, Keun Young Song, Min
author_sort	Verspoor, Karin M.
collection	PubMed
description	BACKGROUND: The Variome corpus, a small collection of published articles about inherited colorectal cancer, includes annotations of 11 entity types and 13 relation types related to the curation of the relationship between genetic variation and disease. Due to the richness of these annotations, the corpus provides a good testbed for evaluation of biomedical literature information extraction systems. METHODS: In this paper, we focus on assessing performance on extracting the relations in the corpus, using gold standard entities as a starting point, to establish a baseline for extraction of relations important for extraction of genetic variant information from the literature. We test the application of the Public Knowledge Discovery Engine for Java (PKDE4J) system, a natural language processing system designed for information extraction of entities and relations in text, on the relation extraction task using this corpus. RESULTS: For the relations which are attested at least 100 times in the Variome corpus, we realise a performance ranging from 0.78–0.84 Precision-weighted F-score, depending on the relation. We find that the PKDE4J system adapted straightforwardly to the range of relation types represented in the corpus; some extensions to the original methodology were required to adapt to the multi-relational classification context. The results are competitive with state-of-the-art relation extraction performance on more heavily studied corpora, although the analysis shows that the Recall of a co-occurrence baseline outweighs the benefit of improved Precision for many relations, indicating the value of simple semantic constraints on relations. CONCLUSIONS: This work represents the first attempt to apply relation extraction methods to the Variome corpus. The results demonstrate that automated methods have good potential to structure the information expressed in the published literature related to genetic variants, connecting mutations to genes, diseases, and patient cohorts. Further development of such approaches will facilitate more efficient biocuration of genetic variant information into structured databases, leveraging the knowledge embedded in the vast publication literature.
format	Online Article Text
id	pubmed-4959367
institution	National Center for Biotechnology Information
language	English
publishDate	2016
publisher	BioMed Central
record_format	MEDLINE/PubMed
spelling	pubmed-49593672016-08-01 Establishing a baseline for literature mining human genetic variants and their relationships to disease cohorts Verspoor, Karin M. Heo, Go Eun Kang, Keun Young Song, Min BMC Med Inform Decis Mak Research BACKGROUND: The Variome corpus, a small collection of published articles about inherited colorectal cancer, includes annotations of 11 entity types and 13 relation types related to the curation of the relationship between genetic variation and disease. Due to the richness of these annotations, the corpus provides a good testbed for evaluation of biomedical literature information extraction systems. METHODS: In this paper, we focus on assessing performance on extracting the relations in the corpus, using gold standard entities as a starting point, to establish a baseline for extraction of relations important for extraction of genetic variant information from the literature. We test the application of the Public Knowledge Discovery Engine for Java (PKDE4J) system, a natural language processing system designed for information extraction of entities and relations in text, on the relation extraction task using this corpus. RESULTS: For the relations which are attested at least 100 times in the Variome corpus, we realise a performance ranging from 0.78–0.84 Precision-weighted F-score, depending on the relation. We find that the PKDE4J system adapted straightforwardly to the range of relation types represented in the corpus; some extensions to the original methodology were required to adapt to the multi-relational classification context. The results are competitive with state-of-the-art relation extraction performance on more heavily studied corpora, although the analysis shows that the Recall of a co-occurrence baseline outweighs the benefit of improved Precision for many relations, indicating the value of simple semantic constraints on relations. CONCLUSIONS: This work represents the first attempt to apply relation extraction methods to the Variome corpus. The results demonstrate that automated methods have good potential to structure the information expressed in the published literature related to genetic variants, connecting mutations to genes, diseases, and patient cohorts. Further development of such approaches will facilitate more efficient biocuration of genetic variant information into structured databases, leveraging the knowledge embedded in the vast publication literature. BioMed Central 2016-07-18 /pmc/articles/PMC4959367/ /pubmed/27454860 http://dx.doi.org/10.1186/s12911-016-0294-3 Text en © Verspoor. 2016 Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver(http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.
spellingShingle	Research Verspoor, Karin M. Heo, Go Eun Kang, Keun Young Song, Min Establishing a baseline for literature mining human genetic variants and their relationships to disease cohorts
title	Establishing a baseline for literature mining human genetic variants and their relationships to disease cohorts
title_full	Establishing a baseline for literature mining human genetic variants and their relationships to disease cohorts
title_fullStr	Establishing a baseline for literature mining human genetic variants and their relationships to disease cohorts
title_full_unstemmed	Establishing a baseline for literature mining human genetic variants and their relationships to disease cohorts
title_short	Establishing a baseline for literature mining human genetic variants and their relationships to disease cohorts
title_sort	establishing a baseline for literature mining human genetic variants and their relationships to disease cohorts
topic	Research
url	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4959367/ https://www.ncbi.nlm.nih.gov/pubmed/27454860 http://dx.doi.org/10.1186/s12911-016-0294-3
work_keys_str_mv	AT verspoorkarinm establishingabaselineforliteraturemininghumangeneticvariantsandtheirrelationshipstodiseasecohorts AT heogoeun establishingabaselineforliteraturemininghumangeneticvariantsandtheirrelationshipstodiseasecohorts AT kangkeunyoung establishingabaselineforliteraturemininghumangeneticvariantsandtheirrelationshipstodiseasecohorts AT songmin establishingabaselineforliteraturemininghumangeneticvariantsandtheirrelationshipstodiseasecohorts

Establishing a baseline for literature mining human genetic variants and their relationships to disease cohorts

Ejemplares similares