Cargando…

SNPPhenA: a corpus for extracting ranked associations of single-nucleotide polymorphisms and phenotypes from literature

BACKGROUND: Single Nucleotide Polymorphisms (SNPs) are among the most important types of genetic variations influencing common diseases and phenotypes. Recently, some corpora and methods have been developed with the purpose of extracting mutations and diseases from texts. However, there is no availa...

Descripción completa

Detalles Bibliográficos
Autores principales:	Bokharaeian, Behrouz, Diaz, Alberto, Taghizadeh, Nasrin, Chitsaz, Hamidreza, Chavoshinejad, Ramyar
Formato:	Online Artículo Texto
Lenguaje:	English
Publicado:	BioMed Central 2017
Materias:	Research
Acceso en línea:	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5383945/ https://www.ncbi.nlm.nih.gov/pubmed/28388928 http://dx.doi.org/10.1186/s13326-017-0116-2

_version_	1782520374796222464
author	Bokharaeian, Behrouz Diaz, Alberto Taghizadeh, Nasrin Chitsaz, Hamidreza Chavoshinejad, Ramyar
author_facet	Bokharaeian, Behrouz Diaz, Alberto Taghizadeh, Nasrin Chitsaz, Hamidreza Chavoshinejad, Ramyar
author_sort	Bokharaeian, Behrouz
collection	PubMed
description	BACKGROUND: Single Nucleotide Polymorphisms (SNPs) are among the most important types of genetic variations influencing common diseases and phenotypes. Recently, some corpora and methods have been developed with the purpose of extracting mutations and diseases from texts. However, there is no available corpus, for extracting associations from texts, that is annotated with linguistic-based negation, modality markers, neutral candidates, and confidence level of associations. METHOD: In this research, different steps were presented so as to produce the SNPPhenA corpus. They include automatic Named Entity Recognition (NER) followed by the manual annotation of SNP and phenotype names, annotation of the SNP-phenotype associations and their level of confidence, as well as modality markers. Moreover, the produced corpus was annotated with negation scopes and cues as well as neutral candidates that play crucial role as far as negation and the modality phenomenon in relation to extraction tasks. RESULT: The agreement between annotators was measured by Cohen’s Kappa coefficient where the resulting scores indicated the reliability of the corpus. The Kappa score was 0.79 for annotating the associations and 0.80 for the confidence degree of associations. Further presented were the basic statistics of the annotated features of the corpus in addition to the results of our first experiments related to the extraction of ranked SNP-Phenotype associations. The prepared guideline documents render the corpus more convenient and facile to use. The corpus, guidelines and inter-annotator agreement analysis are available on the website of the corpus: http://nil.fdi.ucm.es/?q=node/639. CONCLUSION: Specifying the confidence degree of SNP-phenotype associations from articles helps identify the strength of associations that could in turn assist genomics scientists in determining phenotypic plasticity and the importance of environmental factors. What is more, our first experiments with the corpus show that linguistic-based confidence alongside other non-linguistic features can be utilized in order to estimate the strength of the observed SNP-phenotype associations. Trial Registration: Not Applicable ELECTRONIC SUPPLEMENTARY MATERIAL: The online version of this article (doi:10.1186/s13326-017-0116-2) contains supplementary material, which is available to authorized users.
format	Online Article Text
id	pubmed-5383945
institution	National Center for Biotechnology Information
language	English
publishDate	2017
publisher	BioMed Central
record_format	MEDLINE/PubMed
spelling	pubmed-53839452017-04-10 SNPPhenA: a corpus for extracting ranked associations of single-nucleotide polymorphisms and phenotypes from literature Bokharaeian, Behrouz Diaz, Alberto Taghizadeh, Nasrin Chitsaz, Hamidreza Chavoshinejad, Ramyar J Biomed Semantics Research BACKGROUND: Single Nucleotide Polymorphisms (SNPs) are among the most important types of genetic variations influencing common diseases and phenotypes. Recently, some corpora and methods have been developed with the purpose of extracting mutations and diseases from texts. However, there is no available corpus, for extracting associations from texts, that is annotated with linguistic-based negation, modality markers, neutral candidates, and confidence level of associations. METHOD: In this research, different steps were presented so as to produce the SNPPhenA corpus. They include automatic Named Entity Recognition (NER) followed by the manual annotation of SNP and phenotype names, annotation of the SNP-phenotype associations and their level of confidence, as well as modality markers. Moreover, the produced corpus was annotated with negation scopes and cues as well as neutral candidates that play crucial role as far as negation and the modality phenomenon in relation to extraction tasks. RESULT: The agreement between annotators was measured by Cohen’s Kappa coefficient where the resulting scores indicated the reliability of the corpus. The Kappa score was 0.79 for annotating the associations and 0.80 for the confidence degree of associations. Further presented were the basic statistics of the annotated features of the corpus in addition to the results of our first experiments related to the extraction of ranked SNP-Phenotype associations. The prepared guideline documents render the corpus more convenient and facile to use. The corpus, guidelines and inter-annotator agreement analysis are available on the website of the corpus: http://nil.fdi.ucm.es/?q=node/639. CONCLUSION: Specifying the confidence degree of SNP-phenotype associations from articles helps identify the strength of associations that could in turn assist genomics scientists in determining phenotypic plasticity and the importance of environmental factors. What is more, our first experiments with the corpus show that linguistic-based confidence alongside other non-linguistic features can be utilized in order to estimate the strength of the observed SNP-phenotype associations. Trial Registration: Not Applicable ELECTRONIC SUPPLEMENTARY MATERIAL: The online version of this article (doi:10.1186/s13326-017-0116-2) contains supplementary material, which is available to authorized users. BioMed Central 2017-04-07 /pmc/articles/PMC5383945/ /pubmed/28388928 http://dx.doi.org/10.1186/s13326-017-0116-2 Text en © The Author(s). 2017 Open AccessThis article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.
spellingShingle	Research Bokharaeian, Behrouz Diaz, Alberto Taghizadeh, Nasrin Chitsaz, Hamidreza Chavoshinejad, Ramyar SNPPhenA: a corpus for extracting ranked associations of single-nucleotide polymorphisms and phenotypes from literature
title	SNPPhenA: a corpus for extracting ranked associations of single-nucleotide polymorphisms and phenotypes from literature
title_full	SNPPhenA: a corpus for extracting ranked associations of single-nucleotide polymorphisms and phenotypes from literature
title_fullStr	SNPPhenA: a corpus for extracting ranked associations of single-nucleotide polymorphisms and phenotypes from literature
title_full_unstemmed	SNPPhenA: a corpus for extracting ranked associations of single-nucleotide polymorphisms and phenotypes from literature
title_short	SNPPhenA: a corpus for extracting ranked associations of single-nucleotide polymorphisms and phenotypes from literature
title_sort	snpphena: a corpus for extracting ranked associations of single-nucleotide polymorphisms and phenotypes from literature
topic	Research
url	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5383945/ https://www.ncbi.nlm.nih.gov/pubmed/28388928 http://dx.doi.org/10.1186/s13326-017-0116-2
work_keys_str_mv	AT bokharaeianbehrouz snpphenaacorpusforextractingrankedassociationsofsinglenucleotidepolymorphismsandphenotypesfromliterature AT diazalberto snpphenaacorpusforextractingrankedassociationsofsinglenucleotidepolymorphismsandphenotypesfromliterature AT taghizadehnasrin snpphenaacorpusforextractingrankedassociationsofsinglenucleotidepolymorphismsandphenotypesfromliterature AT chitsazhamidreza snpphenaacorpusforextractingrankedassociationsofsinglenucleotidepolymorphismsandphenotypesfromliterature AT chavoshinejadramyar snpphenaacorpusforextractingrankedassociationsofsinglenucleotidepolymorphismsandphenotypesfromliterature

SNPPhenA: a corpus for extracting ranked associations of single-nucleotide polymorphisms and phenotypes from literature

Ejemplares similares