Cargando…

OSIRISv1.2: A named entity recognition system for sequence variants of genes in biomedical literature

BACKGROUND: Single Nucleotide Polymorphisms, among other type of sequence variants, constitute key elements in genetic epidemiology and pharmacogenomics. While sequence data about genetic variation is found at databases such as dbSNP, clues about the functional and phenotypic consequences of the var...

Descripción completa

Detalles Bibliográficos
Autores principales: Furlong, Laura I, Dach, Holger, Hofmann-Apitius, Martin, Sanz, Ferran
Formato: Texto
Lenguaje:English
Publicado: BioMed Central 2008
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2277400/
https://www.ncbi.nlm.nih.gov/pubmed/18251998
http://dx.doi.org/10.1186/1471-2105-9-84
_version_ 1782152021886894080
author Furlong, Laura I
Dach, Holger
Hofmann-Apitius, Martin
Sanz, Ferran
author_facet Furlong, Laura I
Dach, Holger
Hofmann-Apitius, Martin
Sanz, Ferran
author_sort Furlong, Laura I
collection PubMed
description BACKGROUND: Single Nucleotide Polymorphisms, among other type of sequence variants, constitute key elements in genetic epidemiology and pharmacogenomics. While sequence data about genetic variation is found at databases such as dbSNP, clues about the functional and phenotypic consequences of the variations are generally found in biomedical literature. The identification of the relevant documents and the extraction of the information from them are hampered by the large size of literature databases and the lack of widely accepted standard notation for biomedical entities. Thus, automatic systems for the identification of citations of allelic variants of genes in biomedical texts are required. RESULTS: Our group has previously reported the development of OSIRIS, a system aimed at the retrieval of literature about allelic variants of genes . Here we describe the development of a new version of OSIRIS (OSIRISv1.2, ) which incorporates a new entity recognition module and is built on top of a local mirror of the MEDLINE collection and HgenetInfoDB: a database that collects data on human gene sequence variations. The new entity recognition module is based on a pattern-based search algorithm for the identification of variation terms in the texts and their mapping to dbSNP identifiers. The performance of OSIRISv1.2 was evaluated on a manually annotated corpus, resulting in 99% precision, 82% recall, and an F-score of 0.89. As an example, the application of the system for collecting literature citations for the allelic variants of genes related to the diseases intracranial aneurysm and breast cancer is presented. CONCLUSION: OSIRISv1.2 can be used to link literature references to dbSNP database entries with high accuracy, and therefore is suitable for collecting current knowledge on gene sequence variations and supporting the functional annotation of variation databases. The application of OSIRISv1.2 in combination with controlled vocabularies like MeSH provides a way to identify associations of biomedical interest, such as those that relate SNPs with diseases.
format Text
id pubmed-2277400
institution National Center for Biotechnology Information
language English
publishDate 2008
publisher BioMed Central
record_format MEDLINE/PubMed
spelling pubmed-22774002008-04-01 OSIRISv1.2: A named entity recognition system for sequence variants of genes in biomedical literature Furlong, Laura I Dach, Holger Hofmann-Apitius, Martin Sanz, Ferran BMC Bioinformatics Methodology Article BACKGROUND: Single Nucleotide Polymorphisms, among other type of sequence variants, constitute key elements in genetic epidemiology and pharmacogenomics. While sequence data about genetic variation is found at databases such as dbSNP, clues about the functional and phenotypic consequences of the variations are generally found in biomedical literature. The identification of the relevant documents and the extraction of the information from them are hampered by the large size of literature databases and the lack of widely accepted standard notation for biomedical entities. Thus, automatic systems for the identification of citations of allelic variants of genes in biomedical texts are required. RESULTS: Our group has previously reported the development of OSIRIS, a system aimed at the retrieval of literature about allelic variants of genes . Here we describe the development of a new version of OSIRIS (OSIRISv1.2, ) which incorporates a new entity recognition module and is built on top of a local mirror of the MEDLINE collection and HgenetInfoDB: a database that collects data on human gene sequence variations. The new entity recognition module is based on a pattern-based search algorithm for the identification of variation terms in the texts and their mapping to dbSNP identifiers. The performance of OSIRISv1.2 was evaluated on a manually annotated corpus, resulting in 99% precision, 82% recall, and an F-score of 0.89. As an example, the application of the system for collecting literature citations for the allelic variants of genes related to the diseases intracranial aneurysm and breast cancer is presented. CONCLUSION: OSIRISv1.2 can be used to link literature references to dbSNP database entries with high accuracy, and therefore is suitable for collecting current knowledge on gene sequence variations and supporting the functional annotation of variation databases. The application of OSIRISv1.2 in combination with controlled vocabularies like MeSH provides a way to identify associations of biomedical interest, such as those that relate SNPs with diseases. BioMed Central 2008-02-05 /pmc/articles/PMC2277400/ /pubmed/18251998 http://dx.doi.org/10.1186/1471-2105-9-84 Text en Copyright © 2008 Furlong et al; licensee BioMed Central Ltd. http://creativecommons.org/licenses/by/2.0 This is an Open Access article distributed under the terms of the Creative Commons Attribution License ( (http://creativecommons.org/licenses/by/2.0) ), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.
spellingShingle Methodology Article
Furlong, Laura I
Dach, Holger
Hofmann-Apitius, Martin
Sanz, Ferran
OSIRISv1.2: A named entity recognition system for sequence variants of genes in biomedical literature
title OSIRISv1.2: A named entity recognition system for sequence variants of genes in biomedical literature
title_full OSIRISv1.2: A named entity recognition system for sequence variants of genes in biomedical literature
title_fullStr OSIRISv1.2: A named entity recognition system for sequence variants of genes in biomedical literature
title_full_unstemmed OSIRISv1.2: A named entity recognition system for sequence variants of genes in biomedical literature
title_short OSIRISv1.2: A named entity recognition system for sequence variants of genes in biomedical literature
title_sort osirisv1.2: a named entity recognition system for sequence variants of genes in biomedical literature
topic Methodology Article
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2277400/
https://www.ncbi.nlm.nih.gov/pubmed/18251998
http://dx.doi.org/10.1186/1471-2105-9-84
work_keys_str_mv AT furlonglaurai osirisv12anamedentityrecognitionsystemforsequencevariantsofgenesinbiomedicalliterature
AT dachholger osirisv12anamedentityrecognitionsystemforsequencevariantsofgenesinbiomedicalliterature
AT hofmannapitiusmartin osirisv12anamedentityrecognitionsystemforsequencevariantsofgenesinbiomedicalliterature
AT sanzferran osirisv12anamedentityrecognitionsystemforsequencevariantsofgenesinbiomedicalliterature