Cargando…

Annotating genes and genomes with DNA sequences extracted from biomedical articles

Motivation: Increasing rates of publication and DNA sequencing make the problem of finding relevant articles for a particular gene or genomic region more challenging than ever. Existing text-mining approaches focus on finding gene names or identifiers in English text. These are often not unique and...

Descripción completa

Detalles Bibliográficos
Autores principales: Haeussler, Maximilian, Gerner, Martin, Bergman, Casey M.
Formato: Texto
Lenguaje:English
Publicado: Oxford University Press 2011
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3065681/
https://www.ncbi.nlm.nih.gov/pubmed/21325301
http://dx.doi.org/10.1093/bioinformatics/btr043
_version_ 1782201012906360832
author Haeussler, Maximilian
Gerner, Martin
Bergman, Casey M.
author_facet Haeussler, Maximilian
Gerner, Martin
Bergman, Casey M.
author_sort Haeussler, Maximilian
collection PubMed
description Motivation: Increasing rates of publication and DNA sequencing make the problem of finding relevant articles for a particular gene or genomic region more challenging than ever. Existing text-mining approaches focus on finding gene names or identifiers in English text. These are often not unique and do not identify the exact genomic location of a study. Results: Here, we report the results of a novel text-mining approach that extracts DNA sequences from biomedical articles and automatically maps them to genomic databases. We find that ∼20% of open access articles in PubMed central (PMC) have extractable DNA sequences that can be accurately mapped to the correct gene (91%) and genome (96%). We illustrate the utility of data extracted by text2genome from more than 150 000 PMC articles for the interpretation of ChIP-seq data and the design of quantitative reverse transcriptase (RT)-PCR experiments. Conclusion: Our approach links articles to genes and organisms without relying on gene names or identifiers. It also produces genome annotation tracks of the biomedical literature, thereby allowing researchers to use the power of modern genome browsers to access and analyze publications in the context of genomic data. Availability and implementation: Source code is available under a BSD license from http://sourceforge.net/projects/text2genome/ and results can be browsed and downloaded at http://text2genome.org. Contact: maximilianh@gmail.com Supplementary information: Supplementary data are available at Bioinformatics online.
format Text
id pubmed-3065681
institution National Center for Biotechnology Information
language English
publishDate 2011
publisher Oxford University Press
record_format MEDLINE/PubMed
spelling pubmed-30656812011-03-30 Annotating genes and genomes with DNA sequences extracted from biomedical articles Haeussler, Maximilian Gerner, Martin Bergman, Casey M. Bioinformatics Original Papers Motivation: Increasing rates of publication and DNA sequencing make the problem of finding relevant articles for a particular gene or genomic region more challenging than ever. Existing text-mining approaches focus on finding gene names or identifiers in English text. These are often not unique and do not identify the exact genomic location of a study. Results: Here, we report the results of a novel text-mining approach that extracts DNA sequences from biomedical articles and automatically maps them to genomic databases. We find that ∼20% of open access articles in PubMed central (PMC) have extractable DNA sequences that can be accurately mapped to the correct gene (91%) and genome (96%). We illustrate the utility of data extracted by text2genome from more than 150 000 PMC articles for the interpretation of ChIP-seq data and the design of quantitative reverse transcriptase (RT)-PCR experiments. Conclusion: Our approach links articles to genes and organisms without relying on gene names or identifiers. It also produces genome annotation tracks of the biomedical literature, thereby allowing researchers to use the power of modern genome browsers to access and analyze publications in the context of genomic data. Availability and implementation: Source code is available under a BSD license from http://sourceforge.net/projects/text2genome/ and results can be browsed and downloaded at http://text2genome.org. Contact: maximilianh@gmail.com Supplementary information: Supplementary data are available at Bioinformatics online. Oxford University Press 2011-04-01 2011-02-16 /pmc/articles/PMC3065681/ /pubmed/21325301 http://dx.doi.org/10.1093/bioinformatics/btr043 Text en © The Author(s) 2011. Published by Oxford University Press. http://creativecommons.org/licenses/by-nc/2.5 This is an Open Access article distributed under the terms of the Creative Commons Attribution Non-Commercial License (http://creativecommons.org/licenses/by-nc/2.5), which permits unrestricted non-commercial use, distribution, and reproduction in any medium, provided the original work is properly cited.
spellingShingle Original Papers
Haeussler, Maximilian
Gerner, Martin
Bergman, Casey M.
Annotating genes and genomes with DNA sequences extracted from biomedical articles
title Annotating genes and genomes with DNA sequences extracted from biomedical articles
title_full Annotating genes and genomes with DNA sequences extracted from biomedical articles
title_fullStr Annotating genes and genomes with DNA sequences extracted from biomedical articles
title_full_unstemmed Annotating genes and genomes with DNA sequences extracted from biomedical articles
title_short Annotating genes and genomes with DNA sequences extracted from biomedical articles
title_sort annotating genes and genomes with dna sequences extracted from biomedical articles
topic Original Papers
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3065681/
https://www.ncbi.nlm.nih.gov/pubmed/21325301
http://dx.doi.org/10.1093/bioinformatics/btr043
work_keys_str_mv AT haeusslermaximilian annotatinggenesandgenomeswithdnasequencesextractedfrombiomedicalarticles
AT gernermartin annotatinggenesandgenomeswithdnasequencesextractedfrombiomedicalarticles
AT bergmancaseym annotatinggenesandgenomeswithdnasequencesextractedfrombiomedicalarticles