Cargando…

Realistic artificial DNA sequences as negative controls for computational genomics

A common practice in computational genomic analysis is to use a set of ‘background’ sequences as negative controls for evaluating the false-positive rates of prediction tools, such as gene identification programs and algorithms for detection of cis-regulatory elements. Such ‘background’ sequences ar...

Descripción completa

Detalles Bibliográficos
Autores principales: Caballero, Juan, Smit, Arian F. A., Hood, Leroy, Glusman, Gustavo
Formato: Online Artículo Texto
Lenguaje:English
Publicado: Oxford University Press 2014
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4081056/
https://www.ncbi.nlm.nih.gov/pubmed/24803667
http://dx.doi.org/10.1093/nar/gku356
_version_ 1782324055912742912
author Caballero, Juan
Smit, Arian F. A.
Hood, Leroy
Glusman, Gustavo
author_facet Caballero, Juan
Smit, Arian F. A.
Hood, Leroy
Glusman, Gustavo
author_sort Caballero, Juan
collection PubMed
description A common practice in computational genomic analysis is to use a set of ‘background’ sequences as negative controls for evaluating the false-positive rates of prediction tools, such as gene identification programs and algorithms for detection of cis-regulatory elements. Such ‘background’ sequences are generally taken from regions of the genome presumed to be intergenic, or generated synthetically by ‘shuffling’ real sequences. This last method can lead to underestimation of false-positive rates. We developed a new method for generating artificial sequences that are modeled after real intergenic sequences in terms of composition, complexity and interspersed repeat content. These artificial sequences can serve as an inexhaustible source of high-quality negative controls. We used artificial sequences to evaluate the false-positive rates of a set of programs for detecting interspersed repeats, ab initio prediction of coding genes, transcribed regions and non-coding genes. We found that RepeatMasker is more accurate than PClouds, Augustus has the lowest false-positive rate of the coding gene prediction programs tested, and Infernal has a low false-positive rate for non-coding gene detection. A web service, source code and the models for human and many other species are freely available at http://repeatmasker.org/garlic/.
format Online
Article
Text
id pubmed-4081056
institution National Center for Biotechnology Information
language English
publishDate 2014
publisher Oxford University Press
record_format MEDLINE/PubMed
spelling pubmed-40810562014-07-10 Realistic artificial DNA sequences as negative controls for computational genomics Caballero, Juan Smit, Arian F. A. Hood, Leroy Glusman, Gustavo Nucleic Acids Res Methods Online A common practice in computational genomic analysis is to use a set of ‘background’ sequences as negative controls for evaluating the false-positive rates of prediction tools, such as gene identification programs and algorithms for detection of cis-regulatory elements. Such ‘background’ sequences are generally taken from regions of the genome presumed to be intergenic, or generated synthetically by ‘shuffling’ real sequences. This last method can lead to underestimation of false-positive rates. We developed a new method for generating artificial sequences that are modeled after real intergenic sequences in terms of composition, complexity and interspersed repeat content. These artificial sequences can serve as an inexhaustible source of high-quality negative controls. We used artificial sequences to evaluate the false-positive rates of a set of programs for detecting interspersed repeats, ab initio prediction of coding genes, transcribed regions and non-coding genes. We found that RepeatMasker is more accurate than PClouds, Augustus has the lowest false-positive rate of the coding gene prediction programs tested, and Infernal has a low false-positive rate for non-coding gene detection. A web service, source code and the models for human and many other species are freely available at http://repeatmasker.org/garlic/. Oxford University Press 2014-08-01 2014-05-06 /pmc/articles/PMC4081056/ /pubmed/24803667 http://dx.doi.org/10.1093/nar/gku356 Text en © The Author(s) 2014. Published by Oxford University Press on behalf of Nucleic Acids Research. http://creativecommons.org/licenses/by/3.0/ This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/3.0/), which permits unrestricted reuse, distribution, and reproduction in any medium, provided the original work is properly cited.
spellingShingle Methods Online
Caballero, Juan
Smit, Arian F. A.
Hood, Leroy
Glusman, Gustavo
Realistic artificial DNA sequences as negative controls for computational genomics
title Realistic artificial DNA sequences as negative controls for computational genomics
title_full Realistic artificial DNA sequences as negative controls for computational genomics
title_fullStr Realistic artificial DNA sequences as negative controls for computational genomics
title_full_unstemmed Realistic artificial DNA sequences as negative controls for computational genomics
title_short Realistic artificial DNA sequences as negative controls for computational genomics
title_sort realistic artificial dna sequences as negative controls for computational genomics
topic Methods Online
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4081056/
https://www.ncbi.nlm.nih.gov/pubmed/24803667
http://dx.doi.org/10.1093/nar/gku356
work_keys_str_mv AT caballerojuan realisticartificialdnasequencesasnegativecontrolsforcomputationalgenomics
AT smitarianfa realisticartificialdnasequencesasnegativecontrolsforcomputationalgenomics
AT hoodleroy realisticartificialdnasequencesasnegativecontrolsforcomputationalgenomics
AT glusmangustavo realisticartificialdnasequencesasnegativecontrolsforcomputationalgenomics