Cargando…

Realistic artificial DNA sequences as negative controls for computational genomics

A common practice in computational genomic analysis is to use a set of ‘background’ sequences as negative controls for evaluating the false-positive rates of prediction tools, such as gene identification programs and algorithms for detection of cis-regulatory elements. Such ‘background’ sequences ar...

Descripción completa

Detalles Bibliográficos
Autores principales:	Caballero, Juan, Smit, Arian F. A., Hood, Leroy, Glusman, Gustavo
Formato:	Online Artículo Texto
Lenguaje:	English
Publicado:	Oxford University Press 2014
Materias:	Methods Online
Acceso en línea:	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4081056/ https://www.ncbi.nlm.nih.gov/pubmed/24803667 http://dx.doi.org/10.1093/nar/gku356

_version_	1782324055912742912
author	Caballero, Juan Smit, Arian F. A. Hood, Leroy Glusman, Gustavo
author_facet	Caballero, Juan Smit, Arian F. A. Hood, Leroy Glusman, Gustavo
author_sort	Caballero, Juan
collection	PubMed
description	A common practice in computational genomic analysis is to use a set of ‘background’ sequences as negative controls for evaluating the false-positive rates of prediction tools, such as gene identification programs and algorithms for detection of cis-regulatory elements. Such ‘background’ sequences are generally taken from regions of the genome presumed to be intergenic, or generated synthetically by ‘shuffling’ real sequences. This last method can lead to underestimation of false-positive rates. We developed a new method for generating artificial sequences that are modeled after real intergenic sequences in terms of composition, complexity and interspersed repeat content. These artificial sequences can serve as an inexhaustible source of high-quality negative controls. We used artificial sequences to evaluate the false-positive rates of a set of programs for detecting interspersed repeats, ab initio prediction of coding genes, transcribed regions and non-coding genes. We found that RepeatMasker is more accurate than PClouds, Augustus has the lowest false-positive rate of the coding gene prediction programs tested, and Infernal has a low false-positive rate for non-coding gene detection. A web service, source code and the models for human and many other species are freely available at http://repeatmasker.org/garlic/.
format	Online Article Text
id	pubmed-4081056
institution	National Center for Biotechnology Information
language	English
publishDate	2014
publisher	Oxford University Press
record_format	MEDLINE/PubMed
spelling	pubmed-40810562014-07-10 Realistic artificial DNA sequences as negative controls for computational genomics Caballero, Juan Smit, Arian F. A. Hood, Leroy Glusman, Gustavo Nucleic Acids Res Methods Online A common practice in computational genomic analysis is to use a set of ‘background’ sequences as negative controls for evaluating the false-positive rates of prediction tools, such as gene identification programs and algorithms for detection of cis-regulatory elements. Such ‘background’ sequences are generally taken from regions of the genome presumed to be intergenic, or generated synthetically by ‘shuffling’ real sequences. This last method can lead to underestimation of false-positive rates. We developed a new method for generating artificial sequences that are modeled after real intergenic sequences in terms of composition, complexity and interspersed repeat content. These artificial sequences can serve as an inexhaustible source of high-quality negative controls. We used artificial sequences to evaluate the false-positive rates of a set of programs for detecting interspersed repeats, ab initio prediction of coding genes, transcribed regions and non-coding genes. We found that RepeatMasker is more accurate than PClouds, Augustus has the lowest false-positive rate of the coding gene prediction programs tested, and Infernal has a low false-positive rate for non-coding gene detection. A web service, source code and the models for human and many other species are freely available at http://repeatmasker.org/garlic/. Oxford University Press 2014-08-01 2014-05-06 /pmc/articles/PMC4081056/ /pubmed/24803667 http://dx.doi.org/10.1093/nar/gku356 Text en © The Author(s) 2014. Published by Oxford University Press on behalf of Nucleic Acids Research. http://creativecommons.org/licenses/by/3.0/ This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/3.0/), which permits unrestricted reuse, distribution, and reproduction in any medium, provided the original work is properly cited.
spellingShingle	Methods Online Caballero, Juan Smit, Arian F. A. Hood, Leroy Glusman, Gustavo Realistic artificial DNA sequences as negative controls for computational genomics
title	Realistic artificial DNA sequences as negative controls for computational genomics
title_full	Realistic artificial DNA sequences as negative controls for computational genomics
title_fullStr	Realistic artificial DNA sequences as negative controls for computational genomics
title_full_unstemmed	Realistic artificial DNA sequences as negative controls for computational genomics
title_short	Realistic artificial DNA sequences as negative controls for computational genomics
title_sort	realistic artificial dna sequences as negative controls for computational genomics
topic	Methods Online
url	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4081056/ https://www.ncbi.nlm.nih.gov/pubmed/24803667 http://dx.doi.org/10.1093/nar/gku356
work_keys_str_mv	AT caballerojuan realisticartificialdnasequencesasnegativecontrolsforcomputationalgenomics AT smitarianfa realisticartificialdnasequencesasnegativecontrolsforcomputationalgenomics AT hoodleroy realisticartificialdnasequencesasnegativecontrolsforcomputationalgenomics AT glusmangustavo realisticartificialdnasequencesasnegativecontrolsforcomputationalgenomics

Realistic artificial DNA sequences as negative controls for computational genomics

Ejemplares similares