Cargando…
Realistic artificial DNA sequences as negative controls for computational genomics
A common practice in computational genomic analysis is to use a set of ‘background’ sequences as negative controls for evaluating the false-positive rates of prediction tools, such as gene identification programs and algorithms for detection of cis-regulatory elements. Such ‘background’ sequences ar...
Autores principales: | , , , |
---|---|
Formato: | Online Artículo Texto |
Lenguaje: | English |
Publicado: |
Oxford University Press
2014
|
Materias: | |
Acceso en línea: | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4081056/ https://www.ncbi.nlm.nih.gov/pubmed/24803667 http://dx.doi.org/10.1093/nar/gku356 |
_version_ | 1782324055912742912 |
---|---|
author | Caballero, Juan Smit, Arian F. A. Hood, Leroy Glusman, Gustavo |
author_facet | Caballero, Juan Smit, Arian F. A. Hood, Leroy Glusman, Gustavo |
author_sort | Caballero, Juan |
collection | PubMed |
description | A common practice in computational genomic analysis is to use a set of ‘background’ sequences as negative controls for evaluating the false-positive rates of prediction tools, such as gene identification programs and algorithms for detection of cis-regulatory elements. Such ‘background’ sequences are generally taken from regions of the genome presumed to be intergenic, or generated synthetically by ‘shuffling’ real sequences. This last method can lead to underestimation of false-positive rates. We developed a new method for generating artificial sequences that are modeled after real intergenic sequences in terms of composition, complexity and interspersed repeat content. These artificial sequences can serve as an inexhaustible source of high-quality negative controls. We used artificial sequences to evaluate the false-positive rates of a set of programs for detecting interspersed repeats, ab initio prediction of coding genes, transcribed regions and non-coding genes. We found that RepeatMasker is more accurate than PClouds, Augustus has the lowest false-positive rate of the coding gene prediction programs tested, and Infernal has a low false-positive rate for non-coding gene detection. A web service, source code and the models for human and many other species are freely available at http://repeatmasker.org/garlic/. |
format | Online Article Text |
id | pubmed-4081056 |
institution | National Center for Biotechnology Information |
language | English |
publishDate | 2014 |
publisher | Oxford University Press |
record_format | MEDLINE/PubMed |
spelling | pubmed-40810562014-07-10 Realistic artificial DNA sequences as negative controls for computational genomics Caballero, Juan Smit, Arian F. A. Hood, Leroy Glusman, Gustavo Nucleic Acids Res Methods Online A common practice in computational genomic analysis is to use a set of ‘background’ sequences as negative controls for evaluating the false-positive rates of prediction tools, such as gene identification programs and algorithms for detection of cis-regulatory elements. Such ‘background’ sequences are generally taken from regions of the genome presumed to be intergenic, or generated synthetically by ‘shuffling’ real sequences. This last method can lead to underestimation of false-positive rates. We developed a new method for generating artificial sequences that are modeled after real intergenic sequences in terms of composition, complexity and interspersed repeat content. These artificial sequences can serve as an inexhaustible source of high-quality negative controls. We used artificial sequences to evaluate the false-positive rates of a set of programs for detecting interspersed repeats, ab initio prediction of coding genes, transcribed regions and non-coding genes. We found that RepeatMasker is more accurate than PClouds, Augustus has the lowest false-positive rate of the coding gene prediction programs tested, and Infernal has a low false-positive rate for non-coding gene detection. A web service, source code and the models for human and many other species are freely available at http://repeatmasker.org/garlic/. Oxford University Press 2014-08-01 2014-05-06 /pmc/articles/PMC4081056/ /pubmed/24803667 http://dx.doi.org/10.1093/nar/gku356 Text en © The Author(s) 2014. Published by Oxford University Press on behalf of Nucleic Acids Research. http://creativecommons.org/licenses/by/3.0/ This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/3.0/), which permits unrestricted reuse, distribution, and reproduction in any medium, provided the original work is properly cited. |
spellingShingle | Methods Online Caballero, Juan Smit, Arian F. A. Hood, Leroy Glusman, Gustavo Realistic artificial DNA sequences as negative controls for computational genomics |
title | Realistic artificial DNA sequences as negative controls for computational genomics |
title_full | Realistic artificial DNA sequences as negative controls for computational genomics |
title_fullStr | Realistic artificial DNA sequences as negative controls for computational genomics |
title_full_unstemmed | Realistic artificial DNA sequences as negative controls for computational genomics |
title_short | Realistic artificial DNA sequences as negative controls for computational genomics |
title_sort | realistic artificial dna sequences as negative controls for computational genomics |
topic | Methods Online |
url | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4081056/ https://www.ncbi.nlm.nih.gov/pubmed/24803667 http://dx.doi.org/10.1093/nar/gku356 |
work_keys_str_mv | AT caballerojuan realisticartificialdnasequencesasnegativecontrolsforcomputationalgenomics AT smitarianfa realisticartificialdnasequencesasnegativecontrolsforcomputationalgenomics AT hoodleroy realisticartificialdnasequencesasnegativecontrolsforcomputationalgenomics AT glusmangustavo realisticartificialdnasequencesasnegativecontrolsforcomputationalgenomics |