Cargando…

Repeat or not repeat?—Statistical validation of tandem repeat prediction in genomic sequences

Tandem repeats (TRs) represent one of the most prevalent features of genomic sequences. Due to their abundance and functional significance, a plethora of detection tools has been devised over the last two decades. Despite the longstanding interest, TR detection is still not resolved. Our large-scale...

Descripción completa

Detalles Bibliográficos
Autores principales: Schaper, Elke, Kajava, Andrey V., Hauser, Alain, Anisimova, Maria
Formato: Online Artículo Texto
Lenguaje:English
Publicado: Oxford University Press 2012
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3488214/
https://www.ncbi.nlm.nih.gov/pubmed/22923522
http://dx.doi.org/10.1093/nar/gks726
_version_ 1782248583499612160
author Schaper, Elke
Kajava, Andrey V.
Hauser, Alain
Anisimova, Maria
author_facet Schaper, Elke
Kajava, Andrey V.
Hauser, Alain
Anisimova, Maria
author_sort Schaper, Elke
collection PubMed
description Tandem repeats (TRs) represent one of the most prevalent features of genomic sequences. Due to their abundance and functional significance, a plethora of detection tools has been devised over the last two decades. Despite the longstanding interest, TR detection is still not resolved. Our large-scale tests reveal that current detectors produce different, often nonoverlapping inferences, reflecting characteristics of the underlying algorithms rather than the true distribution of TRs in genomic data. Our simulations show that the power of detecting TRs depends on the degree of their divergence, and repeat characteristics such as the length of the minimal repeat unit and their number in tandem. To reconcile the diverse predictions of current algorithms, we propose and evaluate several statistical criteria for measuring the quality of predicted repeat units. In particular, we propose a model-based phylogenetic classifier, entailing a maximum-likelihood estimation of the repeat divergence. Applied in conjunction with the state of the art detectors, our statistical classification scheme for inferred repeats allows to filter out false-positive predictions. Since different algorithms appear to specialize at predicting TRs with certain properties, we advise applying multiple detectors with subsequent filtering to obtain the most complete set of genuine repeats.
format Online
Article
Text
id pubmed-3488214
institution National Center for Biotechnology Information
language English
publishDate 2012
publisher Oxford University Press
record_format MEDLINE/PubMed
spelling pubmed-34882142012-11-06 Repeat or not repeat?—Statistical validation of tandem repeat prediction in genomic sequences Schaper, Elke Kajava, Andrey V. Hauser, Alain Anisimova, Maria Nucleic Acids Res Computational Biology Tandem repeats (TRs) represent one of the most prevalent features of genomic sequences. Due to their abundance and functional significance, a plethora of detection tools has been devised over the last two decades. Despite the longstanding interest, TR detection is still not resolved. Our large-scale tests reveal that current detectors produce different, often nonoverlapping inferences, reflecting characteristics of the underlying algorithms rather than the true distribution of TRs in genomic data. Our simulations show that the power of detecting TRs depends on the degree of their divergence, and repeat characteristics such as the length of the minimal repeat unit and their number in tandem. To reconcile the diverse predictions of current algorithms, we propose and evaluate several statistical criteria for measuring the quality of predicted repeat units. In particular, we propose a model-based phylogenetic classifier, entailing a maximum-likelihood estimation of the repeat divergence. Applied in conjunction with the state of the art detectors, our statistical classification scheme for inferred repeats allows to filter out false-positive predictions. Since different algorithms appear to specialize at predicting TRs with certain properties, we advise applying multiple detectors with subsequent filtering to obtain the most complete set of genuine repeats. Oxford University Press 2012-11 2012-08-24 /pmc/articles/PMC3488214/ /pubmed/22923522 http://dx.doi.org/10.1093/nar/gks726 Text en © The Author(s) 2012. Published by Oxford University Press. http://creativecommons.org/licenses/by-nc/3.0 This is an Open Access article distributed under the terms of the Creative Commons Attribution Non-Commercial License (http://creativecommons.org/licenses/by-nc/3.0), which permits unrestricted non-commercial use, distribution, and reproduction in any medium, provided the original work is properly cited.
spellingShingle Computational Biology
Schaper, Elke
Kajava, Andrey V.
Hauser, Alain
Anisimova, Maria
Repeat or not repeat?—Statistical validation of tandem repeat prediction in genomic sequences
title Repeat or not repeat?—Statistical validation of tandem repeat prediction in genomic sequences
title_full Repeat or not repeat?—Statistical validation of tandem repeat prediction in genomic sequences
title_fullStr Repeat or not repeat?—Statistical validation of tandem repeat prediction in genomic sequences
title_full_unstemmed Repeat or not repeat?—Statistical validation of tandem repeat prediction in genomic sequences
title_short Repeat or not repeat?—Statistical validation of tandem repeat prediction in genomic sequences
title_sort repeat or not repeat?—statistical validation of tandem repeat prediction in genomic sequences
topic Computational Biology
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3488214/
https://www.ncbi.nlm.nih.gov/pubmed/22923522
http://dx.doi.org/10.1093/nar/gks726
work_keys_str_mv AT schaperelke repeatornotrepeatstatisticalvalidationoftandemrepeatpredictioningenomicsequences
AT kajavaandreyv repeatornotrepeatstatisticalvalidationoftandemrepeatpredictioningenomicsequences
AT hauseralain repeatornotrepeatstatisticalvalidationoftandemrepeatpredictioningenomicsequences
AT anisimovamaria repeatornotrepeatstatisticalvalidationoftandemrepeatpredictioningenomicsequences