Cargando…
Optimal choice of word length when comparing two Markov sequences using a χ(2)-statistic
BACKGROUND: Alignment-free sequence comparison using counts of word patterns (grams, k-tuples) has become an active research topic due to the large amount of sequence data from the new sequencing technologies. Genome sequences are frequently modelled by Markov chains and the likelihood ratio test or...
Autores principales: | , , , , |
---|---|
Formato: | Online Artículo Texto |
Lenguaje: | English |
Publicado: |
BioMed Central
2017
|
Materias: | |
Acceso en línea: | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5629589/ https://www.ncbi.nlm.nih.gov/pubmed/28984181 http://dx.doi.org/10.1186/s12864-017-4020-z |
_version_ | 1783269073910497280 |
---|---|
author | Bai, Xin Tang, Kujin Ren, Jie Waterman, Michael Sun, Fengzhu |
author_facet | Bai, Xin Tang, Kujin Ren, Jie Waterman, Michael Sun, Fengzhu |
author_sort | Bai, Xin |
collection | PubMed |
description | BACKGROUND: Alignment-free sequence comparison using counts of word patterns (grams, k-tuples) has become an active research topic due to the large amount of sequence data from the new sequencing technologies. Genome sequences are frequently modelled by Markov chains and the likelihood ratio test or the corresponding approximate χ (2)-statistic has been suggested to compare two sequences. However, it is not known how to best choose the word length k in such studies. RESULTS: We develop an optimal strategy to choose k by maximizing the statistical power of detecting differences between two sequences. Let the orders of the Markov chains for the two sequences be r (1) and r (2), respectively. We show through both simulations and theoretical studies that the optimal k= max(r (1),r (2))+1 for both long sequences and next generation sequencing (NGS) read data. The orders of the Markov chains may be unknown and several methods have been developed to estimate the orders of Markov chains based on both long sequences and NGS reads. We study the power loss of the statistics when the estimated orders are used. It is shown that the power loss is minimal for some of the estimators of the orders of Markov chains. CONCLUSION: Our studies provide guidelines on choosing the optimal word length for the comparison of Markov sequences. ELECTRONIC SUPPLEMENTARY MATERIAL: The online version of this article (doi:10.1186/s12864-017-4020-z) contains supplementary material, which is available to authorized users. |
format | Online Article Text |
id | pubmed-5629589 |
institution | National Center for Biotechnology Information |
language | English |
publishDate | 2017 |
publisher | BioMed Central |
record_format | MEDLINE/PubMed |
spelling | pubmed-56295892017-10-13 Optimal choice of word length when comparing two Markov sequences using a χ(2)-statistic Bai, Xin Tang, Kujin Ren, Jie Waterman, Michael Sun, Fengzhu BMC Genomics Research BACKGROUND: Alignment-free sequence comparison using counts of word patterns (grams, k-tuples) has become an active research topic due to the large amount of sequence data from the new sequencing technologies. Genome sequences are frequently modelled by Markov chains and the likelihood ratio test or the corresponding approximate χ (2)-statistic has been suggested to compare two sequences. However, it is not known how to best choose the word length k in such studies. RESULTS: We develop an optimal strategy to choose k by maximizing the statistical power of detecting differences between two sequences. Let the orders of the Markov chains for the two sequences be r (1) and r (2), respectively. We show through both simulations and theoretical studies that the optimal k= max(r (1),r (2))+1 for both long sequences and next generation sequencing (NGS) read data. The orders of the Markov chains may be unknown and several methods have been developed to estimate the orders of Markov chains based on both long sequences and NGS reads. We study the power loss of the statistics when the estimated orders are used. It is shown that the power loss is minimal for some of the estimators of the orders of Markov chains. CONCLUSION: Our studies provide guidelines on choosing the optimal word length for the comparison of Markov sequences. ELECTRONIC SUPPLEMENTARY MATERIAL: The online version of this article (doi:10.1186/s12864-017-4020-z) contains supplementary material, which is available to authorized users. BioMed Central 2017-10-03 /pmc/articles/PMC5629589/ /pubmed/28984181 http://dx.doi.org/10.1186/s12864-017-4020-z Text en © The Author(s) 2017 Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver(http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated. |
spellingShingle | Research Bai, Xin Tang, Kujin Ren, Jie Waterman, Michael Sun, Fengzhu Optimal choice of word length when comparing two Markov sequences using a χ(2)-statistic |
title | Optimal choice of word length when comparing two Markov sequences using a χ(2)-statistic |
title_full | Optimal choice of word length when comparing two Markov sequences using a χ(2)-statistic |
title_fullStr | Optimal choice of word length when comparing two Markov sequences using a χ(2)-statistic |
title_full_unstemmed | Optimal choice of word length when comparing two Markov sequences using a χ(2)-statistic |
title_short | Optimal choice of word length when comparing two Markov sequences using a χ(2)-statistic |
title_sort | optimal choice of word length when comparing two markov sequences using a χ(2)-statistic |
topic | Research |
url | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5629589/ https://www.ncbi.nlm.nih.gov/pubmed/28984181 http://dx.doi.org/10.1186/s12864-017-4020-z |
work_keys_str_mv | AT baixin optimalchoiceofwordlengthwhencomparingtwomarkovsequencesusingach2statistic AT tangkujin optimalchoiceofwordlengthwhencomparingtwomarkovsequencesusingach2statistic AT renjie optimalchoiceofwordlengthwhencomparingtwomarkovsequencesusingach2statistic AT watermanmichael optimalchoiceofwordlengthwhencomparingtwomarkovsequencesusingach2statistic AT sunfengzhu optimalchoiceofwordlengthwhencomparingtwomarkovsequencesusingach2statistic |