Cargando…
A survey and evaluations of histogram-based statistics in alignment-free sequence comparison
MOTIVATION: Since the dawn of the bioinformatics field, sequence alignment scores have been the main method for comparing sequences. However, alignment algorithms are quadratic, requiring long execution time. As alternatives, scientists have developed tens of alignment-free statistics for measuring...
Autores principales: | , , |
---|---|
Formato: | Online Artículo Texto |
Lenguaje: | English |
Publicado: |
Oxford University Press
2017
|
Materias: | |
Acceso en línea: | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6781583/ https://www.ncbi.nlm.nih.gov/pubmed/29220512 http://dx.doi.org/10.1093/bib/bbx161 |
_version_ | 1783457399518003200 |
---|---|
author | Luczak, Brian B James, Benjamin T Girgis, Hani Z |
author_facet | Luczak, Brian B James, Benjamin T Girgis, Hani Z |
author_sort | Luczak, Brian B |
collection | PubMed |
description | MOTIVATION: Since the dawn of the bioinformatics field, sequence alignment scores have been the main method for comparing sequences. However, alignment algorithms are quadratic, requiring long execution time. As alternatives, scientists have developed tens of alignment-free statistics for measuring the similarity between two sequences. RESULTS: We surveyed tens of alignment-free k-mer statistics. Additionally, we evaluated 33 statistics and multiplicative combinations between the statistics and/or their squares. These statistics are calculated on two k-mer histograms representing two sequences. Our evaluations using global alignment scores revealed that the majority of the statistics are sensitive and capable of finding similar sequences to a query sequence. Therefore, any of these statistics can filter out dissimilar sequences quickly. Further, we observed that multiplicative combinations of the statistics are highly correlated with the identity score. Furthermore, combinations involving sequence length difference or Earth Mover’s distance, which takes the length difference into account, are always among the highest correlated paired statistics with identity scores. Similarly, paired statistics including length difference or Earth Mover’s distance are among the best performers in finding the K-closest sequences. Interestingly, similar performance can be obtained using histograms of shorter words, resulting in reducing the memory requirement and increasing the speed remarkably. Moreover, we found that simple single statistics are sufficient for processing next-generation sequencing reads and for applications relying on local alignment. Finally, we measured the time requirement of each statistic. The survey and the evaluations will help scientists with identifying efficient alternatives to the costly alignment algorithm, saving thousands of computational hours. AVAILABILITY: The source code of the benchmarking tool is available as Supplementary Materials. |
format | Online Article Text |
id | pubmed-6781583 |
institution | National Center for Biotechnology Information |
language | English |
publishDate | 2017 |
publisher | Oxford University Press |
record_format | MEDLINE/PubMed |
spelling | pubmed-67815832019-10-18 A survey and evaluations of histogram-based statistics in alignment-free sequence comparison Luczak, Brian B James, Benjamin T Girgis, Hani Z Brief Bioinform Paper MOTIVATION: Since the dawn of the bioinformatics field, sequence alignment scores have been the main method for comparing sequences. However, alignment algorithms are quadratic, requiring long execution time. As alternatives, scientists have developed tens of alignment-free statistics for measuring the similarity between two sequences. RESULTS: We surveyed tens of alignment-free k-mer statistics. Additionally, we evaluated 33 statistics and multiplicative combinations between the statistics and/or their squares. These statistics are calculated on two k-mer histograms representing two sequences. Our evaluations using global alignment scores revealed that the majority of the statistics are sensitive and capable of finding similar sequences to a query sequence. Therefore, any of these statistics can filter out dissimilar sequences quickly. Further, we observed that multiplicative combinations of the statistics are highly correlated with the identity score. Furthermore, combinations involving sequence length difference or Earth Mover’s distance, which takes the length difference into account, are always among the highest correlated paired statistics with identity scores. Similarly, paired statistics including length difference or Earth Mover’s distance are among the best performers in finding the K-closest sequences. Interestingly, similar performance can be obtained using histograms of shorter words, resulting in reducing the memory requirement and increasing the speed remarkably. Moreover, we found that simple single statistics are sufficient for processing next-generation sequencing reads and for applications relying on local alignment. Finally, we measured the time requirement of each statistic. The survey and the evaluations will help scientists with identifying efficient alternatives to the costly alignment algorithm, saving thousands of computational hours. AVAILABILITY: The source code of the benchmarking tool is available as Supplementary Materials. Oxford University Press 2017-12-06 /pmc/articles/PMC6781583/ /pubmed/29220512 http://dx.doi.org/10.1093/bib/bbx161 Text en © The Author 2017. Published by Oxford University Press. http://creativecommons.org/licenses/by-nc/4.0/ This is an Open Access article distributed under the terms of the Creative Commons Attribution Non-Commercial License (http://creativecommons.org/licenses/by-nc/4.0/), which permits non-commercial re-use, distribution, and reproduction in any medium, provided the original work is properly cited. For commercial re-use, please contact journals.permissions@oup.com |
spellingShingle | Paper Luczak, Brian B James, Benjamin T Girgis, Hani Z A survey and evaluations of histogram-based statistics in alignment-free sequence comparison |
title | A survey and evaluations of histogram-based statistics in alignment-free sequence comparison |
title_full | A survey and evaluations of histogram-based statistics in alignment-free sequence comparison |
title_fullStr | A survey and evaluations of histogram-based statistics in alignment-free sequence comparison |
title_full_unstemmed | A survey and evaluations of histogram-based statistics in alignment-free sequence comparison |
title_short | A survey and evaluations of histogram-based statistics in alignment-free sequence comparison |
title_sort | survey and evaluations of histogram-based statistics in alignment-free sequence comparison |
topic | Paper |
url | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6781583/ https://www.ncbi.nlm.nih.gov/pubmed/29220512 http://dx.doi.org/10.1093/bib/bbx161 |
work_keys_str_mv | AT luczakbrianb asurveyandevaluationsofhistogrambasedstatisticsinalignmentfreesequencecomparison AT jamesbenjamint asurveyandevaluationsofhistogrambasedstatisticsinalignmentfreesequencecomparison AT girgishaniz asurveyandevaluationsofhistogrambasedstatisticsinalignmentfreesequencecomparison AT luczakbrianb surveyandevaluationsofhistogrambasedstatisticsinalignmentfreesequencecomparison AT jamesbenjamint surveyandevaluationsofhistogrambasedstatisticsinalignmentfreesequencecomparison AT girgishaniz surveyandevaluationsofhistogrambasedstatisticsinalignmentfreesequencecomparison |