Cargando…

A survey and evaluations of histogram-based statistics in alignment-free sequence comparison

MOTIVATION: Since the dawn of the bioinformatics field, sequence alignment scores have been the main method for comparing sequences. However, alignment algorithms are quadratic, requiring long execution time. As alternatives, scientists have developed tens of alignment-free statistics for measuring...

Descripción completa

Detalles Bibliográficos
Autores principales: Luczak, Brian B, James, Benjamin T, Girgis, Hani Z
Formato: Online Artículo Texto
Lenguaje:English
Publicado: Oxford University Press 2017
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6781583/
https://www.ncbi.nlm.nih.gov/pubmed/29220512
http://dx.doi.org/10.1093/bib/bbx161
_version_ 1783457399518003200
author Luczak, Brian B
James, Benjamin T
Girgis, Hani Z
author_facet Luczak, Brian B
James, Benjamin T
Girgis, Hani Z
author_sort Luczak, Brian B
collection PubMed
description MOTIVATION: Since the dawn of the bioinformatics field, sequence alignment scores have been the main method for comparing sequences. However, alignment algorithms are quadratic, requiring long execution time. As alternatives, scientists have developed tens of alignment-free statistics for measuring the similarity between two sequences. RESULTS: We surveyed tens of alignment-free k-mer statistics. Additionally, we evaluated 33 statistics and multiplicative combinations between the statistics and/or their squares. These statistics are calculated on two k-mer histograms representing two sequences. Our evaluations using global alignment scores revealed that the majority of the statistics are sensitive and capable of finding similar sequences to a query sequence. Therefore, any of these statistics can filter out dissimilar sequences quickly. Further, we observed that multiplicative combinations of the statistics are highly correlated with the identity score. Furthermore, combinations involving sequence length difference or Earth Mover’s distance, which takes the length difference into account, are always among the highest correlated paired statistics with identity scores. Similarly, paired statistics including length difference or Earth Mover’s distance are among the best performers in finding the K-closest sequences. Interestingly, similar performance can be obtained using histograms of shorter words, resulting in reducing the memory requirement and increasing the speed remarkably. Moreover, we found that simple single statistics are sufficient for processing next-generation sequencing reads and for applications relying on local alignment. Finally, we measured the time requirement of each statistic. The survey and the evaluations will help scientists with identifying efficient alternatives to the costly alignment algorithm, saving thousands of computational hours. AVAILABILITY: The source code of the benchmarking tool is available as Supplementary Materials.
format Online
Article
Text
id pubmed-6781583
institution National Center for Biotechnology Information
language English
publishDate 2017
publisher Oxford University Press
record_format MEDLINE/PubMed
spelling pubmed-67815832019-10-18 A survey and evaluations of histogram-based statistics in alignment-free sequence comparison Luczak, Brian B James, Benjamin T Girgis, Hani Z Brief Bioinform Paper MOTIVATION: Since the dawn of the bioinformatics field, sequence alignment scores have been the main method for comparing sequences. However, alignment algorithms are quadratic, requiring long execution time. As alternatives, scientists have developed tens of alignment-free statistics for measuring the similarity between two sequences. RESULTS: We surveyed tens of alignment-free k-mer statistics. Additionally, we evaluated 33 statistics and multiplicative combinations between the statistics and/or their squares. These statistics are calculated on two k-mer histograms representing two sequences. Our evaluations using global alignment scores revealed that the majority of the statistics are sensitive and capable of finding similar sequences to a query sequence. Therefore, any of these statistics can filter out dissimilar sequences quickly. Further, we observed that multiplicative combinations of the statistics are highly correlated with the identity score. Furthermore, combinations involving sequence length difference or Earth Mover’s distance, which takes the length difference into account, are always among the highest correlated paired statistics with identity scores. Similarly, paired statistics including length difference or Earth Mover’s distance are among the best performers in finding the K-closest sequences. Interestingly, similar performance can be obtained using histograms of shorter words, resulting in reducing the memory requirement and increasing the speed remarkably. Moreover, we found that simple single statistics are sufficient for processing next-generation sequencing reads and for applications relying on local alignment. Finally, we measured the time requirement of each statistic. The survey and the evaluations will help scientists with identifying efficient alternatives to the costly alignment algorithm, saving thousands of computational hours. AVAILABILITY: The source code of the benchmarking tool is available as Supplementary Materials. Oxford University Press 2017-12-06 /pmc/articles/PMC6781583/ /pubmed/29220512 http://dx.doi.org/10.1093/bib/bbx161 Text en © The Author 2017. Published by Oxford University Press. http://creativecommons.org/licenses/by-nc/4.0/ This is an Open Access article distributed under the terms of the Creative Commons Attribution Non-Commercial License (http://creativecommons.org/licenses/by-nc/4.0/), which permits non-commercial re-use, distribution, and reproduction in any medium, provided the original work is properly cited. For commercial re-use, please contact journals.permissions@oup.com
spellingShingle Paper
Luczak, Brian B
James, Benjamin T
Girgis, Hani Z
A survey and evaluations of histogram-based statistics in alignment-free sequence comparison
title A survey and evaluations of histogram-based statistics in alignment-free sequence comparison
title_full A survey and evaluations of histogram-based statistics in alignment-free sequence comparison
title_fullStr A survey and evaluations of histogram-based statistics in alignment-free sequence comparison
title_full_unstemmed A survey and evaluations of histogram-based statistics in alignment-free sequence comparison
title_short A survey and evaluations of histogram-based statistics in alignment-free sequence comparison
title_sort survey and evaluations of histogram-based statistics in alignment-free sequence comparison
topic Paper
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6781583/
https://www.ncbi.nlm.nih.gov/pubmed/29220512
http://dx.doi.org/10.1093/bib/bbx161
work_keys_str_mv AT luczakbrianb asurveyandevaluationsofhistogrambasedstatisticsinalignmentfreesequencecomparison
AT jamesbenjamint asurveyandevaluationsofhistogrambasedstatisticsinalignmentfreesequencecomparison
AT girgishaniz asurveyandevaluationsofhistogrambasedstatisticsinalignmentfreesequencecomparison
AT luczakbrianb surveyandevaluationsofhistogrambasedstatisticsinalignmentfreesequencecomparison
AT jamesbenjamint surveyandevaluationsofhistogrambasedstatisticsinalignmentfreesequencecomparison
AT girgishaniz surveyandevaluationsofhistogrambasedstatisticsinalignmentfreesequencecomparison