Cargando…

A survey and evaluations of histogram-based statistics in alignment-free sequence comparison

MOTIVATION: Since the dawn of the bioinformatics field, sequence alignment scores have been the main method for comparing sequences. However, alignment algorithms are quadratic, requiring long execution time. As alternatives, scientists have developed tens of alignment-free statistics for measuring...

Descripción completa

Detalles Bibliográficos
Autores principales:	Luczak, Brian B, James, Benjamin T, Girgis, Hani Z
Formato:	Online Artículo Texto
Lenguaje:	English
Publicado:	Oxford University Press 2017
Materias:	Paper
Acceso en línea:	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6781583/ https://www.ncbi.nlm.nih.gov/pubmed/29220512 http://dx.doi.org/10.1093/bib/bbx161

_version_	1783457399518003200
author	Luczak, Brian B James, Benjamin T Girgis, Hani Z
author_facet	Luczak, Brian B James, Benjamin T Girgis, Hani Z
author_sort	Luczak, Brian B
collection	PubMed
description	MOTIVATION: Since the dawn of the bioinformatics field, sequence alignment scores have been the main method for comparing sequences. However, alignment algorithms are quadratic, requiring long execution time. As alternatives, scientists have developed tens of alignment-free statistics for measuring the similarity between two sequences. RESULTS: We surveyed tens of alignment-free k-mer statistics. Additionally, we evaluated 33 statistics and multiplicative combinations between the statistics and/or their squares. These statistics are calculated on two k-mer histograms representing two sequences. Our evaluations using global alignment scores revealed that the majority of the statistics are sensitive and capable of finding similar sequences to a query sequence. Therefore, any of these statistics can filter out dissimilar sequences quickly. Further, we observed that multiplicative combinations of the statistics are highly correlated with the identity score. Furthermore, combinations involving sequence length difference or Earth Mover’s distance, which takes the length difference into account, are always among the highest correlated paired statistics with identity scores. Similarly, paired statistics including length difference or Earth Mover’s distance are among the best performers in finding the K-closest sequences. Interestingly, similar performance can be obtained using histograms of shorter words, resulting in reducing the memory requirement and increasing the speed remarkably. Moreover, we found that simple single statistics are sufficient for processing next-generation sequencing reads and for applications relying on local alignment. Finally, we measured the time requirement of each statistic. The survey and the evaluations will help scientists with identifying efficient alternatives to the costly alignment algorithm, saving thousands of computational hours. AVAILABILITY: The source code of the benchmarking tool is available as Supplementary Materials.
format	Online Article Text
id	pubmed-6781583
institution	National Center for Biotechnology Information
language	English
publishDate	2017
publisher	Oxford University Press
record_format	MEDLINE/PubMed
spelling	pubmed-67815832019-10-18 A survey and evaluations of histogram-based statistics in alignment-free sequence comparison Luczak, Brian B James, Benjamin T Girgis, Hani Z Brief Bioinform Paper MOTIVATION: Since the dawn of the bioinformatics field, sequence alignment scores have been the main method for comparing sequences. However, alignment algorithms are quadratic, requiring long execution time. As alternatives, scientists have developed tens of alignment-free statistics for measuring the similarity between two sequences. RESULTS: We surveyed tens of alignment-free k-mer statistics. Additionally, we evaluated 33 statistics and multiplicative combinations between the statistics and/or their squares. These statistics are calculated on two k-mer histograms representing two sequences. Our evaluations using global alignment scores revealed that the majority of the statistics are sensitive and capable of finding similar sequences to a query sequence. Therefore, any of these statistics can filter out dissimilar sequences quickly. Further, we observed that multiplicative combinations of the statistics are highly correlated with the identity score. Furthermore, combinations involving sequence length difference or Earth Mover’s distance, which takes the length difference into account, are always among the highest correlated paired statistics with identity scores. Similarly, paired statistics including length difference or Earth Mover’s distance are among the best performers in finding the K-closest sequences. Interestingly, similar performance can be obtained using histograms of shorter words, resulting in reducing the memory requirement and increasing the speed remarkably. Moreover, we found that simple single statistics are sufficient for processing next-generation sequencing reads and for applications relying on local alignment. Finally, we measured the time requirement of each statistic. The survey and the evaluations will help scientists with identifying efficient alternatives to the costly alignment algorithm, saving thousands of computational hours. AVAILABILITY: The source code of the benchmarking tool is available as Supplementary Materials. Oxford University Press 2017-12-06 /pmc/articles/PMC6781583/ /pubmed/29220512 http://dx.doi.org/10.1093/bib/bbx161 Text en © The Author 2017. Published by Oxford University Press. http://creativecommons.org/licenses/by-nc/4.0/ This is an Open Access article distributed under the terms of the Creative Commons Attribution Non-Commercial License (http://creativecommons.org/licenses/by-nc/4.0/), which permits non-commercial re-use, distribution, and reproduction in any medium, provided the original work is properly cited. For commercial re-use, please contact journals.permissions@oup.com
spellingShingle	Paper Luczak, Brian B James, Benjamin T Girgis, Hani Z A survey and evaluations of histogram-based statistics in alignment-free sequence comparison
title	A survey and evaluations of histogram-based statistics in alignment-free sequence comparison
title_full	A survey and evaluations of histogram-based statistics in alignment-free sequence comparison
title_fullStr	A survey and evaluations of histogram-based statistics in alignment-free sequence comparison
title_full_unstemmed	A survey and evaluations of histogram-based statistics in alignment-free sequence comparison
title_short	A survey and evaluations of histogram-based statistics in alignment-free sequence comparison
title_sort	survey and evaluations of histogram-based statistics in alignment-free sequence comparison
topic	Paper
url	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6781583/ https://www.ncbi.nlm.nih.gov/pubmed/29220512 http://dx.doi.org/10.1093/bib/bbx161
work_keys_str_mv	AT luczakbrianb asurveyandevaluationsofhistogrambasedstatisticsinalignmentfreesequencecomparison AT jamesbenjamint asurveyandevaluationsofhistogrambasedstatisticsinalignmentfreesequencecomparison AT girgishaniz asurveyandevaluationsofhistogrambasedstatisticsinalignmentfreesequencecomparison AT luczakbrianb surveyandevaluationsofhistogrambasedstatisticsinalignmentfreesequencecomparison AT jamesbenjamint surveyandevaluationsofhistogrambasedstatisticsinalignmentfreesequencecomparison AT girgishaniz surveyandevaluationsofhistogrambasedstatisticsinalignmentfreesequencecomparison

A survey and evaluations of histogram-based statistics in alignment-free sequence comparison

Ejemplares similares