Cargando…

Interpreting alignment-free sequence comparison: what makes a score a good score?

Alignment-free methods are alternatives to alignment-based methods when searching sequence data sets. The output from an alignment-free sequence comparison is a similarity score, the interpretation of which is not straightforward. We propose objective functions to interpret and calibrate outputs fro...

Descripción completa

Detalles Bibliográficos
Autores principales:	Swain, Martin T, Vickers, Martin
Formato:	Online Artículo Texto
Lenguaje:	English
Publicado:	Oxford University Press 2022
Materias:	Standard Article
Acceso en línea:	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9442500/ https://www.ncbi.nlm.nih.gov/pubmed/36071721 http://dx.doi.org/10.1093/nargab/lqac062

_version_	1784782826583883776
author	Swain, Martin T Vickers, Martin
author_facet	Swain, Martin T Vickers, Martin
author_sort	Swain, Martin T
collection	PubMed
description	Alignment-free methods are alternatives to alignment-based methods when searching sequence data sets. The output from an alignment-free sequence comparison is a similarity score, the interpretation of which is not straightforward. We propose objective functions to interpret and calibrate outputs from alignment-free searches, noting that different objective functions are necessary for different biological contexts. This leads to advantages: visualising and comparing score distributions, including those from true positives, may be a relatively simple method to gain insight into the performance of different metrics. Using an empirical approach with both DNA and protein sequences, we characterise different similarity score distributions generated under different parameters. In particular, we demonstrate how sequence length can affect the scores. We show that scores of true positive sequence pairs may correlate significantly with their mean length; and even if the correlation is weak, the relative difference in length of the sequence pair may significantly reduce the effectiveness of alignment-free metrics. Importantly, we show how objective functions can be used with test data to accurately estimate the probability of true positives. This can significantly increase the utility of alignment-free approaches. Finally, we have developed a general-purpose software tool called KAST for use in high-throughput workflows on Linux clusters.
format	Online Article Text
id	pubmed-9442500
institution	National Center for Biotechnology Information
language	English
publishDate	2022
publisher	Oxford University Press
record_format	MEDLINE/PubMed
spelling	pubmed-94425002022-09-06 Interpreting alignment-free sequence comparison: what makes a score a good score? Swain, Martin T Vickers, Martin NAR Genom Bioinform Standard Article Alignment-free methods are alternatives to alignment-based methods when searching sequence data sets. The output from an alignment-free sequence comparison is a similarity score, the interpretation of which is not straightforward. We propose objective functions to interpret and calibrate outputs from alignment-free searches, noting that different objective functions are necessary for different biological contexts. This leads to advantages: visualising and comparing score distributions, including those from true positives, may be a relatively simple method to gain insight into the performance of different metrics. Using an empirical approach with both DNA and protein sequences, we characterise different similarity score distributions generated under different parameters. In particular, we demonstrate how sequence length can affect the scores. We show that scores of true positive sequence pairs may correlate significantly with their mean length; and even if the correlation is weak, the relative difference in length of the sequence pair may significantly reduce the effectiveness of alignment-free metrics. Importantly, we show how objective functions can be used with test data to accurately estimate the probability of true positives. This can significantly increase the utility of alignment-free approaches. Finally, we have developed a general-purpose software tool called KAST for use in high-throughput workflows on Linux clusters. Oxford University Press 2022-09-05 /pmc/articles/PMC9442500/ /pubmed/36071721 http://dx.doi.org/10.1093/nargab/lqac062 Text en © The Author(s) 2022. Published by Oxford University Press on behalf of NAR Genomics and Bioinformatics. https://creativecommons.org/licenses/by/4.0/This is an Open Access article distributed under the terms of the Creative Commons Attribution License (https://creativecommons.org/licenses/by/4.0/), which permits unrestricted reuse, distribution, and reproduction in any medium, provided the original work is properly cited.
spellingShingle	Standard Article Swain, Martin T Vickers, Martin Interpreting alignment-free sequence comparison: what makes a score a good score?
title	Interpreting alignment-free sequence comparison: what makes a score a good score?
title_full	Interpreting alignment-free sequence comparison: what makes a score a good score?
title_fullStr	Interpreting alignment-free sequence comparison: what makes a score a good score?
title_full_unstemmed	Interpreting alignment-free sequence comparison: what makes a score a good score?
title_short	Interpreting alignment-free sequence comparison: what makes a score a good score?
title_sort	interpreting alignment-free sequence comparison: what makes a score a good score?
topic	Standard Article
url	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9442500/ https://www.ncbi.nlm.nih.gov/pubmed/36071721 http://dx.doi.org/10.1093/nargab/lqac062
work_keys_str_mv	AT swainmartint interpretingalignmentfreesequencecomparisonwhatmakesascoreagoodscore AT vickersmartin interpretingalignmentfreesequencecomparisonwhatmakesascoreagoodscore

Interpreting alignment-free sequence comparison: what makes a score a good score?

Ejemplares similares