Cargando…

LINflow: a computational pipeline that combines an alignment-free with an alignment-based method to accelerate generation of similarity matrices for prokaryotic genomes

BACKGROUND: Computing genomic similarity between strains is a prerequisite for genome-based prokaryotic classification and identification. Genomic similarity was first computed as Average Nucleotide Identity (ANI) values based on the alignment of genomic fragments. Since this is computationally expe...

Descripción completa

Detalles Bibliográficos
Autores principales:	Tian, Long, Mazloom, Reza, Heath, Lenwood S., Vinatzer, Boris A.
Formato:	Online Artículo Texto
Lenguaje:	English
Publicado:	PeerJ Inc. 2021
Materias:	Bioinformatics
Acceso en línea:	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8000461/ https://www.ncbi.nlm.nih.gov/pubmed/33828908 http://dx.doi.org/10.7717/peerj.10906

_version_	1783671004917137408
author	Tian, Long Mazloom, Reza Heath, Lenwood S. Vinatzer, Boris A.
author_facet	Tian, Long Mazloom, Reza Heath, Lenwood S. Vinatzer, Boris A.
author_sort	Tian, Long
collection	PubMed
description	BACKGROUND: Computing genomic similarity between strains is a prerequisite for genome-based prokaryotic classification and identification. Genomic similarity was first computed as Average Nucleotide Identity (ANI) values based on the alignment of genomic fragments. Since this is computationally expensive, faster and computationally cheaper alignment-free methods have been developed to estimate ANI. However, these methods do not reach the level of accuracy of alignment-based methods. METHODS: Here we introduce LINflow, a computational pipeline that infers pairwise genomic similarity in a set of genomes. LINflow takes advantage of the speed of the alignment-free sourmash tool to identify the genome in a dataset that is most similar to a query genome and the precision of the alignment-based pyani software to precisely compute ANI between the query genome and the most similar genome identified by sourmash. This is repeated for each new genome that is added to a dataset. The sequentially computed ANI values are stored as Life Identification Numbers (LINs), which are then used to infer all other pairwise ANI values in the set. We tested LINflow on four sets, 484 genomes in total, and compared the needed time and the generated similarity matrices with other tools. RESULTS: LINflow is up to 150 times faster than pyani and pairwise ANI values generated by LINflow are highly correlated with those computed by pyani. However, because LINflow infers most pairwise ANI values instead of computing them directly, ANI values occasionally depart from the ANI values computed by pyani. In conclusion, LINflow is a fast and memory-efficient pipeline to infer similarity among a large set of prokaryotic genomes. Its ability to quickly add new genome sequences to an already computed similarity matrix makes LINflow particularly useful for projects when new genome sequences need to be regularly added to an existing dataset.
format	Online Article Text
id	pubmed-8000461
institution	National Center for Biotechnology Information
language	English
publishDate	2021
publisher	PeerJ Inc.
record_format	MEDLINE/PubMed
spelling	pubmed-80004612021-04-06 LINflow: a computational pipeline that combines an alignment-free with an alignment-based method to accelerate generation of similarity matrices for prokaryotic genomes Tian, Long Mazloom, Reza Heath, Lenwood S. Vinatzer, Boris A. PeerJ Bioinformatics BACKGROUND: Computing genomic similarity between strains is a prerequisite for genome-based prokaryotic classification and identification. Genomic similarity was first computed as Average Nucleotide Identity (ANI) values based on the alignment of genomic fragments. Since this is computationally expensive, faster and computationally cheaper alignment-free methods have been developed to estimate ANI. However, these methods do not reach the level of accuracy of alignment-based methods. METHODS: Here we introduce LINflow, a computational pipeline that infers pairwise genomic similarity in a set of genomes. LINflow takes advantage of the speed of the alignment-free sourmash tool to identify the genome in a dataset that is most similar to a query genome and the precision of the alignment-based pyani software to precisely compute ANI between the query genome and the most similar genome identified by sourmash. This is repeated for each new genome that is added to a dataset. The sequentially computed ANI values are stored as Life Identification Numbers (LINs), which are then used to infer all other pairwise ANI values in the set. We tested LINflow on four sets, 484 genomes in total, and compared the needed time and the generated similarity matrices with other tools. RESULTS: LINflow is up to 150 times faster than pyani and pairwise ANI values generated by LINflow are highly correlated with those computed by pyani. However, because LINflow infers most pairwise ANI values instead of computing them directly, ANI values occasionally depart from the ANI values computed by pyani. In conclusion, LINflow is a fast and memory-efficient pipeline to infer similarity among a large set of prokaryotic genomes. Its ability to quickly add new genome sequences to an already computed similarity matrix makes LINflow particularly useful for projects when new genome sequences need to be regularly added to an existing dataset. PeerJ Inc. 2021-03-24 /pmc/articles/PMC8000461/ /pubmed/33828908 http://dx.doi.org/10.7717/peerj.10906 Text en © 2021 Tian et al. https://creativecommons.org/licenses/by/4.0/ This is an open access article distributed under the terms of the Creative Commons Attribution License (https://creativecommons.org/licenses/by/4.0/) , which permits unrestricted use, distribution, reproduction and adaptation in any medium and for any purpose provided that it is properly attributed. For attribution, the original author(s), title, publication source (PeerJ) and either DOI or URL of the article must be cited.
spellingShingle	Bioinformatics Tian, Long Mazloom, Reza Heath, Lenwood S. Vinatzer, Boris A. LINflow: a computational pipeline that combines an alignment-free with an alignment-based method to accelerate generation of similarity matrices for prokaryotic genomes
title	LINflow: a computational pipeline that combines an alignment-free with an alignment-based method to accelerate generation of similarity matrices for prokaryotic genomes
title_full	LINflow: a computational pipeline that combines an alignment-free with an alignment-based method to accelerate generation of similarity matrices for prokaryotic genomes
title_fullStr	LINflow: a computational pipeline that combines an alignment-free with an alignment-based method to accelerate generation of similarity matrices for prokaryotic genomes
title_full_unstemmed	LINflow: a computational pipeline that combines an alignment-free with an alignment-based method to accelerate generation of similarity matrices for prokaryotic genomes
title_short	LINflow: a computational pipeline that combines an alignment-free with an alignment-based method to accelerate generation of similarity matrices for prokaryotic genomes
title_sort	linflow: a computational pipeline that combines an alignment-free with an alignment-based method to accelerate generation of similarity matrices for prokaryotic genomes
topic	Bioinformatics
url	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8000461/ https://www.ncbi.nlm.nih.gov/pubmed/33828908 http://dx.doi.org/10.7717/peerj.10906
work_keys_str_mv	AT tianlong linflowacomputationalpipelinethatcombinesanalignmentfreewithanalignmentbasedmethodtoaccelerategenerationofsimilaritymatricesforprokaryoticgenomes AT mazloomreza linflowacomputationalpipelinethatcombinesanalignmentfreewithanalignmentbasedmethodtoaccelerategenerationofsimilaritymatricesforprokaryoticgenomes AT heathlenwoods linflowacomputationalpipelinethatcombinesanalignmentfreewithanalignmentbasedmethodtoaccelerategenerationofsimilaritymatricesforprokaryoticgenomes AT vinatzerborisa linflowacomputationalpipelinethatcombinesanalignmentfreewithanalignmentbasedmethodtoaccelerategenerationofsimilaritymatricesforprokaryoticgenomes

LINflow: a computational pipeline that combines an alignment-free with an alignment-based method to accelerate generation of similarity matrices for prokaryotic genomes

Ejemplares similares