Cargando…

A fast and efficient algorithm for DNA sequence similarity identification

DNA sequence similarity analysis is necessary for enormous purposes including genome analysis, extracting biological information, finding the evolutionary relationship of species. There are two types of sequence analysis which are alignment-based (AB) and alignment-free (AF). AB is effective for sma...

Descripción completa

Detalles Bibliográficos
Autores principales:	Uddin, Machbah, Islam, Mohammad Khairul, Hassan, Md. Rakib, Jahan, Farah, Baek, Joong Hwan
Formato:	Online Artículo Texto
Lenguaje:	English
Publicado:	Springer International Publishing 2022
Materias:	Original Article
Acceso en línea:	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9395857/ https://www.ncbi.nlm.nih.gov/pubmed/36035628 http://dx.doi.org/10.1007/s40747-022-00846-y

_version_	1784771795020152832
author	Uddin, Machbah Islam, Mohammad Khairul Hassan, Md. Rakib Jahan, Farah Baek, Joong Hwan
author_facet	Uddin, Machbah Islam, Mohammad Khairul Hassan, Md. Rakib Jahan, Farah Baek, Joong Hwan
author_sort	Uddin, Machbah
collection	PubMed
description	DNA sequence similarity analysis is necessary for enormous purposes including genome analysis, extracting biological information, finding the evolutionary relationship of species. There are two types of sequence analysis which are alignment-based (AB) and alignment-free (AF). AB is effective for small homologous sequences but becomes NP-hard problem for long sequences. However, AF algorithms can solve the major limitations of AB. But most of the existing AF methods show high time complexity and memory consumption, less precision, and less performance on benchmark datasets. To minimize these limitations, we develop an AF algorithm using a 2D [Formula: see text] count matrix inspired by the CGR approach. Then we shrink the matrix by analyzing the neighbors and then measure similarities using the best combinations of pairwise distance (PD) and phylogenetic tree methods. We also dynamically choose the value of k for [Formula: see text] . We develop an efficient system for finding the positions of [Formula: see text] in the count matrix. We apply our system in six different datasets. We achieve the top rank for two benchmark datasets from AFproject, 100% accuracy for two datasets (16 S Ribosomal, 18 Eutherian), and achieve a milestone for time complexity and memory consumption in comparison to the existing study datasets (HEV, HIV-1). Therefore, the comparative results of the benchmark datasets and existing studies demonstrate that our method is highly effective, efficient, and accurate. Thus, our method can be used with the top level of authenticity for DNA sequence similarity measurement.
format	Online Article Text
id	pubmed-9395857
institution	National Center for Biotechnology Information
language	English
publishDate	2022
publisher	Springer International Publishing
record_format	MEDLINE/PubMed
spelling	pubmed-93958572022-08-23 A fast and efficient algorithm for DNA sequence similarity identification Uddin, Machbah Islam, Mohammad Khairul Hassan, Md. Rakib Jahan, Farah Baek, Joong Hwan Complex Intell Systems Original Article DNA sequence similarity analysis is necessary for enormous purposes including genome analysis, extracting biological information, finding the evolutionary relationship of species. There are two types of sequence analysis which are alignment-based (AB) and alignment-free (AF). AB is effective for small homologous sequences but becomes NP-hard problem for long sequences. However, AF algorithms can solve the major limitations of AB. But most of the existing AF methods show high time complexity and memory consumption, less precision, and less performance on benchmark datasets. To minimize these limitations, we develop an AF algorithm using a 2D [Formula: see text] count matrix inspired by the CGR approach. Then we shrink the matrix by analyzing the neighbors and then measure similarities using the best combinations of pairwise distance (PD) and phylogenetic tree methods. We also dynamically choose the value of k for [Formula: see text] . We develop an efficient system for finding the positions of [Formula: see text] in the count matrix. We apply our system in six different datasets. We achieve the top rank for two benchmark datasets from AFproject, 100% accuracy for two datasets (16 S Ribosomal, 18 Eutherian), and achieve a milestone for time complexity and memory consumption in comparison to the existing study datasets (HEV, HIV-1). Therefore, the comparative results of the benchmark datasets and existing studies demonstrate that our method is highly effective, efficient, and accurate. Thus, our method can be used with the top level of authenticity for DNA sequence similarity measurement. Springer International Publishing 2022-08-23 2023 /pmc/articles/PMC9395857/ /pubmed/36035628 http://dx.doi.org/10.1007/s40747-022-00846-y Text en © The Author(s) 2022 https://creativecommons.org/licenses/by/4.0/Open AccessThis article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/ (https://creativecommons.org/licenses/by/4.0/) .
spellingShingle	Original Article Uddin, Machbah Islam, Mohammad Khairul Hassan, Md. Rakib Jahan, Farah Baek, Joong Hwan A fast and efficient algorithm for DNA sequence similarity identification
title	A fast and efficient algorithm for DNA sequence similarity identification
title_full	A fast and efficient algorithm for DNA sequence similarity identification
title_fullStr	A fast and efficient algorithm for DNA sequence similarity identification
title_full_unstemmed	A fast and efficient algorithm for DNA sequence similarity identification
title_short	A fast and efficient algorithm for DNA sequence similarity identification
title_sort	fast and efficient algorithm for dna sequence similarity identification
topic	Original Article
url	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9395857/ https://www.ncbi.nlm.nih.gov/pubmed/36035628 http://dx.doi.org/10.1007/s40747-022-00846-y
work_keys_str_mv	AT uddinmachbah afastandefficientalgorithmfordnasequencesimilarityidentification AT islammohammadkhairul afastandefficientalgorithmfordnasequencesimilarityidentification AT hassanmdrakib afastandefficientalgorithmfordnasequencesimilarityidentification AT jahanfarah afastandefficientalgorithmfordnasequencesimilarityidentification AT baekjoonghwan afastandefficientalgorithmfordnasequencesimilarityidentification AT uddinmachbah fastandefficientalgorithmfordnasequencesimilarityidentification AT islammohammadkhairul fastandefficientalgorithmfordnasequencesimilarityidentification AT hassanmdrakib fastandefficientalgorithmfordnasequencesimilarityidentification AT jahanfarah fastandefficientalgorithmfordnasequencesimilarityidentification AT baekjoonghwan fastandefficientalgorithmfordnasequencesimilarityidentification

A fast and efficient algorithm for DNA sequence similarity identification

Ejemplares similares