Cargando…

A fast and efficient algorithm for DNA sequence similarity identification

DNA sequence similarity analysis is necessary for enormous purposes including genome analysis, extracting biological information, finding the evolutionary relationship of species. There are two types of sequence analysis which are alignment-based (AB) and alignment-free (AF). AB is effective for sma...

Descripción completa

Detalles Bibliográficos
Autores principales: Uddin, Machbah, Islam, Mohammad Khairul, Hassan, Md. Rakib, Jahan, Farah, Baek, Joong Hwan
Formato: Online Artículo Texto
Lenguaje:English
Publicado: Springer International Publishing 2022
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9395857/
https://www.ncbi.nlm.nih.gov/pubmed/36035628
http://dx.doi.org/10.1007/s40747-022-00846-y
_version_ 1784771795020152832
author Uddin, Machbah
Islam, Mohammad Khairul
Hassan, Md. Rakib
Jahan, Farah
Baek, Joong Hwan
author_facet Uddin, Machbah
Islam, Mohammad Khairul
Hassan, Md. Rakib
Jahan, Farah
Baek, Joong Hwan
author_sort Uddin, Machbah
collection PubMed
description DNA sequence similarity analysis is necessary for enormous purposes including genome analysis, extracting biological information, finding the evolutionary relationship of species. There are two types of sequence analysis which are alignment-based (AB) and alignment-free (AF). AB is effective for small homologous sequences but becomes NP-hard problem for long sequences. However, AF algorithms can solve the major limitations of AB. But most of the existing AF methods show high time complexity and memory consumption, less precision, and less performance on benchmark datasets. To minimize these limitations, we develop an AF algorithm using a 2D [Formula: see text] count matrix inspired by the CGR approach. Then we shrink the matrix by analyzing the neighbors and then measure similarities using the best combinations of pairwise distance (PD) and phylogenetic tree methods. We also dynamically choose the value of k for [Formula: see text] . We develop an efficient system for finding the positions of [Formula: see text] in the count matrix. We apply our system in six different datasets. We achieve the top rank for two benchmark datasets from AFproject, 100% accuracy for two datasets (16 S Ribosomal, 18 Eutherian), and achieve a milestone for time complexity and memory consumption in comparison to the existing study datasets (HEV, HIV-1). Therefore, the comparative results of the benchmark datasets and existing studies demonstrate that our method is highly effective, efficient, and accurate. Thus, our method can be used with the top level of authenticity for DNA sequence similarity measurement.
format Online
Article
Text
id pubmed-9395857
institution National Center for Biotechnology Information
language English
publishDate 2022
publisher Springer International Publishing
record_format MEDLINE/PubMed
spelling pubmed-93958572022-08-23 A fast and efficient algorithm for DNA sequence similarity identification Uddin, Machbah Islam, Mohammad Khairul Hassan, Md. Rakib Jahan, Farah Baek, Joong Hwan Complex Intell Systems Original Article DNA sequence similarity analysis is necessary for enormous purposes including genome analysis, extracting biological information, finding the evolutionary relationship of species. There are two types of sequence analysis which are alignment-based (AB) and alignment-free (AF). AB is effective for small homologous sequences but becomes NP-hard problem for long sequences. However, AF algorithms can solve the major limitations of AB. But most of the existing AF methods show high time complexity and memory consumption, less precision, and less performance on benchmark datasets. To minimize these limitations, we develop an AF algorithm using a 2D [Formula: see text] count matrix inspired by the CGR approach. Then we shrink the matrix by analyzing the neighbors and then measure similarities using the best combinations of pairwise distance (PD) and phylogenetic tree methods. We also dynamically choose the value of k for [Formula: see text] . We develop an efficient system for finding the positions of [Formula: see text] in the count matrix. We apply our system in six different datasets. We achieve the top rank for two benchmark datasets from AFproject, 100% accuracy for two datasets (16 S Ribosomal, 18 Eutherian), and achieve a milestone for time complexity and memory consumption in comparison to the existing study datasets (HEV, HIV-1). Therefore, the comparative results of the benchmark datasets and existing studies demonstrate that our method is highly effective, efficient, and accurate. Thus, our method can be used with the top level of authenticity for DNA sequence similarity measurement. Springer International Publishing 2022-08-23 2023 /pmc/articles/PMC9395857/ /pubmed/36035628 http://dx.doi.org/10.1007/s40747-022-00846-y Text en © The Author(s) 2022 https://creativecommons.org/licenses/by/4.0/Open AccessThis article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/ (https://creativecommons.org/licenses/by/4.0/) .
spellingShingle Original Article
Uddin, Machbah
Islam, Mohammad Khairul
Hassan, Md. Rakib
Jahan, Farah
Baek, Joong Hwan
A fast and efficient algorithm for DNA sequence similarity identification
title A fast and efficient algorithm for DNA sequence similarity identification
title_full A fast and efficient algorithm for DNA sequence similarity identification
title_fullStr A fast and efficient algorithm for DNA sequence similarity identification
title_full_unstemmed A fast and efficient algorithm for DNA sequence similarity identification
title_short A fast and efficient algorithm for DNA sequence similarity identification
title_sort fast and efficient algorithm for dna sequence similarity identification
topic Original Article
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9395857/
https://www.ncbi.nlm.nih.gov/pubmed/36035628
http://dx.doi.org/10.1007/s40747-022-00846-y
work_keys_str_mv AT uddinmachbah afastandefficientalgorithmfordnasequencesimilarityidentification
AT islammohammadkhairul afastandefficientalgorithmfordnasequencesimilarityidentification
AT hassanmdrakib afastandefficientalgorithmfordnasequencesimilarityidentification
AT jahanfarah afastandefficientalgorithmfordnasequencesimilarityidentification
AT baekjoonghwan afastandefficientalgorithmfordnasequencesimilarityidentification
AT uddinmachbah fastandefficientalgorithmfordnasequencesimilarityidentification
AT islammohammadkhairul fastandefficientalgorithmfordnasequencesimilarityidentification
AT hassanmdrakib fastandefficientalgorithmfordnasequencesimilarityidentification
AT jahanfarah fastandefficientalgorithmfordnasequencesimilarityidentification
AT baekjoonghwan fastandefficientalgorithmfordnasequencesimilarityidentification