Cargando…
A fast and efficient algorithm for DNA sequence similarity identification
DNA sequence similarity analysis is necessary for enormous purposes including genome analysis, extracting biological information, finding the evolutionary relationship of species. There are two types of sequence analysis which are alignment-based (AB) and alignment-free (AF). AB is effective for sma...
Autores principales: | , , , , |
---|---|
Formato: | Online Artículo Texto |
Lenguaje: | English |
Publicado: |
Springer International Publishing
2022
|
Materias: | |
Acceso en línea: | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9395857/ https://www.ncbi.nlm.nih.gov/pubmed/36035628 http://dx.doi.org/10.1007/s40747-022-00846-y |
_version_ | 1784771795020152832 |
---|---|
author | Uddin, Machbah Islam, Mohammad Khairul Hassan, Md. Rakib Jahan, Farah Baek, Joong Hwan |
author_facet | Uddin, Machbah Islam, Mohammad Khairul Hassan, Md. Rakib Jahan, Farah Baek, Joong Hwan |
author_sort | Uddin, Machbah |
collection | PubMed |
description | DNA sequence similarity analysis is necessary for enormous purposes including genome analysis, extracting biological information, finding the evolutionary relationship of species. There are two types of sequence analysis which are alignment-based (AB) and alignment-free (AF). AB is effective for small homologous sequences but becomes NP-hard problem for long sequences. However, AF algorithms can solve the major limitations of AB. But most of the existing AF methods show high time complexity and memory consumption, less precision, and less performance on benchmark datasets. To minimize these limitations, we develop an AF algorithm using a 2D [Formula: see text] count matrix inspired by the CGR approach. Then we shrink the matrix by analyzing the neighbors and then measure similarities using the best combinations of pairwise distance (PD) and phylogenetic tree methods. We also dynamically choose the value of k for [Formula: see text] . We develop an efficient system for finding the positions of [Formula: see text] in the count matrix. We apply our system in six different datasets. We achieve the top rank for two benchmark datasets from AFproject, 100% accuracy for two datasets (16 S Ribosomal, 18 Eutherian), and achieve a milestone for time complexity and memory consumption in comparison to the existing study datasets (HEV, HIV-1). Therefore, the comparative results of the benchmark datasets and existing studies demonstrate that our method is highly effective, efficient, and accurate. Thus, our method can be used with the top level of authenticity for DNA sequence similarity measurement. |
format | Online Article Text |
id | pubmed-9395857 |
institution | National Center for Biotechnology Information |
language | English |
publishDate | 2022 |
publisher | Springer International Publishing |
record_format | MEDLINE/PubMed |
spelling | pubmed-93958572022-08-23 A fast and efficient algorithm for DNA sequence similarity identification Uddin, Machbah Islam, Mohammad Khairul Hassan, Md. Rakib Jahan, Farah Baek, Joong Hwan Complex Intell Systems Original Article DNA sequence similarity analysis is necessary for enormous purposes including genome analysis, extracting biological information, finding the evolutionary relationship of species. There are two types of sequence analysis which are alignment-based (AB) and alignment-free (AF). AB is effective for small homologous sequences but becomes NP-hard problem for long sequences. However, AF algorithms can solve the major limitations of AB. But most of the existing AF methods show high time complexity and memory consumption, less precision, and less performance on benchmark datasets. To minimize these limitations, we develop an AF algorithm using a 2D [Formula: see text] count matrix inspired by the CGR approach. Then we shrink the matrix by analyzing the neighbors and then measure similarities using the best combinations of pairwise distance (PD) and phylogenetic tree methods. We also dynamically choose the value of k for [Formula: see text] . We develop an efficient system for finding the positions of [Formula: see text] in the count matrix. We apply our system in six different datasets. We achieve the top rank for two benchmark datasets from AFproject, 100% accuracy for two datasets (16 S Ribosomal, 18 Eutherian), and achieve a milestone for time complexity and memory consumption in comparison to the existing study datasets (HEV, HIV-1). Therefore, the comparative results of the benchmark datasets and existing studies demonstrate that our method is highly effective, efficient, and accurate. Thus, our method can be used with the top level of authenticity for DNA sequence similarity measurement. Springer International Publishing 2022-08-23 2023 /pmc/articles/PMC9395857/ /pubmed/36035628 http://dx.doi.org/10.1007/s40747-022-00846-y Text en © The Author(s) 2022 https://creativecommons.org/licenses/by/4.0/Open AccessThis article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/ (https://creativecommons.org/licenses/by/4.0/) . |
spellingShingle | Original Article Uddin, Machbah Islam, Mohammad Khairul Hassan, Md. Rakib Jahan, Farah Baek, Joong Hwan A fast and efficient algorithm for DNA sequence similarity identification |
title | A fast and efficient algorithm for DNA sequence similarity identification |
title_full | A fast and efficient algorithm for DNA sequence similarity identification |
title_fullStr | A fast and efficient algorithm for DNA sequence similarity identification |
title_full_unstemmed | A fast and efficient algorithm for DNA sequence similarity identification |
title_short | A fast and efficient algorithm for DNA sequence similarity identification |
title_sort | fast and efficient algorithm for dna sequence similarity identification |
topic | Original Article |
url | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9395857/ https://www.ncbi.nlm.nih.gov/pubmed/36035628 http://dx.doi.org/10.1007/s40747-022-00846-y |
work_keys_str_mv | AT uddinmachbah afastandefficientalgorithmfordnasequencesimilarityidentification AT islammohammadkhairul afastandefficientalgorithmfordnasequencesimilarityidentification AT hassanmdrakib afastandefficientalgorithmfordnasequencesimilarityidentification AT jahanfarah afastandefficientalgorithmfordnasequencesimilarityidentification AT baekjoonghwan afastandefficientalgorithmfordnasequencesimilarityidentification AT uddinmachbah fastandefficientalgorithmfordnasequencesimilarityidentification AT islammohammadkhairul fastandefficientalgorithmfordnasequencesimilarityidentification AT hassanmdrakib fastandefficientalgorithmfordnasequencesimilarityidentification AT jahanfarah fastandefficientalgorithmfordnasequencesimilarityidentification AT baekjoonghwan fastandefficientalgorithmfordnasequencesimilarityidentification |