Cargando…
SPARK-MSNA: Efficient algorithm on Apache Spark for aligning multiple similar DNA/RNA sequences with supervised learning
Multiple sequence alignment (MSA) is an integral part of molecular biology. But handling massive number of large sequences is still a bottleneck for most of the state-of-the-art software tools. Knowledge driven algorithms utilizing features of input sequences, such as high similarity in case of DNA...
Autores principales: | , , |
---|---|
Formato: | Online Artículo Texto |
Lenguaje: | English |
Publicado: |
Nature Publishing Group UK
2019
|
Materias: | |
Acceso en línea: | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6488671/ https://www.ncbi.nlm.nih.gov/pubmed/31036850 http://dx.doi.org/10.1038/s41598-019-42966-5 |
_version_ | 1783414687026642944 |
---|---|
author | Vineetha, V. Biji, C. L. Nair, Achuthsankar S. |
author_facet | Vineetha, V. Biji, C. L. Nair, Achuthsankar S. |
author_sort | Vineetha, V. |
collection | PubMed |
description | Multiple sequence alignment (MSA) is an integral part of molecular biology. But handling massive number of large sequences is still a bottleneck for most of the state-of-the-art software tools. Knowledge driven algorithms utilizing features of input sequences, such as high similarity in case of DNA sequences, can help in improving the efficiency of DNA MSA to assist in phylogenetic tree construction, comparative genomics etc. This article showcases the benefit of utilizing similarity features while performing the alignment. The algorithm uses suffix tree for identifying common substrings and uses a modified Needleman-Wunsch algorithm for pairwise alignments. In order to improve the efficiency of pairwise alignments, a knowledge base is created and a supervised learning with nearest neighbor algorithm is used to guide the alignment. The algorithm provided linear complexity O(m) compared to O(m(2)). Comparing with state-of-the-art algorithms (e.g., HAlign II), SPARK-MSNA provided 50% improvement in memory utilization in processing human mitochondrial genome (mt. genomes, 100x, 1.1. GB) with a better alignment accuracy in terms of average SP score and comparable execution time. The algorithm is implemented on big data framework Apache Spark in order to improve the scalability. The source code & test data are available at: https://sourceforge.net/projects/spark-msna/. |
format | Online Article Text |
id | pubmed-6488671 |
institution | National Center for Biotechnology Information |
language | English |
publishDate | 2019 |
publisher | Nature Publishing Group UK |
record_format | MEDLINE/PubMed |
spelling | pubmed-64886712019-05-16 SPARK-MSNA: Efficient algorithm on Apache Spark for aligning multiple similar DNA/RNA sequences with supervised learning Vineetha, V. Biji, C. L. Nair, Achuthsankar S. Sci Rep Article Multiple sequence alignment (MSA) is an integral part of molecular biology. But handling massive number of large sequences is still a bottleneck for most of the state-of-the-art software tools. Knowledge driven algorithms utilizing features of input sequences, such as high similarity in case of DNA sequences, can help in improving the efficiency of DNA MSA to assist in phylogenetic tree construction, comparative genomics etc. This article showcases the benefit of utilizing similarity features while performing the alignment. The algorithm uses suffix tree for identifying common substrings and uses a modified Needleman-Wunsch algorithm for pairwise alignments. In order to improve the efficiency of pairwise alignments, a knowledge base is created and a supervised learning with nearest neighbor algorithm is used to guide the alignment. The algorithm provided linear complexity O(m) compared to O(m(2)). Comparing with state-of-the-art algorithms (e.g., HAlign II), SPARK-MSNA provided 50% improvement in memory utilization in processing human mitochondrial genome (mt. genomes, 100x, 1.1. GB) with a better alignment accuracy in terms of average SP score and comparable execution time. The algorithm is implemented on big data framework Apache Spark in order to improve the scalability. The source code & test data are available at: https://sourceforge.net/projects/spark-msna/. Nature Publishing Group UK 2019-04-29 /pmc/articles/PMC6488671/ /pubmed/31036850 http://dx.doi.org/10.1038/s41598-019-42966-5 Text en © The Author(s) 2019 Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this license, visit http://creativecommons.org/licenses/by/4.0/. |
spellingShingle | Article Vineetha, V. Biji, C. L. Nair, Achuthsankar S. SPARK-MSNA: Efficient algorithm on Apache Spark for aligning multiple similar DNA/RNA sequences with supervised learning |
title | SPARK-MSNA: Efficient algorithm on Apache Spark for aligning multiple similar DNA/RNA sequences with supervised learning |
title_full | SPARK-MSNA: Efficient algorithm on Apache Spark for aligning multiple similar DNA/RNA sequences with supervised learning |
title_fullStr | SPARK-MSNA: Efficient algorithm on Apache Spark for aligning multiple similar DNA/RNA sequences with supervised learning |
title_full_unstemmed | SPARK-MSNA: Efficient algorithm on Apache Spark for aligning multiple similar DNA/RNA sequences with supervised learning |
title_short | SPARK-MSNA: Efficient algorithm on Apache Spark for aligning multiple similar DNA/RNA sequences with supervised learning |
title_sort | spark-msna: efficient algorithm on apache spark for aligning multiple similar dna/rna sequences with supervised learning |
topic | Article |
url | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6488671/ https://www.ncbi.nlm.nih.gov/pubmed/31036850 http://dx.doi.org/10.1038/s41598-019-42966-5 |
work_keys_str_mv | AT vineethav sparkmsnaefficientalgorithmonapachesparkforaligningmultiplesimilardnarnasequenceswithsupervisedlearning AT bijicl sparkmsnaefficientalgorithmonapachesparkforaligningmultiplesimilardnarnasequenceswithsupervisedlearning AT nairachuthsankars sparkmsnaefficientalgorithmonapachesparkforaligningmultiplesimilardnarnasequenceswithsupervisedlearning |