Cargando…
Generator based approach to analyze mutations in genomic datasets
In contrast to the conventional approach of directly comparing genomic sequences using sequence alignment tools, we propose a computational approach that performs comparisons between sequence generators. These sequence generators are learned via a data-driven approach that empirically computes the s...
Autores principales: | , , , |
---|---|
Formato: | Online Artículo Texto |
Lenguaje: | English |
Publicado: |
Nature Publishing Group UK
2021
|
Materias: | |
Acceso en línea: | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8548350/ https://www.ncbi.nlm.nih.gov/pubmed/34702945 http://dx.doi.org/10.1038/s41598-021-00609-8 |
_version_ | 1784590553842712576 |
---|---|
author | Jain, Siddharth Xiao, Xiongye Bogdan, Paul Bruck, Jehoshua |
author_facet | Jain, Siddharth Xiao, Xiongye Bogdan, Paul Bruck, Jehoshua |
author_sort | Jain, Siddharth |
collection | PubMed |
description | In contrast to the conventional approach of directly comparing genomic sequences using sequence alignment tools, we propose a computational approach that performs comparisons between sequence generators. These sequence generators are learned via a data-driven approach that empirically computes the state machine generating the genomic sequence of interest. As the state machine based generator of the sequence is independent of the sequence length, it provides us with an efficient method to compute the statistical distance between large sets of genomic sequences. Moreover, our technique provides a fast and efficient method to cluster large datasets of genomic sequences, characterize their temporal and spatial evolution in a continuous manner, get insights into the locality sensitive information about the sequences without any need for alignment. Furthermore, we show that the technique can be used to detect local regions with mutation activity, which can then be applied to aid alignment techniques for the fast discovery of mutations. To demonstrate the efficacy of our technique on real genomic data, we cluster different strains of SARS-CoV-2 viral sequences, characterize their evolution and identify regions of the viral sequence with mutations. |
format | Online Article Text |
id | pubmed-8548350 |
institution | National Center for Biotechnology Information |
language | English |
publishDate | 2021 |
publisher | Nature Publishing Group UK |
record_format | MEDLINE/PubMed |
spelling | pubmed-85483502021-10-27 Generator based approach to analyze mutations in genomic datasets Jain, Siddharth Xiao, Xiongye Bogdan, Paul Bruck, Jehoshua Sci Rep Article In contrast to the conventional approach of directly comparing genomic sequences using sequence alignment tools, we propose a computational approach that performs comparisons between sequence generators. These sequence generators are learned via a data-driven approach that empirically computes the state machine generating the genomic sequence of interest. As the state machine based generator of the sequence is independent of the sequence length, it provides us with an efficient method to compute the statistical distance between large sets of genomic sequences. Moreover, our technique provides a fast and efficient method to cluster large datasets of genomic sequences, characterize their temporal and spatial evolution in a continuous manner, get insights into the locality sensitive information about the sequences without any need for alignment. Furthermore, we show that the technique can be used to detect local regions with mutation activity, which can then be applied to aid alignment techniques for the fast discovery of mutations. To demonstrate the efficacy of our technique on real genomic data, we cluster different strains of SARS-CoV-2 viral sequences, characterize their evolution and identify regions of the viral sequence with mutations. Nature Publishing Group UK 2021-10-26 /pmc/articles/PMC8548350/ /pubmed/34702945 http://dx.doi.org/10.1038/s41598-021-00609-8 Text en © The Author(s) 2021 https://creativecommons.org/licenses/by/4.0/Open AccessThis article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/ (https://creativecommons.org/licenses/by/4.0/) . |
spellingShingle | Article Jain, Siddharth Xiao, Xiongye Bogdan, Paul Bruck, Jehoshua Generator based approach to analyze mutations in genomic datasets |
title | Generator based approach to analyze mutations in genomic datasets |
title_full | Generator based approach to analyze mutations in genomic datasets |
title_fullStr | Generator based approach to analyze mutations in genomic datasets |
title_full_unstemmed | Generator based approach to analyze mutations in genomic datasets |
title_short | Generator based approach to analyze mutations in genomic datasets |
title_sort | generator based approach to analyze mutations in genomic datasets |
topic | Article |
url | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8548350/ https://www.ncbi.nlm.nih.gov/pubmed/34702945 http://dx.doi.org/10.1038/s41598-021-00609-8 |
work_keys_str_mv | AT jainsiddharth generatorbasedapproachtoanalyzemutationsingenomicdatasets AT xiaoxiongye generatorbasedapproachtoanalyzemutationsingenomicdatasets AT bogdanpaul generatorbasedapproachtoanalyzemutationsingenomicdatasets AT bruckjehoshua generatorbasedapproachtoanalyzemutationsingenomicdatasets |