Cargando…
A Computational Framework for Pattern Detection on Unaligned Sequences: An Application on SARS-CoV-2 Data
The exponential growth of genome sequences available has spurred research on pattern detection with the aim of extracting evolutionary signal. Traditional approaches, such as multiple sequence alignment, rely on positional homology in order to reconstruct the phylogenetic history of taxa. Yet, minin...
Autores principales: | , , , , , |
---|---|
Formato: | Online Artículo Texto |
Lenguaje: | English |
Publicado: |
Frontiers Media S.A.
2021
|
Materias: | |
Acceso en línea: | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8194296/ https://www.ncbi.nlm.nih.gov/pubmed/34122498 http://dx.doi.org/10.3389/fgene.2021.618170 |
_version_ | 1783706390627352576 |
---|---|
author | Pechlivanis, Nikolaos Togkousidis, Anastasios Tsagiopoulou, Maria Sgardelis, Stefanos Kappas, Ilias Psomopoulos, Fotis |
author_facet | Pechlivanis, Nikolaos Togkousidis, Anastasios Tsagiopoulou, Maria Sgardelis, Stefanos Kappas, Ilias Psomopoulos, Fotis |
author_sort | Pechlivanis, Nikolaos |
collection | PubMed |
description | The exponential growth of genome sequences available has spurred research on pattern detection with the aim of extracting evolutionary signal. Traditional approaches, such as multiple sequence alignment, rely on positional homology in order to reconstruct the phylogenetic history of taxa. Yet, mining information from the plethora of biological data and delineating species on a genetic basis, still proves to be an extremely difficult problem to consider. Multiple algorithms and techniques have been developed in order to approach the problem multidimensionally. Here, we propose a computational framework for identifying potentially meaningful features based on k-mers retrieved from unaligned sequence data. Specifically, we have developed a process which makes use of unsupervised learning techniques in order to identify characteristic k-mers of the input dataset across a range of different k-values and within a reasonable time frame. We use these k-mers as features for clustering the input sequences and identifying differences between the distributions of k-mers across the dataset. The developed algorithm is part of an innovative and much promising approach both to the problem of grouping sequence data based on their inherent characteristic features, as well as for the study of changes in the distributions of k-mers, as the k-value is fluctuating within a range of values. Our framework is fully developed in Python language as an open source software licensed under the MIT License, and is freely available at https://github.com/BiodataAnalysisGroup/kmerAnalyzer. |
format | Online Article Text |
id | pubmed-8194296 |
institution | National Center for Biotechnology Information |
language | English |
publishDate | 2021 |
publisher | Frontiers Media S.A. |
record_format | MEDLINE/PubMed |
spelling | pubmed-81942962021-06-12 A Computational Framework for Pattern Detection on Unaligned Sequences: An Application on SARS-CoV-2 Data Pechlivanis, Nikolaos Togkousidis, Anastasios Tsagiopoulou, Maria Sgardelis, Stefanos Kappas, Ilias Psomopoulos, Fotis Front Genet Genetics The exponential growth of genome sequences available has spurred research on pattern detection with the aim of extracting evolutionary signal. Traditional approaches, such as multiple sequence alignment, rely on positional homology in order to reconstruct the phylogenetic history of taxa. Yet, mining information from the plethora of biological data and delineating species on a genetic basis, still proves to be an extremely difficult problem to consider. Multiple algorithms and techniques have been developed in order to approach the problem multidimensionally. Here, we propose a computational framework for identifying potentially meaningful features based on k-mers retrieved from unaligned sequence data. Specifically, we have developed a process which makes use of unsupervised learning techniques in order to identify characteristic k-mers of the input dataset across a range of different k-values and within a reasonable time frame. We use these k-mers as features for clustering the input sequences and identifying differences between the distributions of k-mers across the dataset. The developed algorithm is part of an innovative and much promising approach both to the problem of grouping sequence data based on their inherent characteristic features, as well as for the study of changes in the distributions of k-mers, as the k-value is fluctuating within a range of values. Our framework is fully developed in Python language as an open source software licensed under the MIT License, and is freely available at https://github.com/BiodataAnalysisGroup/kmerAnalyzer. Frontiers Media S.A. 2021-05-28 /pmc/articles/PMC8194296/ /pubmed/34122498 http://dx.doi.org/10.3389/fgene.2021.618170 Text en Copyright © 2021 Pechlivanis, Togkousidis, Tsagiopoulou, Sgardelis, Kappas and Psomopoulos. https://creativecommons.org/licenses/by/4.0/This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms. |
spellingShingle | Genetics Pechlivanis, Nikolaos Togkousidis, Anastasios Tsagiopoulou, Maria Sgardelis, Stefanos Kappas, Ilias Psomopoulos, Fotis A Computational Framework for Pattern Detection on Unaligned Sequences: An Application on SARS-CoV-2 Data |
title | A Computational Framework for Pattern Detection on Unaligned Sequences: An Application on SARS-CoV-2 Data |
title_full | A Computational Framework for Pattern Detection on Unaligned Sequences: An Application on SARS-CoV-2 Data |
title_fullStr | A Computational Framework for Pattern Detection on Unaligned Sequences: An Application on SARS-CoV-2 Data |
title_full_unstemmed | A Computational Framework for Pattern Detection on Unaligned Sequences: An Application on SARS-CoV-2 Data |
title_short | A Computational Framework for Pattern Detection on Unaligned Sequences: An Application on SARS-CoV-2 Data |
title_sort | computational framework for pattern detection on unaligned sequences: an application on sars-cov-2 data |
topic | Genetics |
url | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8194296/ https://www.ncbi.nlm.nih.gov/pubmed/34122498 http://dx.doi.org/10.3389/fgene.2021.618170 |
work_keys_str_mv | AT pechlivanisnikolaos acomputationalframeworkforpatterndetectiononunalignedsequencesanapplicationonsarscov2data AT togkousidisanastasios acomputationalframeworkforpatterndetectiononunalignedsequencesanapplicationonsarscov2data AT tsagiopouloumaria acomputationalframeworkforpatterndetectiononunalignedsequencesanapplicationonsarscov2data AT sgardelisstefanos acomputationalframeworkforpatterndetectiononunalignedsequencesanapplicationonsarscov2data AT kappasilias acomputationalframeworkforpatterndetectiononunalignedsequencesanapplicationonsarscov2data AT psomopoulosfotis acomputationalframeworkforpatterndetectiononunalignedsequencesanapplicationonsarscov2data AT pechlivanisnikolaos computationalframeworkforpatterndetectiononunalignedsequencesanapplicationonsarscov2data AT togkousidisanastasios computationalframeworkforpatterndetectiononunalignedsequencesanapplicationonsarscov2data AT tsagiopouloumaria computationalframeworkforpatterndetectiononunalignedsequencesanapplicationonsarscov2data AT sgardelisstefanos computationalframeworkforpatterndetectiononunalignedsequencesanapplicationonsarscov2data AT kappasilias computationalframeworkforpatterndetectiononunalignedsequencesanapplicationonsarscov2data AT psomopoulosfotis computationalframeworkforpatterndetectiononunalignedsequencesanapplicationonsarscov2data |