Cargando…

A Computational Framework for Pattern Detection on Unaligned Sequences: An Application on SARS-CoV-2 Data

The exponential growth of genome sequences available has spurred research on pattern detection with the aim of extracting evolutionary signal. Traditional approaches, such as multiple sequence alignment, rely on positional homology in order to reconstruct the phylogenetic history of taxa. Yet, minin...

Descripción completa

Detalles Bibliográficos
Autores principales: Pechlivanis, Nikolaos, Togkousidis, Anastasios, Tsagiopoulou, Maria, Sgardelis, Stefanos, Kappas, Ilias, Psomopoulos, Fotis
Formato: Online Artículo Texto
Lenguaje:English
Publicado: Frontiers Media S.A. 2021
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8194296/
https://www.ncbi.nlm.nih.gov/pubmed/34122498
http://dx.doi.org/10.3389/fgene.2021.618170
_version_ 1783706390627352576
author Pechlivanis, Nikolaos
Togkousidis, Anastasios
Tsagiopoulou, Maria
Sgardelis, Stefanos
Kappas, Ilias
Psomopoulos, Fotis
author_facet Pechlivanis, Nikolaos
Togkousidis, Anastasios
Tsagiopoulou, Maria
Sgardelis, Stefanos
Kappas, Ilias
Psomopoulos, Fotis
author_sort Pechlivanis, Nikolaos
collection PubMed
description The exponential growth of genome sequences available has spurred research on pattern detection with the aim of extracting evolutionary signal. Traditional approaches, such as multiple sequence alignment, rely on positional homology in order to reconstruct the phylogenetic history of taxa. Yet, mining information from the plethora of biological data and delineating species on a genetic basis, still proves to be an extremely difficult problem to consider. Multiple algorithms and techniques have been developed in order to approach the problem multidimensionally. Here, we propose a computational framework for identifying potentially meaningful features based on k-mers retrieved from unaligned sequence data. Specifically, we have developed a process which makes use of unsupervised learning techniques in order to identify characteristic k-mers of the input dataset across a range of different k-values and within a reasonable time frame. We use these k-mers as features for clustering the input sequences and identifying differences between the distributions of k-mers across the dataset. The developed algorithm is part of an innovative and much promising approach both to the problem of grouping sequence data based on their inherent characteristic features, as well as for the study of changes in the distributions of k-mers, as the k-value is fluctuating within a range of values. Our framework is fully developed in Python language as an open source software licensed under the MIT License, and is freely available at https://github.com/BiodataAnalysisGroup/kmerAnalyzer.
format Online
Article
Text
id pubmed-8194296
institution National Center for Biotechnology Information
language English
publishDate 2021
publisher Frontiers Media S.A.
record_format MEDLINE/PubMed
spelling pubmed-81942962021-06-12 A Computational Framework for Pattern Detection on Unaligned Sequences: An Application on SARS-CoV-2 Data Pechlivanis, Nikolaos Togkousidis, Anastasios Tsagiopoulou, Maria Sgardelis, Stefanos Kappas, Ilias Psomopoulos, Fotis Front Genet Genetics The exponential growth of genome sequences available has spurred research on pattern detection with the aim of extracting evolutionary signal. Traditional approaches, such as multiple sequence alignment, rely on positional homology in order to reconstruct the phylogenetic history of taxa. Yet, mining information from the plethora of biological data and delineating species on a genetic basis, still proves to be an extremely difficult problem to consider. Multiple algorithms and techniques have been developed in order to approach the problem multidimensionally. Here, we propose a computational framework for identifying potentially meaningful features based on k-mers retrieved from unaligned sequence data. Specifically, we have developed a process which makes use of unsupervised learning techniques in order to identify characteristic k-mers of the input dataset across a range of different k-values and within a reasonable time frame. We use these k-mers as features for clustering the input sequences and identifying differences between the distributions of k-mers across the dataset. The developed algorithm is part of an innovative and much promising approach both to the problem of grouping sequence data based on their inherent characteristic features, as well as for the study of changes in the distributions of k-mers, as the k-value is fluctuating within a range of values. Our framework is fully developed in Python language as an open source software licensed under the MIT License, and is freely available at https://github.com/BiodataAnalysisGroup/kmerAnalyzer. Frontiers Media S.A. 2021-05-28 /pmc/articles/PMC8194296/ /pubmed/34122498 http://dx.doi.org/10.3389/fgene.2021.618170 Text en Copyright © 2021 Pechlivanis, Togkousidis, Tsagiopoulou, Sgardelis, Kappas and Psomopoulos. https://creativecommons.org/licenses/by/4.0/This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.
spellingShingle Genetics
Pechlivanis, Nikolaos
Togkousidis, Anastasios
Tsagiopoulou, Maria
Sgardelis, Stefanos
Kappas, Ilias
Psomopoulos, Fotis
A Computational Framework for Pattern Detection on Unaligned Sequences: An Application on SARS-CoV-2 Data
title A Computational Framework for Pattern Detection on Unaligned Sequences: An Application on SARS-CoV-2 Data
title_full A Computational Framework for Pattern Detection on Unaligned Sequences: An Application on SARS-CoV-2 Data
title_fullStr A Computational Framework for Pattern Detection on Unaligned Sequences: An Application on SARS-CoV-2 Data
title_full_unstemmed A Computational Framework for Pattern Detection on Unaligned Sequences: An Application on SARS-CoV-2 Data
title_short A Computational Framework for Pattern Detection on Unaligned Sequences: An Application on SARS-CoV-2 Data
title_sort computational framework for pattern detection on unaligned sequences: an application on sars-cov-2 data
topic Genetics
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8194296/
https://www.ncbi.nlm.nih.gov/pubmed/34122498
http://dx.doi.org/10.3389/fgene.2021.618170
work_keys_str_mv AT pechlivanisnikolaos acomputationalframeworkforpatterndetectiononunalignedsequencesanapplicationonsarscov2data
AT togkousidisanastasios acomputationalframeworkforpatterndetectiononunalignedsequencesanapplicationonsarscov2data
AT tsagiopouloumaria acomputationalframeworkforpatterndetectiononunalignedsequencesanapplicationonsarscov2data
AT sgardelisstefanos acomputationalframeworkforpatterndetectiononunalignedsequencesanapplicationonsarscov2data
AT kappasilias acomputationalframeworkforpatterndetectiononunalignedsequencesanapplicationonsarscov2data
AT psomopoulosfotis acomputationalframeworkforpatterndetectiononunalignedsequencesanapplicationonsarscov2data
AT pechlivanisnikolaos computationalframeworkforpatterndetectiononunalignedsequencesanapplicationonsarscov2data
AT togkousidisanastasios computationalframeworkforpatterndetectiononunalignedsequencesanapplicationonsarscov2data
AT tsagiopouloumaria computationalframeworkforpatterndetectiononunalignedsequencesanapplicationonsarscov2data
AT sgardelisstefanos computationalframeworkforpatterndetectiononunalignedsequencesanapplicationonsarscov2data
AT kappasilias computationalframeworkforpatterndetectiononunalignedsequencesanapplicationonsarscov2data
AT psomopoulosfotis computationalframeworkforpatterndetectiononunalignedsequencesanapplicationonsarscov2data