Cargando…

MISSEL: a method to identify a large number of small species-specific genomic subsequences and its application to viruses classification

BACKGROUND: Continuous improvements in next generation sequencing technologies led to ever-increasing collections of genomic sequences, which have not been easily characterized by biologists, and whose analysis requires huge computational effort. The classification of species emerged as one of the m...

Descripción completa

Detalles Bibliográficos
Autores principales: Fiscon, Giulia, Weitschek, Emanuel, Cella, Eleonora, Lo Presti, Alessandra, Giovanetti, Marta, Babakir-Mina, Muhammed, Ciotti, Marco, Ciccozzi, Massimo, Pierangeli, Alessandra, Bertolazzi, Paola, Felici, Giovanni
Formato: Online Artículo Texto
Lenguaje:English
Publicado: BioMed Central 2016
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5139023/
https://www.ncbi.nlm.nih.gov/pubmed/27980679
http://dx.doi.org/10.1186/s13040-016-0116-2
_version_ 1782472167945928704
author Fiscon, Giulia
Weitschek, Emanuel
Cella, Eleonora
Lo Presti, Alessandra
Giovanetti, Marta
Babakir-Mina, Muhammed
Ciotti, Marco
Ciccozzi, Massimo
Pierangeli, Alessandra
Bertolazzi, Paola
Felici, Giovanni
author_facet Fiscon, Giulia
Weitschek, Emanuel
Cella, Eleonora
Lo Presti, Alessandra
Giovanetti, Marta
Babakir-Mina, Muhammed
Ciotti, Marco
Ciccozzi, Massimo
Pierangeli, Alessandra
Bertolazzi, Paola
Felici, Giovanni
author_sort Fiscon, Giulia
collection PubMed
description BACKGROUND: Continuous improvements in next generation sequencing technologies led to ever-increasing collections of genomic sequences, which have not been easily characterized by biologists, and whose analysis requires huge computational effort. The classification of species emerged as one of the main applications of DNA analysis and has been addressed with several approaches, e.g., multiple alignments-, phylogenetic trees-, statistical- and character-based methods. RESULTS: We propose a supervised method based on a genetic algorithm to identify small genomic subsequences that discriminate among different species. The method identifies multiple subsequences of bounded length with the same information power in a given genomic region. The algorithm has been successfully evaluated through its integration into a rule-based classification framework and applied to three different biological data sets: Influenza, Polyoma, and Rhino virus sequences. CONCLUSIONS: We discover a large number of small subsequences that can be used to identify each virus type with high accuracy and low computational time, and moreover help to characterize different genomic regions. Bounding their length to 20, our method found 1164 characterizing subsequences for all the Influenza virus subtypes, 194 for all the Polyoma viruses, and 11 for Rhino viruses. The abundance of small separating subsequences extracted for each genomic region may be an important support for quick and robust virus identification. Finally, useful biological information can be derived by the relative location and abundance of such subsequences along the different regions. ELECTRONIC SUPPLEMENTARY MATERIAL: The online version of this article (doi:10.1186/s13040-016-0116-2) contains supplementary material, which is available to authorized users.
format Online
Article
Text
id pubmed-5139023
institution National Center for Biotechnology Information
language English
publishDate 2016
publisher BioMed Central
record_format MEDLINE/PubMed
spelling pubmed-51390232016-12-15 MISSEL: a method to identify a large number of small species-specific genomic subsequences and its application to viruses classification Fiscon, Giulia Weitschek, Emanuel Cella, Eleonora Lo Presti, Alessandra Giovanetti, Marta Babakir-Mina, Muhammed Ciotti, Marco Ciccozzi, Massimo Pierangeli, Alessandra Bertolazzi, Paola Felici, Giovanni BioData Min Research BACKGROUND: Continuous improvements in next generation sequencing technologies led to ever-increasing collections of genomic sequences, which have not been easily characterized by biologists, and whose analysis requires huge computational effort. The classification of species emerged as one of the main applications of DNA analysis and has been addressed with several approaches, e.g., multiple alignments-, phylogenetic trees-, statistical- and character-based methods. RESULTS: We propose a supervised method based on a genetic algorithm to identify small genomic subsequences that discriminate among different species. The method identifies multiple subsequences of bounded length with the same information power in a given genomic region. The algorithm has been successfully evaluated through its integration into a rule-based classification framework and applied to three different biological data sets: Influenza, Polyoma, and Rhino virus sequences. CONCLUSIONS: We discover a large number of small subsequences that can be used to identify each virus type with high accuracy and low computational time, and moreover help to characterize different genomic regions. Bounding their length to 20, our method found 1164 characterizing subsequences for all the Influenza virus subtypes, 194 for all the Polyoma viruses, and 11 for Rhino viruses. The abundance of small separating subsequences extracted for each genomic region may be an important support for quick and robust virus identification. Finally, useful biological information can be derived by the relative location and abundance of such subsequences along the different regions. ELECTRONIC SUPPLEMENTARY MATERIAL: The online version of this article (doi:10.1186/s13040-016-0116-2) contains supplementary material, which is available to authorized users. BioMed Central 2016-12-06 /pmc/articles/PMC5139023/ /pubmed/27980679 http://dx.doi.org/10.1186/s13040-016-0116-2 Text en © The Author(s) 2016 Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License(http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.
spellingShingle Research
Fiscon, Giulia
Weitschek, Emanuel
Cella, Eleonora
Lo Presti, Alessandra
Giovanetti, Marta
Babakir-Mina, Muhammed
Ciotti, Marco
Ciccozzi, Massimo
Pierangeli, Alessandra
Bertolazzi, Paola
Felici, Giovanni
MISSEL: a method to identify a large number of small species-specific genomic subsequences and its application to viruses classification
title MISSEL: a method to identify a large number of small species-specific genomic subsequences and its application to viruses classification
title_full MISSEL: a method to identify a large number of small species-specific genomic subsequences and its application to viruses classification
title_fullStr MISSEL: a method to identify a large number of small species-specific genomic subsequences and its application to viruses classification
title_full_unstemmed MISSEL: a method to identify a large number of small species-specific genomic subsequences and its application to viruses classification
title_short MISSEL: a method to identify a large number of small species-specific genomic subsequences and its application to viruses classification
title_sort missel: a method to identify a large number of small species-specific genomic subsequences and its application to viruses classification
topic Research
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5139023/
https://www.ncbi.nlm.nih.gov/pubmed/27980679
http://dx.doi.org/10.1186/s13040-016-0116-2
work_keys_str_mv AT fiscongiulia misselamethodtoidentifyalargenumberofsmallspeciesspecificgenomicsubsequencesanditsapplicationtovirusesclassification
AT weitschekemanuel misselamethodtoidentifyalargenumberofsmallspeciesspecificgenomicsubsequencesanditsapplicationtovirusesclassification
AT cellaeleonora misselamethodtoidentifyalargenumberofsmallspeciesspecificgenomicsubsequencesanditsapplicationtovirusesclassification
AT loprestialessandra misselamethodtoidentifyalargenumberofsmallspeciesspecificgenomicsubsequencesanditsapplicationtovirusesclassification
AT giovanettimarta misselamethodtoidentifyalargenumberofsmallspeciesspecificgenomicsubsequencesanditsapplicationtovirusesclassification
AT babakirminamuhammed misselamethodtoidentifyalargenumberofsmallspeciesspecificgenomicsubsequencesanditsapplicationtovirusesclassification
AT ciottimarco misselamethodtoidentifyalargenumberofsmallspeciesspecificgenomicsubsequencesanditsapplicationtovirusesclassification
AT ciccozzimassimo misselamethodtoidentifyalargenumberofsmallspeciesspecificgenomicsubsequencesanditsapplicationtovirusesclassification
AT pierangelialessandra misselamethodtoidentifyalargenumberofsmallspeciesspecificgenomicsubsequencesanditsapplicationtovirusesclassification
AT bertolazzipaola misselamethodtoidentifyalargenumberofsmallspeciesspecificgenomicsubsequencesanditsapplicationtovirusesclassification
AT felicigiovanni misselamethodtoidentifyalargenumberofsmallspeciesspecificgenomicsubsequencesanditsapplicationtovirusesclassification