Cargando…

Accurate and fast clade assignment via deep learning and frequency chaos game representation

BACKGROUND: Since the beginning of the coronavirus disease 2019 pandemic, there has been an explosion of sequencing of the severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) virus, making it the most widely sequenced virus in the history. Several databases and tools have been created to ke...

Descripción completa

Detalles Bibliográficos
Autores principales: Avila Cartes, Jorge, Anand, Santosh, Ciccolella, Simone, Bonizzoni, Paola, Della Vedova, Gianluca
Formato: Online Artículo Texto
Lenguaje:English
Publicado: Oxford University Press 2022
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9795481/
https://www.ncbi.nlm.nih.gov/pubmed/36576129
http://dx.doi.org/10.1093/gigascience/giac119
_version_ 1784860271412510720
author Avila Cartes, Jorge
Anand, Santosh
Ciccolella, Simone
Bonizzoni, Paola
Della Vedova, Gianluca
author_facet Avila Cartes, Jorge
Anand, Santosh
Ciccolella, Simone
Bonizzoni, Paola
Della Vedova, Gianluca
author_sort Avila Cartes, Jorge
collection PubMed
description BACKGROUND: Since the beginning of the coronavirus disease 2019 pandemic, there has been an explosion of sequencing of the severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) virus, making it the most widely sequenced virus in the history. Several databases and tools have been created to keep track of genome sequences and variants of the virus; most notably, the GISAID platform hosts millions of complete genome sequences, and it is continuously expanding every day. A challenging task is the development of fast and accurate tools that are able to distinguish between the different SARS-CoV-2 variants and assign them to a clade. RESULTS: In this article, we leverage the frequency chaos game representation (FCGR) and convolutional neural networks (CNNs) to develop an original method that learns how to classify genome sequences that we implement into CouGaR-g, a tool for the clade assignment problem on SARS-CoV-2 sequences. On a testing subset of the GISAID, CouGaR-g achieved an [Formula: see text] overall accuracy, while a similar tool, Covidex, obtained a [Formula: see text] overall accuracy. As far as we know, our method is the first using deep learning and FCGR for intraspecies classification. Furthermore, by using some feature importance methods, CouGaR-g allows to identify k-mers that match SARS-CoV-2 marker variants. CONCLUSIONS: By combining FCGR and CNNs, we develop a method that achieves a better accuracy than Covidex (which is based on random forest) for clade assignment of SARS-CoV-2 genome sequences, also thanks to our training on a much larger dataset, with comparable running times. Our method implemented in CouGaR-g is able to detect k-mers that capture relevant biological information that distinguishes the clades, known as marker variants. AVAILABILITY: The trained models can be tested online providing a FASTA file (with 1 or multiple sequences) at https://huggingface.co/spaces/BIASLab/sars-cov-2-classification-fcgr. CouGaR-g is also available at https://github.com/AlgoLab/CouGaR-g under the GPL.
format Online
Article
Text
id pubmed-9795481
institution National Center for Biotechnology Information
language English
publishDate 2022
publisher Oxford University Press
record_format MEDLINE/PubMed
spelling pubmed-97954812022-12-28 Accurate and fast clade assignment via deep learning and frequency chaos game representation Avila Cartes, Jorge Anand, Santosh Ciccolella, Simone Bonizzoni, Paola Della Vedova, Gianluca Gigascience Research BACKGROUND: Since the beginning of the coronavirus disease 2019 pandemic, there has been an explosion of sequencing of the severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) virus, making it the most widely sequenced virus in the history. Several databases and tools have been created to keep track of genome sequences and variants of the virus; most notably, the GISAID platform hosts millions of complete genome sequences, and it is continuously expanding every day. A challenging task is the development of fast and accurate tools that are able to distinguish between the different SARS-CoV-2 variants and assign them to a clade. RESULTS: In this article, we leverage the frequency chaos game representation (FCGR) and convolutional neural networks (CNNs) to develop an original method that learns how to classify genome sequences that we implement into CouGaR-g, a tool for the clade assignment problem on SARS-CoV-2 sequences. On a testing subset of the GISAID, CouGaR-g achieved an [Formula: see text] overall accuracy, while a similar tool, Covidex, obtained a [Formula: see text] overall accuracy. As far as we know, our method is the first using deep learning and FCGR for intraspecies classification. Furthermore, by using some feature importance methods, CouGaR-g allows to identify k-mers that match SARS-CoV-2 marker variants. CONCLUSIONS: By combining FCGR and CNNs, we develop a method that achieves a better accuracy than Covidex (which is based on random forest) for clade assignment of SARS-CoV-2 genome sequences, also thanks to our training on a much larger dataset, with comparable running times. Our method implemented in CouGaR-g is able to detect k-mers that capture relevant biological information that distinguishes the clades, known as marker variants. AVAILABILITY: The trained models can be tested online providing a FASTA file (with 1 or multiple sequences) at https://huggingface.co/spaces/BIASLab/sars-cov-2-classification-fcgr. CouGaR-g is also available at https://github.com/AlgoLab/CouGaR-g under the GPL. Oxford University Press 2022-12-28 /pmc/articles/PMC9795481/ /pubmed/36576129 http://dx.doi.org/10.1093/gigascience/giac119 Text en © The Author(s) 2022. Published by Oxford University Press GigaScience. https://creativecommons.org/licenses/by/4.0/This is an Open Access article distributed under the terms of the Creative Commons Attribution License (https://creativecommons.org/licenses/by/4.0/), which permits unrestricted reuse, distribution, and reproduction in any medium, provided the original work is properly cited.
spellingShingle Research
Avila Cartes, Jorge
Anand, Santosh
Ciccolella, Simone
Bonizzoni, Paola
Della Vedova, Gianluca
Accurate and fast clade assignment via deep learning and frequency chaos game representation
title Accurate and fast clade assignment via deep learning and frequency chaos game representation
title_full Accurate and fast clade assignment via deep learning and frequency chaos game representation
title_fullStr Accurate and fast clade assignment via deep learning and frequency chaos game representation
title_full_unstemmed Accurate and fast clade assignment via deep learning and frequency chaos game representation
title_short Accurate and fast clade assignment via deep learning and frequency chaos game representation
title_sort accurate and fast clade assignment via deep learning and frequency chaos game representation
topic Research
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9795481/
https://www.ncbi.nlm.nih.gov/pubmed/36576129
http://dx.doi.org/10.1093/gigascience/giac119
work_keys_str_mv AT avilacartesjorge accurateandfastcladeassignmentviadeeplearningandfrequencychaosgamerepresentation
AT anandsantosh accurateandfastcladeassignmentviadeeplearningandfrequencychaosgamerepresentation
AT ciccolellasimone accurateandfastcladeassignmentviadeeplearningandfrequencychaosgamerepresentation
AT bonizzonipaola accurateandfastcladeassignmentviadeeplearningandfrequencychaosgamerepresentation
AT dellavedovagianluca accurateandfastcladeassignmentviadeeplearningandfrequencychaosgamerepresentation