Cargando…

Classification of Highly Divergent Viruses from DNA/RNA Sequence Using Transformer-Based Models

Viruses infect millions of people worldwide each year, and some can lead to cancer or increase the risk of cancer. As viruses have highly mutable genomes, new viruses may emerge in the future, such as COVID-19 and influenza. Traditional virology relies on predefined rules to identify viruses, but ne...

Descripción completa

Detalles Bibliográficos
Autores principales: Sadad, Tariq, Aurangzeb, Raja Atif, Safran, Mejdl, Imran, Alfarhood, Sultan, Kim, Jungsuk
Formato: Online Artículo Texto
Lenguaje:English
Publicado: MDPI 2023
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10216192/
https://www.ncbi.nlm.nih.gov/pubmed/37238994
http://dx.doi.org/10.3390/biomedicines11051323
_version_ 1785048239303557120
author Sadad, Tariq
Aurangzeb, Raja Atif
Safran, Mejdl
Imran,
Alfarhood, Sultan
Kim, Jungsuk
author_facet Sadad, Tariq
Aurangzeb, Raja Atif
Safran, Mejdl
Imran,
Alfarhood, Sultan
Kim, Jungsuk
author_sort Sadad, Tariq
collection PubMed
description Viruses infect millions of people worldwide each year, and some can lead to cancer or increase the risk of cancer. As viruses have highly mutable genomes, new viruses may emerge in the future, such as COVID-19 and influenza. Traditional virology relies on predefined rules to identify viruses, but new viruses may be completely or partially divergent from the reference genome, rendering statistical methods and similarity calculations insufficient for all genome sequences. Identifying DNA/RNA-based viral sequences is a crucial step in differentiating different types of lethal pathogens, including their variants and strains. While various tools in bioinformatics can align them, expert biologists are required to interpret the results. Computational virology is a scientific field that studies viruses, their origins, and drug discovery, where machine learning plays a crucial role in extracting domain- and task-specific features to tackle this challenge. This paper proposes a genome analysis system that uses advanced deep learning to identify dozens of viruses. The system uses nucleotide sequences from the NCBI GenBank database and a BERT tokenizer to extract features from the sequences by breaking them down into tokens. We also generated synthetic data for viruses with small sample sizes. The proposed system has two components: a scratch BERT architecture specifically designed for DNA analysis, which is used to learn the next codons unsupervised, and a classifier that identifies important features and understands the relationship between genotype and phenotype. Our system achieved an accuracy of 97.69% in identifying viral sequences.
format Online
Article
Text
id pubmed-10216192
institution National Center for Biotechnology Information
language English
publishDate 2023
publisher MDPI
record_format MEDLINE/PubMed
spelling pubmed-102161922023-05-27 Classification of Highly Divergent Viruses from DNA/RNA Sequence Using Transformer-Based Models Sadad, Tariq Aurangzeb, Raja Atif Safran, Mejdl Imran, Alfarhood, Sultan Kim, Jungsuk Biomedicines Article Viruses infect millions of people worldwide each year, and some can lead to cancer or increase the risk of cancer. As viruses have highly mutable genomes, new viruses may emerge in the future, such as COVID-19 and influenza. Traditional virology relies on predefined rules to identify viruses, but new viruses may be completely or partially divergent from the reference genome, rendering statistical methods and similarity calculations insufficient for all genome sequences. Identifying DNA/RNA-based viral sequences is a crucial step in differentiating different types of lethal pathogens, including their variants and strains. While various tools in bioinformatics can align them, expert biologists are required to interpret the results. Computational virology is a scientific field that studies viruses, their origins, and drug discovery, where machine learning plays a crucial role in extracting domain- and task-specific features to tackle this challenge. This paper proposes a genome analysis system that uses advanced deep learning to identify dozens of viruses. The system uses nucleotide sequences from the NCBI GenBank database and a BERT tokenizer to extract features from the sequences by breaking them down into tokens. We also generated synthetic data for viruses with small sample sizes. The proposed system has two components: a scratch BERT architecture specifically designed for DNA analysis, which is used to learn the next codons unsupervised, and a classifier that identifies important features and understands the relationship between genotype and phenotype. Our system achieved an accuracy of 97.69% in identifying viral sequences. MDPI 2023-04-28 /pmc/articles/PMC10216192/ /pubmed/37238994 http://dx.doi.org/10.3390/biomedicines11051323 Text en © 2023 by the authors. https://creativecommons.org/licenses/by/4.0/Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
spellingShingle Article
Sadad, Tariq
Aurangzeb, Raja Atif
Safran, Mejdl
Imran,
Alfarhood, Sultan
Kim, Jungsuk
Classification of Highly Divergent Viruses from DNA/RNA Sequence Using Transformer-Based Models
title Classification of Highly Divergent Viruses from DNA/RNA Sequence Using Transformer-Based Models
title_full Classification of Highly Divergent Viruses from DNA/RNA Sequence Using Transformer-Based Models
title_fullStr Classification of Highly Divergent Viruses from DNA/RNA Sequence Using Transformer-Based Models
title_full_unstemmed Classification of Highly Divergent Viruses from DNA/RNA Sequence Using Transformer-Based Models
title_short Classification of Highly Divergent Viruses from DNA/RNA Sequence Using Transformer-Based Models
title_sort classification of highly divergent viruses from dna/rna sequence using transformer-based models
topic Article
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10216192/
https://www.ncbi.nlm.nih.gov/pubmed/37238994
http://dx.doi.org/10.3390/biomedicines11051323
work_keys_str_mv AT sadadtariq classificationofhighlydivergentvirusesfromdnarnasequenceusingtransformerbasedmodels
AT aurangzebrajaatif classificationofhighlydivergentvirusesfromdnarnasequenceusingtransformerbasedmodels
AT safranmejdl classificationofhighlydivergentvirusesfromdnarnasequenceusingtransformerbasedmodels
AT imran classificationofhighlydivergentvirusesfromdnarnasequenceusingtransformerbasedmodels
AT alfarhoodsultan classificationofhighlydivergentvirusesfromdnarnasequenceusingtransformerbasedmodels
AT kimjungsuk classificationofhighlydivergentvirusesfromdnarnasequenceusingtransformerbasedmodels