Cargando…
Classification of Highly Divergent Viruses from DNA/RNA Sequence Using Transformer-Based Models
Viruses infect millions of people worldwide each year, and some can lead to cancer or increase the risk of cancer. As viruses have highly mutable genomes, new viruses may emerge in the future, such as COVID-19 and influenza. Traditional virology relies on predefined rules to identify viruses, but ne...
Autores principales: | , , , , , |
---|---|
Formato: | Online Artículo Texto |
Lenguaje: | English |
Publicado: |
MDPI
2023
|
Materias: | |
Acceso en línea: | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10216192/ https://www.ncbi.nlm.nih.gov/pubmed/37238994 http://dx.doi.org/10.3390/biomedicines11051323 |
_version_ | 1785048239303557120 |
---|---|
author | Sadad, Tariq Aurangzeb, Raja Atif Safran, Mejdl Imran, Alfarhood, Sultan Kim, Jungsuk |
author_facet | Sadad, Tariq Aurangzeb, Raja Atif Safran, Mejdl Imran, Alfarhood, Sultan Kim, Jungsuk |
author_sort | Sadad, Tariq |
collection | PubMed |
description | Viruses infect millions of people worldwide each year, and some can lead to cancer or increase the risk of cancer. As viruses have highly mutable genomes, new viruses may emerge in the future, such as COVID-19 and influenza. Traditional virology relies on predefined rules to identify viruses, but new viruses may be completely or partially divergent from the reference genome, rendering statistical methods and similarity calculations insufficient for all genome sequences. Identifying DNA/RNA-based viral sequences is a crucial step in differentiating different types of lethal pathogens, including their variants and strains. While various tools in bioinformatics can align them, expert biologists are required to interpret the results. Computational virology is a scientific field that studies viruses, their origins, and drug discovery, where machine learning plays a crucial role in extracting domain- and task-specific features to tackle this challenge. This paper proposes a genome analysis system that uses advanced deep learning to identify dozens of viruses. The system uses nucleotide sequences from the NCBI GenBank database and a BERT tokenizer to extract features from the sequences by breaking them down into tokens. We also generated synthetic data for viruses with small sample sizes. The proposed system has two components: a scratch BERT architecture specifically designed for DNA analysis, which is used to learn the next codons unsupervised, and a classifier that identifies important features and understands the relationship between genotype and phenotype. Our system achieved an accuracy of 97.69% in identifying viral sequences. |
format | Online Article Text |
id | pubmed-10216192 |
institution | National Center for Biotechnology Information |
language | English |
publishDate | 2023 |
publisher | MDPI |
record_format | MEDLINE/PubMed |
spelling | pubmed-102161922023-05-27 Classification of Highly Divergent Viruses from DNA/RNA Sequence Using Transformer-Based Models Sadad, Tariq Aurangzeb, Raja Atif Safran, Mejdl Imran, Alfarhood, Sultan Kim, Jungsuk Biomedicines Article Viruses infect millions of people worldwide each year, and some can lead to cancer or increase the risk of cancer. As viruses have highly mutable genomes, new viruses may emerge in the future, such as COVID-19 and influenza. Traditional virology relies on predefined rules to identify viruses, but new viruses may be completely or partially divergent from the reference genome, rendering statistical methods and similarity calculations insufficient for all genome sequences. Identifying DNA/RNA-based viral sequences is a crucial step in differentiating different types of lethal pathogens, including their variants and strains. While various tools in bioinformatics can align them, expert biologists are required to interpret the results. Computational virology is a scientific field that studies viruses, their origins, and drug discovery, where machine learning plays a crucial role in extracting domain- and task-specific features to tackle this challenge. This paper proposes a genome analysis system that uses advanced deep learning to identify dozens of viruses. The system uses nucleotide sequences from the NCBI GenBank database and a BERT tokenizer to extract features from the sequences by breaking them down into tokens. We also generated synthetic data for viruses with small sample sizes. The proposed system has two components: a scratch BERT architecture specifically designed for DNA analysis, which is used to learn the next codons unsupervised, and a classifier that identifies important features and understands the relationship between genotype and phenotype. Our system achieved an accuracy of 97.69% in identifying viral sequences. MDPI 2023-04-28 /pmc/articles/PMC10216192/ /pubmed/37238994 http://dx.doi.org/10.3390/biomedicines11051323 Text en © 2023 by the authors. https://creativecommons.org/licenses/by/4.0/Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/). |
spellingShingle | Article Sadad, Tariq Aurangzeb, Raja Atif Safran, Mejdl Imran, Alfarhood, Sultan Kim, Jungsuk Classification of Highly Divergent Viruses from DNA/RNA Sequence Using Transformer-Based Models |
title | Classification of Highly Divergent Viruses from DNA/RNA Sequence Using Transformer-Based Models |
title_full | Classification of Highly Divergent Viruses from DNA/RNA Sequence Using Transformer-Based Models |
title_fullStr | Classification of Highly Divergent Viruses from DNA/RNA Sequence Using Transformer-Based Models |
title_full_unstemmed | Classification of Highly Divergent Viruses from DNA/RNA Sequence Using Transformer-Based Models |
title_short | Classification of Highly Divergent Viruses from DNA/RNA Sequence Using Transformer-Based Models |
title_sort | classification of highly divergent viruses from dna/rna sequence using transformer-based models |
topic | Article |
url | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10216192/ https://www.ncbi.nlm.nih.gov/pubmed/37238994 http://dx.doi.org/10.3390/biomedicines11051323 |
work_keys_str_mv | AT sadadtariq classificationofhighlydivergentvirusesfromdnarnasequenceusingtransformerbasedmodels AT aurangzebrajaatif classificationofhighlydivergentvirusesfromdnarnasequenceusingtransformerbasedmodels AT safranmejdl classificationofhighlydivergentvirusesfromdnarnasequenceusingtransformerbasedmodels AT imran classificationofhighlydivergentvirusesfromdnarnasequenceusingtransformerbasedmodels AT alfarhoodsultan classificationofhighlydivergentvirusesfromdnarnasequenceusingtransformerbasedmodels AT kimjungsuk classificationofhighlydivergentvirusesfromdnarnasequenceusingtransformerbasedmodels |