Cargando…
Exploring the Potential of GANs in Biological Sequence Analysis
SIMPLE SUMMARY: This work deals with class imbalance issues associated with the bio-sequence datasets by employing a generative adversarial model (GAN) to improve their machine-learning-based classification performance. GAN is used to generate synthetic sequence data, which is very similar to real d...
Autores principales: | , , |
---|---|
Formato: | Online Artículo Texto |
Lenguaje: | English |
Publicado: |
MDPI
2023
|
Materias: | |
Acceso en línea: | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10295061/ https://www.ncbi.nlm.nih.gov/pubmed/37372139 http://dx.doi.org/10.3390/biology12060854 |
_version_ | 1785063330785787904 |
---|---|
author | Murad, Taslim Ali, Sarwan Patterson, Murray |
author_facet | Murad, Taslim Ali, Sarwan Patterson, Murray |
author_sort | Murad, Taslim |
collection | PubMed |
description | SIMPLE SUMMARY: This work deals with class imbalance issues associated with the bio-sequence datasets by employing a generative adversarial model (GAN) to improve their machine-learning-based classification performance. GAN is used to generate synthetic sequence data, which is very similar to real data in terms of tackling the data imbalance challenge. The experimental results on four distinct datasets demonstrate that GANs can improve the overall classification performance. This kind of analytical (classification) information can improve our understanding of the viruses associated with the sequences, which can be used to build prevention mechanisms to eradicate the impact of the viruses. ABSTRACT: Biological sequence analysis is an essential step toward building a deeper understanding of the underlying functions, structures, and behaviors of the sequences. It can help in identifying the characteristics of the associated organisms, such as viruses, etc., and building prevention mechanisms to eradicate their spread and impact, as viruses are known to cause epidemics that can become global pandemics. New tools for biological sequence analysis are provided by machine learning (ML) technologies to effectively analyze the functions and structures of the sequences. However, these ML-based methods undergo challenges with data imbalance, generally associated with biological sequence datasets, which hinders their performance. Although various strategies are present to address this issue, such as the SMOTE algorithm, which creates synthetic data, however, they focus on local information rather than the overall class distribution. In this work, we explore a novel approach to handle the data imbalance issue based on generative adversarial networks (GANs), which use the overall data distribution. GANs are utilized to generate synthetic data that closely resembles real data, thus, these generated data can be employed to enhance the ML models’ performance by eradicating the class imbalance problem for biological sequence analysis. We perform four distinct classification tasks by using four different sequence datasets (Influenza A Virus, PALMdb, VDjDB, Host) and our results illustrate that GANs can improve the overall classification performance. |
format | Online Article Text |
id | pubmed-10295061 |
institution | National Center for Biotechnology Information |
language | English |
publishDate | 2023 |
publisher | MDPI |
record_format | MEDLINE/PubMed |
spelling | pubmed-102950612023-06-28 Exploring the Potential of GANs in Biological Sequence Analysis Murad, Taslim Ali, Sarwan Patterson, Murray Biology (Basel) Article SIMPLE SUMMARY: This work deals with class imbalance issues associated with the bio-sequence datasets by employing a generative adversarial model (GAN) to improve their machine-learning-based classification performance. GAN is used to generate synthetic sequence data, which is very similar to real data in terms of tackling the data imbalance challenge. The experimental results on four distinct datasets demonstrate that GANs can improve the overall classification performance. This kind of analytical (classification) information can improve our understanding of the viruses associated with the sequences, which can be used to build prevention mechanisms to eradicate the impact of the viruses. ABSTRACT: Biological sequence analysis is an essential step toward building a deeper understanding of the underlying functions, structures, and behaviors of the sequences. It can help in identifying the characteristics of the associated organisms, such as viruses, etc., and building prevention mechanisms to eradicate their spread and impact, as viruses are known to cause epidemics that can become global pandemics. New tools for biological sequence analysis are provided by machine learning (ML) technologies to effectively analyze the functions and structures of the sequences. However, these ML-based methods undergo challenges with data imbalance, generally associated with biological sequence datasets, which hinders their performance. Although various strategies are present to address this issue, such as the SMOTE algorithm, which creates synthetic data, however, they focus on local information rather than the overall class distribution. In this work, we explore a novel approach to handle the data imbalance issue based on generative adversarial networks (GANs), which use the overall data distribution. GANs are utilized to generate synthetic data that closely resembles real data, thus, these generated data can be employed to enhance the ML models’ performance by eradicating the class imbalance problem for biological sequence analysis. We perform four distinct classification tasks by using four different sequence datasets (Influenza A Virus, PALMdb, VDjDB, Host) and our results illustrate that GANs can improve the overall classification performance. MDPI 2023-06-14 /pmc/articles/PMC10295061/ /pubmed/37372139 http://dx.doi.org/10.3390/biology12060854 Text en © 2023 by the authors. https://creativecommons.org/licenses/by/4.0/Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/). |
spellingShingle | Article Murad, Taslim Ali, Sarwan Patterson, Murray Exploring the Potential of GANs in Biological Sequence Analysis |
title | Exploring the Potential of GANs in Biological Sequence Analysis |
title_full | Exploring the Potential of GANs in Biological Sequence Analysis |
title_fullStr | Exploring the Potential of GANs in Biological Sequence Analysis |
title_full_unstemmed | Exploring the Potential of GANs in Biological Sequence Analysis |
title_short | Exploring the Potential of GANs in Biological Sequence Analysis |
title_sort | exploring the potential of gans in biological sequence analysis |
topic | Article |
url | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10295061/ https://www.ncbi.nlm.nih.gov/pubmed/37372139 http://dx.doi.org/10.3390/biology12060854 |
work_keys_str_mv | AT muradtaslim exploringthepotentialofgansinbiologicalsequenceanalysis AT alisarwan exploringthepotentialofgansinbiologicalsequenceanalysis AT pattersonmurray exploringthepotentialofgansinbiologicalsequenceanalysis |