Cargando…

Exploring the Potential of GANs in Biological Sequence Analysis

SIMPLE SUMMARY: This work deals with class imbalance issues associated with the bio-sequence datasets by employing a generative adversarial model (GAN) to improve their machine-learning-based classification performance. GAN is used to generate synthetic sequence data, which is very similar to real d...

Descripción completa

Detalles Bibliográficos
Autores principales: Murad, Taslim, Ali, Sarwan, Patterson, Murray
Formato: Online Artículo Texto
Lenguaje:English
Publicado: MDPI 2023
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10295061/
https://www.ncbi.nlm.nih.gov/pubmed/37372139
http://dx.doi.org/10.3390/biology12060854
_version_ 1785063330785787904
author Murad, Taslim
Ali, Sarwan
Patterson, Murray
author_facet Murad, Taslim
Ali, Sarwan
Patterson, Murray
author_sort Murad, Taslim
collection PubMed
description SIMPLE SUMMARY: This work deals with class imbalance issues associated with the bio-sequence datasets by employing a generative adversarial model (GAN) to improve their machine-learning-based classification performance. GAN is used to generate synthetic sequence data, which is very similar to real data in terms of tackling the data imbalance challenge. The experimental results on four distinct datasets demonstrate that GANs can improve the overall classification performance. This kind of analytical (classification) information can improve our understanding of the viruses associated with the sequences, which can be used to build prevention mechanisms to eradicate the impact of the viruses. ABSTRACT: Biological sequence analysis is an essential step toward building a deeper understanding of the underlying functions, structures, and behaviors of the sequences. It can help in identifying the characteristics of the associated organisms, such as viruses, etc., and building prevention mechanisms to eradicate their spread and impact, as viruses are known to cause epidemics that can become global pandemics. New tools for biological sequence analysis are provided by machine learning (ML) technologies to effectively analyze the functions and structures of the sequences. However, these ML-based methods undergo challenges with data imbalance, generally associated with biological sequence datasets, which hinders their performance. Although various strategies are present to address this issue, such as the SMOTE algorithm, which creates synthetic data, however, they focus on local information rather than the overall class distribution. In this work, we explore a novel approach to handle the data imbalance issue based on generative adversarial networks (GANs), which use the overall data distribution. GANs are utilized to generate synthetic data that closely resembles real data, thus, these generated data can be employed to enhance the ML models’ performance by eradicating the class imbalance problem for biological sequence analysis. We perform four distinct classification tasks by using four different sequence datasets (Influenza A Virus, PALMdb, VDjDB, Host) and our results illustrate that GANs can improve the overall classification performance.
format Online
Article
Text
id pubmed-10295061
institution National Center for Biotechnology Information
language English
publishDate 2023
publisher MDPI
record_format MEDLINE/PubMed
spelling pubmed-102950612023-06-28 Exploring the Potential of GANs in Biological Sequence Analysis Murad, Taslim Ali, Sarwan Patterson, Murray Biology (Basel) Article SIMPLE SUMMARY: This work deals with class imbalance issues associated with the bio-sequence datasets by employing a generative adversarial model (GAN) to improve their machine-learning-based classification performance. GAN is used to generate synthetic sequence data, which is very similar to real data in terms of tackling the data imbalance challenge. The experimental results on four distinct datasets demonstrate that GANs can improve the overall classification performance. This kind of analytical (classification) information can improve our understanding of the viruses associated with the sequences, which can be used to build prevention mechanisms to eradicate the impact of the viruses. ABSTRACT: Biological sequence analysis is an essential step toward building a deeper understanding of the underlying functions, structures, and behaviors of the sequences. It can help in identifying the characteristics of the associated organisms, such as viruses, etc., and building prevention mechanisms to eradicate their spread and impact, as viruses are known to cause epidemics that can become global pandemics. New tools for biological sequence analysis are provided by machine learning (ML) technologies to effectively analyze the functions and structures of the sequences. However, these ML-based methods undergo challenges with data imbalance, generally associated with biological sequence datasets, which hinders their performance. Although various strategies are present to address this issue, such as the SMOTE algorithm, which creates synthetic data, however, they focus on local information rather than the overall class distribution. In this work, we explore a novel approach to handle the data imbalance issue based on generative adversarial networks (GANs), which use the overall data distribution. GANs are utilized to generate synthetic data that closely resembles real data, thus, these generated data can be employed to enhance the ML models’ performance by eradicating the class imbalance problem for biological sequence analysis. We perform four distinct classification tasks by using four different sequence datasets (Influenza A Virus, PALMdb, VDjDB, Host) and our results illustrate that GANs can improve the overall classification performance. MDPI 2023-06-14 /pmc/articles/PMC10295061/ /pubmed/37372139 http://dx.doi.org/10.3390/biology12060854 Text en © 2023 by the authors. https://creativecommons.org/licenses/by/4.0/Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
spellingShingle Article
Murad, Taslim
Ali, Sarwan
Patterson, Murray
Exploring the Potential of GANs in Biological Sequence Analysis
title Exploring the Potential of GANs in Biological Sequence Analysis
title_full Exploring the Potential of GANs in Biological Sequence Analysis
title_fullStr Exploring the Potential of GANs in Biological Sequence Analysis
title_full_unstemmed Exploring the Potential of GANs in Biological Sequence Analysis
title_short Exploring the Potential of GANs in Biological Sequence Analysis
title_sort exploring the potential of gans in biological sequence analysis
topic Article
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10295061/
https://www.ncbi.nlm.nih.gov/pubmed/37372139
http://dx.doi.org/10.3390/biology12060854
work_keys_str_mv AT muradtaslim exploringthepotentialofgansinbiologicalsequenceanalysis
AT alisarwan exploringthepotentialofgansinbiologicalsequenceanalysis
AT pattersonmurray exploringthepotentialofgansinbiologicalsequenceanalysis