Cargando…

Exploring the Potential of GANs in Biological Sequence Analysis

SIMPLE SUMMARY: This work deals with class imbalance issues associated with the bio-sequence datasets by employing a generative adversarial model (GAN) to improve their machine-learning-based classification performance. GAN is used to generate synthetic sequence data, which is very similar to real d...

Descripción completa

Detalles Bibliográficos
Autores principales:	Murad, Taslim, Ali, Sarwan, Patterson, Murray
Formato:	Online Artículo Texto
Lenguaje:	English
Publicado:	MDPI 2023
Materias:	Article
Acceso en línea:	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10295061/ https://www.ncbi.nlm.nih.gov/pubmed/37372139 http://dx.doi.org/10.3390/biology12060854

_version_	1785063330785787904
author	Murad, Taslim Ali, Sarwan Patterson, Murray
author_facet	Murad, Taslim Ali, Sarwan Patterson, Murray
author_sort	Murad, Taslim
collection	PubMed
description	SIMPLE SUMMARY: This work deals with class imbalance issues associated with the bio-sequence datasets by employing a generative adversarial model (GAN) to improve their machine-learning-based classification performance. GAN is used to generate synthetic sequence data, which is very similar to real data in terms of tackling the data imbalance challenge. The experimental results on four distinct datasets demonstrate that GANs can improve the overall classification performance. This kind of analytical (classification) information can improve our understanding of the viruses associated with the sequences, which can be used to build prevention mechanisms to eradicate the impact of the viruses. ABSTRACT: Biological sequence analysis is an essential step toward building a deeper understanding of the underlying functions, structures, and behaviors of the sequences. It can help in identifying the characteristics of the associated organisms, such as viruses, etc., and building prevention mechanisms to eradicate their spread and impact, as viruses are known to cause epidemics that can become global pandemics. New tools for biological sequence analysis are provided by machine learning (ML) technologies to effectively analyze the functions and structures of the sequences. However, these ML-based methods undergo challenges with data imbalance, generally associated with biological sequence datasets, which hinders their performance. Although various strategies are present to address this issue, such as the SMOTE algorithm, which creates synthetic data, however, they focus on local information rather than the overall class distribution. In this work, we explore a novel approach to handle the data imbalance issue based on generative adversarial networks (GANs), which use the overall data distribution. GANs are utilized to generate synthetic data that closely resembles real data, thus, these generated data can be employed to enhance the ML models’ performance by eradicating the class imbalance problem for biological sequence analysis. We perform four distinct classification tasks by using four different sequence datasets (Influenza A Virus, PALMdb, VDjDB, Host) and our results illustrate that GANs can improve the overall classification performance.
format	Online Article Text
id	pubmed-10295061
institution	National Center for Biotechnology Information
language	English
publishDate	2023
publisher	MDPI
record_format	MEDLINE/PubMed
spelling	pubmed-102950612023-06-28 Exploring the Potential of GANs in Biological Sequence Analysis Murad, Taslim Ali, Sarwan Patterson, Murray Biology (Basel) Article SIMPLE SUMMARY: This work deals with class imbalance issues associated with the bio-sequence datasets by employing a generative adversarial model (GAN) to improve their machine-learning-based classification performance. GAN is used to generate synthetic sequence data, which is very similar to real data in terms of tackling the data imbalance challenge. The experimental results on four distinct datasets demonstrate that GANs can improve the overall classification performance. This kind of analytical (classification) information can improve our understanding of the viruses associated with the sequences, which can be used to build prevention mechanisms to eradicate the impact of the viruses. ABSTRACT: Biological sequence analysis is an essential step toward building a deeper understanding of the underlying functions, structures, and behaviors of the sequences. It can help in identifying the characteristics of the associated organisms, such as viruses, etc., and building prevention mechanisms to eradicate their spread and impact, as viruses are known to cause epidemics that can become global pandemics. New tools for biological sequence analysis are provided by machine learning (ML) technologies to effectively analyze the functions and structures of the sequences. However, these ML-based methods undergo challenges with data imbalance, generally associated with biological sequence datasets, which hinders their performance. Although various strategies are present to address this issue, such as the SMOTE algorithm, which creates synthetic data, however, they focus on local information rather than the overall class distribution. In this work, we explore a novel approach to handle the data imbalance issue based on generative adversarial networks (GANs), which use the overall data distribution. GANs are utilized to generate synthetic data that closely resembles real data, thus, these generated data can be employed to enhance the ML models’ performance by eradicating the class imbalance problem for biological sequence analysis. We perform four distinct classification tasks by using four different sequence datasets (Influenza A Virus, PALMdb, VDjDB, Host) and our results illustrate that GANs can improve the overall classification performance. MDPI 2023-06-14 /pmc/articles/PMC10295061/ /pubmed/37372139 http://dx.doi.org/10.3390/biology12060854 Text en © 2023 by the authors. https://creativecommons.org/licenses/by/4.0/Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
spellingShingle	Article Murad, Taslim Ali, Sarwan Patterson, Murray Exploring the Potential of GANs in Biological Sequence Analysis
title	Exploring the Potential of GANs in Biological Sequence Analysis
title_full	Exploring the Potential of GANs in Biological Sequence Analysis
title_fullStr	Exploring the Potential of GANs in Biological Sequence Analysis
title_full_unstemmed	Exploring the Potential of GANs in Biological Sequence Analysis
title_short	Exploring the Potential of GANs in Biological Sequence Analysis
title_sort	exploring the potential of gans in biological sequence analysis
topic	Article
url	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10295061/ https://www.ncbi.nlm.nih.gov/pubmed/37372139 http://dx.doi.org/10.3390/biology12060854
work_keys_str_mv	AT muradtaslim exploringthepotentialofgansinbiologicalsequenceanalysis AT alisarwan exploringthepotentialofgansinbiologicalsequenceanalysis AT pattersonmurray exploringthepotentialofgansinbiologicalsequenceanalysis

Exploring the Potential of GANs in Biological Sequence Analysis

Ejemplares similares