Cargando…

Information Theory for Biological Sequence Classification: A Novel Feature Extraction Technique Based on Tsallis Entropy

In recent years, there has been an exponential growth in sequencing projects due to accelerated technological advances, leading to a significant increase in the amount of data and resulting in new challenges for biological sequence analysis. Consequently, the use of techniques capable of analyzing l...

Descripción completa

Detalles Bibliográficos
Autores principales: Bonidia, Robson P., Avila Santos, Anderson P., de Almeida, Breno L. S., Stadler, Peter F., Nunes da Rocha, Ulisses, Sanches, Danilo S., de Carvalho, André C. P. L. F.
Formato: Online Artículo Texto
Lenguaje:English
Publicado: MDPI 2022
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9601431/
https://www.ncbi.nlm.nih.gov/pubmed/37420418
http://dx.doi.org/10.3390/e24101398
_version_ 1784817063567556608
author Bonidia, Robson P.
Avila Santos, Anderson P.
de Almeida, Breno L. S.
Stadler, Peter F.
Nunes da Rocha, Ulisses
Sanches, Danilo S.
de Carvalho, André C. P. L. F.
author_facet Bonidia, Robson P.
Avila Santos, Anderson P.
de Almeida, Breno L. S.
Stadler, Peter F.
Nunes da Rocha, Ulisses
Sanches, Danilo S.
de Carvalho, André C. P. L. F.
author_sort Bonidia, Robson P.
collection PubMed
description In recent years, there has been an exponential growth in sequencing projects due to accelerated technological advances, leading to a significant increase in the amount of data and resulting in new challenges for biological sequence analysis. Consequently, the use of techniques capable of analyzing large amounts of data has been explored, such as machine learning (ML) algorithms. ML algorithms are being used to analyze and classify biological sequences, despite the intrinsic difficulty in extracting and finding representative biological sequence methods suitable for them. Thereby, extracting numerical features to represent sequences makes it statistically feasible to use universal concepts from Information Theory, such as Tsallis and Shannon entropy. In this study, we propose a novel Tsallis entropy-based feature extractor to provide useful information to classify biological sequences. To assess its relevance, we prepared five case studies: (1) an analysis of the entropic index q; (2) performance testing of the best entropic indices on new datasets; (3) a comparison made with Shannon entropy and (4) generalized entropies; (5) an investigation of the Tsallis entropy in the context of dimensionality reduction. As a result, our proposal proved to be effective, being superior to Shannon entropy and robust in terms of generalization, and also potentially representative for collecting information in fewer dimensions compared with methods such as Singular Value Decomposition and Uniform Manifold Approximation and Projection.
format Online
Article
Text
id pubmed-9601431
institution National Center for Biotechnology Information
language English
publishDate 2022
publisher MDPI
record_format MEDLINE/PubMed
spelling pubmed-96014312022-10-27 Information Theory for Biological Sequence Classification: A Novel Feature Extraction Technique Based on Tsallis Entropy Bonidia, Robson P. Avila Santos, Anderson P. de Almeida, Breno L. S. Stadler, Peter F. Nunes da Rocha, Ulisses Sanches, Danilo S. de Carvalho, André C. P. L. F. Entropy (Basel) Article In recent years, there has been an exponential growth in sequencing projects due to accelerated technological advances, leading to a significant increase in the amount of data and resulting in new challenges for biological sequence analysis. Consequently, the use of techniques capable of analyzing large amounts of data has been explored, such as machine learning (ML) algorithms. ML algorithms are being used to analyze and classify biological sequences, despite the intrinsic difficulty in extracting and finding representative biological sequence methods suitable for them. Thereby, extracting numerical features to represent sequences makes it statistically feasible to use universal concepts from Information Theory, such as Tsallis and Shannon entropy. In this study, we propose a novel Tsallis entropy-based feature extractor to provide useful information to classify biological sequences. To assess its relevance, we prepared five case studies: (1) an analysis of the entropic index q; (2) performance testing of the best entropic indices on new datasets; (3) a comparison made with Shannon entropy and (4) generalized entropies; (5) an investigation of the Tsallis entropy in the context of dimensionality reduction. As a result, our proposal proved to be effective, being superior to Shannon entropy and robust in terms of generalization, and also potentially representative for collecting information in fewer dimensions compared with methods such as Singular Value Decomposition and Uniform Manifold Approximation and Projection. MDPI 2022-10-01 /pmc/articles/PMC9601431/ /pubmed/37420418 http://dx.doi.org/10.3390/e24101398 Text en © 2022 by the authors. https://creativecommons.org/licenses/by/4.0/Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
spellingShingle Article
Bonidia, Robson P.
Avila Santos, Anderson P.
de Almeida, Breno L. S.
Stadler, Peter F.
Nunes da Rocha, Ulisses
Sanches, Danilo S.
de Carvalho, André C. P. L. F.
Information Theory for Biological Sequence Classification: A Novel Feature Extraction Technique Based on Tsallis Entropy
title Information Theory for Biological Sequence Classification: A Novel Feature Extraction Technique Based on Tsallis Entropy
title_full Information Theory for Biological Sequence Classification: A Novel Feature Extraction Technique Based on Tsallis Entropy
title_fullStr Information Theory for Biological Sequence Classification: A Novel Feature Extraction Technique Based on Tsallis Entropy
title_full_unstemmed Information Theory for Biological Sequence Classification: A Novel Feature Extraction Technique Based on Tsallis Entropy
title_short Information Theory for Biological Sequence Classification: A Novel Feature Extraction Technique Based on Tsallis Entropy
title_sort information theory for biological sequence classification: a novel feature extraction technique based on tsallis entropy
topic Article
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9601431/
https://www.ncbi.nlm.nih.gov/pubmed/37420418
http://dx.doi.org/10.3390/e24101398
work_keys_str_mv AT bonidiarobsonp informationtheoryforbiologicalsequenceclassificationanovelfeatureextractiontechniquebasedontsallisentropy
AT avilasantosandersonp informationtheoryforbiologicalsequenceclassificationanovelfeatureextractiontechniquebasedontsallisentropy
AT dealmeidabrenols informationtheoryforbiologicalsequenceclassificationanovelfeatureextractiontechniquebasedontsallisentropy
AT stadlerpeterf informationtheoryforbiologicalsequenceclassificationanovelfeatureextractiontechniquebasedontsallisentropy
AT nunesdarochaulisses informationtheoryforbiologicalsequenceclassificationanovelfeatureextractiontechniquebasedontsallisentropy
AT sanchesdanilos informationtheoryforbiologicalsequenceclassificationanovelfeatureextractiontechniquebasedontsallisentropy
AT decarvalhoandrecplf informationtheoryforbiologicalsequenceclassificationanovelfeatureextractiontechniquebasedontsallisentropy