Cargando…
Gene expression data classification using topology and machine learning models
BACKGROUND: Interpretation of high-throughput gene expression data continues to require mathematical tools in data analysis that recognizes the shape of the data in high dimensions. Topological data analysis (TDA) has recently been successful in extracting robust features in several applications dea...
Autores principales: | , , |
---|---|
Formato: | Online Artículo Texto |
Lenguaje: | English |
Publicado: |
BioMed Central
2022
|
Materias: | |
Acceso en línea: | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9121583/ https://www.ncbi.nlm.nih.gov/pubmed/35596135 http://dx.doi.org/10.1186/s12859-022-04704-z |
_version_ | 1784711183062794240 |
---|---|
author | Dey, Tamal K. Mandal, Sayan Mukherjee, Soham |
author_facet | Dey, Tamal K. Mandal, Sayan Mukherjee, Soham |
author_sort | Dey, Tamal K. |
collection | PubMed |
description | BACKGROUND: Interpretation of high-throughput gene expression data continues to require mathematical tools in data analysis that recognizes the shape of the data in high dimensions. Topological data analysis (TDA) has recently been successful in extracting robust features in several applications dealing with high dimensional constructs. In this work, we utilize some recent developments in TDA to curate gene expression data. Our work differs from the predecessors in two aspects: (1) Traditional TDA pipelines use topological signatures called barcodes to enhance feature vectors which are used for classification. In contrast, this work involves curating relevant features to obtain somewhat better representatives with the help of TDA. This representatives of the entire data facilitates better comprehension of the phenotype labels. (2) Most of the earlier works employ barcodes obtained using topological summaries as fingerprints for the data. Even though they are stable signatures, there exists no direct mapping between the data and said barcodes. RESULTS: The topology relevant curated data that we obtain provides an improvement in shallow learning as well as deep learning based supervised classifications. We further show that the representative cycles we compute have an unsupervised inclination towards phenotype labels. This work thus shows that topological signatures are able to comprehend gene expression levels and classify cohorts accordingly. CONCLUSIONS: In this work, we engender representative persistent cycles to discern the gene expression data. These cycles allow us to directly procure genes entailed in similar processes. |
format | Online Article Text |
id | pubmed-9121583 |
institution | National Center for Biotechnology Information |
language | English |
publishDate | 2022 |
publisher | BioMed Central |
record_format | MEDLINE/PubMed |
spelling | pubmed-91215832022-05-21 Gene expression data classification using topology and machine learning models Dey, Tamal K. Mandal, Sayan Mukherjee, Soham BMC Bioinformatics Research BACKGROUND: Interpretation of high-throughput gene expression data continues to require mathematical tools in data analysis that recognizes the shape of the data in high dimensions. Topological data analysis (TDA) has recently been successful in extracting robust features in several applications dealing with high dimensional constructs. In this work, we utilize some recent developments in TDA to curate gene expression data. Our work differs from the predecessors in two aspects: (1) Traditional TDA pipelines use topological signatures called barcodes to enhance feature vectors which are used for classification. In contrast, this work involves curating relevant features to obtain somewhat better representatives with the help of TDA. This representatives of the entire data facilitates better comprehension of the phenotype labels. (2) Most of the earlier works employ barcodes obtained using topological summaries as fingerprints for the data. Even though they are stable signatures, there exists no direct mapping between the data and said barcodes. RESULTS: The topology relevant curated data that we obtain provides an improvement in shallow learning as well as deep learning based supervised classifications. We further show that the representative cycles we compute have an unsupervised inclination towards phenotype labels. This work thus shows that topological signatures are able to comprehend gene expression levels and classify cohorts accordingly. CONCLUSIONS: In this work, we engender representative persistent cycles to discern the gene expression data. These cycles allow us to directly procure genes entailed in similar processes. BioMed Central 2022-05-20 /pmc/articles/PMC9121583/ /pubmed/35596135 http://dx.doi.org/10.1186/s12859-022-04704-z Text en © The Author(s) 2022 https://creativecommons.org/licenses/by/4.0/Open AccessThis article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/ (https://creativecommons.org/licenses/by/4.0/) . The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/ (https://creativecommons.org/publicdomain/zero/1.0/) ) applies to the data made available in this article, unless otherwise stated in a credit line to the data. |
spellingShingle | Research Dey, Tamal K. Mandal, Sayan Mukherjee, Soham Gene expression data classification using topology and machine learning models |
title | Gene expression data classification using topology and machine learning models |
title_full | Gene expression data classification using topology and machine learning models |
title_fullStr | Gene expression data classification using topology and machine learning models |
title_full_unstemmed | Gene expression data classification using topology and machine learning models |
title_short | Gene expression data classification using topology and machine learning models |
title_sort | gene expression data classification using topology and machine learning models |
topic | Research |
url | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9121583/ https://www.ncbi.nlm.nih.gov/pubmed/35596135 http://dx.doi.org/10.1186/s12859-022-04704-z |
work_keys_str_mv | AT deytamalk geneexpressiondataclassificationusingtopologyandmachinelearningmodels AT mandalsayan geneexpressiondataclassificationusingtopologyandmachinelearningmodels AT mukherjeesoham geneexpressiondataclassificationusingtopologyandmachinelearningmodels |