Cargando…

STR-based feature extraction and selection for genetic feature discovery in neurological disease genes

Gene expression, often determined by single nucleotide polymorphisms, short repeated sequences known as short tandem repeats (STRs), structural variants, and environmental factors, provides means for an organism to produce gene products necessary to live. Variation in expression levels, sometimes kn...

Descripción completa

Detalles Bibliográficos
Autores principales: Dhaliwal, Jasbir, Wagner, John
Formato: Online Artículo Texto
Lenguaje:English
Publicado: Nature Publishing Group UK 2023
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9922266/
https://www.ncbi.nlm.nih.gov/pubmed/36774368
http://dx.doi.org/10.1038/s41598-023-29376-4
_version_ 1784887506821447680
author Dhaliwal, Jasbir
Wagner, John
author_facet Dhaliwal, Jasbir
Wagner, John
author_sort Dhaliwal, Jasbir
collection PubMed
description Gene expression, often determined by single nucleotide polymorphisms, short repeated sequences known as short tandem repeats (STRs), structural variants, and environmental factors, provides means for an organism to produce gene products necessary to live. Variation in expression levels, sometimes known as enrichment patterns, has been associated with disease progression. Thus, the STR enrichment patterns have recently gained interest as potential genetic markers for disease progression. However, to the best of our knowledge, we are unaware of any study that evaluates and explores STRs, particularly trinucleotide sequences, as machine learning features for classifying neurological disease genes for the purpose of discovering genetic features. Thus, in this paper, we proposed a new metric and a novel feature extraction and selection algorithm based on statistically significant STR-based features and their respective enrichment patterns to create a statistically significant feature set. The proposed new metric has shown that the neurological disease family genes have a non-random AA, AT, TA, TG, and TT enrichment pattern. This is an important result, as it supports prior research that has established that certain trinucleotides, such as AAT, ATA, ATT, TAT, and TTA, are favored during protein misfolding. In contrast, trinucleotides, such as TAA, TAG, and TGA, are favored during premature termination codon mutations as they are stop codons. This suggests that the metric has the potential to identify patterns that may be genetic features in a sample of neurological genes. Moreover, the practical performance and high prediction results of the statistically significant STR-based feature set indicate that variations in STR enrichment patterns can distinguish neurological disease genes. In conclusion, the proposed approach may have the potential to discover differential genetic features for other diseases.
format Online
Article
Text
id pubmed-9922266
institution National Center for Biotechnology Information
language English
publishDate 2023
publisher Nature Publishing Group UK
record_format MEDLINE/PubMed
spelling pubmed-99222662023-02-13 STR-based feature extraction and selection for genetic feature discovery in neurological disease genes Dhaliwal, Jasbir Wagner, John Sci Rep Article Gene expression, often determined by single nucleotide polymorphisms, short repeated sequences known as short tandem repeats (STRs), structural variants, and environmental factors, provides means for an organism to produce gene products necessary to live. Variation in expression levels, sometimes known as enrichment patterns, has been associated with disease progression. Thus, the STR enrichment patterns have recently gained interest as potential genetic markers for disease progression. However, to the best of our knowledge, we are unaware of any study that evaluates and explores STRs, particularly trinucleotide sequences, as machine learning features for classifying neurological disease genes for the purpose of discovering genetic features. Thus, in this paper, we proposed a new metric and a novel feature extraction and selection algorithm based on statistically significant STR-based features and their respective enrichment patterns to create a statistically significant feature set. The proposed new metric has shown that the neurological disease family genes have a non-random AA, AT, TA, TG, and TT enrichment pattern. This is an important result, as it supports prior research that has established that certain trinucleotides, such as AAT, ATA, ATT, TAT, and TTA, are favored during protein misfolding. In contrast, trinucleotides, such as TAA, TAG, and TGA, are favored during premature termination codon mutations as they are stop codons. This suggests that the metric has the potential to identify patterns that may be genetic features in a sample of neurological genes. Moreover, the practical performance and high prediction results of the statistically significant STR-based feature set indicate that variations in STR enrichment patterns can distinguish neurological disease genes. In conclusion, the proposed approach may have the potential to discover differential genetic features for other diseases. Nature Publishing Group UK 2023-02-11 /pmc/articles/PMC9922266/ /pubmed/36774368 http://dx.doi.org/10.1038/s41598-023-29376-4 Text en © The Author(s) 2023 https://creativecommons.org/licenses/by/4.0/Open AccessThis article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/ (https://creativecommons.org/licenses/by/4.0/) .
spellingShingle Article
Dhaliwal, Jasbir
Wagner, John
STR-based feature extraction and selection for genetic feature discovery in neurological disease genes
title STR-based feature extraction and selection for genetic feature discovery in neurological disease genes
title_full STR-based feature extraction and selection for genetic feature discovery in neurological disease genes
title_fullStr STR-based feature extraction and selection for genetic feature discovery in neurological disease genes
title_full_unstemmed STR-based feature extraction and selection for genetic feature discovery in neurological disease genes
title_short STR-based feature extraction and selection for genetic feature discovery in neurological disease genes
title_sort str-based feature extraction and selection for genetic feature discovery in neurological disease genes
topic Article
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9922266/
https://www.ncbi.nlm.nih.gov/pubmed/36774368
http://dx.doi.org/10.1038/s41598-023-29376-4
work_keys_str_mv AT dhaliwaljasbir strbasedfeatureextractionandselectionforgeneticfeaturediscoveryinneurologicaldiseasegenes
AT wagnerjohn strbasedfeatureextractionandselectionforgeneticfeaturediscoveryinneurologicaldiseasegenes