Cargando…
Comparison of machine learning and deep learning techniques in promoter prediction across diverse species
Gene promoters are the key DNA regulatory elements positioned around the transcription start sites and are responsible for regulating gene transcription process. Various alignment-based, signal-based and content-based approaches are reported for the prediction of promoters. However, since all promot...
Autores principales: | , , , |
---|---|
Formato: | Online Artículo Texto |
Lenguaje: | English |
Publicado: |
PeerJ Inc.
2021
|
Materias: | |
Acceso en línea: | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7959599/ https://www.ncbi.nlm.nih.gov/pubmed/33817015 http://dx.doi.org/10.7717/peerj-cs.365 |
_version_ | 1783664984421564416 |
---|---|
author | Bhandari, Nikita Khare, Satyajeet Walambe, Rahee Kotecha, Ketan |
author_facet | Bhandari, Nikita Khare, Satyajeet Walambe, Rahee Kotecha, Ketan |
author_sort | Bhandari, Nikita |
collection | PubMed |
description | Gene promoters are the key DNA regulatory elements positioned around the transcription start sites and are responsible for regulating gene transcription process. Various alignment-based, signal-based and content-based approaches are reported for the prediction of promoters. However, since all promoter sequences do not show explicit features, the prediction performance of these techniques is poor. Therefore, many machine learning and deep learning models have been proposed for promoter prediction. In this work, we studied methods for vector encoding and promoter classification using genome sequences of three distinct higher eukaryotes viz. yeast (Saccharomyces cerevisiae), A. thaliana (plant) and human (Homo sapiens). We compared one-hot vector encoding method with frequency-based tokenization (FBT) for data pre-processing on 1-D Convolutional Neural Network (CNN) model. We found that FBT gives a shorter input dimension reducing the training time without affecting the sensitivity and specificity of classification. We employed the deep learning techniques, mainly CNN and recurrent neural network with Long Short Term Memory (LSTM) and random forest (RF) classifier for promoter classification at k-mer sizes of 2, 4 and 8. We found CNN to be superior in classification of promoters from non-promoter sequences (binary classification) as well as species-specific classification of promoter sequences (multiclass classification). In summary, the contribution of this work lies in the use of synthetic shuffled negative dataset and frequency-based tokenization for pre-processing. This study provides a comprehensive and generic framework for classification tasks in genomic applications and can be extended to various classification problems. |
format | Online Article Text |
id | pubmed-7959599 |
institution | National Center for Biotechnology Information |
language | English |
publishDate | 2021 |
publisher | PeerJ Inc. |
record_format | MEDLINE/PubMed |
spelling | pubmed-79595992021-04-02 Comparison of machine learning and deep learning techniques in promoter prediction across diverse species Bhandari, Nikita Khare, Satyajeet Walambe, Rahee Kotecha, Ketan PeerJ Comput Sci Bioinformatics Gene promoters are the key DNA regulatory elements positioned around the transcription start sites and are responsible for regulating gene transcription process. Various alignment-based, signal-based and content-based approaches are reported for the prediction of promoters. However, since all promoter sequences do not show explicit features, the prediction performance of these techniques is poor. Therefore, many machine learning and deep learning models have been proposed for promoter prediction. In this work, we studied methods for vector encoding and promoter classification using genome sequences of three distinct higher eukaryotes viz. yeast (Saccharomyces cerevisiae), A. thaliana (plant) and human (Homo sapiens). We compared one-hot vector encoding method with frequency-based tokenization (FBT) for data pre-processing on 1-D Convolutional Neural Network (CNN) model. We found that FBT gives a shorter input dimension reducing the training time without affecting the sensitivity and specificity of classification. We employed the deep learning techniques, mainly CNN and recurrent neural network with Long Short Term Memory (LSTM) and random forest (RF) classifier for promoter classification at k-mer sizes of 2, 4 and 8. We found CNN to be superior in classification of promoters from non-promoter sequences (binary classification) as well as species-specific classification of promoter sequences (multiclass classification). In summary, the contribution of this work lies in the use of synthetic shuffled negative dataset and frequency-based tokenization for pre-processing. This study provides a comprehensive and generic framework for classification tasks in genomic applications and can be extended to various classification problems. PeerJ Inc. 2021-02-09 /pmc/articles/PMC7959599/ /pubmed/33817015 http://dx.doi.org/10.7717/peerj-cs.365 Text en ©2021 Bhandari et al. https://creativecommons.org/licenses/by/4.0/ This is an open access article distributed under the terms of the Creative Commons Attribution License (https://creativecommons.org/licenses/by/4.0/) , which permits unrestricted use, distribution, reproduction and adaptation in any medium and for any purpose provided that it is properly attributed. For attribution, the original author(s), title, publication source (PeerJ Computer Science) and either DOI or URL of the article must be cited. |
spellingShingle | Bioinformatics Bhandari, Nikita Khare, Satyajeet Walambe, Rahee Kotecha, Ketan Comparison of machine learning and deep learning techniques in promoter prediction across diverse species |
title | Comparison of machine learning and deep learning techniques in promoter prediction across diverse species |
title_full | Comparison of machine learning and deep learning techniques in promoter prediction across diverse species |
title_fullStr | Comparison of machine learning and deep learning techniques in promoter prediction across diverse species |
title_full_unstemmed | Comparison of machine learning and deep learning techniques in promoter prediction across diverse species |
title_short | Comparison of machine learning and deep learning techniques in promoter prediction across diverse species |
title_sort | comparison of machine learning and deep learning techniques in promoter prediction across diverse species |
topic | Bioinformatics |
url | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7959599/ https://www.ncbi.nlm.nih.gov/pubmed/33817015 http://dx.doi.org/10.7717/peerj-cs.365 |
work_keys_str_mv | AT bhandarinikita comparisonofmachinelearninganddeeplearningtechniquesinpromoterpredictionacrossdiversespecies AT kharesatyajeet comparisonofmachinelearninganddeeplearningtechniquesinpromoterpredictionacrossdiversespecies AT walamberahee comparisonofmachinelearninganddeeplearningtechniquesinpromoterpredictionacrossdiversespecies AT kotechaketan comparisonofmachinelearninganddeeplearningtechniquesinpromoterpredictionacrossdiversespecies |