Cargando…

Comparison of machine learning and deep learning techniques in promoter prediction across diverse species

Gene promoters are the key DNA regulatory elements positioned around the transcription start sites and are responsible for regulating gene transcription process. Various alignment-based, signal-based and content-based approaches are reported for the prediction of promoters. However, since all promot...

Descripción completa

Detalles Bibliográficos
Autores principales: Bhandari, Nikita, Khare, Satyajeet, Walambe, Rahee, Kotecha, Ketan
Formato: Online Artículo Texto
Lenguaje:English
Publicado: PeerJ Inc. 2021
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7959599/
https://www.ncbi.nlm.nih.gov/pubmed/33817015
http://dx.doi.org/10.7717/peerj-cs.365
_version_ 1783664984421564416
author Bhandari, Nikita
Khare, Satyajeet
Walambe, Rahee
Kotecha, Ketan
author_facet Bhandari, Nikita
Khare, Satyajeet
Walambe, Rahee
Kotecha, Ketan
author_sort Bhandari, Nikita
collection PubMed
description Gene promoters are the key DNA regulatory elements positioned around the transcription start sites and are responsible for regulating gene transcription process. Various alignment-based, signal-based and content-based approaches are reported for the prediction of promoters. However, since all promoter sequences do not show explicit features, the prediction performance of these techniques is poor. Therefore, many machine learning and deep learning models have been proposed for promoter prediction. In this work, we studied methods for vector encoding and promoter classification using genome sequences of three distinct higher eukaryotes viz. yeast (Saccharomyces cerevisiae), A. thaliana (plant) and human (Homo sapiens). We compared one-hot vector encoding method with frequency-based tokenization (FBT) for data pre-processing on 1-D Convolutional Neural Network (CNN) model. We found that FBT gives a shorter input dimension reducing the training time without affecting the sensitivity and specificity of classification. We employed the deep learning techniques, mainly CNN and recurrent neural network with Long Short Term Memory (LSTM) and random forest (RF) classifier for promoter classification at k-mer sizes of 2, 4 and 8. We found CNN to be superior in classification of promoters from non-promoter sequences (binary classification) as well as species-specific classification of promoter sequences (multiclass classification). In summary, the contribution of this work lies in the use of synthetic shuffled negative dataset and frequency-based tokenization for pre-processing. This study provides a comprehensive and generic framework for classification tasks in genomic applications and can be extended to various classification problems.
format Online
Article
Text
id pubmed-7959599
institution National Center for Biotechnology Information
language English
publishDate 2021
publisher PeerJ Inc.
record_format MEDLINE/PubMed
spelling pubmed-79595992021-04-02 Comparison of machine learning and deep learning techniques in promoter prediction across diverse species Bhandari, Nikita Khare, Satyajeet Walambe, Rahee Kotecha, Ketan PeerJ Comput Sci Bioinformatics Gene promoters are the key DNA regulatory elements positioned around the transcription start sites and are responsible for regulating gene transcription process. Various alignment-based, signal-based and content-based approaches are reported for the prediction of promoters. However, since all promoter sequences do not show explicit features, the prediction performance of these techniques is poor. Therefore, many machine learning and deep learning models have been proposed for promoter prediction. In this work, we studied methods for vector encoding and promoter classification using genome sequences of three distinct higher eukaryotes viz. yeast (Saccharomyces cerevisiae), A. thaliana (plant) and human (Homo sapiens). We compared one-hot vector encoding method with frequency-based tokenization (FBT) for data pre-processing on 1-D Convolutional Neural Network (CNN) model. We found that FBT gives a shorter input dimension reducing the training time without affecting the sensitivity and specificity of classification. We employed the deep learning techniques, mainly CNN and recurrent neural network with Long Short Term Memory (LSTM) and random forest (RF) classifier for promoter classification at k-mer sizes of 2, 4 and 8. We found CNN to be superior in classification of promoters from non-promoter sequences (binary classification) as well as species-specific classification of promoter sequences (multiclass classification). In summary, the contribution of this work lies in the use of synthetic shuffled negative dataset and frequency-based tokenization for pre-processing. This study provides a comprehensive and generic framework for classification tasks in genomic applications and can be extended to various classification problems. PeerJ Inc. 2021-02-09 /pmc/articles/PMC7959599/ /pubmed/33817015 http://dx.doi.org/10.7717/peerj-cs.365 Text en ©2021 Bhandari et al. https://creativecommons.org/licenses/by/4.0/ This is an open access article distributed under the terms of the Creative Commons Attribution License (https://creativecommons.org/licenses/by/4.0/) , which permits unrestricted use, distribution, reproduction and adaptation in any medium and for any purpose provided that it is properly attributed. For attribution, the original author(s), title, publication source (PeerJ Computer Science) and either DOI or URL of the article must be cited.
spellingShingle Bioinformatics
Bhandari, Nikita
Khare, Satyajeet
Walambe, Rahee
Kotecha, Ketan
Comparison of machine learning and deep learning techniques in promoter prediction across diverse species
title Comparison of machine learning and deep learning techniques in promoter prediction across diverse species
title_full Comparison of machine learning and deep learning techniques in promoter prediction across diverse species
title_fullStr Comparison of machine learning and deep learning techniques in promoter prediction across diverse species
title_full_unstemmed Comparison of machine learning and deep learning techniques in promoter prediction across diverse species
title_short Comparison of machine learning and deep learning techniques in promoter prediction across diverse species
title_sort comparison of machine learning and deep learning techniques in promoter prediction across diverse species
topic Bioinformatics
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7959599/
https://www.ncbi.nlm.nih.gov/pubmed/33817015
http://dx.doi.org/10.7717/peerj-cs.365
work_keys_str_mv AT bhandarinikita comparisonofmachinelearninganddeeplearningtechniquesinpromoterpredictionacrossdiversespecies
AT kharesatyajeet comparisonofmachinelearninganddeeplearningtechniquesinpromoterpredictionacrossdiversespecies
AT walamberahee comparisonofmachinelearninganddeeplearningtechniquesinpromoterpredictionacrossdiversespecies
AT kotechaketan comparisonofmachinelearninganddeeplearningtechniquesinpromoterpredictionacrossdiversespecies