Cargando…

Comparison of machine learning and deep learning techniques in promoter prediction across diverse species

Gene promoters are the key DNA regulatory elements positioned around the transcription start sites and are responsible for regulating gene transcription process. Various alignment-based, signal-based and content-based approaches are reported for the prediction of promoters. However, since all promot...

Descripción completa

Detalles Bibliográficos
Autores principales:	Bhandari, Nikita, Khare, Satyajeet, Walambe, Rahee, Kotecha, Ketan
Formato:	Online Artículo Texto
Lenguaje:	English
Publicado:	PeerJ Inc. 2021
Materias:	Bioinformatics
Acceso en línea:	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7959599/ https://www.ncbi.nlm.nih.gov/pubmed/33817015 http://dx.doi.org/10.7717/peerj-cs.365

_version_	1783664984421564416
author	Bhandari, Nikita Khare, Satyajeet Walambe, Rahee Kotecha, Ketan
author_facet	Bhandari, Nikita Khare, Satyajeet Walambe, Rahee Kotecha, Ketan
author_sort	Bhandari, Nikita
collection	PubMed
description	Gene promoters are the key DNA regulatory elements positioned around the transcription start sites and are responsible for regulating gene transcription process. Various alignment-based, signal-based and content-based approaches are reported for the prediction of promoters. However, since all promoter sequences do not show explicit features, the prediction performance of these techniques is poor. Therefore, many machine learning and deep learning models have been proposed for promoter prediction. In this work, we studied methods for vector encoding and promoter classification using genome sequences of three distinct higher eukaryotes viz. yeast (Saccharomyces cerevisiae), A. thaliana (plant) and human (Homo sapiens). We compared one-hot vector encoding method with frequency-based tokenization (FBT) for data pre-processing on 1-D Convolutional Neural Network (CNN) model. We found that FBT gives a shorter input dimension reducing the training time without affecting the sensitivity and specificity of classification. We employed the deep learning techniques, mainly CNN and recurrent neural network with Long Short Term Memory (LSTM) and random forest (RF) classifier for promoter classification at k-mer sizes of 2, 4 and 8. We found CNN to be superior in classification of promoters from non-promoter sequences (binary classification) as well as species-specific classification of promoter sequences (multiclass classification). In summary, the contribution of this work lies in the use of synthetic shuffled negative dataset and frequency-based tokenization for pre-processing. This study provides a comprehensive and generic framework for classification tasks in genomic applications and can be extended to various classification problems.
format	Online Article Text
id	pubmed-7959599
institution	National Center for Biotechnology Information
language	English
publishDate	2021
publisher	PeerJ Inc.
record_format	MEDLINE/PubMed
spelling	pubmed-79595992021-04-02 Comparison of machine learning and deep learning techniques in promoter prediction across diverse species Bhandari, Nikita Khare, Satyajeet Walambe, Rahee Kotecha, Ketan PeerJ Comput Sci Bioinformatics Gene promoters are the key DNA regulatory elements positioned around the transcription start sites and are responsible for regulating gene transcription process. Various alignment-based, signal-based and content-based approaches are reported for the prediction of promoters. However, since all promoter sequences do not show explicit features, the prediction performance of these techniques is poor. Therefore, many machine learning and deep learning models have been proposed for promoter prediction. In this work, we studied methods for vector encoding and promoter classification using genome sequences of three distinct higher eukaryotes viz. yeast (Saccharomyces cerevisiae), A. thaliana (plant) and human (Homo sapiens). We compared one-hot vector encoding method with frequency-based tokenization (FBT) for data pre-processing on 1-D Convolutional Neural Network (CNN) model. We found that FBT gives a shorter input dimension reducing the training time without affecting the sensitivity and specificity of classification. We employed the deep learning techniques, mainly CNN and recurrent neural network with Long Short Term Memory (LSTM) and random forest (RF) classifier for promoter classification at k-mer sizes of 2, 4 and 8. We found CNN to be superior in classification of promoters from non-promoter sequences (binary classification) as well as species-specific classification of promoter sequences (multiclass classification). In summary, the contribution of this work lies in the use of synthetic shuffled negative dataset and frequency-based tokenization for pre-processing. This study provides a comprehensive and generic framework for classification tasks in genomic applications and can be extended to various classification problems. PeerJ Inc. 2021-02-09 /pmc/articles/PMC7959599/ /pubmed/33817015 http://dx.doi.org/10.7717/peerj-cs.365 Text en ©2021 Bhandari et al. https://creativecommons.org/licenses/by/4.0/ This is an open access article distributed under the terms of the Creative Commons Attribution License (https://creativecommons.org/licenses/by/4.0/) , which permits unrestricted use, distribution, reproduction and adaptation in any medium and for any purpose provided that it is properly attributed. For attribution, the original author(s), title, publication source (PeerJ Computer Science) and either DOI or URL of the article must be cited.
spellingShingle	Bioinformatics Bhandari, Nikita Khare, Satyajeet Walambe, Rahee Kotecha, Ketan Comparison of machine learning and deep learning techniques in promoter prediction across diverse species
title	Comparison of machine learning and deep learning techniques in promoter prediction across diverse species
title_full	Comparison of machine learning and deep learning techniques in promoter prediction across diverse species
title_fullStr	Comparison of machine learning and deep learning techniques in promoter prediction across diverse species
title_full_unstemmed	Comparison of machine learning and deep learning techniques in promoter prediction across diverse species
title_short	Comparison of machine learning and deep learning techniques in promoter prediction across diverse species
title_sort	comparison of machine learning and deep learning techniques in promoter prediction across diverse species
topic	Bioinformatics
url	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7959599/ https://www.ncbi.nlm.nih.gov/pubmed/33817015 http://dx.doi.org/10.7717/peerj-cs.365
work_keys_str_mv	AT bhandarinikita comparisonofmachinelearninganddeeplearningtechniquesinpromoterpredictionacrossdiversespecies AT kharesatyajeet comparisonofmachinelearninganddeeplearningtechniquesinpromoterpredictionacrossdiversespecies AT walamberahee comparisonofmachinelearninganddeeplearningtechniquesinpromoterpredictionacrossdiversespecies AT kotechaketan comparisonofmachinelearninganddeeplearningtechniquesinpromoterpredictionacrossdiversespecies

Comparison of machine learning and deep learning techniques in promoter prediction across diverse species

Ejemplares similares