Cargando…
SpliceFinder: ab initio prediction of splice sites using convolutional neural network
BACKGROUND: Identifying splice sites is a necessary step to analyze the location and structure of genes. Two dinucleotides, GT and AG, are highly frequent on splice sites, and many other patterns are also on splice sites with important biological functions. Meanwhile, the dinucleotides occur frequen...
Autores principales: | , , , |
---|---|
Formato: | Online Artículo Texto |
Lenguaje: | English |
Publicado: |
BioMed Central
2019
|
Materias: | |
Acceso en línea: | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6933889/ https://www.ncbi.nlm.nih.gov/pubmed/31881982 http://dx.doi.org/10.1186/s12859-019-3306-3 |
_version_ | 1783483297016315904 |
---|---|
author | Wang, Ruohan Wang, Zishuai Wang, Jianping Li, Shuaicheng |
author_facet | Wang, Ruohan Wang, Zishuai Wang, Jianping Li, Shuaicheng |
author_sort | Wang, Ruohan |
collection | PubMed |
description | BACKGROUND: Identifying splice sites is a necessary step to analyze the location and structure of genes. Two dinucleotides, GT and AG, are highly frequent on splice sites, and many other patterns are also on splice sites with important biological functions. Meanwhile, the dinucleotides occur frequently at the sequences without splice sites, which makes the prediction prone to generate false positives. Most existing tools select all the sequences with the two dimers and then focus on distinguishing the true splice sites from those pseudo ones. Such an approach will lead to a decrease in false positives; however, it will result in non-canonical splice sites missing. RESULT: We have designed SpliceFinder based on convolutional neural network (CNN) to predict splice sites. To achieve the ab initio prediction, we used human genomic data to train our neural network. An iterative approach is adopted to reconstruct the dataset, which tackles the data unbalance problem and forces the model to learn more features of splice sites. The proposed CNN obtains the classification accuracy of 90.25%, which is 10% higher than the existing algorithms. The method outperforms other existing methods in terms of area under receiver operating characteristics (AUC), recall, precision, and F1 score. Furthermore, SpliceFinder can find the exact position of splice sites on long genomic sequences with a sliding window. Compared with other state-of-the-art splice site prediction tools, SpliceFinder generates results in about half lower false positive while keeping recall higher than 0.8. Also, SpliceFinder captures the non-canonical splice sites. In addition, SpliceFinder performs well on the genomic sequences of Drosophila melanogaster, Mus musculus, Rattus, and Danio rerio without retraining. CONCLUSION: Based on CNN, we have proposed a new ab initio splice site prediction tool, SpliceFinder, which generates less false positives and can detect non-canonical splice sites. Additionally, SpliceFinder is transferable to other species without retraining. The source code and additional materials are available at https://gitlab.deepomics.org/wangruohan/SpliceFinder. |
format | Online Article Text |
id | pubmed-6933889 |
institution | National Center for Biotechnology Information |
language | English |
publishDate | 2019 |
publisher | BioMed Central |
record_format | MEDLINE/PubMed |
spelling | pubmed-69338892019-12-30 SpliceFinder: ab initio prediction of splice sites using convolutional neural network Wang, Ruohan Wang, Zishuai Wang, Jianping Li, Shuaicheng BMC Bioinformatics Methodology BACKGROUND: Identifying splice sites is a necessary step to analyze the location and structure of genes. Two dinucleotides, GT and AG, are highly frequent on splice sites, and many other patterns are also on splice sites with important biological functions. Meanwhile, the dinucleotides occur frequently at the sequences without splice sites, which makes the prediction prone to generate false positives. Most existing tools select all the sequences with the two dimers and then focus on distinguishing the true splice sites from those pseudo ones. Such an approach will lead to a decrease in false positives; however, it will result in non-canonical splice sites missing. RESULT: We have designed SpliceFinder based on convolutional neural network (CNN) to predict splice sites. To achieve the ab initio prediction, we used human genomic data to train our neural network. An iterative approach is adopted to reconstruct the dataset, which tackles the data unbalance problem and forces the model to learn more features of splice sites. The proposed CNN obtains the classification accuracy of 90.25%, which is 10% higher than the existing algorithms. The method outperforms other existing methods in terms of area under receiver operating characteristics (AUC), recall, precision, and F1 score. Furthermore, SpliceFinder can find the exact position of splice sites on long genomic sequences with a sliding window. Compared with other state-of-the-art splice site prediction tools, SpliceFinder generates results in about half lower false positive while keeping recall higher than 0.8. Also, SpliceFinder captures the non-canonical splice sites. In addition, SpliceFinder performs well on the genomic sequences of Drosophila melanogaster, Mus musculus, Rattus, and Danio rerio without retraining. CONCLUSION: Based on CNN, we have proposed a new ab initio splice site prediction tool, SpliceFinder, which generates less false positives and can detect non-canonical splice sites. Additionally, SpliceFinder is transferable to other species without retraining. The source code and additional materials are available at https://gitlab.deepomics.org/wangruohan/SpliceFinder. BioMed Central 2019-12-27 /pmc/articles/PMC6933889/ /pubmed/31881982 http://dx.doi.org/10.1186/s12859-019-3306-3 Text en © The Author(s) 2019 Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License(http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver(http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated. |
spellingShingle | Methodology Wang, Ruohan Wang, Zishuai Wang, Jianping Li, Shuaicheng SpliceFinder: ab initio prediction of splice sites using convolutional neural network |
title | SpliceFinder: ab initio prediction of splice sites using convolutional neural network |
title_full | SpliceFinder: ab initio prediction of splice sites using convolutional neural network |
title_fullStr | SpliceFinder: ab initio prediction of splice sites using convolutional neural network |
title_full_unstemmed | SpliceFinder: ab initio prediction of splice sites using convolutional neural network |
title_short | SpliceFinder: ab initio prediction of splice sites using convolutional neural network |
title_sort | splicefinder: ab initio prediction of splice sites using convolutional neural network |
topic | Methodology |
url | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6933889/ https://www.ncbi.nlm.nih.gov/pubmed/31881982 http://dx.doi.org/10.1186/s12859-019-3306-3 |
work_keys_str_mv | AT wangruohan splicefinderabinitiopredictionofsplicesitesusingconvolutionalneuralnetwork AT wangzishuai splicefinderabinitiopredictionofsplicesitesusingconvolutionalneuralnetwork AT wangjianping splicefinderabinitiopredictionofsplicesitesusingconvolutionalneuralnetwork AT lishuaicheng splicefinderabinitiopredictionofsplicesitesusingconvolutionalneuralnetwork |