Cargando…

Seqping: gene prediction pipeline for plant genomes using self-training gene models and transcriptomic data

BACKGROUND: Gene prediction is one of the most important steps in the genome annotation process. A large number of software tools and pipelines developed by various computing techniques are available for gene prediction. However, these systems have yet to accurately predict all or even most of the p...

Descripción completa

Detalles Bibliográficos
Autores principales:	Chan, Kuang-Lim, Rosli, Rozana, Tatarinova, Tatiana V., Hogan, Michael, Firdaus-Raih, Mohd, Low, Eng-Ti Leslie
Formato:	Online Artículo Texto
Lenguaje:	English
Publicado:	BioMed Central 2017
Materias:	Research
Acceso en línea:	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5333190/ https://www.ncbi.nlm.nih.gov/pubmed/28466793 http://dx.doi.org/10.1186/s12859-016-1426-6

_version_	1782511682216525824
author	Chan, Kuang-Lim Rosli, Rozana Tatarinova, Tatiana V. Hogan, Michael Firdaus-Raih, Mohd Low, Eng-Ti Leslie
author_facet	Chan, Kuang-Lim Rosli, Rozana Tatarinova, Tatiana V. Hogan, Michael Firdaus-Raih, Mohd Low, Eng-Ti Leslie
author_sort	Chan, Kuang-Lim
collection	PubMed
description	BACKGROUND: Gene prediction is one of the most important steps in the genome annotation process. A large number of software tools and pipelines developed by various computing techniques are available for gene prediction. However, these systems have yet to accurately predict all or even most of the protein-coding regions. Furthermore, none of the currently available gene-finders has a universal Hidden Markov Model (HMM) that can perform gene prediction for all organisms equally well in an automatic fashion. RESULTS: We present an automated gene prediction pipeline, Seqping that uses self-training HMM models and transcriptomic data. The pipeline processes the genome and transcriptome sequences of the target species using GlimmerHMM, SNAP, and AUGUSTUS pipelines, followed by MAKER2 program to combine predictions from the three tools in association with the transcriptomic evidence. Seqping generates species-specific HMMs that are able to offer unbiased gene predictions. The pipeline was evaluated using the Oryza sativa and Arabidopsis thaliana genomes. Benchmarking Universal Single-Copy Orthologs (BUSCO) analysis showed that the pipeline was able to identify at least 95% of BUSCO’s plantae dataset. Our evaluation shows that Seqping was able to generate better gene predictions compared to three HMM-based programs (MAKER2, GlimmerHMM and AUGUSTUS) using their respective available HMMs. Seqping had the highest accuracy in rice (0.5648 for CDS, 0.4468 for exon, and 0.6695 nucleotide structure) and A. thaliana (0.5808 for CDS, 0.5955 for exon, and 0.8839 nucleotide structure). CONCLUSIONS: Seqping provides researchers a seamless pipeline to train species-specific HMMs and predict genes in newly sequenced or less-studied genomes. We conclude that the Seqping pipeline predictions are more accurate than gene predictions using the other three approaches with the default or available HMMs.
format	Online Article Text
id	pubmed-5333190
institution	National Center for Biotechnology Information
language	English
publishDate	2017
publisher	BioMed Central
record_format	MEDLINE/PubMed
spelling	pubmed-53331902017-03-06 Seqping: gene prediction pipeline for plant genomes using self-training gene models and transcriptomic data Chan, Kuang-Lim Rosli, Rozana Tatarinova, Tatiana V. Hogan, Michael Firdaus-Raih, Mohd Low, Eng-Ti Leslie BMC Bioinformatics Research BACKGROUND: Gene prediction is one of the most important steps in the genome annotation process. A large number of software tools and pipelines developed by various computing techniques are available for gene prediction. However, these systems have yet to accurately predict all or even most of the protein-coding regions. Furthermore, none of the currently available gene-finders has a universal Hidden Markov Model (HMM) that can perform gene prediction for all organisms equally well in an automatic fashion. RESULTS: We present an automated gene prediction pipeline, Seqping that uses self-training HMM models and transcriptomic data. The pipeline processes the genome and transcriptome sequences of the target species using GlimmerHMM, SNAP, and AUGUSTUS pipelines, followed by MAKER2 program to combine predictions from the three tools in association with the transcriptomic evidence. Seqping generates species-specific HMMs that are able to offer unbiased gene predictions. The pipeline was evaluated using the Oryza sativa and Arabidopsis thaliana genomes. Benchmarking Universal Single-Copy Orthologs (BUSCO) analysis showed that the pipeline was able to identify at least 95% of BUSCO’s plantae dataset. Our evaluation shows that Seqping was able to generate better gene predictions compared to three HMM-based programs (MAKER2, GlimmerHMM and AUGUSTUS) using their respective available HMMs. Seqping had the highest accuracy in rice (0.5648 for CDS, 0.4468 for exon, and 0.6695 nucleotide structure) and A. thaliana (0.5808 for CDS, 0.5955 for exon, and 0.8839 nucleotide structure). CONCLUSIONS: Seqping provides researchers a seamless pipeline to train species-specific HMMs and predict genes in newly sequenced or less-studied genomes. We conclude that the Seqping pipeline predictions are more accurate than gene predictions using the other three approaches with the default or available HMMs. BioMed Central 2017-01-27 /pmc/articles/PMC5333190/ /pubmed/28466793 http://dx.doi.org/10.1186/s12859-016-1426-6 Text en © The Author(s). 2017 Open AccessThis article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.
spellingShingle	Research Chan, Kuang-Lim Rosli, Rozana Tatarinova, Tatiana V. Hogan, Michael Firdaus-Raih, Mohd Low, Eng-Ti Leslie Seqping: gene prediction pipeline for plant genomes using self-training gene models and transcriptomic data
title	Seqping: gene prediction pipeline for plant genomes using self-training gene models and transcriptomic data
title_full	Seqping: gene prediction pipeline for plant genomes using self-training gene models and transcriptomic data
title_fullStr	Seqping: gene prediction pipeline for plant genomes using self-training gene models and transcriptomic data
title_full_unstemmed	Seqping: gene prediction pipeline for plant genomes using self-training gene models and transcriptomic data
title_short	Seqping: gene prediction pipeline for plant genomes using self-training gene models and transcriptomic data
title_sort	seqping: gene prediction pipeline for plant genomes using self-training gene models and transcriptomic data
topic	Research
url	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5333190/ https://www.ncbi.nlm.nih.gov/pubmed/28466793 http://dx.doi.org/10.1186/s12859-016-1426-6
work_keys_str_mv	AT chankuanglim seqpinggenepredictionpipelineforplantgenomesusingselftraininggenemodelsandtranscriptomicdata AT roslirozana seqpinggenepredictionpipelineforplantgenomesusingselftraininggenemodelsandtranscriptomicdata AT tatarinovatatianav seqpinggenepredictionpipelineforplantgenomesusingselftraininggenemodelsandtranscriptomicdata AT hoganmichael seqpinggenepredictionpipelineforplantgenomesusingselftraininggenemodelsandtranscriptomicdata AT firdausraihmohd seqpinggenepredictionpipelineforplantgenomesusingselftraininggenemodelsandtranscriptomicdata AT lowengtileslie seqpinggenepredictionpipelineforplantgenomesusingselftraininggenemodelsandtranscriptomicdata

Seqping: gene prediction pipeline for plant genomes using self-training gene models and transcriptomic data

Ejemplares similares