Cargando…
GeneMark-EP+: eukaryotic gene prediction with self-training in the space of genes and proteins
We have made several steps toward creating a fast and accurate algorithm for gene prediction in eukaryotic genomes. First, we introduced an automated method for efficient ab initio gene finding, GeneMark-ES, with parameters trained in iterative unsupervised mode. Next, in GeneMark-ET we proposed a m...
Autores principales: | , , |
---|---|
Formato: | Online Artículo Texto |
Lenguaje: | English |
Publicado: |
Oxford University Press
2020
|
Materias: | |
Acceso en línea: | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7222226/ https://www.ncbi.nlm.nih.gov/pubmed/32440658 http://dx.doi.org/10.1093/nargab/lqaa026 |
_version_ | 1783533526441787392 |
---|---|
author | Brůna, Tomáš Lomsadze, Alexandre Borodovsky, Mark |
author_facet | Brůna, Tomáš Lomsadze, Alexandre Borodovsky, Mark |
author_sort | Brůna, Tomáš |
collection | PubMed |
description | We have made several steps toward creating a fast and accurate algorithm for gene prediction in eukaryotic genomes. First, we introduced an automated method for efficient ab initio gene finding, GeneMark-ES, with parameters trained in iterative unsupervised mode. Next, in GeneMark-ET we proposed a method of integration of unsupervised training with information on intron positions revealed by mapping short RNA reads. Now we describe GeneMark-EP, a tool that utilizes another source of external information, a protein database, readily available prior to the start of a sequencing project. A new specialized pipeline, ProtHint, initiates massive protein mapping to genome and extracts hints to splice sites and translation start and stop sites of potential genes. GeneMark-EP uses the hints to improve estimation of model parameters as well as to adjust coordinates of predicted genes if they disagree with the most reliable hints (the -EP+ mode). Tests of GeneMark-EP and -EP+ demonstrated improvements in gene prediction accuracy in comparison with GeneMark-ES, while the GeneMark-EP+ showed higher accuracy than GeneMark-ET. We have observed that the most pronounced improvements in gene prediction accuracy happened in large eukaryotic genomes. |
format | Online Article Text |
id | pubmed-7222226 |
institution | National Center for Biotechnology Information |
language | English |
publishDate | 2020 |
publisher | Oxford University Press |
record_format | MEDLINE/PubMed |
spelling | pubmed-72222262020-05-19 GeneMark-EP+: eukaryotic gene prediction with self-training in the space of genes and proteins Brůna, Tomáš Lomsadze, Alexandre Borodovsky, Mark NAR Genom Bioinform Methart We have made several steps toward creating a fast and accurate algorithm for gene prediction in eukaryotic genomes. First, we introduced an automated method for efficient ab initio gene finding, GeneMark-ES, with parameters trained in iterative unsupervised mode. Next, in GeneMark-ET we proposed a method of integration of unsupervised training with information on intron positions revealed by mapping short RNA reads. Now we describe GeneMark-EP, a tool that utilizes another source of external information, a protein database, readily available prior to the start of a sequencing project. A new specialized pipeline, ProtHint, initiates massive protein mapping to genome and extracts hints to splice sites and translation start and stop sites of potential genes. GeneMark-EP uses the hints to improve estimation of model parameters as well as to adjust coordinates of predicted genes if they disagree with the most reliable hints (the -EP+ mode). Tests of GeneMark-EP and -EP+ demonstrated improvements in gene prediction accuracy in comparison with GeneMark-ES, while the GeneMark-EP+ showed higher accuracy than GeneMark-ET. We have observed that the most pronounced improvements in gene prediction accuracy happened in large eukaryotic genomes. Oxford University Press 2020-05-13 /pmc/articles/PMC7222226/ /pubmed/32440658 http://dx.doi.org/10.1093/nargab/lqaa026 Text en © The Author(s) 2019. Published by Oxford University Press on behalf of NAR Genomics and Bioinformatics. http://creativecommons.org/licenses/by-nc/4.0/ This is an Open Access article distributed under the terms of the Creative Commons Attribution Non-Commercial License (http://creativecommons.org/licenses/by-nc/4.0/), which permits non-commercial re-use, distribution, and reproduction in any medium, provided the original work is properly cited. For commercial re-use, please contact journals.permissions@oup.com |
spellingShingle | Methart Brůna, Tomáš Lomsadze, Alexandre Borodovsky, Mark GeneMark-EP+: eukaryotic gene prediction with self-training in the space of genes and proteins |
title | GeneMark-EP+: eukaryotic gene prediction with self-training in the space of genes and proteins |
title_full | GeneMark-EP+: eukaryotic gene prediction with self-training in the space of genes and proteins |
title_fullStr | GeneMark-EP+: eukaryotic gene prediction with self-training in the space of genes and proteins |
title_full_unstemmed | GeneMark-EP+: eukaryotic gene prediction with self-training in the space of genes and proteins |
title_short | GeneMark-EP+: eukaryotic gene prediction with self-training in the space of genes and proteins |
title_sort | genemark-ep+: eukaryotic gene prediction with self-training in the space of genes and proteins |
topic | Methart |
url | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7222226/ https://www.ncbi.nlm.nih.gov/pubmed/32440658 http://dx.doi.org/10.1093/nargab/lqaa026 |
work_keys_str_mv | AT brunatomas genemarkepeukaryoticgenepredictionwithselftraininginthespaceofgenesandproteins AT lomsadzealexandre genemarkepeukaryoticgenepredictionwithselftraininginthespaceofgenesandproteins AT borodovskymark genemarkepeukaryoticgenepredictionwithselftraininginthespaceofgenesandproteins |