Cargando…

Understanding the causes of errors in eukaryotic protein-coding gene prediction: a case study of primate proteomes

BACKGROUND: Recent advances in sequencing technologies have led to an explosion in the number of genomes available, but accurate genome annotation remains a major challenge. The prediction of protein-coding genes in eukaryotic genomes is especially problematic, due to their complex exon–intron struc...

Descripción completa

Detalles Bibliográficos
Autores principales: Meyer, Corentin, Scalzitti, Nicolas, Jeannin-Girardon, Anne, Collet, Pierre, Poch, Olivier, Thompson, Julie D.
Formato: Online Artículo Texto
Lenguaje:English
Publicado: BioMed Central 2020
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7656754/
https://www.ncbi.nlm.nih.gov/pubmed/33172385
http://dx.doi.org/10.1186/s12859-020-03855-1
_version_ 1783608415093784576
author Meyer, Corentin
Scalzitti, Nicolas
Jeannin-Girardon, Anne
Collet, Pierre
Poch, Olivier
Thompson, Julie D.
author_facet Meyer, Corentin
Scalzitti, Nicolas
Jeannin-Girardon, Anne
Collet, Pierre
Poch, Olivier
Thompson, Julie D.
author_sort Meyer, Corentin
collection PubMed
description BACKGROUND: Recent advances in sequencing technologies have led to an explosion in the number of genomes available, but accurate genome annotation remains a major challenge. The prediction of protein-coding genes in eukaryotic genomes is especially problematic, due to their complex exon–intron structures. Even the best eukaryotic gene prediction algorithms can make serious errors that will significantly affect subsequent analyses. RESULTS: We first investigated the prevalence of gene prediction errors in a large set of 176,478 proteins from ten primate proteomes available in public databases. Using the well-studied human proteins as a reference, a total of 82,305 potential errors were detected, including 44,001 deletions, 27,289 insertions and 11,015 mismatched segments where part of the correct protein sequence is replaced with an alternative erroneous sequence. We then focused on the mismatched sequence errors that cause particular problems for downstream applications. A detailed characterization allowed us to identify the potential causes for the gene misprediction in approximately half (5446) of these cases. As a proof-of-concept, we also developed a simple method which allowed us to propose improved sequences for 603 primate proteins. CONCLUSIONS: Gene prediction errors in primate proteomes affect up to 50% of the sequences. Major causes of errors include undetermined genome regions, genome sequencing or assembly issues, and limitations in the models used to represent gene exon–intron structures. Nevertheless, existing genome sequences can still be exploited to improve protein sequence quality. Perspectives of the work include the characterization of other types of gene prediction errors, as well as the development of a more comprehensive algorithm for protein sequence error correction.
format Online
Article
Text
id pubmed-7656754
institution National Center for Biotechnology Information
language English
publishDate 2020
publisher BioMed Central
record_format MEDLINE/PubMed
spelling pubmed-76567542020-11-13 Understanding the causes of errors in eukaryotic protein-coding gene prediction: a case study of primate proteomes Meyer, Corentin Scalzitti, Nicolas Jeannin-Girardon, Anne Collet, Pierre Poch, Olivier Thompson, Julie D. BMC Bioinformatics Research Article BACKGROUND: Recent advances in sequencing technologies have led to an explosion in the number of genomes available, but accurate genome annotation remains a major challenge. The prediction of protein-coding genes in eukaryotic genomes is especially problematic, due to their complex exon–intron structures. Even the best eukaryotic gene prediction algorithms can make serious errors that will significantly affect subsequent analyses. RESULTS: We first investigated the prevalence of gene prediction errors in a large set of 176,478 proteins from ten primate proteomes available in public databases. Using the well-studied human proteins as a reference, a total of 82,305 potential errors were detected, including 44,001 deletions, 27,289 insertions and 11,015 mismatched segments where part of the correct protein sequence is replaced with an alternative erroneous sequence. We then focused on the mismatched sequence errors that cause particular problems for downstream applications. A detailed characterization allowed us to identify the potential causes for the gene misprediction in approximately half (5446) of these cases. As a proof-of-concept, we also developed a simple method which allowed us to propose improved sequences for 603 primate proteins. CONCLUSIONS: Gene prediction errors in primate proteomes affect up to 50% of the sequences. Major causes of errors include undetermined genome regions, genome sequencing or assembly issues, and limitations in the models used to represent gene exon–intron structures. Nevertheless, existing genome sequences can still be exploited to improve protein sequence quality. Perspectives of the work include the characterization of other types of gene prediction errors, as well as the development of a more comprehensive algorithm for protein sequence error correction. BioMed Central 2020-11-10 /pmc/articles/PMC7656754/ /pubmed/33172385 http://dx.doi.org/10.1186/s12859-020-03855-1 Text en © The Author(s) 2020 Open AccessThis article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated in a credit line to the data.
spellingShingle Research Article
Meyer, Corentin
Scalzitti, Nicolas
Jeannin-Girardon, Anne
Collet, Pierre
Poch, Olivier
Thompson, Julie D.
Understanding the causes of errors in eukaryotic protein-coding gene prediction: a case study of primate proteomes
title Understanding the causes of errors in eukaryotic protein-coding gene prediction: a case study of primate proteomes
title_full Understanding the causes of errors in eukaryotic protein-coding gene prediction: a case study of primate proteomes
title_fullStr Understanding the causes of errors in eukaryotic protein-coding gene prediction: a case study of primate proteomes
title_full_unstemmed Understanding the causes of errors in eukaryotic protein-coding gene prediction: a case study of primate proteomes
title_short Understanding the causes of errors in eukaryotic protein-coding gene prediction: a case study of primate proteomes
title_sort understanding the causes of errors in eukaryotic protein-coding gene prediction: a case study of primate proteomes
topic Research Article
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7656754/
https://www.ncbi.nlm.nih.gov/pubmed/33172385
http://dx.doi.org/10.1186/s12859-020-03855-1
work_keys_str_mv AT meyercorentin understandingthecausesoferrorsineukaryoticproteincodinggenepredictionacasestudyofprimateproteomes
AT scalzittinicolas understandingthecausesoferrorsineukaryoticproteincodinggenepredictionacasestudyofprimateproteomes
AT jeanningirardonanne understandingthecausesoferrorsineukaryoticproteincodinggenepredictionacasestudyofprimateproteomes
AT colletpierre understandingthecausesoferrorsineukaryoticproteincodinggenepredictionacasestudyofprimateproteomes
AT pocholivier understandingthecausesoferrorsineukaryoticproteincodinggenepredictionacasestudyofprimateproteomes
AT thompsonjulied understandingthecausesoferrorsineukaryoticproteincodinggenepredictionacasestudyofprimateproteomes