Cargando…

Addressing Statistical Biases in Nucleotide-Derived Protein Databases for Proteogenomic Search Strategies

[Image: see text] Proteogenomics has the potential to advance genome annotation through high quality peptide identifications derived from mass spectrometry experiments, which demonstrate a given gene or isoform is expressed and translated at the protein level. This can advance our understanding of g...

Descripción completa

Detalles Bibliográficos
Autores principales:	Blakeley, Paul, Overton, Ian M., Hubbard, Simon J.
Formato:	Online Artículo Texto
Lenguaje:	English
Publicado:	American Chemical Society 2012
Acceso en línea:	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3703792/ https://www.ncbi.nlm.nih.gov/pubmed/23025403 http://dx.doi.org/10.1021/pr300411q

_version_	1782275943485669376
author	Blakeley, Paul Overton, Ian M. Hubbard, Simon J.
author_facet	Blakeley, Paul Overton, Ian M. Hubbard, Simon J.
author_sort	Blakeley, Paul
collection	PubMed
description	[Image: see text] Proteogenomics has the potential to advance genome annotation through high quality peptide identifications derived from mass spectrometry experiments, which demonstrate a given gene or isoform is expressed and translated at the protein level. This can advance our understanding of genome function, discovering novel genes and gene structure that have not yet been identified or validated. Because of the high-throughput shotgun nature of most proteomics experiments, it is essential to carefully control for false positives and prevent any potential misannotation. A number of statistical procedures to deal with this are in wide use in proteomics, calculating false discovery rate (FDR) and posterior error probability (PEP) values for groups and individual peptide spectrum matches (PSMs). These methods control for multiple testing and exploit decoy databases to estimate statistical significance. Here, we show that database choice has a major effect on these confidence estimates leading to significant differences in the number of PSMs reported. We note that standard target:decoy approaches using six-frame translations of nucleotide sequences, such as assembled transcriptome data, apparently underestimate the confidence assigned to the PSMs. The source of this error stems from the inflated and unusual nature of the six-frame database, where for every target sequence there exists five “incorrect” targets that are unlikely to code for protein. The attendant FDR and PEP estimates lead to fewer accepted PSMs at fixed thresholds, and we show that this effect is a product of the database and statistical modeling and not the search engine. A variety of approaches to limit database size and remove noncoding target sequences are examined and discussed in terms of the altered statistical estimates generated and PSMs reported. These results are of importance to groups carrying out proteogenomics, aiming to maximize the validation and discovery of gene structure in sequenced genomes, while still controlling for false positives.
format	Online Article Text
id	pubmed-3703792
institution	National Center for Biotechnology Information
language	English
publishDate	2012
publisher	American Chemical Society
record_format	MEDLINE/PubMed
spelling	pubmed-37037922013-07-08 Addressing Statistical Biases in Nucleotide-Derived Protein Databases for Proteogenomic Search Strategies Blakeley, Paul Overton, Ian M. Hubbard, Simon J. J Proteome Res [Image: see text] Proteogenomics has the potential to advance genome annotation through high quality peptide identifications derived from mass spectrometry experiments, which demonstrate a given gene or isoform is expressed and translated at the protein level. This can advance our understanding of genome function, discovering novel genes and gene structure that have not yet been identified or validated. Because of the high-throughput shotgun nature of most proteomics experiments, it is essential to carefully control for false positives and prevent any potential misannotation. A number of statistical procedures to deal with this are in wide use in proteomics, calculating false discovery rate (FDR) and posterior error probability (PEP) values for groups and individual peptide spectrum matches (PSMs). These methods control for multiple testing and exploit decoy databases to estimate statistical significance. Here, we show that database choice has a major effect on these confidence estimates leading to significant differences in the number of PSMs reported. We note that standard target:decoy approaches using six-frame translations of nucleotide sequences, such as assembled transcriptome data, apparently underestimate the confidence assigned to the PSMs. The source of this error stems from the inflated and unusual nature of the six-frame database, where for every target sequence there exists five “incorrect” targets that are unlikely to code for protein. The attendant FDR and PEP estimates lead to fewer accepted PSMs at fixed thresholds, and we show that this effect is a product of the database and statistical modeling and not the search engine. A variety of approaches to limit database size and remove noncoding target sequences are examined and discussed in terms of the altered statistical estimates generated and PSMs reported. These results are of importance to groups carrying out proteogenomics, aiming to maximize the validation and discovery of gene structure in sequenced genomes, while still controlling for false positives. American Chemical Society 2012-10-02 2012-11-02 /pmc/articles/PMC3703792/ /pubmed/23025403 http://dx.doi.org/10.1021/pr300411q Text en Copyright © 2012 American Chemical Society
spellingShingle	Blakeley, Paul Overton, Ian M. Hubbard, Simon J. Addressing Statistical Biases in Nucleotide-Derived Protein Databases for Proteogenomic Search Strategies
title	Addressing Statistical Biases in Nucleotide-Derived Protein Databases for Proteogenomic Search Strategies
title_full	Addressing Statistical Biases in Nucleotide-Derived Protein Databases for Proteogenomic Search Strategies
title_fullStr	Addressing Statistical Biases in Nucleotide-Derived Protein Databases for Proteogenomic Search Strategies
title_full_unstemmed	Addressing Statistical Biases in Nucleotide-Derived Protein Databases for Proteogenomic Search Strategies
title_short	Addressing Statistical Biases in Nucleotide-Derived Protein Databases for Proteogenomic Search Strategies
title_sort	addressing statistical biases in nucleotide-derived protein databases for proteogenomic search strategies
url	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3703792/ https://www.ncbi.nlm.nih.gov/pubmed/23025403 http://dx.doi.org/10.1021/pr300411q
work_keys_str_mv	AT blakeleypaul addressingstatisticalbiasesinnucleotidederivedproteindatabasesforproteogenomicsearchstrategies AT overtonianm addressingstatisticalbiasesinnucleotidederivedproteindatabasesforproteogenomicsearchstrategies AT hubbardsimonj addressingstatisticalbiasesinnucleotidederivedproteindatabasesforproteogenomicsearchstrategies

Addressing Statistical Biases in Nucleotide-Derived Protein Databases for Proteogenomic Search Strategies

Ejemplares similares