Cargando…

On the Impact of the Pangenome and Annotation Discrepancies While Building Protein Sequence Databases for Bacteria Proteogenomics

In proteomics, peptide information within mass spectrometry (MS) data from a specific organism sample is routinely matched against a protein sequence database that best represent such organism. However, if the species/strain in the sample is unknown or genetically poorly characterized, it becomes ch...

Descripción completa

Detalles Bibliográficos
Autores principales: Machado, Karla C. T., Fortuin, Suereta, Tomazella, Gisele Guicardi, Fonseca, Andre F., Warren, Robin Mark, Wiker, Harald G., de Souza, Sandro Jose, de Souza, Gustavo Antonio
Formato: Online Artículo Texto
Lenguaje:English
Publicado: Frontiers Media S.A. 2019
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6596428/
https://www.ncbi.nlm.nih.gov/pubmed/31281302
http://dx.doi.org/10.3389/fmicb.2019.01410
_version_ 1783430507958108160
author Machado, Karla C. T.
Fortuin, Suereta
Tomazella, Gisele Guicardi
Fonseca, Andre F.
Warren, Robin Mark
Wiker, Harald G.
de Souza, Sandro Jose
de Souza, Gustavo Antonio
author_facet Machado, Karla C. T.
Fortuin, Suereta
Tomazella, Gisele Guicardi
Fonseca, Andre F.
Warren, Robin Mark
Wiker, Harald G.
de Souza, Sandro Jose
de Souza, Gustavo Antonio
author_sort Machado, Karla C. T.
collection PubMed
description In proteomics, peptide information within mass spectrometry (MS) data from a specific organism sample is routinely matched against a protein sequence database that best represent such organism. However, if the species/strain in the sample is unknown or genetically poorly characterized, it becomes challenging to determine a database which can represent such sample. Building customized protein sequence databases merging multiple strains for a given species has become a strategy to overcome such restrictions. However, as more genetic information is publicly available and interesting genetic features such as the existence of pan- and core genes within a species are revealed, we questioned how efficient such merging strategies are to report relevant information. To test this assumption, we constructed databases containing conserved and unique sequences for 10 different species. Features that are relevant for probabilistic-based protein identification by proteomics were then monitored. As expected, increase in database complexity correlates with pangenomic complexity. However, Mycobacterium tuberculosis and Bordetella pertussis generated very complex databases even having low pangenomic complexity. We further tested database performance by using MS data from eight clinical strains from M. tuberculosis, and from two published datasets from Staphylococcus aureus. We show that by using an approach where database size is controlled by removing repeated identical tryptic sequences across strains/species, computational time can be reduced drastically as database complexity increases.
format Online
Article
Text
id pubmed-6596428
institution National Center for Biotechnology Information
language English
publishDate 2019
publisher Frontiers Media S.A.
record_format MEDLINE/PubMed
spelling pubmed-65964282019-07-05 On the Impact of the Pangenome and Annotation Discrepancies While Building Protein Sequence Databases for Bacteria Proteogenomics Machado, Karla C. T. Fortuin, Suereta Tomazella, Gisele Guicardi Fonseca, Andre F. Warren, Robin Mark Wiker, Harald G. de Souza, Sandro Jose de Souza, Gustavo Antonio Front Microbiol Microbiology In proteomics, peptide information within mass spectrometry (MS) data from a specific organism sample is routinely matched against a protein sequence database that best represent such organism. However, if the species/strain in the sample is unknown or genetically poorly characterized, it becomes challenging to determine a database which can represent such sample. Building customized protein sequence databases merging multiple strains for a given species has become a strategy to overcome such restrictions. However, as more genetic information is publicly available and interesting genetic features such as the existence of pan- and core genes within a species are revealed, we questioned how efficient such merging strategies are to report relevant information. To test this assumption, we constructed databases containing conserved and unique sequences for 10 different species. Features that are relevant for probabilistic-based protein identification by proteomics were then monitored. As expected, increase in database complexity correlates with pangenomic complexity. However, Mycobacterium tuberculosis and Bordetella pertussis generated very complex databases even having low pangenomic complexity. We further tested database performance by using MS data from eight clinical strains from M. tuberculosis, and from two published datasets from Staphylococcus aureus. We show that by using an approach where database size is controlled by removing repeated identical tryptic sequences across strains/species, computational time can be reduced drastically as database complexity increases. Frontiers Media S.A. 2019-06-20 /pmc/articles/PMC6596428/ /pubmed/31281302 http://dx.doi.org/10.3389/fmicb.2019.01410 Text en Copyright © 2019 Machado, Fortuin, Tomazella, Fonseca, Warren, Wiker, de Souza and de Souza. http://creativecommons.org/licenses/by/4.0/ This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.
spellingShingle Microbiology
Machado, Karla C. T.
Fortuin, Suereta
Tomazella, Gisele Guicardi
Fonseca, Andre F.
Warren, Robin Mark
Wiker, Harald G.
de Souza, Sandro Jose
de Souza, Gustavo Antonio
On the Impact of the Pangenome and Annotation Discrepancies While Building Protein Sequence Databases for Bacteria Proteogenomics
title On the Impact of the Pangenome and Annotation Discrepancies While Building Protein Sequence Databases for Bacteria Proteogenomics
title_full On the Impact of the Pangenome and Annotation Discrepancies While Building Protein Sequence Databases for Bacteria Proteogenomics
title_fullStr On the Impact of the Pangenome and Annotation Discrepancies While Building Protein Sequence Databases for Bacteria Proteogenomics
title_full_unstemmed On the Impact of the Pangenome and Annotation Discrepancies While Building Protein Sequence Databases for Bacteria Proteogenomics
title_short On the Impact of the Pangenome and Annotation Discrepancies While Building Protein Sequence Databases for Bacteria Proteogenomics
title_sort on the impact of the pangenome and annotation discrepancies while building protein sequence databases for bacteria proteogenomics
topic Microbiology
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6596428/
https://www.ncbi.nlm.nih.gov/pubmed/31281302
http://dx.doi.org/10.3389/fmicb.2019.01410
work_keys_str_mv AT machadokarlact ontheimpactofthepangenomeandannotationdiscrepancieswhilebuildingproteinsequencedatabasesforbacteriaproteogenomics
AT fortuinsuereta ontheimpactofthepangenomeandannotationdiscrepancieswhilebuildingproteinsequencedatabasesforbacteriaproteogenomics
AT tomazellagiseleguicardi ontheimpactofthepangenomeandannotationdiscrepancieswhilebuildingproteinsequencedatabasesforbacteriaproteogenomics
AT fonsecaandref ontheimpactofthepangenomeandannotationdiscrepancieswhilebuildingproteinsequencedatabasesforbacteriaproteogenomics
AT warrenrobinmark ontheimpactofthepangenomeandannotationdiscrepancieswhilebuildingproteinsequencedatabasesforbacteriaproteogenomics
AT wikerharaldg ontheimpactofthepangenomeandannotationdiscrepancieswhilebuildingproteinsequencedatabasesforbacteriaproteogenomics
AT desouzasandrojose ontheimpactofthepangenomeandannotationdiscrepancieswhilebuildingproteinsequencedatabasesforbacteriaproteogenomics
AT desouzagustavoantonio ontheimpactofthepangenomeandannotationdiscrepancieswhilebuildingproteinsequencedatabasesforbacteriaproteogenomics