Cargando…

Optimizing metaproteomics database construction: lessons from a study of the vaginal microbiome

Metaproteomics, a method for untargeted, high-throughput identification of proteins in complex samples, provides functional information about microbial communities and can tie functions to specific taxa. Metaproteomics often generates less data than other omics techniques, but analytical workflows c...

Descripción completa

Detalles Bibliográficos
Autores principales: Lee, Elliot M., Srinivasan, Sujatha, Purvine, Samuel O., Fiedler, Tina L., Leiser, Owen P., Proll, Sean C., Minot, Samuel S., Deatherage Kaiser, Brooke L., Fredricks, David N.
Formato: Online Artículo Texto
Lenguaje:English
Publicado: American Society for Microbiology 2023
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10469846/
https://www.ncbi.nlm.nih.gov/pubmed/37350639
http://dx.doi.org/10.1128/msystems.00678-22
_version_ 1785099537615945728
author Lee, Elliot M.
Srinivasan, Sujatha
Purvine, Samuel O.
Fiedler, Tina L.
Leiser, Owen P.
Proll, Sean C.
Minot, Samuel S.
Deatherage Kaiser, Brooke L.
Fredricks, David N.
author_facet Lee, Elliot M.
Srinivasan, Sujatha
Purvine, Samuel O.
Fiedler, Tina L.
Leiser, Owen P.
Proll, Sean C.
Minot, Samuel S.
Deatherage Kaiser, Brooke L.
Fredricks, David N.
author_sort Lee, Elliot M.
collection PubMed
description Metaproteomics, a method for untargeted, high-throughput identification of proteins in complex samples, provides functional information about microbial communities and can tie functions to specific taxa. Metaproteomics often generates less data than other omics techniques, but analytical workflows can be improved to increase usable data in metaproteomic outputs. Identification of peptides in the metaproteomic analysis is performed by comparing mass spectra of sample peptides to a reference database of protein sequences. Although these protein databases are an integral part of the metaproteomic analysis, few studies have explored how database composition impacts peptide identification. Here, we used cervicovaginal lavage (CVL) samples from a study of bacterial vaginosis (BV) to compare the performance of databases built using six different strategies. We evaluated broad versus sample-matched databases, as well as databases populated with proteins translated from metagenomic sequencing of the same samples versus sequences from public repositories. Smaller sample-matched databases performed significantly better, driven by the statistical constraints on large databases. Additionally, large databases attributed up to 34% of significant bacterial hits to taxa absent from the sample, as determined orthogonally by 16S rRNA gene sequencing. We also tested a set of hybrid databases which included bacterial proteins from NCBI RefSeq and translated bacterial genes from the samples. These hybrid databases had the best overall performance, identifying 1,068 unique human and 1,418 unique bacterial proteins, ~30% more than a database populated with proteins from typical vaginal bacteria and fungi. Our findings can help guide the optimal identification of proteins while maintaining statistical power for reaching biological conclusions. IMPORTANCE: Metaproteomic analysis can provide valuable insights into the functions of microbial and cellular communities by identifying a broad, untargeted set of proteins. The databases used in the analysis of metaproteomic data influence results by defining what proteins can be identified. Moreover, the size of the database impacts the number of identifications after accounting for false discovery rates (FDRs). Few studies have tested the performance of different strategies for building a protein database to identify proteins from metaproteomic data and those that have largely focused on highly diverse microbial communities. We tested a range of databases on CVL samples and found that a hybrid sample-matched approach, using publicly available proteins from organisms present in the samples, as well as proteins translated from metagenomic sequencing of the samples, had the best performance. However, our results also suggest that public sequence databases will continue to improve as more bacterial genomes are published.
format Online
Article
Text
id pubmed-10469846
institution National Center for Biotechnology Information
language English
publishDate 2023
publisher American Society for Microbiology
record_format MEDLINE/PubMed
spelling pubmed-104698462023-09-01 Optimizing metaproteomics database construction: lessons from a study of the vaginal microbiome Lee, Elliot M. Srinivasan, Sujatha Purvine, Samuel O. Fiedler, Tina L. Leiser, Owen P. Proll, Sean C. Minot, Samuel S. Deatherage Kaiser, Brooke L. Fredricks, David N. mSystems Research Article Metaproteomics, a method for untargeted, high-throughput identification of proteins in complex samples, provides functional information about microbial communities and can tie functions to specific taxa. Metaproteomics often generates less data than other omics techniques, but analytical workflows can be improved to increase usable data in metaproteomic outputs. Identification of peptides in the metaproteomic analysis is performed by comparing mass spectra of sample peptides to a reference database of protein sequences. Although these protein databases are an integral part of the metaproteomic analysis, few studies have explored how database composition impacts peptide identification. Here, we used cervicovaginal lavage (CVL) samples from a study of bacterial vaginosis (BV) to compare the performance of databases built using six different strategies. We evaluated broad versus sample-matched databases, as well as databases populated with proteins translated from metagenomic sequencing of the same samples versus sequences from public repositories. Smaller sample-matched databases performed significantly better, driven by the statistical constraints on large databases. Additionally, large databases attributed up to 34% of significant bacterial hits to taxa absent from the sample, as determined orthogonally by 16S rRNA gene sequencing. We also tested a set of hybrid databases which included bacterial proteins from NCBI RefSeq and translated bacterial genes from the samples. These hybrid databases had the best overall performance, identifying 1,068 unique human and 1,418 unique bacterial proteins, ~30% more than a database populated with proteins from typical vaginal bacteria and fungi. Our findings can help guide the optimal identification of proteins while maintaining statistical power for reaching biological conclusions. IMPORTANCE: Metaproteomic analysis can provide valuable insights into the functions of microbial and cellular communities by identifying a broad, untargeted set of proteins. The databases used in the analysis of metaproteomic data influence results by defining what proteins can be identified. Moreover, the size of the database impacts the number of identifications after accounting for false discovery rates (FDRs). Few studies have tested the performance of different strategies for building a protein database to identify proteins from metaproteomic data and those that have largely focused on highly diverse microbial communities. We tested a range of databases on CVL samples and found that a hybrid sample-matched approach, using publicly available proteins from organisms present in the samples, as well as proteins translated from metagenomic sequencing of the samples, had the best performance. However, our results also suggest that public sequence databases will continue to improve as more bacterial genomes are published. American Society for Microbiology 2023-06-23 /pmc/articles/PMC10469846/ /pubmed/37350639 http://dx.doi.org/10.1128/msystems.00678-22 Text en Copyright © 2023 Lee et al. https://creativecommons.org/licenses/by/4.0/This is an open-access article distributed under the terms of the Creative Commons Attribution 4.0 International license (https://creativecommons.org/licenses/by/4.0/) .
spellingShingle Research Article
Lee, Elliot M.
Srinivasan, Sujatha
Purvine, Samuel O.
Fiedler, Tina L.
Leiser, Owen P.
Proll, Sean C.
Minot, Samuel S.
Deatherage Kaiser, Brooke L.
Fredricks, David N.
Optimizing metaproteomics database construction: lessons from a study of the vaginal microbiome
title Optimizing metaproteomics database construction: lessons from a study of the vaginal microbiome
title_full Optimizing metaproteomics database construction: lessons from a study of the vaginal microbiome
title_fullStr Optimizing metaproteomics database construction: lessons from a study of the vaginal microbiome
title_full_unstemmed Optimizing metaproteomics database construction: lessons from a study of the vaginal microbiome
title_short Optimizing metaproteomics database construction: lessons from a study of the vaginal microbiome
title_sort optimizing metaproteomics database construction: lessons from a study of the vaginal microbiome
topic Research Article
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10469846/
https://www.ncbi.nlm.nih.gov/pubmed/37350639
http://dx.doi.org/10.1128/msystems.00678-22
work_keys_str_mv AT leeelliotm optimizingmetaproteomicsdatabaseconstructionlessonsfromastudyofthevaginalmicrobiome
AT srinivasansujatha optimizingmetaproteomicsdatabaseconstructionlessonsfromastudyofthevaginalmicrobiome
AT purvinesamuelo optimizingmetaproteomicsdatabaseconstructionlessonsfromastudyofthevaginalmicrobiome
AT fiedlertinal optimizingmetaproteomicsdatabaseconstructionlessonsfromastudyofthevaginalmicrobiome
AT leiserowenp optimizingmetaproteomicsdatabaseconstructionlessonsfromastudyofthevaginalmicrobiome
AT prollseanc optimizingmetaproteomicsdatabaseconstructionlessonsfromastudyofthevaginalmicrobiome
AT minotsamuels optimizingmetaproteomicsdatabaseconstructionlessonsfromastudyofthevaginalmicrobiome
AT deatheragekaiserbrookel optimizingmetaproteomicsdatabaseconstructionlessonsfromastudyofthevaginalmicrobiome
AT fredricksdavidn optimizingmetaproteomicsdatabaseconstructionlessonsfromastudyofthevaginalmicrobiome