Cargando…

Critical evaluation of the use of artificial data for machine learning based de novo peptide identification

Proteins are essential components of all living cells and so the study of their in situ expression, proteomics, has wide reaching applications. Peptide identification in proteomics typically relies on matching high resolution tandem mass spectra to a protein database but can also be performed de nov...

Descripción completa

Detalles Bibliográficos
Autores principales: McDonnell, Kevin, Howley, Enda, Abram, Florence
Formato: Online Artículo Texto
Lenguaje:English
Publicado: Research Network of Computational and Structural Biotechnology 2023
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10165132/
https://www.ncbi.nlm.nih.gov/pubmed/37168871
http://dx.doi.org/10.1016/j.csbj.2023.04.014
_version_ 1785038205078208512
author McDonnell, Kevin
Howley, Enda
Abram, Florence
author_facet McDonnell, Kevin
Howley, Enda
Abram, Florence
author_sort McDonnell, Kevin
collection PubMed
description Proteins are essential components of all living cells and so the study of their in situ expression, proteomics, has wide reaching applications. Peptide identification in proteomics typically relies on matching high resolution tandem mass spectra to a protein database but can also be performed de novo. While artificial spectra have been successfully incorporated into database search pipelines to increase peptide identification rates, little work has been done to investigate the utility of artificial spectra in the context of de novo peptide identification. Here, we perform a critical analysis of the use of artificial data for the training and evaluation of de novo peptide identification algorithms. First, we classify the different fragment ion types present in real spectra and then estimate the number of spurious matches using random peptides. We then categorise the different types of noise present in real spectra. Finally, we transfer this knowledge to artificial data and test the performance of a state-of-the-art de novo peptide identification algorithm trained using artificial spectra with and without relevant noise addition. Noise supplementation increased artificial training data performance from 30% to 77% of real training data peptide recall. While real data performance was not fully replicated, this work provides the first steps towards an artificial spectrum framework for the training and evaluation of de novo peptide identification algorithms. Further enhanced artificial spectra may allow for more in depth analysis of de novo algorithms as well as alleviating the reliance on database searches for training data.
format Online
Article
Text
id pubmed-10165132
institution National Center for Biotechnology Information
language English
publishDate 2023
publisher Research Network of Computational and Structural Biotechnology
record_format MEDLINE/PubMed
spelling pubmed-101651322023-05-09 Critical evaluation of the use of artificial data for machine learning based de novo peptide identification McDonnell, Kevin Howley, Enda Abram, Florence Comput Struct Biotechnol J Research Article Proteins are essential components of all living cells and so the study of their in situ expression, proteomics, has wide reaching applications. Peptide identification in proteomics typically relies on matching high resolution tandem mass spectra to a protein database but can also be performed de novo. While artificial spectra have been successfully incorporated into database search pipelines to increase peptide identification rates, little work has been done to investigate the utility of artificial spectra in the context of de novo peptide identification. Here, we perform a critical analysis of the use of artificial data for the training and evaluation of de novo peptide identification algorithms. First, we classify the different fragment ion types present in real spectra and then estimate the number of spurious matches using random peptides. We then categorise the different types of noise present in real spectra. Finally, we transfer this knowledge to artificial data and test the performance of a state-of-the-art de novo peptide identification algorithm trained using artificial spectra with and without relevant noise addition. Noise supplementation increased artificial training data performance from 30% to 77% of real training data peptide recall. While real data performance was not fully replicated, this work provides the first steps towards an artificial spectrum framework for the training and evaluation of de novo peptide identification algorithms. Further enhanced artificial spectra may allow for more in depth analysis of de novo algorithms as well as alleviating the reliance on database searches for training data. Research Network of Computational and Structural Biotechnology 2023-04-17 /pmc/articles/PMC10165132/ /pubmed/37168871 http://dx.doi.org/10.1016/j.csbj.2023.04.014 Text en © 2023 The Authors https://creativecommons.org/licenses/by/4.0/This is an open access article under the CC BY license (http://creativecommons.org/licenses/by/4.0/).
spellingShingle Research Article
McDonnell, Kevin
Howley, Enda
Abram, Florence
Critical evaluation of the use of artificial data for machine learning based de novo peptide identification
title Critical evaluation of the use of artificial data for machine learning based de novo peptide identification
title_full Critical evaluation of the use of artificial data for machine learning based de novo peptide identification
title_fullStr Critical evaluation of the use of artificial data for machine learning based de novo peptide identification
title_full_unstemmed Critical evaluation of the use of artificial data for machine learning based de novo peptide identification
title_short Critical evaluation of the use of artificial data for machine learning based de novo peptide identification
title_sort critical evaluation of the use of artificial data for machine learning based de novo peptide identification
topic Research Article
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10165132/
https://www.ncbi.nlm.nih.gov/pubmed/37168871
http://dx.doi.org/10.1016/j.csbj.2023.04.014
work_keys_str_mv AT mcdonnellkevin criticalevaluationoftheuseofartificialdataformachinelearningbaseddenovopeptideidentification
AT howleyenda criticalevaluationoftheuseofartificialdataformachinelearningbaseddenovopeptideidentification
AT abramflorence criticalevaluationoftheuseofartificialdataformachinelearningbaseddenovopeptideidentification