Cargando…

On the choice of negative examples for prediction of host-pathogen protein interactions

As practitioners of machine learning in the area of bioinformatics we know that the quality of the results crucially depends on the quality of our labeled data. While there is a tendency to focus on the quality of positive examples, the negative examples are equally as important. In this opinion pap...

Descripción completa

Detalles Bibliográficos
Autores principales:	Neumann, Don, Roy, Soumyadip, Minhas, Fayyaz Ul Amir Afsar, Ben-Hur, Asa
Formato:	Online Artículo Texto
Lenguaje:	English
Publicado:	Frontiers Media S.A. 2022
Materias:	Bioinformatics
Acceso en línea:	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9798088/ https://www.ncbi.nlm.nih.gov/pubmed/36591335 http://dx.doi.org/10.3389/fbinf.2022.1083292

_version_	1784860830513233920
author	Neumann, Don Roy, Soumyadip Minhas, Fayyaz Ul Amir Afsar Ben-Hur, Asa
author_facet	Neumann, Don Roy, Soumyadip Minhas, Fayyaz Ul Amir Afsar Ben-Hur, Asa
author_sort	Neumann, Don
collection	PubMed
description	As practitioners of machine learning in the area of bioinformatics we know that the quality of the results crucially depends on the quality of our labeled data. While there is a tendency to focus on the quality of positive examples, the negative examples are equally as important. In this opinion paper we revisit the problem of choosing negative examples for the task of predicting protein-protein interactions, either among proteins of a given species or for host-pathogen interactions and describe important issues that are prevalent in the current literature. The challenge in creating datasets for this task is the noisy nature of the experimentally derived interactions and the lack of information on non-interacting proteins. A standard approach is to choose random pairs of non-interacting proteins as negative examples. Since the interactomes of all species are only partially known, this leads to a very small percentage of false negatives. This is especially true for host-pathogen interactions. To address this perceived issue, some researchers have chosen to select negative examples as pairs of proteins whose sequence similarity to the positive examples is sufficiently low. This clearly reduces the chance for false negatives, but also makes the problem much easier than it really is, leading to over-optimistic accuracy estimates. We demonstrate the effect of this form of bias using a selection of recent protein interaction prediction methods of varying complexity, and urge researchers to pay attention to the details of generating their datasets for potential biases like this.
format	Online Article Text
id	pubmed-9798088
institution	National Center for Biotechnology Information
language	English
publishDate	2022
publisher	Frontiers Media S.A.
record_format	MEDLINE/PubMed
spelling	pubmed-97980882022-12-30 On the choice of negative examples for prediction of host-pathogen protein interactions Neumann, Don Roy, Soumyadip Minhas, Fayyaz Ul Amir Afsar Ben-Hur, Asa Front Bioinform Bioinformatics As practitioners of machine learning in the area of bioinformatics we know that the quality of the results crucially depends on the quality of our labeled data. While there is a tendency to focus on the quality of positive examples, the negative examples are equally as important. In this opinion paper we revisit the problem of choosing negative examples for the task of predicting protein-protein interactions, either among proteins of a given species or for host-pathogen interactions and describe important issues that are prevalent in the current literature. The challenge in creating datasets for this task is the noisy nature of the experimentally derived interactions and the lack of information on non-interacting proteins. A standard approach is to choose random pairs of non-interacting proteins as negative examples. Since the interactomes of all species are only partially known, this leads to a very small percentage of false negatives. This is especially true for host-pathogen interactions. To address this perceived issue, some researchers have chosen to select negative examples as pairs of proteins whose sequence similarity to the positive examples is sufficiently low. This clearly reduces the chance for false negatives, but also makes the problem much easier than it really is, leading to over-optimistic accuracy estimates. We demonstrate the effect of this form of bias using a selection of recent protein interaction prediction methods of varying complexity, and urge researchers to pay attention to the details of generating their datasets for potential biases like this. Frontiers Media S.A. 2022-12-15 /pmc/articles/PMC9798088/ /pubmed/36591335 http://dx.doi.org/10.3389/fbinf.2022.1083292 Text en Copyright © 2022 Neumann, Roy, Minhas and Ben-Hur. https://creativecommons.org/licenses/by/4.0/This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.
spellingShingle	Bioinformatics Neumann, Don Roy, Soumyadip Minhas, Fayyaz Ul Amir Afsar Ben-Hur, Asa On the choice of negative examples for prediction of host-pathogen protein interactions
title	On the choice of negative examples for prediction of host-pathogen protein interactions
title_full	On the choice of negative examples for prediction of host-pathogen protein interactions
title_fullStr	On the choice of negative examples for prediction of host-pathogen protein interactions
title_full_unstemmed	On the choice of negative examples for prediction of host-pathogen protein interactions
title_short	On the choice of negative examples for prediction of host-pathogen protein interactions
title_sort	on the choice of negative examples for prediction of host-pathogen protein interactions
topic	Bioinformatics
url	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9798088/ https://www.ncbi.nlm.nih.gov/pubmed/36591335 http://dx.doi.org/10.3389/fbinf.2022.1083292
work_keys_str_mv	AT neumanndon onthechoiceofnegativeexamplesforpredictionofhostpathogenproteininteractions AT roysoumyadip onthechoiceofnegativeexamplesforpredictionofhostpathogenproteininteractions AT minhasfayyazulamirafsar onthechoiceofnegativeexamplesforpredictionofhostpathogenproteininteractions AT benhurasa onthechoiceofnegativeexamplesforpredictionofhostpathogenproteininteractions

On the choice of negative examples for prediction of host-pathogen protein interactions

Ejemplares similares