Cargando…

Techniques to cope with missing data in host–pathogen protein interaction prediction

Motivation: Approaches that use supervised machine learning techniques for protein–protein interaction (PPI) prediction typically use features obtained by integrating several sources of data. Often certain attributes of the data are not available, resulting in missing values. In particular, our host...

Descripción completa

Detalles Bibliográficos
Autores principales: Kshirsagar, Meghana, Carbonell, Jaime, Klein-Seetharaman, Judith
Formato: Online Artículo Texto
Lenguaje:English
Publicado: Oxford University Press 2012
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3436802/
https://www.ncbi.nlm.nih.gov/pubmed/22962468
http://dx.doi.org/10.1093/bioinformatics/bts375
_version_ 1782242700973572096
author Kshirsagar, Meghana
Carbonell, Jaime
Klein-Seetharaman, Judith
author_facet Kshirsagar, Meghana
Carbonell, Jaime
Klein-Seetharaman, Judith
author_sort Kshirsagar, Meghana
collection PubMed
description Motivation: Approaches that use supervised machine learning techniques for protein–protein interaction (PPI) prediction typically use features obtained by integrating several sources of data. Often certain attributes of the data are not available, resulting in missing values. In particular, our host–pathogen PPI datasets have a large fraction, in the range of 58–85% of missing values, which makes it challenging to apply machine learning algorithms. Results: We show that specialized techniques for missing value imputation can improve the performance of the models significantly. We use cross species information in combination with machine learning techniques like Group lasso with ℓ(1)/ℓ(2) regularization. We demonstrate the benefits of our approach on two PPI prediction problems. In our first example of Salmonella–human PPI prediction, we are able to obtain high prediction accuracies with 77.6% precision and 84% recall. Comparison with various other techniques shows an improvement of 9 in F1 score over the next best technique. We also apply our method to Yersinia–human PPI prediction successfully, demonstrating the generality of our approach. Availability: Predicted interactions, datasets, features are available at: http://www.cs.cmu.edu/~mkshirsa/eccb2012_paper46.html. Contact: judithks@cs.cmu.edu Supplementary Information: Supplementary data are available at Bioinformatics online.
format Online
Article
Text
id pubmed-3436802
institution National Center for Biotechnology Information
language English
publishDate 2012
publisher Oxford University Press
record_format MEDLINE/PubMed
spelling pubmed-34368022012-12-12 Techniques to cope with missing data in host–pathogen protein interaction prediction Kshirsagar, Meghana Carbonell, Jaime Klein-Seetharaman, Judith Bioinformatics Original Papers Motivation: Approaches that use supervised machine learning techniques for protein–protein interaction (PPI) prediction typically use features obtained by integrating several sources of data. Often certain attributes of the data are not available, resulting in missing values. In particular, our host–pathogen PPI datasets have a large fraction, in the range of 58–85% of missing values, which makes it challenging to apply machine learning algorithms. Results: We show that specialized techniques for missing value imputation can improve the performance of the models significantly. We use cross species information in combination with machine learning techniques like Group lasso with ℓ(1)/ℓ(2) regularization. We demonstrate the benefits of our approach on two PPI prediction problems. In our first example of Salmonella–human PPI prediction, we are able to obtain high prediction accuracies with 77.6% precision and 84% recall. Comparison with various other techniques shows an improvement of 9 in F1 score over the next best technique. We also apply our method to Yersinia–human PPI prediction successfully, demonstrating the generality of our approach. Availability: Predicted interactions, datasets, features are available at: http://www.cs.cmu.edu/~mkshirsa/eccb2012_paper46.html. Contact: judithks@cs.cmu.edu Supplementary Information: Supplementary data are available at Bioinformatics online. Oxford University Press 2012-09-15 2012-09-03 /pmc/articles/PMC3436802/ /pubmed/22962468 http://dx.doi.org/10.1093/bioinformatics/bts375 Text en © The Author(s) (2012). Published by Oxford University Press. http://creativecommons.org/licenses/by/3.0 This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/3.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.
spellingShingle Original Papers
Kshirsagar, Meghana
Carbonell, Jaime
Klein-Seetharaman, Judith
Techniques to cope with missing data in host–pathogen protein interaction prediction
title Techniques to cope with missing data in host–pathogen protein interaction prediction
title_full Techniques to cope with missing data in host–pathogen protein interaction prediction
title_fullStr Techniques to cope with missing data in host–pathogen protein interaction prediction
title_full_unstemmed Techniques to cope with missing data in host–pathogen protein interaction prediction
title_short Techniques to cope with missing data in host–pathogen protein interaction prediction
title_sort techniques to cope with missing data in host–pathogen protein interaction prediction
topic Original Papers
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3436802/
https://www.ncbi.nlm.nih.gov/pubmed/22962468
http://dx.doi.org/10.1093/bioinformatics/bts375
work_keys_str_mv AT kshirsagarmeghana techniquestocopewithmissingdatainhostpathogenproteininteractionprediction
AT carbonelljaime techniquestocopewithmissingdatainhostpathogenproteininteractionprediction
AT kleinseetharamanjudith techniquestocopewithmissingdatainhostpathogenproteininteractionprediction