Cargando…
Techniques to cope with missing data in host–pathogen protein interaction prediction
Motivation: Approaches that use supervised machine learning techniques for protein–protein interaction (PPI) prediction typically use features obtained by integrating several sources of data. Often certain attributes of the data are not available, resulting in missing values. In particular, our host...
Autores principales: | , , |
---|---|
Formato: | Online Artículo Texto |
Lenguaje: | English |
Publicado: |
Oxford University Press
2012
|
Materias: | |
Acceso en línea: | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3436802/ https://www.ncbi.nlm.nih.gov/pubmed/22962468 http://dx.doi.org/10.1093/bioinformatics/bts375 |
_version_ | 1782242700973572096 |
---|---|
author | Kshirsagar, Meghana Carbonell, Jaime Klein-Seetharaman, Judith |
author_facet | Kshirsagar, Meghana Carbonell, Jaime Klein-Seetharaman, Judith |
author_sort | Kshirsagar, Meghana |
collection | PubMed |
description | Motivation: Approaches that use supervised machine learning techniques for protein–protein interaction (PPI) prediction typically use features obtained by integrating several sources of data. Often certain attributes of the data are not available, resulting in missing values. In particular, our host–pathogen PPI datasets have a large fraction, in the range of 58–85% of missing values, which makes it challenging to apply machine learning algorithms. Results: We show that specialized techniques for missing value imputation can improve the performance of the models significantly. We use cross species information in combination with machine learning techniques like Group lasso with ℓ(1)/ℓ(2) regularization. We demonstrate the benefits of our approach on two PPI prediction problems. In our first example of Salmonella–human PPI prediction, we are able to obtain high prediction accuracies with 77.6% precision and 84% recall. Comparison with various other techniques shows an improvement of 9 in F1 score over the next best technique. We also apply our method to Yersinia–human PPI prediction successfully, demonstrating the generality of our approach. Availability: Predicted interactions, datasets, features are available at: http://www.cs.cmu.edu/~mkshirsa/eccb2012_paper46.html. Contact: judithks@cs.cmu.edu Supplementary Information: Supplementary data are available at Bioinformatics online. |
format | Online Article Text |
id | pubmed-3436802 |
institution | National Center for Biotechnology Information |
language | English |
publishDate | 2012 |
publisher | Oxford University Press |
record_format | MEDLINE/PubMed |
spelling | pubmed-34368022012-12-12 Techniques to cope with missing data in host–pathogen protein interaction prediction Kshirsagar, Meghana Carbonell, Jaime Klein-Seetharaman, Judith Bioinformatics Original Papers Motivation: Approaches that use supervised machine learning techniques for protein–protein interaction (PPI) prediction typically use features obtained by integrating several sources of data. Often certain attributes of the data are not available, resulting in missing values. In particular, our host–pathogen PPI datasets have a large fraction, in the range of 58–85% of missing values, which makes it challenging to apply machine learning algorithms. Results: We show that specialized techniques for missing value imputation can improve the performance of the models significantly. We use cross species information in combination with machine learning techniques like Group lasso with ℓ(1)/ℓ(2) regularization. We demonstrate the benefits of our approach on two PPI prediction problems. In our first example of Salmonella–human PPI prediction, we are able to obtain high prediction accuracies with 77.6% precision and 84% recall. Comparison with various other techniques shows an improvement of 9 in F1 score over the next best technique. We also apply our method to Yersinia–human PPI prediction successfully, demonstrating the generality of our approach. Availability: Predicted interactions, datasets, features are available at: http://www.cs.cmu.edu/~mkshirsa/eccb2012_paper46.html. Contact: judithks@cs.cmu.edu Supplementary Information: Supplementary data are available at Bioinformatics online. Oxford University Press 2012-09-15 2012-09-03 /pmc/articles/PMC3436802/ /pubmed/22962468 http://dx.doi.org/10.1093/bioinformatics/bts375 Text en © The Author(s) (2012). Published by Oxford University Press. http://creativecommons.org/licenses/by/3.0 This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/3.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited. |
spellingShingle | Original Papers Kshirsagar, Meghana Carbonell, Jaime Klein-Seetharaman, Judith Techniques to cope with missing data in host–pathogen protein interaction prediction |
title | Techniques to cope with missing data in host–pathogen protein interaction prediction |
title_full | Techniques to cope with missing data in host–pathogen protein interaction prediction |
title_fullStr | Techniques to cope with missing data in host–pathogen protein interaction prediction |
title_full_unstemmed | Techniques to cope with missing data in host–pathogen protein interaction prediction |
title_short | Techniques to cope with missing data in host–pathogen protein interaction prediction |
title_sort | techniques to cope with missing data in host–pathogen protein interaction prediction |
topic | Original Papers |
url | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3436802/ https://www.ncbi.nlm.nih.gov/pubmed/22962468 http://dx.doi.org/10.1093/bioinformatics/bts375 |
work_keys_str_mv | AT kshirsagarmeghana techniquestocopewithmissingdatainhostpathogenproteininteractionprediction AT carbonelljaime techniquestocopewithmissingdatainhostpathogenproteininteractionprediction AT kleinseetharamanjudith techniquestocopewithmissingdatainhostpathogenproteininteractionprediction |