Cargando…

Exploiting likely-positive and unlabeled data to improve the identification of protein-protein interaction articles

BACKGROUND: Experimentally verified protein-protein interactions (PPI) cannot be easily retrieved by researchers unless they are stored in PPI databases. The curation of such databases can be made faster by ranking newly-published articles' relevance to PPI, a task which we approach here by des...

Descripción completa

Detalles Bibliográficos
Autores principales:	Tsai, Richard Tzong-Han, Hung, Hsi-Chuan, Dai, Hong-Jie, Lin, Yi-Wen, Hsu, Wen-Lian
Formato:	Texto
Lenguaje:	English
Publicado:	BioMed Central 2008
Materias:	Proceedings
Acceso en línea:	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2259404/ https://www.ncbi.nlm.nih.gov/pubmed/18315856 http://dx.doi.org/10.1186/1471-2105-9-S1-S3

_version_	1782151389192912896
author	Tsai, Richard Tzong-Han Hung, Hsi-Chuan Dai, Hong-Jie Lin, Yi-Wen Hsu, Wen-Lian
author_facet	Tsai, Richard Tzong-Han Hung, Hsi-Chuan Dai, Hong-Jie Lin, Yi-Wen Hsu, Wen-Lian
author_sort	Tsai, Richard Tzong-Han
collection	PubMed
description	BACKGROUND: Experimentally verified protein-protein interactions (PPI) cannot be easily retrieved by researchers unless they are stored in PPI databases. The curation of such databases can be made faster by ranking newly-published articles' relevance to PPI, a task which we approach here by designing a machine-learning-based PPI classifier. All classifiers require labeled data, and the more labeled data available, the more reliable they become. Although many PPI databases with large numbers of labeled articles are available, incorporating these databases into the base training data may actually reduce classification performance since the supplementary databases may not annotate exactly the same PPI types as the base training data. Our first goal in this paper is to find a method of selecting likely positive data from such supplementary databases. Only extracting likely positive data, however, will bias the classification model unless sufficient negative data is also added. Unfortunately, negative data is very hard to obtain because there are no resources that compile such information. Therefore, our second aim is to select such negative data from unlabeled PubMed data. Thirdly, we explore how to exploit these likely positive and negative data. And lastly, we look at the somewhat unrelated question of which term-weighting scheme is most effective for identifying PPI-related articles. RESULTS: To evaluate the performance of our PPI text classifier, we conducted experiments based on the BioCreAtIvE-II IAS dataset. Our results show that adding likely-labeled data generally increases AUC by 3~6%, indicating better ranking ability. Our experiments also show that our newly-proposed term-weighting scheme has the highest AUC among all common weighting schemes. Our final model achieves an F-measure and AUC 2.9% and 5.0% higher than those of the top-ranking system in the IAS challenge. CONCLUSION: Our experiments demonstrate the effectiveness of integrating unlabeled and likely labeled data to augment a PPI text classification system. Our mixed model is suitable for ranking purposes whereas our hierarchical model is better for filtering. In addition, our results indicate that supervised weighting schemes outperform unsupervised ones. Our newly-proposed weighting scheme, TFBRF, which considers documents that do not contain the target word, avoids some of the biases found in traditional weighting schemes. Our experiment results show TFBRF to be the most effective among several other top weighting schemes.
format	Text
id	pubmed-2259404
institution	National Center for Biotechnology Information
language	English
publishDate	2008
publisher	BioMed Central
record_format	MEDLINE/PubMed
spelling	pubmed-22594042008-03-04 Exploiting likely-positive and unlabeled data to improve the identification of protein-protein interaction articles Tsai, Richard Tzong-Han Hung, Hsi-Chuan Dai, Hong-Jie Lin, Yi-Wen Hsu, Wen-Lian BMC Bioinformatics Proceedings BACKGROUND: Experimentally verified protein-protein interactions (PPI) cannot be easily retrieved by researchers unless they are stored in PPI databases. The curation of such databases can be made faster by ranking newly-published articles' relevance to PPI, a task which we approach here by designing a machine-learning-based PPI classifier. All classifiers require labeled data, and the more labeled data available, the more reliable they become. Although many PPI databases with large numbers of labeled articles are available, incorporating these databases into the base training data may actually reduce classification performance since the supplementary databases may not annotate exactly the same PPI types as the base training data. Our first goal in this paper is to find a method of selecting likely positive data from such supplementary databases. Only extracting likely positive data, however, will bias the classification model unless sufficient negative data is also added. Unfortunately, negative data is very hard to obtain because there are no resources that compile such information. Therefore, our second aim is to select such negative data from unlabeled PubMed data. Thirdly, we explore how to exploit these likely positive and negative data. And lastly, we look at the somewhat unrelated question of which term-weighting scheme is most effective for identifying PPI-related articles. RESULTS: To evaluate the performance of our PPI text classifier, we conducted experiments based on the BioCreAtIvE-II IAS dataset. Our results show that adding likely-labeled data generally increases AUC by 3~6%, indicating better ranking ability. Our experiments also show that our newly-proposed term-weighting scheme has the highest AUC among all common weighting schemes. Our final model achieves an F-measure and AUC 2.9% and 5.0% higher than those of the top-ranking system in the IAS challenge. CONCLUSION: Our experiments demonstrate the effectiveness of integrating unlabeled and likely labeled data to augment a PPI text classification system. Our mixed model is suitable for ranking purposes whereas our hierarchical model is better for filtering. In addition, our results indicate that supervised weighting schemes outperform unsupervised ones. Our newly-proposed weighting scheme, TFBRF, which considers documents that do not contain the target word, avoids some of the biases found in traditional weighting schemes. Our experiment results show TFBRF to be the most effective among several other top weighting schemes. BioMed Central 2008-02-13 /pmc/articles/PMC2259404/ /pubmed/18315856 http://dx.doi.org/10.1186/1471-2105-9-S1-S3 Text en Copyright © 2008 Tsai et al; licensee BioMed Central Ltd. http://creativecommons.org/licenses/by/2.0 This is an open access article distributed under the terms of the Creative Commons Attribution License ( (http://creativecommons.org/licenses/by/2.0) ), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.
spellingShingle	Proceedings Tsai, Richard Tzong-Han Hung, Hsi-Chuan Dai, Hong-Jie Lin, Yi-Wen Hsu, Wen-Lian Exploiting likely-positive and unlabeled data to improve the identification of protein-protein interaction articles
title	Exploiting likely-positive and unlabeled data to improve the identification of protein-protein interaction articles
title_full	Exploiting likely-positive and unlabeled data to improve the identification of protein-protein interaction articles
title_fullStr	Exploiting likely-positive and unlabeled data to improve the identification of protein-protein interaction articles
title_full_unstemmed	Exploiting likely-positive and unlabeled data to improve the identification of protein-protein interaction articles
title_short	Exploiting likely-positive and unlabeled data to improve the identification of protein-protein interaction articles
title_sort	exploiting likely-positive and unlabeled data to improve the identification of protein-protein interaction articles
topic	Proceedings
url	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2259404/ https://www.ncbi.nlm.nih.gov/pubmed/18315856 http://dx.doi.org/10.1186/1471-2105-9-S1-S3
work_keys_str_mv	AT tsairichardtzonghan exploitinglikelypositiveandunlabeleddatatoimprovetheidentificationofproteinproteininteractionarticles AT hunghsichuan exploitinglikelypositiveandunlabeleddatatoimprovetheidentificationofproteinproteininteractionarticles AT daihongjie exploitinglikelypositiveandunlabeleddatatoimprovetheidentificationofproteinproteininteractionarticles AT linyiwen exploitinglikelypositiveandunlabeleddatatoimprovetheidentificationofproteinproteininteractionarticles AT hsuwenlian exploitinglikelypositiveandunlabeleddatatoimprovetheidentificationofproteinproteininteractionarticles

Exploiting likely-positive and unlabeled data to improve the identification of protein-protein interaction articles

Ejemplares similares