Cargando…

Predicting mostly disordered proteins by using structure-unknown protein data

BACKGROUND: Predicting intrinsically disordered proteins is important in structural biology because they are thought to carry out various cellular functions even though they have no stable three-dimensional structure. We know the structures of far more ordered proteins than disordered proteins. The...

Descripción completa

Detalles Bibliográficos
Autores principales:	Shimizu, Kana, Muraoka, Yoichi, Hirose, Shuichi, Tomii, Kentaro, Noguchi, Tamotsu
Formato:	Texto
Lenguaje:	English
Publicado:	BioMed Central 2007
Materias:	Methodology Article
Acceso en línea:	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC1838436/ https://www.ncbi.nlm.nih.gov/pubmed/17338828 http://dx.doi.org/10.1186/1471-2105-8-78

_version_	1782132830972674048
author	Shimizu, Kana Muraoka, Yoichi Hirose, Shuichi Tomii, Kentaro Noguchi, Tamotsu
author_facet	Shimizu, Kana Muraoka, Yoichi Hirose, Shuichi Tomii, Kentaro Noguchi, Tamotsu
author_sort	Shimizu, Kana
collection	PubMed
description	BACKGROUND: Predicting intrinsically disordered proteins is important in structural biology because they are thought to carry out various cellular functions even though they have no stable three-dimensional structure. We know the structures of far more ordered proteins than disordered proteins. The structural distribution of proteins in nature can therefore be inferred to differ from that of proteins whose structures have been determined experimentally. We know many more protein sequences than we do protein structures, and many of the known sequences can be expected to be those of disordered proteins. Thus it would be efficient to use the information of structure-unknown proteins in order to avoid training data sparseness. We propose a novel method for predicting which proteins are mostly disordered by using spectral graph transducer and training with a huge amount of structure-unknown sequences as well as structure-known sequences. RESULTS: When the proposed method was evaluated on data that included 82 disordered proteins and 526 ordered proteins, its sensitivity was 0.723 and its specificity was 0.977. It resulted in a Matthews correlation coefficient 0.202 points higher than that obtained using FoldIndex, 0.221 points higher than that obtained using the method based on plotting hydrophobicity against the number of contacts and 0.07 points higher than that obtained using support vector machines (SVMs). To examine robustness against training data sparseness, we investigated the correlation between two results obtained when the method was trained on different datasets and tested on the same dataset. The correlation coefficient for the proposed method is 0.14 higher than that for the method using SVMs. When the proposed SGT-based method was compared with four per-residue predictors (VL3, GlobPlot, DISOPRED2 and IUPred (long)), its sensitivity was 0.834 for disordered proteins, which is 0.052–0.523 higher than that of the per-residue predictors, and its specificity was 0.991 for ordered proteins, which is 0.036–0.153 higher than that of the per-residue predictors. The proposed method was also evaluated on data that included 417 partially disordered proteins. It predicted the frequency of disordered proteins to be 1.95% for the proteins with 5%–10% disordered sequences, 1.46% for the proteins with 10%–20% disordered sequences and 16.57% for proteins with 20%–40% disordered sequences. CONCLUSION: The proposed method, which utilizes the information of structure-unknown data, predicts disordered proteins more accurately than other methods and is less affected by training data sparseness.
format	Text
id	pubmed-1838436
institution	National Center for Biotechnology Information
language	English
publishDate	2007
publisher	BioMed Central
record_format	MEDLINE/PubMed
spelling	pubmed-18384362007-04-04 Predicting mostly disordered proteins by using structure-unknown protein data Shimizu, Kana Muraoka, Yoichi Hirose, Shuichi Tomii, Kentaro Noguchi, Tamotsu BMC Bioinformatics Methodology Article BACKGROUND: Predicting intrinsically disordered proteins is important in structural biology because they are thought to carry out various cellular functions even though they have no stable three-dimensional structure. We know the structures of far more ordered proteins than disordered proteins. The structural distribution of proteins in nature can therefore be inferred to differ from that of proteins whose structures have been determined experimentally. We know many more protein sequences than we do protein structures, and many of the known sequences can be expected to be those of disordered proteins. Thus it would be efficient to use the information of structure-unknown proteins in order to avoid training data sparseness. We propose a novel method for predicting which proteins are mostly disordered by using spectral graph transducer and training with a huge amount of structure-unknown sequences as well as structure-known sequences. RESULTS: When the proposed method was evaluated on data that included 82 disordered proteins and 526 ordered proteins, its sensitivity was 0.723 and its specificity was 0.977. It resulted in a Matthews correlation coefficient 0.202 points higher than that obtained using FoldIndex, 0.221 points higher than that obtained using the method based on plotting hydrophobicity against the number of contacts and 0.07 points higher than that obtained using support vector machines (SVMs). To examine robustness against training data sparseness, we investigated the correlation between two results obtained when the method was trained on different datasets and tested on the same dataset. The correlation coefficient for the proposed method is 0.14 higher than that for the method using SVMs. When the proposed SGT-based method was compared with four per-residue predictors (VL3, GlobPlot, DISOPRED2 and IUPred (long)), its sensitivity was 0.834 for disordered proteins, which is 0.052–0.523 higher than that of the per-residue predictors, and its specificity was 0.991 for ordered proteins, which is 0.036–0.153 higher than that of the per-residue predictors. The proposed method was also evaluated on data that included 417 partially disordered proteins. It predicted the frequency of disordered proteins to be 1.95% for the proteins with 5%–10% disordered sequences, 1.46% for the proteins with 10%–20% disordered sequences and 16.57% for proteins with 20%–40% disordered sequences. CONCLUSION: The proposed method, which utilizes the information of structure-unknown data, predicts disordered proteins more accurately than other methods and is less affected by training data sparseness. BioMed Central 2007-03-06 /pmc/articles/PMC1838436/ /pubmed/17338828 http://dx.doi.org/10.1186/1471-2105-8-78 Text en Copyright © 2007 Shimizu et al; licensee BioMed Central Ltd. http://creativecommons.org/licenses/by/2.0 This is an Open Access article distributed under the terms of the Creative Commons Attribution License ( (http://creativecommons.org/licenses/by/2.0) ), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.
spellingShingle	Methodology Article Shimizu, Kana Muraoka, Yoichi Hirose, Shuichi Tomii, Kentaro Noguchi, Tamotsu Predicting mostly disordered proteins by using structure-unknown protein data
title	Predicting mostly disordered proteins by using structure-unknown protein data
title_full	Predicting mostly disordered proteins by using structure-unknown protein data
title_fullStr	Predicting mostly disordered proteins by using structure-unknown protein data
title_full_unstemmed	Predicting mostly disordered proteins by using structure-unknown protein data
title_short	Predicting mostly disordered proteins by using structure-unknown protein data
title_sort	predicting mostly disordered proteins by using structure-unknown protein data
topic	Methodology Article
url	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC1838436/ https://www.ncbi.nlm.nih.gov/pubmed/17338828 http://dx.doi.org/10.1186/1471-2105-8-78
work_keys_str_mv	AT shimizukana predictingmostlydisorderedproteinsbyusingstructureunknownproteindata AT muraokayoichi predictingmostlydisorderedproteinsbyusingstructureunknownproteindata AT hiroseshuichi predictingmostlydisorderedproteinsbyusingstructureunknownproteindata AT tomiikentaro predictingmostlydisorderedproteinsbyusingstructureunknownproteindata AT noguchitamotsu predictingmostlydisorderedproteinsbyusingstructureunknownproteindata

Predicting mostly disordered proteins by using structure-unknown protein data

Ejemplares similares