Cargando…

Predicting mostly disordered proteins by using structure-unknown protein data

BACKGROUND: Predicting intrinsically disordered proteins is important in structural biology because they are thought to carry out various cellular functions even though they have no stable three-dimensional structure. We know the structures of far more ordered proteins than disordered proteins. The...

Descripción completa

Detalles Bibliográficos
Autores principales: Shimizu, Kana, Muraoka, Yoichi, Hirose, Shuichi, Tomii, Kentaro, Noguchi, Tamotsu
Formato: Texto
Lenguaje:English
Publicado: BioMed Central 2007
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC1838436/
https://www.ncbi.nlm.nih.gov/pubmed/17338828
http://dx.doi.org/10.1186/1471-2105-8-78
_version_ 1782132830972674048
author Shimizu, Kana
Muraoka, Yoichi
Hirose, Shuichi
Tomii, Kentaro
Noguchi, Tamotsu
author_facet Shimizu, Kana
Muraoka, Yoichi
Hirose, Shuichi
Tomii, Kentaro
Noguchi, Tamotsu
author_sort Shimizu, Kana
collection PubMed
description BACKGROUND: Predicting intrinsically disordered proteins is important in structural biology because they are thought to carry out various cellular functions even though they have no stable three-dimensional structure. We know the structures of far more ordered proteins than disordered proteins. The structural distribution of proteins in nature can therefore be inferred to differ from that of proteins whose structures have been determined experimentally. We know many more protein sequences than we do protein structures, and many of the known sequences can be expected to be those of disordered proteins. Thus it would be efficient to use the information of structure-unknown proteins in order to avoid training data sparseness. We propose a novel method for predicting which proteins are mostly disordered by using spectral graph transducer and training with a huge amount of structure-unknown sequences as well as structure-known sequences. RESULTS: When the proposed method was evaluated on data that included 82 disordered proteins and 526 ordered proteins, its sensitivity was 0.723 and its specificity was 0.977. It resulted in a Matthews correlation coefficient 0.202 points higher than that obtained using FoldIndex, 0.221 points higher than that obtained using the method based on plotting hydrophobicity against the number of contacts and 0.07 points higher than that obtained using support vector machines (SVMs). To examine robustness against training data sparseness, we investigated the correlation between two results obtained when the method was trained on different datasets and tested on the same dataset. The correlation coefficient for the proposed method is 0.14 higher than that for the method using SVMs. When the proposed SGT-based method was compared with four per-residue predictors (VL3, GlobPlot, DISOPRED2 and IUPred (long)), its sensitivity was 0.834 for disordered proteins, which is 0.052–0.523 higher than that of the per-residue predictors, and its specificity was 0.991 for ordered proteins, which is 0.036–0.153 higher than that of the per-residue predictors. The proposed method was also evaluated on data that included 417 partially disordered proteins. It predicted the frequency of disordered proteins to be 1.95% for the proteins with 5%–10% disordered sequences, 1.46% for the proteins with 10%–20% disordered sequences and 16.57% for proteins with 20%–40% disordered sequences. CONCLUSION: The proposed method, which utilizes the information of structure-unknown data, predicts disordered proteins more accurately than other methods and is less affected by training data sparseness.
format Text
id pubmed-1838436
institution National Center for Biotechnology Information
language English
publishDate 2007
publisher BioMed Central
record_format MEDLINE/PubMed
spelling pubmed-18384362007-04-04 Predicting mostly disordered proteins by using structure-unknown protein data Shimizu, Kana Muraoka, Yoichi Hirose, Shuichi Tomii, Kentaro Noguchi, Tamotsu BMC Bioinformatics Methodology Article BACKGROUND: Predicting intrinsically disordered proteins is important in structural biology because they are thought to carry out various cellular functions even though they have no stable three-dimensional structure. We know the structures of far more ordered proteins than disordered proteins. The structural distribution of proteins in nature can therefore be inferred to differ from that of proteins whose structures have been determined experimentally. We know many more protein sequences than we do protein structures, and many of the known sequences can be expected to be those of disordered proteins. Thus it would be efficient to use the information of structure-unknown proteins in order to avoid training data sparseness. We propose a novel method for predicting which proteins are mostly disordered by using spectral graph transducer and training with a huge amount of structure-unknown sequences as well as structure-known sequences. RESULTS: When the proposed method was evaluated on data that included 82 disordered proteins and 526 ordered proteins, its sensitivity was 0.723 and its specificity was 0.977. It resulted in a Matthews correlation coefficient 0.202 points higher than that obtained using FoldIndex, 0.221 points higher than that obtained using the method based on plotting hydrophobicity against the number of contacts and 0.07 points higher than that obtained using support vector machines (SVMs). To examine robustness against training data sparseness, we investigated the correlation between two results obtained when the method was trained on different datasets and tested on the same dataset. The correlation coefficient for the proposed method is 0.14 higher than that for the method using SVMs. When the proposed SGT-based method was compared with four per-residue predictors (VL3, GlobPlot, DISOPRED2 and IUPred (long)), its sensitivity was 0.834 for disordered proteins, which is 0.052–0.523 higher than that of the per-residue predictors, and its specificity was 0.991 for ordered proteins, which is 0.036–0.153 higher than that of the per-residue predictors. The proposed method was also evaluated on data that included 417 partially disordered proteins. It predicted the frequency of disordered proteins to be 1.95% for the proteins with 5%–10% disordered sequences, 1.46% for the proteins with 10%–20% disordered sequences and 16.57% for proteins with 20%–40% disordered sequences. CONCLUSION: The proposed method, which utilizes the information of structure-unknown data, predicts disordered proteins more accurately than other methods and is less affected by training data sparseness. BioMed Central 2007-03-06 /pmc/articles/PMC1838436/ /pubmed/17338828 http://dx.doi.org/10.1186/1471-2105-8-78 Text en Copyright © 2007 Shimizu et al; licensee BioMed Central Ltd. http://creativecommons.org/licenses/by/2.0 This is an Open Access article distributed under the terms of the Creative Commons Attribution License ( (http://creativecommons.org/licenses/by/2.0) ), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.
spellingShingle Methodology Article
Shimizu, Kana
Muraoka, Yoichi
Hirose, Shuichi
Tomii, Kentaro
Noguchi, Tamotsu
Predicting mostly disordered proteins by using structure-unknown protein data
title Predicting mostly disordered proteins by using structure-unknown protein data
title_full Predicting mostly disordered proteins by using structure-unknown protein data
title_fullStr Predicting mostly disordered proteins by using structure-unknown protein data
title_full_unstemmed Predicting mostly disordered proteins by using structure-unknown protein data
title_short Predicting mostly disordered proteins by using structure-unknown protein data
title_sort predicting mostly disordered proteins by using structure-unknown protein data
topic Methodology Article
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC1838436/
https://www.ncbi.nlm.nih.gov/pubmed/17338828
http://dx.doi.org/10.1186/1471-2105-8-78
work_keys_str_mv AT shimizukana predictingmostlydisorderedproteinsbyusingstructureunknownproteindata
AT muraokayoichi predictingmostlydisorderedproteinsbyusingstructureunknownproteindata
AT hiroseshuichi predictingmostlydisorderedproteinsbyusingstructureunknownproteindata
AT tomiikentaro predictingmostlydisorderedproteinsbyusingstructureunknownproteindata
AT noguchitamotsu predictingmostlydisorderedproteinsbyusingstructureunknownproteindata