Cargando…
On the amyloid datasets used for training PAFIG how (not) to extend the experimental dataset of hexapeptides
BACKGROUND: Amyloids are proteins capable of forming aberrant intramolecular contact sites, characteristic of beta zipper configuration. Amyloids can underlie serious health conditions, e.g. Alzheimer’s or Parkinson’s diseases. It has been proposed that short segments of amino acids can be responsib...
Autores principales: | , |
---|---|
Formato: | Online Artículo Texto |
Lenguaje: | English |
Publicado: |
BioMed Central
2013
|
Materias: | |
Acceso en línea: | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3879009/ https://www.ncbi.nlm.nih.gov/pubmed/24305169 http://dx.doi.org/10.1186/1471-2105-14-351 |
_version_ | 1782297903177400320 |
---|---|
author | Kotulska, Malgorzata Unold, Olgierd |
author_facet | Kotulska, Malgorzata Unold, Olgierd |
author_sort | Kotulska, Malgorzata |
collection | PubMed |
description | BACKGROUND: Amyloids are proteins capable of forming aberrant intramolecular contact sites, characteristic of beta zipper configuration. Amyloids can underlie serious health conditions, e.g. Alzheimer’s or Parkinson’s diseases. It has been proposed that short segments of amino acids can be responsible for protein amyloidogenicity, but no more than two hundred such hexapeptides have been experimentally found. The authors of the computational tool Pafig published in BMC Bioinformatics a method for extending the amyloid hexapeptide dataset that could be used for training and testing models. They assumed that all hexapeptides belonging to an amyloid protein can be regarded as amylopositive, while those from proteins never reported as amyloid are always amylonegative. Here we show why the above described method of extending datasets is wrong and discuss the reasons why the incorrect data could lead to falsely correct classification. RESULTS: The amyloid classification of hexapeptides by Pafig was confronted with the classification results from different state of the art computational methods and the outputs of all methods were studied by clustering analysis. The clustering methods show that Pafig is an outlier with regard to other approaches. Our study of the statistical patterns of its training and testing datasets showed a strong bias towards STVIIE hexapeptide in their positive part. Different statistical patterns of seemingly amylo -positive and -negative hexapeptides allow for a repeatable classification, which is not related to amyloid propensity of the hexapetides. CONCLUSIONS: Our study on recognition of amyloid hexapeptides showed that occurrence of incidental patterns in wrongly selected datasets can produce falsely correct results of classification. The assumption that all hexapeptides belonging to amyloid protein can be regarded as amylopositive and those from proteins never reported as amyloid are always amylonegative is not supported by any other computational method. This is in line with experimental observations that amyloid propensity of a full protein can result from only one amyloidogenic fragment in this protein, while the occurrence of amyliodogenic part that is well hidden inside the protein may never lead to fibril formation. This leads to the conclusion that Pafig does not provide correct classification with regard to amyloidogenicity. |
format | Online Article Text |
id | pubmed-3879009 |
institution | National Center for Biotechnology Information |
language | English |
publishDate | 2013 |
publisher | BioMed Central |
record_format | MEDLINE/PubMed |
spelling | pubmed-38790092014-01-08 On the amyloid datasets used for training PAFIG how (not) to extend the experimental dataset of hexapeptides Kotulska, Malgorzata Unold, Olgierd BMC Bioinformatics Correspondence BACKGROUND: Amyloids are proteins capable of forming aberrant intramolecular contact sites, characteristic of beta zipper configuration. Amyloids can underlie serious health conditions, e.g. Alzheimer’s or Parkinson’s diseases. It has been proposed that short segments of amino acids can be responsible for protein amyloidogenicity, but no more than two hundred such hexapeptides have been experimentally found. The authors of the computational tool Pafig published in BMC Bioinformatics a method for extending the amyloid hexapeptide dataset that could be used for training and testing models. They assumed that all hexapeptides belonging to an amyloid protein can be regarded as amylopositive, while those from proteins never reported as amyloid are always amylonegative. Here we show why the above described method of extending datasets is wrong and discuss the reasons why the incorrect data could lead to falsely correct classification. RESULTS: The amyloid classification of hexapeptides by Pafig was confronted with the classification results from different state of the art computational methods and the outputs of all methods were studied by clustering analysis. The clustering methods show that Pafig is an outlier with regard to other approaches. Our study of the statistical patterns of its training and testing datasets showed a strong bias towards STVIIE hexapeptide in their positive part. Different statistical patterns of seemingly amylo -positive and -negative hexapeptides allow for a repeatable classification, which is not related to amyloid propensity of the hexapetides. CONCLUSIONS: Our study on recognition of amyloid hexapeptides showed that occurrence of incidental patterns in wrongly selected datasets can produce falsely correct results of classification. The assumption that all hexapeptides belonging to amyloid protein can be regarded as amylopositive and those from proteins never reported as amyloid are always amylonegative is not supported by any other computational method. This is in line with experimental observations that amyloid propensity of a full protein can result from only one amyloidogenic fragment in this protein, while the occurrence of amyliodogenic part that is well hidden inside the protein may never lead to fibril formation. This leads to the conclusion that Pafig does not provide correct classification with regard to amyloidogenicity. BioMed Central 2013-12-04 /pmc/articles/PMC3879009/ /pubmed/24305169 http://dx.doi.org/10.1186/1471-2105-14-351 Text en Copyright © 2013 Kotulska and Unold; licensee BioMed Central Ltd. http://creativecommons.org/licenses/by/2.0 This is an open access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited. |
spellingShingle | Correspondence Kotulska, Malgorzata Unold, Olgierd On the amyloid datasets used for training PAFIG how (not) to extend the experimental dataset of hexapeptides |
title | On the amyloid datasets used for training PAFIG how (not) to extend the experimental dataset of hexapeptides |
title_full | On the amyloid datasets used for training PAFIG how (not) to extend the experimental dataset of hexapeptides |
title_fullStr | On the amyloid datasets used for training PAFIG how (not) to extend the experimental dataset of hexapeptides |
title_full_unstemmed | On the amyloid datasets used for training PAFIG how (not) to extend the experimental dataset of hexapeptides |
title_short | On the amyloid datasets used for training PAFIG how (not) to extend the experimental dataset of hexapeptides |
title_sort | on the amyloid datasets used for training pafig how (not) to extend the experimental dataset of hexapeptides |
topic | Correspondence |
url | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3879009/ https://www.ncbi.nlm.nih.gov/pubmed/24305169 http://dx.doi.org/10.1186/1471-2105-14-351 |
work_keys_str_mv | AT kotulskamalgorzata ontheamyloiddatasetsusedfortrainingpafighownottoextendtheexperimentaldatasetofhexapeptides AT unoldolgierd ontheamyloiddatasetsusedfortrainingpafighownottoextendtheexperimentaldatasetofhexapeptides |