Cargando…

On the amyloid datasets used for training PAFIG ­ how (not) to extend the experimental dataset of hexapeptides

BACKGROUND: Amyloids are proteins capable of forming aberrant intramolecular contact sites, characteristic of beta zipper configuration. Amyloids can underlie serious health conditions, e.g. Alzheimer’s or Parkinson’s diseases. It has been proposed that short segments of amino acids can be responsib...

Descripción completa

Detalles Bibliográficos
Autores principales: Kotulska, Malgorzata, Unold, Olgierd
Formato: Online Artículo Texto
Lenguaje:English
Publicado: BioMed Central 2013
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3879009/
https://www.ncbi.nlm.nih.gov/pubmed/24305169
http://dx.doi.org/10.1186/1471-2105-14-351
_version_ 1782297903177400320
author Kotulska, Malgorzata
Unold, Olgierd
author_facet Kotulska, Malgorzata
Unold, Olgierd
author_sort Kotulska, Malgorzata
collection PubMed
description BACKGROUND: Amyloids are proteins capable of forming aberrant intramolecular contact sites, characteristic of beta zipper configuration. Amyloids can underlie serious health conditions, e.g. Alzheimer’s or Parkinson’s diseases. It has been proposed that short segments of amino acids can be responsible for protein amyloidogenicity, but no more than two hundred such hexapeptides have been experimentally found. The authors of the computational tool Pafig published in BMC Bioinformatics a method for extending the amyloid hexapeptide dataset that could be used for training and testing models. They assumed that all hexapeptides belonging to an amyloid protein can be regarded as amylopositive, while those from proteins never reported as amyloid are always amylonegative. Here we show why the above described method of extending datasets is wrong and discuss the reasons why the incorrect data could lead to falsely correct classification. RESULTS: The amyloid classification of hexapeptides by Pafig was confronted with the classification results from different state of the art computational methods and the outputs of all methods were studied by clustering analysis. The clustering methods show that Pafig is an outlier with regard to other approaches. Our study of the statistical patterns of its training and testing datasets showed a strong bias towards STVIIE hexapeptide in their positive part. Different statistical patterns of seemingly amylo -positive and -negative hexapeptides allow for a repeatable classification, which is not related to amyloid propensity of the hexapetides. CONCLUSIONS: Our study on recognition of amyloid hexapeptides showed that occurrence of incidental patterns in wrongly selected datasets can produce falsely correct results of classification. The assumption that all hexapeptides belonging to amyloid protein can be regarded as amylopositive and those from proteins never reported as amyloid are always amylonegative is not supported by any other computational method. This is in line with experimental observations that amyloid propensity of a full protein can result from only one amyloidogenic fragment in this protein, while the occurrence of amyliodogenic part that is well hidden inside the protein may never lead to fibril formation. This leads to the conclusion that Pafig does not provide correct classification with regard to amyloidogenicity.
format Online
Article
Text
id pubmed-3879009
institution National Center for Biotechnology Information
language English
publishDate 2013
publisher BioMed Central
record_format MEDLINE/PubMed
spelling pubmed-38790092014-01-08 On the amyloid datasets used for training PAFIG ­ how (not) to extend the experimental dataset of hexapeptides Kotulska, Malgorzata Unold, Olgierd BMC Bioinformatics Correspondence BACKGROUND: Amyloids are proteins capable of forming aberrant intramolecular contact sites, characteristic of beta zipper configuration. Amyloids can underlie serious health conditions, e.g. Alzheimer’s or Parkinson’s diseases. It has been proposed that short segments of amino acids can be responsible for protein amyloidogenicity, but no more than two hundred such hexapeptides have been experimentally found. The authors of the computational tool Pafig published in BMC Bioinformatics a method for extending the amyloid hexapeptide dataset that could be used for training and testing models. They assumed that all hexapeptides belonging to an amyloid protein can be regarded as amylopositive, while those from proteins never reported as amyloid are always amylonegative. Here we show why the above described method of extending datasets is wrong and discuss the reasons why the incorrect data could lead to falsely correct classification. RESULTS: The amyloid classification of hexapeptides by Pafig was confronted with the classification results from different state of the art computational methods and the outputs of all methods were studied by clustering analysis. The clustering methods show that Pafig is an outlier with regard to other approaches. Our study of the statistical patterns of its training and testing datasets showed a strong bias towards STVIIE hexapeptide in their positive part. Different statistical patterns of seemingly amylo -positive and -negative hexapeptides allow for a repeatable classification, which is not related to amyloid propensity of the hexapetides. CONCLUSIONS: Our study on recognition of amyloid hexapeptides showed that occurrence of incidental patterns in wrongly selected datasets can produce falsely correct results of classification. The assumption that all hexapeptides belonging to amyloid protein can be regarded as amylopositive and those from proteins never reported as amyloid are always amylonegative is not supported by any other computational method. This is in line with experimental observations that amyloid propensity of a full protein can result from only one amyloidogenic fragment in this protein, while the occurrence of amyliodogenic part that is well hidden inside the protein may never lead to fibril formation. This leads to the conclusion that Pafig does not provide correct classification with regard to amyloidogenicity. BioMed Central 2013-12-04 /pmc/articles/PMC3879009/ /pubmed/24305169 http://dx.doi.org/10.1186/1471-2105-14-351 Text en Copyright © 2013 Kotulska and Unold; licensee BioMed Central Ltd. http://creativecommons.org/licenses/by/2.0 This is an open access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.
spellingShingle Correspondence
Kotulska, Malgorzata
Unold, Olgierd
On the amyloid datasets used for training PAFIG ­ how (not) to extend the experimental dataset of hexapeptides
title On the amyloid datasets used for training PAFIG ­ how (not) to extend the experimental dataset of hexapeptides
title_full On the amyloid datasets used for training PAFIG ­ how (not) to extend the experimental dataset of hexapeptides
title_fullStr On the amyloid datasets used for training PAFIG ­ how (not) to extend the experimental dataset of hexapeptides
title_full_unstemmed On the amyloid datasets used for training PAFIG ­ how (not) to extend the experimental dataset of hexapeptides
title_short On the amyloid datasets used for training PAFIG ­ how (not) to extend the experimental dataset of hexapeptides
title_sort on the amyloid datasets used for training pafig ­ how (not) to extend the experimental dataset of hexapeptides
topic Correspondence
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3879009/
https://www.ncbi.nlm.nih.gov/pubmed/24305169
http://dx.doi.org/10.1186/1471-2105-14-351
work_keys_str_mv AT kotulskamalgorzata ontheamyloiddatasetsusedfortrainingpafighownottoextendtheexperimentaldatasetofhexapeptides
AT unoldolgierd ontheamyloiddatasetsusedfortrainingpafighownottoextendtheexperimentaldatasetofhexapeptides