Cargando…

On the amyloid datasets used for training PAFIG how (not) to extend the experimental dataset of hexapeptides

BACKGROUND: Amyloids are proteins capable of forming aberrant intramolecular contact sites, characteristic of beta zipper configuration. Amyloids can underlie serious health conditions, e.g. Alzheimer’s or Parkinson’s diseases. It has been proposed that short segments of amino acids can be responsib...

Descripción completa

Detalles Bibliográficos
Autores principales:	Kotulska, Malgorzata, Unold, Olgierd
Formato:	Online Artículo Texto
Lenguaje:	English
Publicado:	BioMed Central 2013
Materias:	Correspondence
Acceso en línea:	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3879009/ https://www.ncbi.nlm.nih.gov/pubmed/24305169 http://dx.doi.org/10.1186/1471-2105-14-351

_version_	1782297903177400320
author	Kotulska, Malgorzata Unold, Olgierd
author_facet	Kotulska, Malgorzata Unold, Olgierd
author_sort	Kotulska, Malgorzata
collection	PubMed
description	BACKGROUND: Amyloids are proteins capable of forming aberrant intramolecular contact sites, characteristic of beta zipper configuration. Amyloids can underlie serious health conditions, e.g. Alzheimer’s or Parkinson’s diseases. It has been proposed that short segments of amino acids can be responsible for protein amyloidogenicity, but no more than two hundred such hexapeptides have been experimentally found. The authors of the computational tool Pafig published in BMC Bioinformatics a method for extending the amyloid hexapeptide dataset that could be used for training and testing models. They assumed that all hexapeptides belonging to an amyloid protein can be regarded as amylopositive, while those from proteins never reported as amyloid are always amylonegative. Here we show why the above described method of extending datasets is wrong and discuss the reasons why the incorrect data could lead to falsely correct classification. RESULTS: The amyloid classification of hexapeptides by Pafig was confronted with the classification results from different state of the art computational methods and the outputs of all methods were studied by clustering analysis. The clustering methods show that Pafig is an outlier with regard to other approaches. Our study of the statistical patterns of its training and testing datasets showed a strong bias towards STVIIE hexapeptide in their positive part. Different statistical patterns of seemingly amylo -positive and -negative hexapeptides allow for a repeatable classification, which is not related to amyloid propensity of the hexapetides. CONCLUSIONS: Our study on recognition of amyloid hexapeptides showed that occurrence of incidental patterns in wrongly selected datasets can produce falsely correct results of classification. The assumption that all hexapeptides belonging to amyloid protein can be regarded as amylopositive and those from proteins never reported as amyloid are always amylonegative is not supported by any other computational method. This is in line with experimental observations that amyloid propensity of a full protein can result from only one amyloidogenic fragment in this protein, while the occurrence of amyliodogenic part that is well hidden inside the protein may never lead to fibril formation. This leads to the conclusion that Pafig does not provide correct classification with regard to amyloidogenicity.
format	Online Article Text
id	pubmed-3879009
institution	National Center for Biotechnology Information
language	English
publishDate	2013
publisher	BioMed Central
record_format	MEDLINE/PubMed
spelling	pubmed-38790092014-01-08 On the amyloid datasets used for training PAFIG how (not) to extend the experimental dataset of hexapeptides Kotulska, Malgorzata Unold, Olgierd BMC Bioinformatics Correspondence BACKGROUND: Amyloids are proteins capable of forming aberrant intramolecular contact sites, characteristic of beta zipper configuration. Amyloids can underlie serious health conditions, e.g. Alzheimer’s or Parkinson’s diseases. It has been proposed that short segments of amino acids can be responsible for protein amyloidogenicity, but no more than two hundred such hexapeptides have been experimentally found. The authors of the computational tool Pafig published in BMC Bioinformatics a method for extending the amyloid hexapeptide dataset that could be used for training and testing models. They assumed that all hexapeptides belonging to an amyloid protein can be regarded as amylopositive, while those from proteins never reported as amyloid are always amylonegative. Here we show why the above described method of extending datasets is wrong and discuss the reasons why the incorrect data could lead to falsely correct classification. RESULTS: The amyloid classification of hexapeptides by Pafig was confronted with the classification results from different state of the art computational methods and the outputs of all methods were studied by clustering analysis. The clustering methods show that Pafig is an outlier with regard to other approaches. Our study of the statistical patterns of its training and testing datasets showed a strong bias towards STVIIE hexapeptide in their positive part. Different statistical patterns of seemingly amylo -positive and -negative hexapeptides allow for a repeatable classification, which is not related to amyloid propensity of the hexapetides. CONCLUSIONS: Our study on recognition of amyloid hexapeptides showed that occurrence of incidental patterns in wrongly selected datasets can produce falsely correct results of classification. The assumption that all hexapeptides belonging to amyloid protein can be regarded as amylopositive and those from proteins never reported as amyloid are always amylonegative is not supported by any other computational method. This is in line with experimental observations that amyloid propensity of a full protein can result from only one amyloidogenic fragment in this protein, while the occurrence of amyliodogenic part that is well hidden inside the protein may never lead to fibril formation. This leads to the conclusion that Pafig does not provide correct classification with regard to amyloidogenicity. BioMed Central 2013-12-04 /pmc/articles/PMC3879009/ /pubmed/24305169 http://dx.doi.org/10.1186/1471-2105-14-351 Text en Copyright © 2013 Kotulska and Unold; licensee BioMed Central Ltd. http://creativecommons.org/licenses/by/2.0 This is an open access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.
spellingShingle	Correspondence Kotulska, Malgorzata Unold, Olgierd On the amyloid datasets used for training PAFIG how (not) to extend the experimental dataset of hexapeptides
title	On the amyloid datasets used for training PAFIG how (not) to extend the experimental dataset of hexapeptides
title_full	On the amyloid datasets used for training PAFIG how (not) to extend the experimental dataset of hexapeptides
title_fullStr	On the amyloid datasets used for training PAFIG how (not) to extend the experimental dataset of hexapeptides
title_full_unstemmed	On the amyloid datasets used for training PAFIG how (not) to extend the experimental dataset of hexapeptides
title_short	On the amyloid datasets used for training PAFIG how (not) to extend the experimental dataset of hexapeptides
title_sort	on the amyloid datasets used for training pafig how (not) to extend the experimental dataset of hexapeptides
topic	Correspondence
url	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3879009/ https://www.ncbi.nlm.nih.gov/pubmed/24305169 http://dx.doi.org/10.1186/1471-2105-14-351
work_keys_str_mv	AT kotulskamalgorzata ontheamyloiddatasetsusedfortrainingpafighownottoextendtheexperimentaldatasetofhexapeptides AT unoldolgierd ontheamyloiddatasetsusedfortrainingpafighownottoextendtheexperimentaldatasetofhexapeptides

On the amyloid datasets used for training PAFIG ­ how (not) to extend the experimental dataset of hexapeptides

Ejemplares similares

On the amyloid datasets used for training PAFIG how (not) to extend the experimental dataset of hexapeptides