Cargando…
Automated Alphabet Reduction for Protein Datasets
BACKGROUND: We investigate automated and generic alphabet reduction techniques for protein structure prediction datasets. Reducing alphabet cardinality without losing key biochemical information opens the door to potentially faster machine learning, data mining and optimization applications in struc...
Autores principales: | , , , , , |
---|---|
Formato: | Texto |
Lenguaje: | English |
Publicado: |
BioMed Central
2009
|
Materias: | |
Acceso en línea: | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2646702/ https://www.ncbi.nlm.nih.gov/pubmed/19126227 http://dx.doi.org/10.1186/1471-2105-10-6 |
_version_ | 1782164879678898176 |
---|---|
author | Bacardit, Jaume Stout, Michael Hirst, Jonathan D Valencia, Alfonso Smith, Robert E Krasnogor, Natalio |
author_facet | Bacardit, Jaume Stout, Michael Hirst, Jonathan D Valencia, Alfonso Smith, Robert E Krasnogor, Natalio |
author_sort | Bacardit, Jaume |
collection | PubMed |
description | BACKGROUND: We investigate automated and generic alphabet reduction techniques for protein structure prediction datasets. Reducing alphabet cardinality without losing key biochemical information opens the door to potentially faster machine learning, data mining and optimization applications in structural bioinformatics. Furthermore, reduced but informative alphabets often result in, e.g., more compact and human-friendly classification/clustering rules. In this paper we propose a robust and sophisticated alphabet reduction protocol based on mutual information and state-of-the-art optimization techniques. RESULTS: We applied this protocol to the prediction of two protein structural features: contact number and relative solvent accessibility. For both features we generated alphabets of two, three, four and five letters. The five-letter alphabets gave prediction accuracies statistically similar to that obtained using the full amino acid alphabet. Moreover, the automatically designed alphabets were compared against other reduced alphabets taken from the literature or human-designed, outperforming them. The differences between our alphabets and the alphabets taken from the literature were quantitatively analyzed. All the above process had been performed using a primary sequence representation of proteins. As a final experiment, we extrapolated the obtained five-letter alphabet to reduce a, much richer, protein representation based on evolutionary information for the prediction of the same two features. Again, the performance gap between the full representation and the reduced representation was small, showing that the results of our automated alphabet reduction protocol, even if they were obtained using a simple representation, are also able to capture the crucial information needed for state-of-the-art protein representations. CONCLUSION: Our automated alphabet reduction protocol generates competent reduced alphabets tailored specifically for a variety of protein datasets. This process is done without any domain knowledge, using information theory metrics instead. The reduced alphabets contain some unexpected (but sound) groups of amino acids, thus suggesting new ways of interpreting the data. |
format | Text |
id | pubmed-2646702 |
institution | National Center for Biotechnology Information |
language | English |
publishDate | 2009 |
publisher | BioMed Central |
record_format | MEDLINE/PubMed |
spelling | pubmed-26467022009-02-24 Automated Alphabet Reduction for Protein Datasets Bacardit, Jaume Stout, Michael Hirst, Jonathan D Valencia, Alfonso Smith, Robert E Krasnogor, Natalio BMC Bioinformatics Research Article BACKGROUND: We investigate automated and generic alphabet reduction techniques for protein structure prediction datasets. Reducing alphabet cardinality without losing key biochemical information opens the door to potentially faster machine learning, data mining and optimization applications in structural bioinformatics. Furthermore, reduced but informative alphabets often result in, e.g., more compact and human-friendly classification/clustering rules. In this paper we propose a robust and sophisticated alphabet reduction protocol based on mutual information and state-of-the-art optimization techniques. RESULTS: We applied this protocol to the prediction of two protein structural features: contact number and relative solvent accessibility. For both features we generated alphabets of two, three, four and five letters. The five-letter alphabets gave prediction accuracies statistically similar to that obtained using the full amino acid alphabet. Moreover, the automatically designed alphabets were compared against other reduced alphabets taken from the literature or human-designed, outperforming them. The differences between our alphabets and the alphabets taken from the literature were quantitatively analyzed. All the above process had been performed using a primary sequence representation of proteins. As a final experiment, we extrapolated the obtained five-letter alphabet to reduce a, much richer, protein representation based on evolutionary information for the prediction of the same two features. Again, the performance gap between the full representation and the reduced representation was small, showing that the results of our automated alphabet reduction protocol, even if they were obtained using a simple representation, are also able to capture the crucial information needed for state-of-the-art protein representations. CONCLUSION: Our automated alphabet reduction protocol generates competent reduced alphabets tailored specifically for a variety of protein datasets. This process is done without any domain knowledge, using information theory metrics instead. The reduced alphabets contain some unexpected (but sound) groups of amino acids, thus suggesting new ways of interpreting the data. BioMed Central 2009-01-06 /pmc/articles/PMC2646702/ /pubmed/19126227 http://dx.doi.org/10.1186/1471-2105-10-6 Text en Copyright © 2009 Bacardit et al; licensee BioMed Central Ltd. http://creativecommons.org/licenses/by/2.0 This is an Open Access article distributed under the terms of the Creative Commons Attribution License ( (http://creativecommons.org/licenses/by/2.0) ), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited. |
spellingShingle | Research Article Bacardit, Jaume Stout, Michael Hirst, Jonathan D Valencia, Alfonso Smith, Robert E Krasnogor, Natalio Automated Alphabet Reduction for Protein Datasets |
title | Automated Alphabet Reduction for Protein Datasets |
title_full | Automated Alphabet Reduction for Protein Datasets |
title_fullStr | Automated Alphabet Reduction for Protein Datasets |
title_full_unstemmed | Automated Alphabet Reduction for Protein Datasets |
title_short | Automated Alphabet Reduction for Protein Datasets |
title_sort | automated alphabet reduction for protein datasets |
topic | Research Article |
url | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2646702/ https://www.ncbi.nlm.nih.gov/pubmed/19126227 http://dx.doi.org/10.1186/1471-2105-10-6 |
work_keys_str_mv | AT bacarditjaume automatedalphabetreductionforproteindatasets AT stoutmichael automatedalphabetreductionforproteindatasets AT hirstjonathand automatedalphabetreductionforproteindatasets AT valenciaalfonso automatedalphabetreductionforproteindatasets AT smithroberte automatedalphabetreductionforproteindatasets AT krasnogornatalio automatedalphabetreductionforproteindatasets |