Cargando…
Machine learning on normalized protein sequences
BACKGROUND: Machine learning techniques have been widely applied to biological sequences, e.g. to predict drug resistance in HIV-1 from sequences of drug target proteins and protein functional classes. As deletions and insertions are frequent in biological sequences, a major limitation of current me...
Autores principales: | , , |
---|---|
Formato: | Texto |
Lenguaje: | English |
Publicado: |
BioMed Central
2011
|
Materias: | |
Acceso en línea: | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3079662/ https://www.ncbi.nlm.nih.gov/pubmed/21453485 http://dx.doi.org/10.1186/1756-0500-4-94 |
_version_ | 1782202039898472448 |
---|---|
author | Heider, Dominik Verheyen, Jens Hoffmann, Daniel |
author_facet | Heider, Dominik Verheyen, Jens Hoffmann, Daniel |
author_sort | Heider, Dominik |
collection | PubMed |
description | BACKGROUND: Machine learning techniques have been widely applied to biological sequences, e.g. to predict drug resistance in HIV-1 from sequences of drug target proteins and protein functional classes. As deletions and insertions are frequent in biological sequences, a major limitation of current methods is the inability to handle varying sequence lengths. FINDINGS: We propose to normalize sequences to uniform length. To this end, we tested one linear and four different non-linear interpolation methods for the normalization of sequence lengths of 19 classification datasets. Classification tasks included prediction of HIV-1 drug resistance from drug target sequences and sequence-based prediction of protein function. We applied random forests to the classification of sequences into "positive" and "negative" samples. Statistical tests showed that the linear interpolation outperforms the non-linear interpolation methods in most of the analyzed datasets, while in a few cases non-linear methods had a small but significant advantage. Compared to other published methods, our prediction scheme leads to an improvement in prediction accuracy by up to 14%. CONCLUSIONS: We found that machine learning on sequences normalized by simple linear interpolation gave better or at least competitive results compared to state-of-the-art procedures, and thus, is a promising alternative to existing methods, especially for protein sequences of variable length. |
format | Text |
id | pubmed-3079662 |
institution | National Center for Biotechnology Information |
language | English |
publishDate | 2011 |
publisher | BioMed Central |
record_format | MEDLINE/PubMed |
spelling | pubmed-30796622011-04-20 Machine learning on normalized protein sequences Heider, Dominik Verheyen, Jens Hoffmann, Daniel BMC Res Notes Short Report BACKGROUND: Machine learning techniques have been widely applied to biological sequences, e.g. to predict drug resistance in HIV-1 from sequences of drug target proteins and protein functional classes. As deletions and insertions are frequent in biological sequences, a major limitation of current methods is the inability to handle varying sequence lengths. FINDINGS: We propose to normalize sequences to uniform length. To this end, we tested one linear and four different non-linear interpolation methods for the normalization of sequence lengths of 19 classification datasets. Classification tasks included prediction of HIV-1 drug resistance from drug target sequences and sequence-based prediction of protein function. We applied random forests to the classification of sequences into "positive" and "negative" samples. Statistical tests showed that the linear interpolation outperforms the non-linear interpolation methods in most of the analyzed datasets, while in a few cases non-linear methods had a small but significant advantage. Compared to other published methods, our prediction scheme leads to an improvement in prediction accuracy by up to 14%. CONCLUSIONS: We found that machine learning on sequences normalized by simple linear interpolation gave better or at least competitive results compared to state-of-the-art procedures, and thus, is a promising alternative to existing methods, especially for protein sequences of variable length. BioMed Central 2011-03-31 /pmc/articles/PMC3079662/ /pubmed/21453485 http://dx.doi.org/10.1186/1756-0500-4-94 Text en Copyright ©2011 Heider et al; licensee BioMed Central Ltd. http://creativecommons.org/licenses/by/2.0 This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited. |
spellingShingle | Short Report Heider, Dominik Verheyen, Jens Hoffmann, Daniel Machine learning on normalized protein sequences |
title | Machine learning on normalized protein sequences |
title_full | Machine learning on normalized protein sequences |
title_fullStr | Machine learning on normalized protein sequences |
title_full_unstemmed | Machine learning on normalized protein sequences |
title_short | Machine learning on normalized protein sequences |
title_sort | machine learning on normalized protein sequences |
topic | Short Report |
url | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3079662/ https://www.ncbi.nlm.nih.gov/pubmed/21453485 http://dx.doi.org/10.1186/1756-0500-4-94 |
work_keys_str_mv | AT heiderdominik machinelearningonnormalizedproteinsequences AT verheyenjens machinelearningonnormalizedproteinsequences AT hoffmanndaniel machinelearningonnormalizedproteinsequences |