Cargando…

Machine learning on normalized protein sequences

BACKGROUND: Machine learning techniques have been widely applied to biological sequences, e.g. to predict drug resistance in HIV-1 from sequences of drug target proteins and protein functional classes. As deletions and insertions are frequent in biological sequences, a major limitation of current me...

Descripción completa

Detalles Bibliográficos
Autores principales:	Heider, Dominik, Verheyen, Jens, Hoffmann, Daniel
Formato:	Texto
Lenguaje:	English
Publicado:	BioMed Central 2011
Materias:	Short Report
Acceso en línea:	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3079662/ https://www.ncbi.nlm.nih.gov/pubmed/21453485 http://dx.doi.org/10.1186/1756-0500-4-94

_version_	1782202039898472448
author	Heider, Dominik Verheyen, Jens Hoffmann, Daniel
author_facet	Heider, Dominik Verheyen, Jens Hoffmann, Daniel
author_sort	Heider, Dominik
collection	PubMed
description	BACKGROUND: Machine learning techniques have been widely applied to biological sequences, e.g. to predict drug resistance in HIV-1 from sequences of drug target proteins and protein functional classes. As deletions and insertions are frequent in biological sequences, a major limitation of current methods is the inability to handle varying sequence lengths. FINDINGS: We propose to normalize sequences to uniform length. To this end, we tested one linear and four different non-linear interpolation methods for the normalization of sequence lengths of 19 classification datasets. Classification tasks included prediction of HIV-1 drug resistance from drug target sequences and sequence-based prediction of protein function. We applied random forests to the classification of sequences into "positive" and "negative" samples. Statistical tests showed that the linear interpolation outperforms the non-linear interpolation methods in most of the analyzed datasets, while in a few cases non-linear methods had a small but significant advantage. Compared to other published methods, our prediction scheme leads to an improvement in prediction accuracy by up to 14%. CONCLUSIONS: We found that machine learning on sequences normalized by simple linear interpolation gave better or at least competitive results compared to state-of-the-art procedures, and thus, is a promising alternative to existing methods, especially for protein sequences of variable length.
format	Text
id	pubmed-3079662
institution	National Center for Biotechnology Information
language	English
publishDate	2011
publisher	BioMed Central
record_format	MEDLINE/PubMed
spelling	pubmed-30796622011-04-20 Machine learning on normalized protein sequences Heider, Dominik Verheyen, Jens Hoffmann, Daniel BMC Res Notes Short Report BACKGROUND: Machine learning techniques have been widely applied to biological sequences, e.g. to predict drug resistance in HIV-1 from sequences of drug target proteins and protein functional classes. As deletions and insertions are frequent in biological sequences, a major limitation of current methods is the inability to handle varying sequence lengths. FINDINGS: We propose to normalize sequences to uniform length. To this end, we tested one linear and four different non-linear interpolation methods for the normalization of sequence lengths of 19 classification datasets. Classification tasks included prediction of HIV-1 drug resistance from drug target sequences and sequence-based prediction of protein function. We applied random forests to the classification of sequences into "positive" and "negative" samples. Statistical tests showed that the linear interpolation outperforms the non-linear interpolation methods in most of the analyzed datasets, while in a few cases non-linear methods had a small but significant advantage. Compared to other published methods, our prediction scheme leads to an improvement in prediction accuracy by up to 14%. CONCLUSIONS: We found that machine learning on sequences normalized by simple linear interpolation gave better or at least competitive results compared to state-of-the-art procedures, and thus, is a promising alternative to existing methods, especially for protein sequences of variable length. BioMed Central 2011-03-31 /pmc/articles/PMC3079662/ /pubmed/21453485 http://dx.doi.org/10.1186/1756-0500-4-94 Text en Copyright ©2011 Heider et al; licensee BioMed Central Ltd. http://creativecommons.org/licenses/by/2.0 This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.
spellingShingle	Short Report Heider, Dominik Verheyen, Jens Hoffmann, Daniel Machine learning on normalized protein sequences
title	Machine learning on normalized protein sequences
title_full	Machine learning on normalized protein sequences
title_fullStr	Machine learning on normalized protein sequences
title_full_unstemmed	Machine learning on normalized protein sequences
title_short	Machine learning on normalized protein sequences
title_sort	machine learning on normalized protein sequences
topic	Short Report
url	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3079662/ https://www.ncbi.nlm.nih.gov/pubmed/21453485 http://dx.doi.org/10.1186/1756-0500-4-94
work_keys_str_mv	AT heiderdominik machinelearningonnormalizedproteinsequences AT verheyenjens machinelearningonnormalizedproteinsequences AT hoffmanndaniel machinelearningonnormalizedproteinsequences

Machine learning on normalized protein sequences

Ejemplares similares