Cargando…

Reproducing the manual annotation of multiple sequence alignments using a SVM classifier

Motivation: Aligning protein sequences with the best possible accuracy requires sophisticated algorithms. Since the optimal alignment is not guaranteed to be the correct one, it is expected that even the best alignment will contain sites that do not respect the assumption of positional homology. Bec...

Descripción completa

Detalles Bibliográficos
Autores principales: Blouin, Christian, Perry, Scott, Lavell, Allan, Susko, Edward, Roger, Andrew J.
Formato: Texto
Lenguaje:English
Publicado: Oxford University Press 2009
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2778337/
https://www.ncbi.nlm.nih.gov/pubmed/19770262
http://dx.doi.org/10.1093/bioinformatics/btp552
_version_ 1782174233817776128
author Blouin, Christian
Perry, Scott
Lavell, Allan
Susko, Edward
Roger, Andrew J.
author_facet Blouin, Christian
Perry, Scott
Lavell, Allan
Susko, Edward
Roger, Andrew J.
author_sort Blouin, Christian
collection PubMed
description Motivation: Aligning protein sequences with the best possible accuracy requires sophisticated algorithms. Since the optimal alignment is not guaranteed to be the correct one, it is expected that even the best alignment will contain sites that do not respect the assumption of positional homology. Because formulating rules to identify these sites is difficult, it is common practice to manually remove them. Although considered necessary in some cases, manual editing is time consuming and not reproducible. We present here an automated editing method based on the classification of ‘valid’ and ‘invalid’ sites. Results: A support vector machine (SVM) classifier is trained to reproduce the decisions made during manual editing with an accuracy of 95.0%. This implies that manual editing can be made reproducible and applied to large-scale analyses. We further demonstrate that it is possible to retrain/extend the training of the classifier by providing examples of multiple sequence alignment (MSA) annotation. Near optimal training can be achieved with only 1000 annotated sites, or roughly three samples of protein sequence alignments. Availability: This method is implemented in the software MANUEL, licensed under the GPL. A web-based application for single and batch job is available at http://fester.cs.dal.ca/manuel. Contact: cblouin@cs.dal.ca Supplementary information: Supplementary data are available at Bioinformatics online.
format Text
id pubmed-2778337
institution National Center for Biotechnology Information
language English
publishDate 2009
publisher Oxford University Press
record_format MEDLINE/PubMed
spelling pubmed-27783372009-11-18 Reproducing the manual annotation of multiple sequence alignments using a SVM classifier Blouin, Christian Perry, Scott Lavell, Allan Susko, Edward Roger, Andrew J. Bioinformatics Original Papers Motivation: Aligning protein sequences with the best possible accuracy requires sophisticated algorithms. Since the optimal alignment is not guaranteed to be the correct one, it is expected that even the best alignment will contain sites that do not respect the assumption of positional homology. Because formulating rules to identify these sites is difficult, it is common practice to manually remove them. Although considered necessary in some cases, manual editing is time consuming and not reproducible. We present here an automated editing method based on the classification of ‘valid’ and ‘invalid’ sites. Results: A support vector machine (SVM) classifier is trained to reproduce the decisions made during manual editing with an accuracy of 95.0%. This implies that manual editing can be made reproducible and applied to large-scale analyses. We further demonstrate that it is possible to retrain/extend the training of the classifier by providing examples of multiple sequence alignment (MSA) annotation. Near optimal training can be achieved with only 1000 annotated sites, or roughly three samples of protein sequence alignments. Availability: This method is implemented in the software MANUEL, licensed under the GPL. A web-based application for single and batch job is available at http://fester.cs.dal.ca/manuel. Contact: cblouin@cs.dal.ca Supplementary information: Supplementary data are available at Bioinformatics online. Oxford University Press 2009-12-01 2009-09-21 /pmc/articles/PMC2778337/ /pubmed/19770262 http://dx.doi.org/10.1093/bioinformatics/btp552 Text en © The Author(s) 2009. Published by Oxford University Press. http://creativecommons.org/licenses/by-nc/2.0/uk/ This is an Open Access article distributed under the terms of the Creative Commons Attribution Non-Commercial License (http://creativecommons.org/licenses/by-nc/2.5/uk/) which permits unrestricted non-commercial use, distribution, and reproduction in any medium, provided the original work is properly cited.
spellingShingle Original Papers
Blouin, Christian
Perry, Scott
Lavell, Allan
Susko, Edward
Roger, Andrew J.
Reproducing the manual annotation of multiple sequence alignments using a SVM classifier
title Reproducing the manual annotation of multiple sequence alignments using a SVM classifier
title_full Reproducing the manual annotation of multiple sequence alignments using a SVM classifier
title_fullStr Reproducing the manual annotation of multiple sequence alignments using a SVM classifier
title_full_unstemmed Reproducing the manual annotation of multiple sequence alignments using a SVM classifier
title_short Reproducing the manual annotation of multiple sequence alignments using a SVM classifier
title_sort reproducing the manual annotation of multiple sequence alignments using a svm classifier
topic Original Papers
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2778337/
https://www.ncbi.nlm.nih.gov/pubmed/19770262
http://dx.doi.org/10.1093/bioinformatics/btp552
work_keys_str_mv AT blouinchristian reproducingthemanualannotationofmultiplesequencealignmentsusingasvmclassifier
AT perryscott reproducingthemanualannotationofmultiplesequencealignmentsusingasvmclassifier
AT lavellallan reproducingthemanualannotationofmultiplesequencealignmentsusingasvmclassifier
AT suskoedward reproducingthemanualannotationofmultiplesequencealignmentsusingasvmclassifier
AT rogerandrewj reproducingthemanualannotationofmultiplesequencealignmentsusingasvmclassifier