Cargando…

IsoSVM – Distinguishing isoforms and paralogs on the protein level

BACKGROUND: Recent progress in cDNA and EST sequencing is yielding a deluge of sequence data. Like database search results and proteome databases, this data gives rise to inferred protein sequences without ready access to the underlying genomic data. Analysis of this information (e.g. for EST cluste...

Descripción completa

Detalles Bibliográficos
Autores principales: Spitzer, Michael, Lorkowski, Stefan, Cullen, Paul, Sczyrba, Alexander, Fuellen, Georg
Formato: Texto
Lenguaje:English
Publicado: BioMed Central 2006
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC1431569/
https://www.ncbi.nlm.nih.gov/pubmed/16519805
http://dx.doi.org/10.1186/1471-2105-7-110
_version_ 1782127206712999936
author Spitzer, Michael
Lorkowski, Stefan
Cullen, Paul
Sczyrba, Alexander
Fuellen, Georg
author_facet Spitzer, Michael
Lorkowski, Stefan
Cullen, Paul
Sczyrba, Alexander
Fuellen, Georg
author_sort Spitzer, Michael
collection PubMed
description BACKGROUND: Recent progress in cDNA and EST sequencing is yielding a deluge of sequence data. Like database search results and proteome databases, this data gives rise to inferred protein sequences without ready access to the underlying genomic data. Analysis of this information (e.g. for EST clustering or phylogenetic reconstruction from proteome data) is hampered because it is not known if two protein sequences are isoforms (splice variants) or not (i.e. paralogs/orthologs). However, even without knowing the intron/exon structure, visual analysis of the pattern of similarity across the alignment of the two protein sequences is usually helpful since paralogs and orthologs feature substitutions with respect to each other, as opposed to isoforms, which do not. RESULTS: The IsoSVM tool introduces an automated approach to identifying isoforms on the protein level using a support vector machine (SVM) classifier. Based on three specific features used as input of the SVM classifier, it is possible to automatically identify isoforms with little effort and with an accuracy of more than 97%. We show that the SVM is superior to a radial basis function network and to a linear classifier. As an example application we use IsoSVM to estimate that a set of Xenopus laevis EST clusters consists of approximately 81% cases where sequences are each other's paralogs and 19% cases where sequences are each other's isoforms. The number of isoforms and paralogs in this allotetraploid species is of interest in the study of evolution. CONCLUSION: We developed an SVM classifier that can be used to distinguish isoforms from paralogs with high accuracy and without access to the genomic data. It can be used to analyze, for example, EST data and database search results. Our software is freely available on the Web, under the name IsoSVM.
format Text
id pubmed-1431569
institution National Center for Biotechnology Information
language English
publishDate 2006
publisher BioMed Central
record_format MEDLINE/PubMed
spelling pubmed-14315692006-04-21 IsoSVM – Distinguishing isoforms and paralogs on the protein level Spitzer, Michael Lorkowski, Stefan Cullen, Paul Sczyrba, Alexander Fuellen, Georg BMC Bioinformatics Research Article BACKGROUND: Recent progress in cDNA and EST sequencing is yielding a deluge of sequence data. Like database search results and proteome databases, this data gives rise to inferred protein sequences without ready access to the underlying genomic data. Analysis of this information (e.g. for EST clustering or phylogenetic reconstruction from proteome data) is hampered because it is not known if two protein sequences are isoforms (splice variants) or not (i.e. paralogs/orthologs). However, even without knowing the intron/exon structure, visual analysis of the pattern of similarity across the alignment of the two protein sequences is usually helpful since paralogs and orthologs feature substitutions with respect to each other, as opposed to isoforms, which do not. RESULTS: The IsoSVM tool introduces an automated approach to identifying isoforms on the protein level using a support vector machine (SVM) classifier. Based on three specific features used as input of the SVM classifier, it is possible to automatically identify isoforms with little effort and with an accuracy of more than 97%. We show that the SVM is superior to a radial basis function network and to a linear classifier. As an example application we use IsoSVM to estimate that a set of Xenopus laevis EST clusters consists of approximately 81% cases where sequences are each other's paralogs and 19% cases where sequences are each other's isoforms. The number of isoforms and paralogs in this allotetraploid species is of interest in the study of evolution. CONCLUSION: We developed an SVM classifier that can be used to distinguish isoforms from paralogs with high accuracy and without access to the genomic data. It can be used to analyze, for example, EST data and database search results. Our software is freely available on the Web, under the name IsoSVM. BioMed Central 2006-03-06 /pmc/articles/PMC1431569/ /pubmed/16519805 http://dx.doi.org/10.1186/1471-2105-7-110 Text en Copyright © 2006 Spitzer et al; licensee BioMed Central Ltd.
spellingShingle Research Article
Spitzer, Michael
Lorkowski, Stefan
Cullen, Paul
Sczyrba, Alexander
Fuellen, Georg
IsoSVM – Distinguishing isoforms and paralogs on the protein level
title IsoSVM – Distinguishing isoforms and paralogs on the protein level
title_full IsoSVM – Distinguishing isoforms and paralogs on the protein level
title_fullStr IsoSVM – Distinguishing isoforms and paralogs on the protein level
title_full_unstemmed IsoSVM – Distinguishing isoforms and paralogs on the protein level
title_short IsoSVM – Distinguishing isoforms and paralogs on the protein level
title_sort isosvm – distinguishing isoforms and paralogs on the protein level
topic Research Article
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC1431569/
https://www.ncbi.nlm.nih.gov/pubmed/16519805
http://dx.doi.org/10.1186/1471-2105-7-110
work_keys_str_mv AT spitzermichael isosvmdistinguishingisoformsandparalogsontheproteinlevel
AT lorkowskistefan isosvmdistinguishingisoformsandparalogsontheproteinlevel
AT cullenpaul isosvmdistinguishingisoformsandparalogsontheproteinlevel
AT sczyrbaalexander isosvmdistinguishingisoformsandparalogsontheproteinlevel
AT fuellengeorg isosvmdistinguishingisoformsandparalogsontheproteinlevel