Cargando…

Label noise in subtype discrimination of class C G protein-coupled receptors: A systematic approach to the analysis of classification errors

BACKGROUND: The characterization of proteins in families and subfamilies, at different levels, entails the definition and use of class labels. When the adscription of a protein to a family is uncertain, or even wrong, this becomes an instance of what has come to be known as a label noise problem. La...

Descripción completa

Detalles Bibliográficos
Autores principales: König, Caroline, Cárdenas, Martha I, Giraldo, Jesús, Alquézar, René, Vellido, Alfredo
Formato: Online Artículo Texto
Lenguaje:English
Publicado: BioMed Central 2015
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4587730/
https://www.ncbi.nlm.nih.gov/pubmed/26415951
http://dx.doi.org/10.1186/s12859-015-0731-9
_version_ 1782392504090361856
author König, Caroline
Cárdenas, Martha I
Giraldo, Jesús
Alquézar, René
Vellido, Alfredo
author_facet König, Caroline
Cárdenas, Martha I
Giraldo, Jesús
Alquézar, René
Vellido, Alfredo
author_sort König, Caroline
collection PubMed
description BACKGROUND: The characterization of proteins in families and subfamilies, at different levels, entails the definition and use of class labels. When the adscription of a protein to a family is uncertain, or even wrong, this becomes an instance of what has come to be known as a label noise problem. Label noise has a potentially negative effect on any quantitative analysis of proteins that depends on label information. This study investigates class C of G protein-coupled receptors, which are cell membrane proteins of relevance both to biology in general and pharmacology in particular. Their supervised classification into different known subtypes, based on primary sequence data, is hampered by label noise. The latter may stem from a combination of expert knowledge limitations and the lack of a clear correspondence between labels that mostly reflect GPCR functionality and the different representations of the protein primary sequences. RESULTS: In this study, we describe a systematic approach, using Support Vector Machine classifiers, to the analysis of G protein-coupled receptor misclassifications. As a proof of concept, this approach is used to assist the discovery of labeling quality problems in a curated, publicly accessible database of this type of proteins. We also investigate the extent to which physico-chemical transformations of the protein sequences reflect G protein-coupled receptor subtype labeling. The candidate mislabeled cases detected with this approach are externally validated with phylogenetic trees and against further trusted sources such as the National Center for Biotechnology Information, Universal Protein Resource, European Bioinformatics Institute and Ensembl Genome Browser information repositories. CONCLUSIONS: In quantitative classification problems, class labels are often by default assumed to be correct. Label noise, though, is bound to be a pervasive problem in bioinformatics, where labels may be obtained indirectly through complex, many-step similarity modelling processes. In the case of G protein-coupled receptors, methods capable of singling out and characterizing those sequences with consistent misclassification behaviour are required to minimize this problem. A systematic, Support Vector Machine-based method has been proposed in this study for such purpose. The proposed method enables a filtering approach to the label noise problem and might become a support tool for database curators in proteomics. ELECTRONIC SUPPLEMENTARY MATERIAL: The online version of this article (doi:10.1186/s12859-015-0731-9) contains supplementary material, which is available to authorized users.
format Online
Article
Text
id pubmed-4587730
institution National Center for Biotechnology Information
language English
publishDate 2015
publisher BioMed Central
record_format MEDLINE/PubMed
spelling pubmed-45877302015-09-30 Label noise in subtype discrimination of class C G protein-coupled receptors: A systematic approach to the analysis of classification errors König, Caroline Cárdenas, Martha I Giraldo, Jesús Alquézar, René Vellido, Alfredo BMC Bioinformatics Research Article BACKGROUND: The characterization of proteins in families and subfamilies, at different levels, entails the definition and use of class labels. When the adscription of a protein to a family is uncertain, or even wrong, this becomes an instance of what has come to be known as a label noise problem. Label noise has a potentially negative effect on any quantitative analysis of proteins that depends on label information. This study investigates class C of G protein-coupled receptors, which are cell membrane proteins of relevance both to biology in general and pharmacology in particular. Their supervised classification into different known subtypes, based on primary sequence data, is hampered by label noise. The latter may stem from a combination of expert knowledge limitations and the lack of a clear correspondence between labels that mostly reflect GPCR functionality and the different representations of the protein primary sequences. RESULTS: In this study, we describe a systematic approach, using Support Vector Machine classifiers, to the analysis of G protein-coupled receptor misclassifications. As a proof of concept, this approach is used to assist the discovery of labeling quality problems in a curated, publicly accessible database of this type of proteins. We also investigate the extent to which physico-chemical transformations of the protein sequences reflect G protein-coupled receptor subtype labeling. The candidate mislabeled cases detected with this approach are externally validated with phylogenetic trees and against further trusted sources such as the National Center for Biotechnology Information, Universal Protein Resource, European Bioinformatics Institute and Ensembl Genome Browser information repositories. CONCLUSIONS: In quantitative classification problems, class labels are often by default assumed to be correct. Label noise, though, is bound to be a pervasive problem in bioinformatics, where labels may be obtained indirectly through complex, many-step similarity modelling processes. In the case of G protein-coupled receptors, methods capable of singling out and characterizing those sequences with consistent misclassification behaviour are required to minimize this problem. A systematic, Support Vector Machine-based method has been proposed in this study for such purpose. The proposed method enables a filtering approach to the label noise problem and might become a support tool for database curators in proteomics. ELECTRONIC SUPPLEMENTARY MATERIAL: The online version of this article (doi:10.1186/s12859-015-0731-9) contains supplementary material, which is available to authorized users. BioMed Central 2015-09-29 /pmc/articles/PMC4587730/ /pubmed/26415951 http://dx.doi.org/10.1186/s12859-015-0731-9 Text en © König et al. 2015 Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License(http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver(http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.
spellingShingle Research Article
König, Caroline
Cárdenas, Martha I
Giraldo, Jesús
Alquézar, René
Vellido, Alfredo
Label noise in subtype discrimination of class C G protein-coupled receptors: A systematic approach to the analysis of classification errors
title Label noise in subtype discrimination of class C G protein-coupled receptors: A systematic approach to the analysis of classification errors
title_full Label noise in subtype discrimination of class C G protein-coupled receptors: A systematic approach to the analysis of classification errors
title_fullStr Label noise in subtype discrimination of class C G protein-coupled receptors: A systematic approach to the analysis of classification errors
title_full_unstemmed Label noise in subtype discrimination of class C G protein-coupled receptors: A systematic approach to the analysis of classification errors
title_short Label noise in subtype discrimination of class C G protein-coupled receptors: A systematic approach to the analysis of classification errors
title_sort label noise in subtype discrimination of class c g protein-coupled receptors: a systematic approach to the analysis of classification errors
topic Research Article
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4587730/
https://www.ncbi.nlm.nih.gov/pubmed/26415951
http://dx.doi.org/10.1186/s12859-015-0731-9
work_keys_str_mv AT konigcaroline labelnoiseinsubtypediscriminationofclasscgproteincoupledreceptorsasystematicapproachtotheanalysisofclassificationerrors
AT cardenasmarthai labelnoiseinsubtypediscriminationofclasscgproteincoupledreceptorsasystematicapproachtotheanalysisofclassificationerrors
AT giraldojesus labelnoiseinsubtypediscriminationofclasscgproteincoupledreceptorsasystematicapproachtotheanalysisofclassificationerrors
AT alquezarrene labelnoiseinsubtypediscriminationofclasscgproteincoupledreceptorsasystematicapproachtotheanalysisofclassificationerrors
AT vellidoalfredo labelnoiseinsubtypediscriminationofclasscgproteincoupledreceptorsasystematicapproachtotheanalysisofclassificationerrors