Cargando…

Genes sharing the protein family domain decrease the performance of classification with RNA-seq genomic signatures

BACKGROUND: The experience with running various types of classification on the CAMDA neuroblastoma dataset have led us to the conclusion that the results are not always obvious and may differ depending on type of analysis and selection of genes used for classification. This paper aims in pointing ou...

Descripción completa

Detalles Bibliográficos
Autores principales: Leśniewska, Anna, Zyprych-Walczak, Joanna, Szabelska-Beręsewicz, Alicja, Okoniewski, Michal J.
Formato: Online Artículo Texto
Lenguaje:English
Publicado: BioMed Central 2018
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5822623/
https://www.ncbi.nlm.nih.gov/pubmed/29467011
http://dx.doi.org/10.1186/s13062-018-0205-x
_version_ 1783301727180554240
author Leśniewska, Anna
Zyprych-Walczak, Joanna
Szabelska-Beręsewicz, Alicja
Okoniewski, Michal J.
author_facet Leśniewska, Anna
Zyprych-Walczak, Joanna
Szabelska-Beręsewicz, Alicja
Okoniewski, Michal J.
author_sort Leśniewska, Anna
collection PubMed
description BACKGROUND: The experience with running various types of classification on the CAMDA neuroblastoma dataset have led us to the conclusion that the results are not always obvious and may differ depending on type of analysis and selection of genes used for classification. This paper aims in pointing out several factors that may influence the downstream machine learning analysis. In particular those factors are: type of the primary analysis, type of the classifier and increased correlation between the genes sharing a protein domain. They influence the analysis directly, but also interplay between them may be important. We have compiled the gene-domain database and used it for analysis to see the differences between the genes that share a domain versus the rest of the genes in the datasets. RESULTS: pairs of genes that share a domain have an increased Spearman’s correlation coefficients of counts; genes sharing a domain are expected to have a lower predictive power due to increased correlation. For most of the cases it can be seen with the higher number of misclassified samples; classifiers performance may vary depending on a method, still in most cases using genes sharing a domain in the training set results in a higher misclassification rate; increased correlation in genes sharing a domain results most often in worse performance of the classifiers regardless of the primary analysis tools used, even if the primary analysis alignment yield varies. CONCLUSIONS: The effect of sharing a domain is likely more a results of real biological co-expression than just sequence similarity and artifacts of mapping and counting. Still, this is more difficult to conclude and needs further research. The effect is interesting itself, but we also point out some practical aspects in which it may influence the RNA sequencing analysis and RNA biomarker use. In particular it means that a gene signature biomarker set build out of RNA-sequencing results should be depleted for genes sharing common domains. It may cause to perform better when applying classification. REVIEWERS: This article was reviewed by Dimitar Vassiliev and Susmita Datta.
format Online
Article
Text
id pubmed-5822623
institution National Center for Biotechnology Information
language English
publishDate 2018
publisher BioMed Central
record_format MEDLINE/PubMed
spelling pubmed-58226232018-02-26 Genes sharing the protein family domain decrease the performance of classification with RNA-seq genomic signatures Leśniewska, Anna Zyprych-Walczak, Joanna Szabelska-Beręsewicz, Alicja Okoniewski, Michal J. Biol Direct Research BACKGROUND: The experience with running various types of classification on the CAMDA neuroblastoma dataset have led us to the conclusion that the results are not always obvious and may differ depending on type of analysis and selection of genes used for classification. This paper aims in pointing out several factors that may influence the downstream machine learning analysis. In particular those factors are: type of the primary analysis, type of the classifier and increased correlation between the genes sharing a protein domain. They influence the analysis directly, but also interplay between them may be important. We have compiled the gene-domain database and used it for analysis to see the differences between the genes that share a domain versus the rest of the genes in the datasets. RESULTS: pairs of genes that share a domain have an increased Spearman’s correlation coefficients of counts; genes sharing a domain are expected to have a lower predictive power due to increased correlation. For most of the cases it can be seen with the higher number of misclassified samples; classifiers performance may vary depending on a method, still in most cases using genes sharing a domain in the training set results in a higher misclassification rate; increased correlation in genes sharing a domain results most often in worse performance of the classifiers regardless of the primary analysis tools used, even if the primary analysis alignment yield varies. CONCLUSIONS: The effect of sharing a domain is likely more a results of real biological co-expression than just sequence similarity and artifacts of mapping and counting. Still, this is more difficult to conclude and needs further research. The effect is interesting itself, but we also point out some practical aspects in which it may influence the RNA sequencing analysis and RNA biomarker use. In particular it means that a gene signature biomarker set build out of RNA-sequencing results should be depleted for genes sharing common domains. It may cause to perform better when applying classification. REVIEWERS: This article was reviewed by Dimitar Vassiliev and Susmita Datta. BioMed Central 2018-02-21 /pmc/articles/PMC5822623/ /pubmed/29467011 http://dx.doi.org/10.1186/s13062-018-0205-x Text en © The Author(s) 2018 Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.
spellingShingle Research
Leśniewska, Anna
Zyprych-Walczak, Joanna
Szabelska-Beręsewicz, Alicja
Okoniewski, Michal J.
Genes sharing the protein family domain decrease the performance of classification with RNA-seq genomic signatures
title Genes sharing the protein family domain decrease the performance of classification with RNA-seq genomic signatures
title_full Genes sharing the protein family domain decrease the performance of classification with RNA-seq genomic signatures
title_fullStr Genes sharing the protein family domain decrease the performance of classification with RNA-seq genomic signatures
title_full_unstemmed Genes sharing the protein family domain decrease the performance of classification with RNA-seq genomic signatures
title_short Genes sharing the protein family domain decrease the performance of classification with RNA-seq genomic signatures
title_sort genes sharing the protein family domain decrease the performance of classification with rna-seq genomic signatures
topic Research
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5822623/
https://www.ncbi.nlm.nih.gov/pubmed/29467011
http://dx.doi.org/10.1186/s13062-018-0205-x
work_keys_str_mv AT lesniewskaanna genessharingtheproteinfamilydomaindecreasetheperformanceofclassificationwithrnaseqgenomicsignatures
AT zyprychwalczakjoanna genessharingtheproteinfamilydomaindecreasetheperformanceofclassificationwithrnaseqgenomicsignatures
AT szabelskaberesewiczalicja genessharingtheproteinfamilydomaindecreasetheperformanceofclassificationwithrnaseqgenomicsignatures
AT okoniewskimichalj genessharingtheproteinfamilydomaindecreasetheperformanceofclassificationwithrnaseqgenomicsignatures