Cargando…

Combining DNA methylation and RNA sequencing data of cancer for supervised knowledge extraction

BACKGROUND: In the Next Generation Sequencing (NGS) era a large amount of biological data is being sequenced, analyzed, and stored in many public databases, whose interoperability is often required to allow an enhanced accessibility. The combination of heterogeneous NGS genomic data is an open chall...

Descripción completa

Detalles Bibliográficos
Autores principales: Cappelli, Eleonora, Felici, Giovanni, Weitschek, Emanuel
Formato: Online Artículo Texto
Lenguaje:English
Publicado: BioMed Central 2018
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6203208/
https://www.ncbi.nlm.nih.gov/pubmed/30386434
http://dx.doi.org/10.1186/s13040-018-0184-6
_version_ 1783365835034722304
author Cappelli, Eleonora
Felici, Giovanni
Weitschek, Emanuel
author_facet Cappelli, Eleonora
Felici, Giovanni
Weitschek, Emanuel
author_sort Cappelli, Eleonora
collection PubMed
description BACKGROUND: In the Next Generation Sequencing (NGS) era a large amount of biological data is being sequenced, analyzed, and stored in many public databases, whose interoperability is often required to allow an enhanced accessibility. The combination of heterogeneous NGS genomic data is an open challenge: the analysis of data from different experiments is a fundamental practice for the study of diseases. In this work, we propose to combine DNA methylation and RNA sequencing NGS experiments at gene level for supervised knowledge extraction in cancer. METHODS: We retrieve DNA methylation and RNA sequencing datasets from The Cancer Genome Atlas (TCGA), focusing on the Breast Invasive Carcinoma (BRCA), the Thyroid Carcinoma (THCA), and the Kidney Renal Papillary Cell Carcinoma (KIRP). We combine the RNA sequencing gene expression values with the gene methylation quantity, as a new measure that we define for representing the methylation quantity associated to a gene. Additionally, we propose to analyze the combined data through tree- and rule-based classification algorithms (C4.5, Random Forest, RIPPER, and CAMUR). RESULTS: We extract more than 15,000 classification models (composed of gene sets), which allow to distinguish the tumoral samples from the normal ones with an average accuracy of 95%. From the integrated experiments we obtain about 5000 classification models that consider both the gene measures related to the RNA sequencing and the DNA methylation experiments. CONCLUSIONS: We compare the sets of genes obtained from the classifications on RNA sequencing and DNA methylation data with the genes obtained from the integration of the two experiments. The comparison results in several genes that are in common among the single experiments and the integrated ones (733 for BRCA, 35 for KIRP, and 861 for THCA) and 509 genes that are in common among the different experiments. Finally, we investigate the possible relationships among the different analyzed tumors by extracting a core set of 13 genes that appear in all tumors. A preliminary functional analysis confirms the relation of part of those genes (5 out of 13 and 279 out of 509) with cancer, suggesting to focus further studies on the new individuated ones. ELECTRONIC SUPPLEMENTARY MATERIAL: The online version of this article (10.1186/s13040-018-0184-6) contains supplementary material, which is available to authorized users.
format Online
Article
Text
id pubmed-6203208
institution National Center for Biotechnology Information
language English
publishDate 2018
publisher BioMed Central
record_format MEDLINE/PubMed
spelling pubmed-62032082018-11-01 Combining DNA methylation and RNA sequencing data of cancer for supervised knowledge extraction Cappelli, Eleonora Felici, Giovanni Weitschek, Emanuel BioData Min Methodology BACKGROUND: In the Next Generation Sequencing (NGS) era a large amount of biological data is being sequenced, analyzed, and stored in many public databases, whose interoperability is often required to allow an enhanced accessibility. The combination of heterogeneous NGS genomic data is an open challenge: the analysis of data from different experiments is a fundamental practice for the study of diseases. In this work, we propose to combine DNA methylation and RNA sequencing NGS experiments at gene level for supervised knowledge extraction in cancer. METHODS: We retrieve DNA methylation and RNA sequencing datasets from The Cancer Genome Atlas (TCGA), focusing on the Breast Invasive Carcinoma (BRCA), the Thyroid Carcinoma (THCA), and the Kidney Renal Papillary Cell Carcinoma (KIRP). We combine the RNA sequencing gene expression values with the gene methylation quantity, as a new measure that we define for representing the methylation quantity associated to a gene. Additionally, we propose to analyze the combined data through tree- and rule-based classification algorithms (C4.5, Random Forest, RIPPER, and CAMUR). RESULTS: We extract more than 15,000 classification models (composed of gene sets), which allow to distinguish the tumoral samples from the normal ones with an average accuracy of 95%. From the integrated experiments we obtain about 5000 classification models that consider both the gene measures related to the RNA sequencing and the DNA methylation experiments. CONCLUSIONS: We compare the sets of genes obtained from the classifications on RNA sequencing and DNA methylation data with the genes obtained from the integration of the two experiments. The comparison results in several genes that are in common among the single experiments and the integrated ones (733 for BRCA, 35 for KIRP, and 861 for THCA) and 509 genes that are in common among the different experiments. Finally, we investigate the possible relationships among the different analyzed tumors by extracting a core set of 13 genes that appear in all tumors. A preliminary functional analysis confirms the relation of part of those genes (5 out of 13 and 279 out of 509) with cancer, suggesting to focus further studies on the new individuated ones. ELECTRONIC SUPPLEMENTARY MATERIAL: The online version of this article (10.1186/s13040-018-0184-6) contains supplementary material, which is available to authorized users. BioMed Central 2018-10-25 /pmc/articles/PMC6203208/ /pubmed/30386434 http://dx.doi.org/10.1186/s13040-018-0184-6 Text en © The Author(s) 2018 Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.
spellingShingle Methodology
Cappelli, Eleonora
Felici, Giovanni
Weitschek, Emanuel
Combining DNA methylation and RNA sequencing data of cancer for supervised knowledge extraction
title Combining DNA methylation and RNA sequencing data of cancer for supervised knowledge extraction
title_full Combining DNA methylation and RNA sequencing data of cancer for supervised knowledge extraction
title_fullStr Combining DNA methylation and RNA sequencing data of cancer for supervised knowledge extraction
title_full_unstemmed Combining DNA methylation and RNA sequencing data of cancer for supervised knowledge extraction
title_short Combining DNA methylation and RNA sequencing data of cancer for supervised knowledge extraction
title_sort combining dna methylation and rna sequencing data of cancer for supervised knowledge extraction
topic Methodology
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6203208/
https://www.ncbi.nlm.nih.gov/pubmed/30386434
http://dx.doi.org/10.1186/s13040-018-0184-6
work_keys_str_mv AT cappellieleonora combiningdnamethylationandrnasequencingdataofcancerforsupervisedknowledgeextraction
AT felicigiovanni combiningdnamethylationandrnasequencingdataofcancerforsupervisedknowledgeextraction
AT weitschekemanuel combiningdnamethylationandrnasequencingdataofcancerforsupervisedknowledgeextraction