Cargando…

Using epigenomics data to predict gene expression in lung cancer

BACKGROUND: Epigenetic alterations are known to correlate with changes in gene expression among various diseases including cancers. However, quantitative models that accurately predict the up or down regulation of gene expression are currently lacking. METHODS: A new machine learning-based method of...

Descripción completa

Detalles Bibliográficos
Autores principales:	Li, Jeffery, Ching, Travers, Huang, Sijia, Garmire, Lana X
Formato:	Online Artículo Texto
Lenguaje:	English
Publicado:	BioMed Central 2015
Materias:	Proceedings
Acceso en línea:	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4402699/ https://www.ncbi.nlm.nih.gov/pubmed/25861082 http://dx.doi.org/10.1186/1471-2105-16-S5-S10

_version_	1782367292146843648
author	Li, Jeffery Ching, Travers Huang, Sijia Garmire, Lana X
author_facet	Li, Jeffery Ching, Travers Huang, Sijia Garmire, Lana X
author_sort	Li, Jeffery
collection	PubMed
description	BACKGROUND: Epigenetic alterations are known to correlate with changes in gene expression among various diseases including cancers. However, quantitative models that accurately predict the up or down regulation of gene expression are currently lacking. METHODS: A new machine learning-based method of gene expression prediction is developed in the context of lung cancer. This method uses the Illumina Infinium HumanMethylation450K Beadchip CpG methylation array data from paired lung cancer and adjacent normal tissues in The Cancer Genome Atlas (TCGA) and histone modification marker CHIP-Seq data from the ENCODE project, to predict the differential expression of RNA-Seq data in TCGA lung cancers. It considers a comprehensive list of 1424 features spanning the four categories of CpG methylation, histone H3 methylation modification, nucleotide composition, and conservation. Various feature selection and classification methods are compared to select the best model over 10-fold cross-validation in the training data set. RESULTS: A best model comprising 67 features is chosen by ReliefF based feature selection and random forest classification method, with AUC = 0.864 from the 10-fold cross-validation of the training set and AUC = 0.836 from the testing set. The selected features cover all four data types, with histone H3 methylation modification (32 features) and CpG methylation (15 features) being most abundant. Among the dropping-off tests of individual data-type based features, removal of CpG methylation feature leads to the most reduction in model performance. In the best model, 19 selected features are from the promoter regions (TSS200 and TSS1500), highest among all locations relative to transcripts. Sequential dropping-off of CpG methylation features relative to different regions on the protein coding transcripts shows that promoter regions contribute most significantly to the accurate prediction of gene expression. CONCLUSIONS: By considering a comprehensive list of epigenomic and genomic features, we have constructed an accurate model to predict transcriptomic differential expression, exemplified in lung cancer.
format	Online Article Text
id	pubmed-4402699
institution	National Center for Biotechnology Information
language	English
publishDate	2015
publisher	BioMed Central
record_format	MEDLINE/PubMed
spelling	pubmed-44026992015-04-29 Using epigenomics data to predict gene expression in lung cancer Li, Jeffery Ching, Travers Huang, Sijia Garmire, Lana X BMC Bioinformatics Proceedings BACKGROUND: Epigenetic alterations are known to correlate with changes in gene expression among various diseases including cancers. However, quantitative models that accurately predict the up or down regulation of gene expression are currently lacking. METHODS: A new machine learning-based method of gene expression prediction is developed in the context of lung cancer. This method uses the Illumina Infinium HumanMethylation450K Beadchip CpG methylation array data from paired lung cancer and adjacent normal tissues in The Cancer Genome Atlas (TCGA) and histone modification marker CHIP-Seq data from the ENCODE project, to predict the differential expression of RNA-Seq data in TCGA lung cancers. It considers a comprehensive list of 1424 features spanning the four categories of CpG methylation, histone H3 methylation modification, nucleotide composition, and conservation. Various feature selection and classification methods are compared to select the best model over 10-fold cross-validation in the training data set. RESULTS: A best model comprising 67 features is chosen by ReliefF based feature selection and random forest classification method, with AUC = 0.864 from the 10-fold cross-validation of the training set and AUC = 0.836 from the testing set. The selected features cover all four data types, with histone H3 methylation modification (32 features) and CpG methylation (15 features) being most abundant. Among the dropping-off tests of individual data-type based features, removal of CpG methylation feature leads to the most reduction in model performance. In the best model, 19 selected features are from the promoter regions (TSS200 and TSS1500), highest among all locations relative to transcripts. Sequential dropping-off of CpG methylation features relative to different regions on the protein coding transcripts shows that promoter regions contribute most significantly to the accurate prediction of gene expression. CONCLUSIONS: By considering a comprehensive list of epigenomic and genomic features, we have constructed an accurate model to predict transcriptomic differential expression, exemplified in lung cancer. BioMed Central 2015-03-18 /pmc/articles/PMC4402699/ /pubmed/25861082 http://dx.doi.org/10.1186/1471-2105-16-S5-S10 Text en Copyright © 2015 Li et al.; licensee BioMed Central Ltd. http://creativecommons.org/licenses/by/4.0 This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/4.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.
spellingShingle	Proceedings Li, Jeffery Ching, Travers Huang, Sijia Garmire, Lana X Using epigenomics data to predict gene expression in lung cancer
title	Using epigenomics data to predict gene expression in lung cancer
title_full	Using epigenomics data to predict gene expression in lung cancer
title_fullStr	Using epigenomics data to predict gene expression in lung cancer
title_full_unstemmed	Using epigenomics data to predict gene expression in lung cancer
title_short	Using epigenomics data to predict gene expression in lung cancer
title_sort	using epigenomics data to predict gene expression in lung cancer
topic	Proceedings
url	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4402699/ https://www.ncbi.nlm.nih.gov/pubmed/25861082 http://dx.doi.org/10.1186/1471-2105-16-S5-S10
work_keys_str_mv	AT lijeffery usingepigenomicsdatatopredictgeneexpressioninlungcancer AT chingtravers usingepigenomicsdatatopredictgeneexpressioninlungcancer AT huangsijia usingepigenomicsdatatopredictgeneexpressioninlungcancer AT garmirelanax usingepigenomicsdatatopredictgeneexpressioninlungcancer

Using epigenomics data to predict gene expression in lung cancer

Ejemplares similares