Cargando…

A merged lung cancer transcriptome dataset for clinical predictive modeling

The Gene Expression Omnibus (GEO) database is an excellent public source of whole transcriptomic profiles of multiple cancers. The main challenge is the limited accessibility of such large-scale genomic data to people without a background in bioinformatics or computer science. This presents difficul...

Descripción completa

Detalles Bibliográficos
Autores principales: Lim, Su Bin, Tan, Swee Jin, Lim, Wan-Teck, Lim, Chwee Teck
Formato: Online Artículo Texto
Lenguaje:English
Publicado: Nature Publishing Group 2018
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6057440/
https://www.ncbi.nlm.nih.gov/pubmed/30040079
http://dx.doi.org/10.1038/sdata.2018.136
_version_ 1783341527871782912
author Lim, Su Bin
Tan, Swee Jin
Lim, Wan-Teck
Lim, Chwee Teck
author_facet Lim, Su Bin
Tan, Swee Jin
Lim, Wan-Teck
Lim, Chwee Teck
author_sort Lim, Su Bin
collection PubMed
description The Gene Expression Omnibus (GEO) database is an excellent public source of whole transcriptomic profiles of multiple cancers. The main challenge is the limited accessibility of such large-scale genomic data to people without a background in bioinformatics or computer science. This presents difficulties in data analysis, sharing and visualization. Here, we present an integrated bioinformatics pipeline and a normalized dataset that has been preprocessed using a robust statistical methodology; allowing others to perform large-scale meta-analysis, without having to conduct time-consuming data mining and statistical correction. Comprising 1,118 patient-derived samples, the normalized dataset includes primary non-small cell lung cancer (NSCLC) tumors and paired normal lung tissues from ten independent GEO datasets, facilitating differential expression analysis. The data has been merged, normalized, batch effect-corrected and filtered for genes with low variance via multiple open source R packages integrated into our workflow. Overall this dataset (with associated clinical metadata) better represents the diseased population and serves as a powerful tool for early predictive biomarker discovery.
format Online
Article
Text
id pubmed-6057440
institution National Center for Biotechnology Information
language English
publishDate 2018
publisher Nature Publishing Group
record_format MEDLINE/PubMed
spelling pubmed-60574402018-07-27 A merged lung cancer transcriptome dataset for clinical predictive modeling Lim, Su Bin Tan, Swee Jin Lim, Wan-Teck Lim, Chwee Teck Sci Data Data Descriptor The Gene Expression Omnibus (GEO) database is an excellent public source of whole transcriptomic profiles of multiple cancers. The main challenge is the limited accessibility of such large-scale genomic data to people without a background in bioinformatics or computer science. This presents difficulties in data analysis, sharing and visualization. Here, we present an integrated bioinformatics pipeline and a normalized dataset that has been preprocessed using a robust statistical methodology; allowing others to perform large-scale meta-analysis, without having to conduct time-consuming data mining and statistical correction. Comprising 1,118 patient-derived samples, the normalized dataset includes primary non-small cell lung cancer (NSCLC) tumors and paired normal lung tissues from ten independent GEO datasets, facilitating differential expression analysis. The data has been merged, normalized, batch effect-corrected and filtered for genes with low variance via multiple open source R packages integrated into our workflow. Overall this dataset (with associated clinical metadata) better represents the diseased population and serves as a powerful tool for early predictive biomarker discovery. Nature Publishing Group 2018-07-24 /pmc/articles/PMC6057440/ /pubmed/30040079 http://dx.doi.org/10.1038/sdata.2018.136 Text en Copyright © 2018, The Author(s) http://creativecommons.org/licenses/by/4.0/ Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this license, visit http://creativecommons.org/licenses/by/4.0/ The Creative Commons Public Domain Dedication waiver http://creativecommons.org/publicdomain/zero/1.0/ applies to the metadata files made available in this article.
spellingShingle Data Descriptor
Lim, Su Bin
Tan, Swee Jin
Lim, Wan-Teck
Lim, Chwee Teck
A merged lung cancer transcriptome dataset for clinical predictive modeling
title A merged lung cancer transcriptome dataset for clinical predictive modeling
title_full A merged lung cancer transcriptome dataset for clinical predictive modeling
title_fullStr A merged lung cancer transcriptome dataset for clinical predictive modeling
title_full_unstemmed A merged lung cancer transcriptome dataset for clinical predictive modeling
title_short A merged lung cancer transcriptome dataset for clinical predictive modeling
title_sort merged lung cancer transcriptome dataset for clinical predictive modeling
topic Data Descriptor
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6057440/
https://www.ncbi.nlm.nih.gov/pubmed/30040079
http://dx.doi.org/10.1038/sdata.2018.136
work_keys_str_mv AT limsubin amergedlungcancertranscriptomedatasetforclinicalpredictivemodeling
AT tansweejin amergedlungcancertranscriptomedatasetforclinicalpredictivemodeling
AT limwanteck amergedlungcancertranscriptomedatasetforclinicalpredictivemodeling
AT limchweeteck amergedlungcancertranscriptomedatasetforclinicalpredictivemodeling
AT limsubin mergedlungcancertranscriptomedatasetforclinicalpredictivemodeling
AT tansweejin mergedlungcancertranscriptomedatasetforclinicalpredictivemodeling
AT limwanteck mergedlungcancertranscriptomedatasetforclinicalpredictivemodeling
AT limchweeteck mergedlungcancertranscriptomedatasetforclinicalpredictivemodeling