Cargando…
A merged lung cancer transcriptome dataset for clinical predictive modeling
The Gene Expression Omnibus (GEO) database is an excellent public source of whole transcriptomic profiles of multiple cancers. The main challenge is the limited accessibility of such large-scale genomic data to people without a background in bioinformatics or computer science. This presents difficul...
Autores principales: | , , , |
---|---|
Formato: | Online Artículo Texto |
Lenguaje: | English |
Publicado: |
Nature Publishing Group
2018
|
Materias: | |
Acceso en línea: | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6057440/ https://www.ncbi.nlm.nih.gov/pubmed/30040079 http://dx.doi.org/10.1038/sdata.2018.136 |
_version_ | 1783341527871782912 |
---|---|
author | Lim, Su Bin Tan, Swee Jin Lim, Wan-Teck Lim, Chwee Teck |
author_facet | Lim, Su Bin Tan, Swee Jin Lim, Wan-Teck Lim, Chwee Teck |
author_sort | Lim, Su Bin |
collection | PubMed |
description | The Gene Expression Omnibus (GEO) database is an excellent public source of whole transcriptomic profiles of multiple cancers. The main challenge is the limited accessibility of such large-scale genomic data to people without a background in bioinformatics or computer science. This presents difficulties in data analysis, sharing and visualization. Here, we present an integrated bioinformatics pipeline and a normalized dataset that has been preprocessed using a robust statistical methodology; allowing others to perform large-scale meta-analysis, without having to conduct time-consuming data mining and statistical correction. Comprising 1,118 patient-derived samples, the normalized dataset includes primary non-small cell lung cancer (NSCLC) tumors and paired normal lung tissues from ten independent GEO datasets, facilitating differential expression analysis. The data has been merged, normalized, batch effect-corrected and filtered for genes with low variance via multiple open source R packages integrated into our workflow. Overall this dataset (with associated clinical metadata) better represents the diseased population and serves as a powerful tool for early predictive biomarker discovery. |
format | Online Article Text |
id | pubmed-6057440 |
institution | National Center for Biotechnology Information |
language | English |
publishDate | 2018 |
publisher | Nature Publishing Group |
record_format | MEDLINE/PubMed |
spelling | pubmed-60574402018-07-27 A merged lung cancer transcriptome dataset for clinical predictive modeling Lim, Su Bin Tan, Swee Jin Lim, Wan-Teck Lim, Chwee Teck Sci Data Data Descriptor The Gene Expression Omnibus (GEO) database is an excellent public source of whole transcriptomic profiles of multiple cancers. The main challenge is the limited accessibility of such large-scale genomic data to people without a background in bioinformatics or computer science. This presents difficulties in data analysis, sharing and visualization. Here, we present an integrated bioinformatics pipeline and a normalized dataset that has been preprocessed using a robust statistical methodology; allowing others to perform large-scale meta-analysis, without having to conduct time-consuming data mining and statistical correction. Comprising 1,118 patient-derived samples, the normalized dataset includes primary non-small cell lung cancer (NSCLC) tumors and paired normal lung tissues from ten independent GEO datasets, facilitating differential expression analysis. The data has been merged, normalized, batch effect-corrected and filtered for genes with low variance via multiple open source R packages integrated into our workflow. Overall this dataset (with associated clinical metadata) better represents the diseased population and serves as a powerful tool for early predictive biomarker discovery. Nature Publishing Group 2018-07-24 /pmc/articles/PMC6057440/ /pubmed/30040079 http://dx.doi.org/10.1038/sdata.2018.136 Text en Copyright © 2018, The Author(s) http://creativecommons.org/licenses/by/4.0/ Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this license, visit http://creativecommons.org/licenses/by/4.0/ The Creative Commons Public Domain Dedication waiver http://creativecommons.org/publicdomain/zero/1.0/ applies to the metadata files made available in this article. |
spellingShingle | Data Descriptor Lim, Su Bin Tan, Swee Jin Lim, Wan-Teck Lim, Chwee Teck A merged lung cancer transcriptome dataset for clinical predictive modeling |
title | A merged lung cancer transcriptome dataset for clinical predictive modeling |
title_full | A merged lung cancer transcriptome dataset for clinical predictive modeling |
title_fullStr | A merged lung cancer transcriptome dataset for clinical predictive modeling |
title_full_unstemmed | A merged lung cancer transcriptome dataset for clinical predictive modeling |
title_short | A merged lung cancer transcriptome dataset for clinical predictive modeling |
title_sort | merged lung cancer transcriptome dataset for clinical predictive modeling |
topic | Data Descriptor |
url | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6057440/ https://www.ncbi.nlm.nih.gov/pubmed/30040079 http://dx.doi.org/10.1038/sdata.2018.136 |
work_keys_str_mv | AT limsubin amergedlungcancertranscriptomedatasetforclinicalpredictivemodeling AT tansweejin amergedlungcancertranscriptomedatasetforclinicalpredictivemodeling AT limwanteck amergedlungcancertranscriptomedatasetforclinicalpredictivemodeling AT limchweeteck amergedlungcancertranscriptomedatasetforclinicalpredictivemodeling AT limsubin mergedlungcancertranscriptomedatasetforclinicalpredictivemodeling AT tansweejin mergedlungcancertranscriptomedatasetforclinicalpredictivemodeling AT limwanteck mergedlungcancertranscriptomedatasetforclinicalpredictivemodeling AT limchweeteck mergedlungcancertranscriptomedatasetforclinicalpredictivemodeling |