Cargando…
Improving Cancer Gene Expression Data Quality through a TCGA Data-Driven Evaluation of Identifier Filtering
Data quality is a recognized problem for high-throughput genomics platforms, as evinced by the proliferation of methods attempting to filter out lower quality data points. Different filtering methods lead to discordant results, raising the question, which methods are best? Astonishingly, little comp...
Autores principales: | , , |
---|---|
Formato: | Online Artículo Texto |
Lenguaje: | English |
Publicado: |
Libertas Academica
2015
|
Materias: | |
Acceso en línea: | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4686346/ https://www.ncbi.nlm.nih.gov/pubmed/26715829 http://dx.doi.org/10.4137/CIN.S33076 |
_version_ | 1782406429678764032 |
---|---|
author | McDade, Kevin K. Chandran, Uma Day, Roger S. |
author_facet | McDade, Kevin K. Chandran, Uma Day, Roger S. |
author_sort | McDade, Kevin K. |
collection | PubMed |
description | Data quality is a recognized problem for high-throughput genomics platforms, as evinced by the proliferation of methods attempting to filter out lower quality data points. Different filtering methods lead to discordant results, raising the question, which methods are best? Astonishingly, little computational support is offered to analysts to decide which filtering methods are optimal for the research question at hand. To evaluate them, we begin with a pair of expression data sets, transcriptomic and proteomic, on the same samples. The pair of data sets form a test-bed for the evaluation. Identifier mapping between the data sets creates a collection of feature pairs, with correlations calculated for each pair. To evaluate a filtering strategy, we estimate posterior probabilities for the correctness of probesets accepted by the method. An analyst can set expected utilities that represent the trade-off between the quality and quantity of accepted features. We tested nine published probeset filtering methods and combination strategies. We used two test-beds from cancer studies providing transcriptomic and proteomic data. For reasonable utility settings, the Jetset filtering method was optimal for probeset filtering on both test-beds, even though both assay platforms were different. Further intersection with a second filtering method was indicated on one test-bed but not the other. |
format | Online Article Text |
id | pubmed-4686346 |
institution | National Center for Biotechnology Information |
language | English |
publishDate | 2015 |
publisher | Libertas Academica |
record_format | MEDLINE/PubMed |
spelling | pubmed-46863462015-12-29 Improving Cancer Gene Expression Data Quality through a TCGA Data-Driven Evaluation of Identifier Filtering McDade, Kevin K. Chandran, Uma Day, Roger S. Cancer Inform Original Research Data quality is a recognized problem for high-throughput genomics platforms, as evinced by the proliferation of methods attempting to filter out lower quality data points. Different filtering methods lead to discordant results, raising the question, which methods are best? Astonishingly, little computational support is offered to analysts to decide which filtering methods are optimal for the research question at hand. To evaluate them, we begin with a pair of expression data sets, transcriptomic and proteomic, on the same samples. The pair of data sets form a test-bed for the evaluation. Identifier mapping between the data sets creates a collection of feature pairs, with correlations calculated for each pair. To evaluate a filtering strategy, we estimate posterior probabilities for the correctness of probesets accepted by the method. An analyst can set expected utilities that represent the trade-off between the quality and quantity of accepted features. We tested nine published probeset filtering methods and combination strategies. We used two test-beds from cancer studies providing transcriptomic and proteomic data. For reasonable utility settings, the Jetset filtering method was optimal for probeset filtering on both test-beds, even though both assay platforms were different. Further intersection with a second filtering method was indicated on one test-bed but not the other. Libertas Academica 2015-12-16 /pmc/articles/PMC4686346/ /pubmed/26715829 http://dx.doi.org/10.4137/CIN.S33076 Text en © 2015 the author(s), publisher and licensee Libertas Academica Ltd. This is an open access article published under the Creative Commons CC-BY-NC 3.0 license. |
spellingShingle | Original Research McDade, Kevin K. Chandran, Uma Day, Roger S. Improving Cancer Gene Expression Data Quality through a TCGA Data-Driven Evaluation of Identifier Filtering |
title | Improving Cancer Gene Expression Data Quality through a TCGA Data-Driven Evaluation of Identifier Filtering |
title_full | Improving Cancer Gene Expression Data Quality through a TCGA Data-Driven Evaluation of Identifier Filtering |
title_fullStr | Improving Cancer Gene Expression Data Quality through a TCGA Data-Driven Evaluation of Identifier Filtering |
title_full_unstemmed | Improving Cancer Gene Expression Data Quality through a TCGA Data-Driven Evaluation of Identifier Filtering |
title_short | Improving Cancer Gene Expression Data Quality through a TCGA Data-Driven Evaluation of Identifier Filtering |
title_sort | improving cancer gene expression data quality through a tcga data-driven evaluation of identifier filtering |
topic | Original Research |
url | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4686346/ https://www.ncbi.nlm.nih.gov/pubmed/26715829 http://dx.doi.org/10.4137/CIN.S33076 |
work_keys_str_mv | AT mcdadekevink improvingcancergeneexpressiondataqualitythroughatcgadatadrivenevaluationofidentifierfiltering AT chandranuma improvingcancergeneexpressiondataqualitythroughatcgadatadrivenevaluationofidentifierfiltering AT dayrogers improvingcancergeneexpressiondataqualitythroughatcgadatadrivenevaluationofidentifierfiltering |