Cargando…

Improving Cancer Gene Expression Data Quality through a TCGA Data-Driven Evaluation of Identifier Filtering

Data quality is a recognized problem for high-throughput genomics platforms, as evinced by the proliferation of methods attempting to filter out lower quality data points. Different filtering methods lead to discordant results, raising the question, which methods are best? Astonishingly, little comp...

Descripción completa

Detalles Bibliográficos
Autores principales: McDade, Kevin K., Chandran, Uma, Day, Roger S.
Formato: Online Artículo Texto
Lenguaje:English
Publicado: Libertas Academica 2015
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4686346/
https://www.ncbi.nlm.nih.gov/pubmed/26715829
http://dx.doi.org/10.4137/CIN.S33076
_version_ 1782406429678764032
author McDade, Kevin K.
Chandran, Uma
Day, Roger S.
author_facet McDade, Kevin K.
Chandran, Uma
Day, Roger S.
author_sort McDade, Kevin K.
collection PubMed
description Data quality is a recognized problem for high-throughput genomics platforms, as evinced by the proliferation of methods attempting to filter out lower quality data points. Different filtering methods lead to discordant results, raising the question, which methods are best? Astonishingly, little computational support is offered to analysts to decide which filtering methods are optimal for the research question at hand. To evaluate them, we begin with a pair of expression data sets, transcriptomic and proteomic, on the same samples. The pair of data sets form a test-bed for the evaluation. Identifier mapping between the data sets creates a collection of feature pairs, with correlations calculated for each pair. To evaluate a filtering strategy, we estimate posterior probabilities for the correctness of probesets accepted by the method. An analyst can set expected utilities that represent the trade-off between the quality and quantity of accepted features. We tested nine published probeset filtering methods and combination strategies. We used two test-beds from cancer studies providing transcriptomic and proteomic data. For reasonable utility settings, the Jetset filtering method was optimal for probeset filtering on both test-beds, even though both assay platforms were different. Further intersection with a second filtering method was indicated on one test-bed but not the other.
format Online
Article
Text
id pubmed-4686346
institution National Center for Biotechnology Information
language English
publishDate 2015
publisher Libertas Academica
record_format MEDLINE/PubMed
spelling pubmed-46863462015-12-29 Improving Cancer Gene Expression Data Quality through a TCGA Data-Driven Evaluation of Identifier Filtering McDade, Kevin K. Chandran, Uma Day, Roger S. Cancer Inform Original Research Data quality is a recognized problem for high-throughput genomics platforms, as evinced by the proliferation of methods attempting to filter out lower quality data points. Different filtering methods lead to discordant results, raising the question, which methods are best? Astonishingly, little computational support is offered to analysts to decide which filtering methods are optimal for the research question at hand. To evaluate them, we begin with a pair of expression data sets, transcriptomic and proteomic, on the same samples. The pair of data sets form a test-bed for the evaluation. Identifier mapping between the data sets creates a collection of feature pairs, with correlations calculated for each pair. To evaluate a filtering strategy, we estimate posterior probabilities for the correctness of probesets accepted by the method. An analyst can set expected utilities that represent the trade-off between the quality and quantity of accepted features. We tested nine published probeset filtering methods and combination strategies. We used two test-beds from cancer studies providing transcriptomic and proteomic data. For reasonable utility settings, the Jetset filtering method was optimal for probeset filtering on both test-beds, even though both assay platforms were different. Further intersection with a second filtering method was indicated on one test-bed but not the other. Libertas Academica 2015-12-16 /pmc/articles/PMC4686346/ /pubmed/26715829 http://dx.doi.org/10.4137/CIN.S33076 Text en © 2015 the author(s), publisher and licensee Libertas Academica Ltd. This is an open access article published under the Creative Commons CC-BY-NC 3.0 license.
spellingShingle Original Research
McDade, Kevin K.
Chandran, Uma
Day, Roger S.
Improving Cancer Gene Expression Data Quality through a TCGA Data-Driven Evaluation of Identifier Filtering
title Improving Cancer Gene Expression Data Quality through a TCGA Data-Driven Evaluation of Identifier Filtering
title_full Improving Cancer Gene Expression Data Quality through a TCGA Data-Driven Evaluation of Identifier Filtering
title_fullStr Improving Cancer Gene Expression Data Quality through a TCGA Data-Driven Evaluation of Identifier Filtering
title_full_unstemmed Improving Cancer Gene Expression Data Quality through a TCGA Data-Driven Evaluation of Identifier Filtering
title_short Improving Cancer Gene Expression Data Quality through a TCGA Data-Driven Evaluation of Identifier Filtering
title_sort improving cancer gene expression data quality through a tcga data-driven evaluation of identifier filtering
topic Original Research
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4686346/
https://www.ncbi.nlm.nih.gov/pubmed/26715829
http://dx.doi.org/10.4137/CIN.S33076
work_keys_str_mv AT mcdadekevink improvingcancergeneexpressiondataqualitythroughatcgadatadrivenevaluationofidentifierfiltering
AT chandranuma improvingcancergeneexpressiondataqualitythroughatcgadatadrivenevaluationofidentifierfiltering
AT dayrogers improvingcancergeneexpressiondataqualitythroughatcgadatadrivenevaluationofidentifierfiltering