Cargando…

A summarization approach for Affymetrix GeneChip data using a reference training set from a large, biologically diverse database

BACKGROUND: Many of the most popular pre-processing methods for Affymetrix expression arrays, such as RMA, gcRMA, and PLIER, simultaneously analyze data across a set of predetermined arrays to improve precision of the final measures of expression. One problem associated with these algorithms is that...

Descripción completa

Detalles Bibliográficos
Autores principales:	Katz, Simon, Irizarry, Rafael A, Lin, Xue, Tripputi, Mark, Porter, Mark W
Formato:	Texto
Lenguaje:	English
Publicado:	BioMed Central 2006
Materias:	Methodology Article
Acceso en línea:	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC1624855/ https://www.ncbi.nlm.nih.gov/pubmed/17059591 http://dx.doi.org/10.1186/1471-2105-7-464

_version_	1782130576405299200
author	Katz, Simon Irizarry, Rafael A Lin, Xue Tripputi, Mark Porter, Mark W
author_facet	Katz, Simon Irizarry, Rafael A Lin, Xue Tripputi, Mark Porter, Mark W
author_sort	Katz, Simon
collection	PubMed
description	BACKGROUND: Many of the most popular pre-processing methods for Affymetrix expression arrays, such as RMA, gcRMA, and PLIER, simultaneously analyze data across a set of predetermined arrays to improve precision of the final measures of expression. One problem associated with these algorithms is that expression measurements for a particular sample are highly dependent on the set of samples used for normalization and results obtained by normalization with a different set may not be comparable. A related problem is that an organization producing and/or storing large amounts of data in a sequential fashion will need to either re-run the pre-processing algorithm every time an array is added or store them in batches that are pre-processed together. Furthermore, pre-processing of large numbers of arrays requires loading all the feature-level data into memory which is a difficult task even with modern computers. We utilize a scheme that produces all the information necessary for pre-processing using a very large training set that can be used for summarization of samples outside of the training set. All subsequent pre-processing tasks can be done on an individual array basis. We demonstrate the utility of this approach by defining a new version of the Robust Multi-chip Averaging (RMA) algorithm which we refer to as refRMA. RESULTS: We assess performance based on multiple sets of samples processed over HG U133A Affymetrix GeneChip(® )arrays. We show that the refRMA workflow, when used in conjunction with a large, biologically diverse training set, results in the same general characteristics as that of RMA in its classic form when comparing overall data structure, sample-to-sample correlation, and variation. Further, we demonstrate that the refRMA workflow and reference set can be robustly applied to naïve organ types and to benchmark data where its performance indicates respectable results. CONCLUSION: Our results indicate that a biologically diverse reference database can be used to train a model for estimating probe set intensities of exclusive test sets, while retaining the overall characteristics of the base algorithm. Although the results we present are specific for RMA, similar versions of other multi-array normalization and summarization schemes can be developed.
format	Text
id	pubmed-1624855
institution	National Center for Biotechnology Information
language	English
publishDate	2006
publisher	BioMed Central
record_format	MEDLINE/PubMed
spelling	pubmed-16248552006-10-26 A summarization approach for Affymetrix GeneChip data using a reference training set from a large, biologically diverse database Katz, Simon Irizarry, Rafael A Lin, Xue Tripputi, Mark Porter, Mark W BMC Bioinformatics Methodology Article BACKGROUND: Many of the most popular pre-processing methods for Affymetrix expression arrays, such as RMA, gcRMA, and PLIER, simultaneously analyze data across a set of predetermined arrays to improve precision of the final measures of expression. One problem associated with these algorithms is that expression measurements for a particular sample are highly dependent on the set of samples used for normalization and results obtained by normalization with a different set may not be comparable. A related problem is that an organization producing and/or storing large amounts of data in a sequential fashion will need to either re-run the pre-processing algorithm every time an array is added or store them in batches that are pre-processed together. Furthermore, pre-processing of large numbers of arrays requires loading all the feature-level data into memory which is a difficult task even with modern computers. We utilize a scheme that produces all the information necessary for pre-processing using a very large training set that can be used for summarization of samples outside of the training set. All subsequent pre-processing tasks can be done on an individual array basis. We demonstrate the utility of this approach by defining a new version of the Robust Multi-chip Averaging (RMA) algorithm which we refer to as refRMA. RESULTS: We assess performance based on multiple sets of samples processed over HG U133A Affymetrix GeneChip(® )arrays. We show that the refRMA workflow, when used in conjunction with a large, biologically diverse training set, results in the same general characteristics as that of RMA in its classic form when comparing overall data structure, sample-to-sample correlation, and variation. Further, we demonstrate that the refRMA workflow and reference set can be robustly applied to naïve organ types and to benchmark data where its performance indicates respectable results. CONCLUSION: Our results indicate that a biologically diverse reference database can be used to train a model for estimating probe set intensities of exclusive test sets, while retaining the overall characteristics of the base algorithm. Although the results we present are specific for RMA, similar versions of other multi-array normalization and summarization schemes can be developed. BioMed Central 2006-10-23 /pmc/articles/PMC1624855/ /pubmed/17059591 http://dx.doi.org/10.1186/1471-2105-7-464 Text en Copyright © 2006 Katz et al; licensee BioMed Central Ltd. http://creativecommons.org/licenses/by/2.0 This is an Open Access article distributed under the terms of the Creative Commons Attribution License ( (http://creativecommons.org/licenses/by/2.0) ), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.
spellingShingle	Methodology Article Katz, Simon Irizarry, Rafael A Lin, Xue Tripputi, Mark Porter, Mark W A summarization approach for Affymetrix GeneChip data using a reference training set from a large, biologically diverse database
title	A summarization approach for Affymetrix GeneChip data using a reference training set from a large, biologically diverse database
title_full	A summarization approach for Affymetrix GeneChip data using a reference training set from a large, biologically diverse database
title_fullStr	A summarization approach for Affymetrix GeneChip data using a reference training set from a large, biologically diverse database
title_full_unstemmed	A summarization approach for Affymetrix GeneChip data using a reference training set from a large, biologically diverse database
title_short	A summarization approach for Affymetrix GeneChip data using a reference training set from a large, biologically diverse database
title_sort	summarization approach for affymetrix genechip data using a reference training set from a large, biologically diverse database
topic	Methodology Article
url	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC1624855/ https://www.ncbi.nlm.nih.gov/pubmed/17059591 http://dx.doi.org/10.1186/1471-2105-7-464
work_keys_str_mv	AT katzsimon asummarizationapproachforaffymetrixgenechipdatausingareferencetrainingsetfromalargebiologicallydiversedatabase AT irizarryrafaela asummarizationapproachforaffymetrixgenechipdatausingareferencetrainingsetfromalargebiologicallydiversedatabase AT linxue asummarizationapproachforaffymetrixgenechipdatausingareferencetrainingsetfromalargebiologicallydiversedatabase AT tripputimark asummarizationapproachforaffymetrixgenechipdatausingareferencetrainingsetfromalargebiologicallydiversedatabase AT portermarkw asummarizationapproachforaffymetrixgenechipdatausingareferencetrainingsetfromalargebiologicallydiversedatabase AT katzsimon summarizationapproachforaffymetrixgenechipdatausingareferencetrainingsetfromalargebiologicallydiversedatabase AT irizarryrafaela summarizationapproachforaffymetrixgenechipdatausingareferencetrainingsetfromalargebiologicallydiversedatabase AT linxue summarizationapproachforaffymetrixgenechipdatausingareferencetrainingsetfromalargebiologicallydiversedatabase AT tripputimark summarizationapproachforaffymetrixgenechipdatausingareferencetrainingsetfromalargebiologicallydiversedatabase AT portermarkw summarizationapproachforaffymetrixgenechipdatausingareferencetrainingsetfromalargebiologicallydiversedatabase

A summarization approach for Affymetrix GeneChip data using a reference training set from a large, biologically diverse database

Ejemplares similares