Cargando…

A decision theory paradigm for evaluating identifier mapping and filtering methods using data integration

BACKGROUND: In bioinformatics, we pre-process raw data into a format ready for answering medical and biological questions. A key step in processing is labeling the measured features with the identities of the molecules purportedly assayed: “molecular identification” (MI). Biological meaning comes fr...

Descripción completa

Detalles Bibliográficos
Autores principales:	Day, Roger S, McDade, Kevin K
Formato:	Online Artículo Texto
Lenguaje:	English
Publicado:	BioMed Central 2013
Materias:	Methodology Article
Acceso en línea:	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3734162/ https://www.ncbi.nlm.nih.gov/pubmed/23855655 http://dx.doi.org/10.1186/1471-2105-14-223

_version_	1782279486585176064
author	Day, Roger S McDade, Kevin K
author_facet	Day, Roger S McDade, Kevin K
author_sort	Day, Roger S
collection	PubMed
description	BACKGROUND: In bioinformatics, we pre-process raw data into a format ready for answering medical and biological questions. A key step in processing is labeling the measured features with the identities of the molecules purportedly assayed: “molecular identification” (MI). Biological meaning comes from identifying these molecular measurements correctly with actual molecular species. But MI can be incorrect. Identifier filtering (IDF) selects features with more trusted MI, leaving a smaller, but more correct dataset. Identifier mapping (IDM) is needed when an analyst is combining two high-throughput (HT) measurement platforms on the same samples. IDM produces ID pairs, one ID from each platform, where the mapping declares that the two analytes are associated through a causal path, direct or indirect (example: pairing an ID for an mRNA species with an ID for a protein species that is its putative translation). Many competing solutions for IDF and IDM exist. Analysts need a rigorous method for evaluating and comparing all these choices. RESULTS: We describe a paradigm for critically evaluating and comparing IDF and IDM methods, guided by data on biological samples. The requirements are: a large set of biological samples, measurements on those samples from at least two high-throughput platforms, a model family connecting features from the platforms, and an association measure. From these ingredients, one fits a mixture model coupled to a decision framework. We demonstrate this evaluation paradigm in three settings: comparing performance of several bioinformatics resources for IDM between transcripts and proteins, comparing several published microarray probeset IDF methods and their combinations, and selecting optimal quality thresholds for tandem mass spectrometry spectral events. CONCLUSIONS: The paradigm outlined here provides a data-grounded approach for evaluating the quality not just of IDM and IDF, but of any pre-processing step or pipeline. The results will help researchers to semantically integrate or filter data optimally, and help bioinformatics database curators to track changes in quality over time and even to troubleshoot causes of MI errors.
format	Online Article Text
id	pubmed-3734162
institution	National Center for Biotechnology Information
language	English
publishDate	2013
publisher	BioMed Central
record_format	MEDLINE/PubMed
spelling	pubmed-37341622013-08-06 A decision theory paradigm for evaluating identifier mapping and filtering methods using data integration Day, Roger S McDade, Kevin K BMC Bioinformatics Methodology Article BACKGROUND: In bioinformatics, we pre-process raw data into a format ready for answering medical and biological questions. A key step in processing is labeling the measured features with the identities of the molecules purportedly assayed: “molecular identification” (MI). Biological meaning comes from identifying these molecular measurements correctly with actual molecular species. But MI can be incorrect. Identifier filtering (IDF) selects features with more trusted MI, leaving a smaller, but more correct dataset. Identifier mapping (IDM) is needed when an analyst is combining two high-throughput (HT) measurement platforms on the same samples. IDM produces ID pairs, one ID from each platform, where the mapping declares that the two analytes are associated through a causal path, direct or indirect (example: pairing an ID for an mRNA species with an ID for a protein species that is its putative translation). Many competing solutions for IDF and IDM exist. Analysts need a rigorous method for evaluating and comparing all these choices. RESULTS: We describe a paradigm for critically evaluating and comparing IDF and IDM methods, guided by data on biological samples. The requirements are: a large set of biological samples, measurements on those samples from at least two high-throughput platforms, a model family connecting features from the platforms, and an association measure. From these ingredients, one fits a mixture model coupled to a decision framework. We demonstrate this evaluation paradigm in three settings: comparing performance of several bioinformatics resources for IDM between transcripts and proteins, comparing several published microarray probeset IDF methods and their combinations, and selecting optimal quality thresholds for tandem mass spectrometry spectral events. CONCLUSIONS: The paradigm outlined here provides a data-grounded approach for evaluating the quality not just of IDM and IDF, but of any pre-processing step or pipeline. The results will help researchers to semantically integrate or filter data optimally, and help bioinformatics database curators to track changes in quality over time and even to troubleshoot causes of MI errors. BioMed Central 2013-07-15 /pmc/articles/PMC3734162/ /pubmed/23855655 http://dx.doi.org/10.1186/1471-2105-14-223 Text en Copyright © 2013 Day and McDade; licensee BioMed Central Ltd. http://creativecommons.org/licenses/by/2.0 This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.
spellingShingle	Methodology Article Day, Roger S McDade, Kevin K A decision theory paradigm for evaluating identifier mapping and filtering methods using data integration
title	A decision theory paradigm for evaluating identifier mapping and filtering methods using data integration
title_full	A decision theory paradigm for evaluating identifier mapping and filtering methods using data integration
title_fullStr	A decision theory paradigm for evaluating identifier mapping and filtering methods using data integration
title_full_unstemmed	A decision theory paradigm for evaluating identifier mapping and filtering methods using data integration
title_short	A decision theory paradigm for evaluating identifier mapping and filtering methods using data integration
title_sort	decision theory paradigm for evaluating identifier mapping and filtering methods using data integration
topic	Methodology Article
url	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3734162/ https://www.ncbi.nlm.nih.gov/pubmed/23855655 http://dx.doi.org/10.1186/1471-2105-14-223
work_keys_str_mv	AT dayrogers adecisiontheoryparadigmforevaluatingidentifiermappingandfilteringmethodsusingdataintegration AT mcdadekevink adecisiontheoryparadigmforevaluatingidentifiermappingandfilteringmethodsusingdataintegration AT dayrogers decisiontheoryparadigmforevaluatingidentifiermappingandfilteringmethodsusingdataintegration AT mcdadekevink decisiontheoryparadigmforevaluatingidentifiermappingandfilteringmethodsusingdataintegration

A decision theory paradigm for evaluating identifier mapping and filtering methods using data integration

Ejemplares similares