Cargando…

Problems with the nested granularity of feature domains in bioinformatics: the eXtasy case

BACKGROUND: Data from biomedical domains often have an inherit hierarchical structure. As this structure is usually implicit, its existence can be overlooked by practitioners interested in constructing and evaluating predictive models from such data. Ignoring these constructs leads to potentially pr...

Descripción completa

Detalles Bibliográficos
Autores principales: Popovic, Dusan, Sifrim, Alejandro, Davis, Jesse, Moreau, Yves, De Moor, Bart
Formato: Online Artículo Texto
Lenguaje:English
Publicado: BioMed Central 2015
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4347616/
https://www.ncbi.nlm.nih.gov/pubmed/25734591
http://dx.doi.org/10.1186/1471-2105-16-S4-S2
_version_ 1782359849264218112
author Popovic, Dusan
Sifrim, Alejandro
Davis, Jesse
Moreau, Yves
De Moor, Bart
author_facet Popovic, Dusan
Sifrim, Alejandro
Davis, Jesse
Moreau, Yves
De Moor, Bart
author_sort Popovic, Dusan
collection PubMed
description BACKGROUND: Data from biomedical domains often have an inherit hierarchical structure. As this structure is usually implicit, its existence can be overlooked by practitioners interested in constructing and evaluating predictive models from such data. Ignoring these constructs leads to potentially problematic and the routinely unrecognized bias in the models and results. In this work, we discuss this bias in detail and propose a simple, sampling-based solution for it. Next, we explore its sources and extent on synthetic data. Finally, we demonstrate how the state-of-the-art variant prioritization framework, eXtasy, benefits from using the described approach in its Random forest-based core classification model. RESULTS AND CONCLUSIONS: The conducted simulations clearly indicate that the heterogeneous granularity of feature domains poses significant problems for both the standard Random forest classifier and a modification that relies on stratified bootstrapping. Conversely, using the proposed sampling scheme when training the classifier mitigates the described bias. Furthermore, when applied to the eXtasy data under a realistic class distribution scenario, a Random forest learned using the proposed sampling scheme displays much better precision that its standard version, without degrading recall. Moreover, the largest performance gains are achieved in the most important part of the operating range: the top of prioritized gene list.
format Online
Article
Text
id pubmed-4347616
institution National Center for Biotechnology Information
language English
publishDate 2015
publisher BioMed Central
record_format MEDLINE/PubMed
spelling pubmed-43476162015-03-19 Problems with the nested granularity of feature domains in bioinformatics: the eXtasy case Popovic, Dusan Sifrim, Alejandro Davis, Jesse Moreau, Yves De Moor, Bart BMC Bioinformatics Research BACKGROUND: Data from biomedical domains often have an inherit hierarchical structure. As this structure is usually implicit, its existence can be overlooked by practitioners interested in constructing and evaluating predictive models from such data. Ignoring these constructs leads to potentially problematic and the routinely unrecognized bias in the models and results. In this work, we discuss this bias in detail and propose a simple, sampling-based solution for it. Next, we explore its sources and extent on synthetic data. Finally, we demonstrate how the state-of-the-art variant prioritization framework, eXtasy, benefits from using the described approach in its Random forest-based core classification model. RESULTS AND CONCLUSIONS: The conducted simulations clearly indicate that the heterogeneous granularity of feature domains poses significant problems for both the standard Random forest classifier and a modification that relies on stratified bootstrapping. Conversely, using the proposed sampling scheme when training the classifier mitigates the described bias. Furthermore, when applied to the eXtasy data under a realistic class distribution scenario, a Random forest learned using the proposed sampling scheme displays much better precision that its standard version, without degrading recall. Moreover, the largest performance gains are achieved in the most important part of the operating range: the top of prioritized gene list. BioMed Central 2015-02-23 /pmc/articles/PMC4347616/ /pubmed/25734591 http://dx.doi.org/10.1186/1471-2105-16-S4-S2 Text en Copyright © 2015 Popovic et al.; licensee BioMed Central Ltd. http://creativecommons.org/licenses/by/4.0 This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/4.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.
spellingShingle Research
Popovic, Dusan
Sifrim, Alejandro
Davis, Jesse
Moreau, Yves
De Moor, Bart
Problems with the nested granularity of feature domains in bioinformatics: the eXtasy case
title Problems with the nested granularity of feature domains in bioinformatics: the eXtasy case
title_full Problems with the nested granularity of feature domains in bioinformatics: the eXtasy case
title_fullStr Problems with the nested granularity of feature domains in bioinformatics: the eXtasy case
title_full_unstemmed Problems with the nested granularity of feature domains in bioinformatics: the eXtasy case
title_short Problems with the nested granularity of feature domains in bioinformatics: the eXtasy case
title_sort problems with the nested granularity of feature domains in bioinformatics: the extasy case
topic Research
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4347616/
https://www.ncbi.nlm.nih.gov/pubmed/25734591
http://dx.doi.org/10.1186/1471-2105-16-S4-S2
work_keys_str_mv AT popovicdusan problemswiththenestedgranularityoffeaturedomainsinbioinformaticstheextasycase
AT sifrimalejandro problemswiththenestedgranularityoffeaturedomainsinbioinformaticstheextasycase
AT davisjesse problemswiththenestedgranularityoffeaturedomainsinbioinformaticstheextasycase
AT moreauyves problemswiththenestedgranularityoffeaturedomainsinbioinformaticstheextasycase
AT demoorbart problemswiththenestedgranularityoffeaturedomainsinbioinformaticstheextasycase