Cargando…

An Exhaustive, Non-Euclidean, Non-Parametric Data Mining Tool for Unraveling the Complexity of Biological Systems – Novel Insights into Malaria

Complex, high-dimensional data sets pose significant analytical challenges in the post-genomic era. Such data sets are not exclusive to genetic analyses and are also pertinent to epidemiology. There has been considerable effort to develop hypothesis-free data mining and machine learning methodologie...

Descripción completa

Detalles Bibliográficos
Autores principales: Loucoubar, Cheikh, Paul, Richard, Bar-Hen, Avner, Huret, Augustin, Tall, Adama, Sokhna, Cheikh, Trape, Jean-François, Ly, Alioune Badara, Faye, Joseph, Badiane, Abdoulaye, Diakhaby, Gaoussou, Sarr, Fatoumata Diène, Diop, Aliou, Sakuntabhai, Anavaj, Bureau, Jean-François
Formato: Online Artículo Texto
Lenguaje:English
Publicado: Public Library of Science 2011
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3170284/
https://www.ncbi.nlm.nih.gov/pubmed/21931645
http://dx.doi.org/10.1371/journal.pone.0024085
_version_ 1782211603430637568
author Loucoubar, Cheikh
Paul, Richard
Bar-Hen, Avner
Huret, Augustin
Tall, Adama
Sokhna, Cheikh
Trape, Jean-François
Ly, Alioune Badara
Faye, Joseph
Badiane, Abdoulaye
Diakhaby, Gaoussou
Sarr, Fatoumata Diène
Diop, Aliou
Sakuntabhai, Anavaj
Bureau, Jean-François
author_facet Loucoubar, Cheikh
Paul, Richard
Bar-Hen, Avner
Huret, Augustin
Tall, Adama
Sokhna, Cheikh
Trape, Jean-François
Ly, Alioune Badara
Faye, Joseph
Badiane, Abdoulaye
Diakhaby, Gaoussou
Sarr, Fatoumata Diène
Diop, Aliou
Sakuntabhai, Anavaj
Bureau, Jean-François
author_sort Loucoubar, Cheikh
collection PubMed
description Complex, high-dimensional data sets pose significant analytical challenges in the post-genomic era. Such data sets are not exclusive to genetic analyses and are also pertinent to epidemiology. There has been considerable effort to develop hypothesis-free data mining and machine learning methodologies. However, current methodologies lack exhaustivity and general applicability. Here we use a novel non-parametric, non-euclidean data mining tool, HyperCube®, to explore exhaustively a complex epidemiological malaria data set by searching for over density of events in m-dimensional space. Hotspots of over density correspond to strings of variables, rules, that determine, in this case, the occurrence of Plasmodium falciparum clinical malaria episodes. The data set contained 46,837 outcome events from 1,653 individuals and 34 explanatory variables. The best predictive rule contained 1,689 events from 148 individuals and was defined as: individuals present during 1992–2003, aged 1–5 years old, having hemoglobin AA, and having had previous Plasmodium malariae malaria parasite infection ≤10 times. These individuals had 3.71 times more P. falciparum clinical malaria episodes than the general population. We validated the rule in two different cohorts. We compared and contrasted the HyperCube® rule with the rules using variables identified by both traditional statistical methods and non-parametric regression tree methods. In addition, we tried all possible sub-stratified quantitative variables. No other model with equal or greater representativity gave a higher Relative Risk. Although three of the four variables in the rule were intuitive, the effect of number of P. malariae episodes was not. HyperCube® efficiently sub-stratified quantitative variables to optimize the rule and was able to identify interactions among the variables, tasks not easy to perform using standard data mining methods. Search of local over density in m-dimensional space, explained by easily interpretable rules, is thus seemingly ideal for generating hypotheses for large datasets to unravel the complexity inherent in biological systems.
format Online
Article
Text
id pubmed-3170284
institution National Center for Biotechnology Information
language English
publishDate 2011
publisher Public Library of Science
record_format MEDLINE/PubMed
spelling pubmed-31702842011-09-19 An Exhaustive, Non-Euclidean, Non-Parametric Data Mining Tool for Unraveling the Complexity of Biological Systems – Novel Insights into Malaria Loucoubar, Cheikh Paul, Richard Bar-Hen, Avner Huret, Augustin Tall, Adama Sokhna, Cheikh Trape, Jean-François Ly, Alioune Badara Faye, Joseph Badiane, Abdoulaye Diakhaby, Gaoussou Sarr, Fatoumata Diène Diop, Aliou Sakuntabhai, Anavaj Bureau, Jean-François PLoS One Research Article Complex, high-dimensional data sets pose significant analytical challenges in the post-genomic era. Such data sets are not exclusive to genetic analyses and are also pertinent to epidemiology. There has been considerable effort to develop hypothesis-free data mining and machine learning methodologies. However, current methodologies lack exhaustivity and general applicability. Here we use a novel non-parametric, non-euclidean data mining tool, HyperCube®, to explore exhaustively a complex epidemiological malaria data set by searching for over density of events in m-dimensional space. Hotspots of over density correspond to strings of variables, rules, that determine, in this case, the occurrence of Plasmodium falciparum clinical malaria episodes. The data set contained 46,837 outcome events from 1,653 individuals and 34 explanatory variables. The best predictive rule contained 1,689 events from 148 individuals and was defined as: individuals present during 1992–2003, aged 1–5 years old, having hemoglobin AA, and having had previous Plasmodium malariae malaria parasite infection ≤10 times. These individuals had 3.71 times more P. falciparum clinical malaria episodes than the general population. We validated the rule in two different cohorts. We compared and contrasted the HyperCube® rule with the rules using variables identified by both traditional statistical methods and non-parametric regression tree methods. In addition, we tried all possible sub-stratified quantitative variables. No other model with equal or greater representativity gave a higher Relative Risk. Although three of the four variables in the rule were intuitive, the effect of number of P. malariae episodes was not. HyperCube® efficiently sub-stratified quantitative variables to optimize the rule and was able to identify interactions among the variables, tasks not easy to perform using standard data mining methods. Search of local over density in m-dimensional space, explained by easily interpretable rules, is thus seemingly ideal for generating hypotheses for large datasets to unravel the complexity inherent in biological systems. Public Library of Science 2011-09-09 /pmc/articles/PMC3170284/ /pubmed/21931645 http://dx.doi.org/10.1371/journal.pone.0024085 Text en Loucoubar et al. http://creativecommons.org/licenses/by/4.0/ This is an open-access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are properly credited.
spellingShingle Research Article
Loucoubar, Cheikh
Paul, Richard
Bar-Hen, Avner
Huret, Augustin
Tall, Adama
Sokhna, Cheikh
Trape, Jean-François
Ly, Alioune Badara
Faye, Joseph
Badiane, Abdoulaye
Diakhaby, Gaoussou
Sarr, Fatoumata Diène
Diop, Aliou
Sakuntabhai, Anavaj
Bureau, Jean-François
An Exhaustive, Non-Euclidean, Non-Parametric Data Mining Tool for Unraveling the Complexity of Biological Systems – Novel Insights into Malaria
title An Exhaustive, Non-Euclidean, Non-Parametric Data Mining Tool for Unraveling the Complexity of Biological Systems – Novel Insights into Malaria
title_full An Exhaustive, Non-Euclidean, Non-Parametric Data Mining Tool for Unraveling the Complexity of Biological Systems – Novel Insights into Malaria
title_fullStr An Exhaustive, Non-Euclidean, Non-Parametric Data Mining Tool for Unraveling the Complexity of Biological Systems – Novel Insights into Malaria
title_full_unstemmed An Exhaustive, Non-Euclidean, Non-Parametric Data Mining Tool for Unraveling the Complexity of Biological Systems – Novel Insights into Malaria
title_short An Exhaustive, Non-Euclidean, Non-Parametric Data Mining Tool for Unraveling the Complexity of Biological Systems – Novel Insights into Malaria
title_sort exhaustive, non-euclidean, non-parametric data mining tool for unraveling the complexity of biological systems – novel insights into malaria
topic Research Article
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3170284/
https://www.ncbi.nlm.nih.gov/pubmed/21931645
http://dx.doi.org/10.1371/journal.pone.0024085
work_keys_str_mv AT loucoubarcheikh anexhaustivenoneuclideannonparametricdataminingtoolforunravelingthecomplexityofbiologicalsystemsnovelinsightsintomalaria
AT paulrichard anexhaustivenoneuclideannonparametricdataminingtoolforunravelingthecomplexityofbiologicalsystemsnovelinsightsintomalaria
AT barhenavner anexhaustivenoneuclideannonparametricdataminingtoolforunravelingthecomplexityofbiologicalsystemsnovelinsightsintomalaria
AT huretaugustin anexhaustivenoneuclideannonparametricdataminingtoolforunravelingthecomplexityofbiologicalsystemsnovelinsightsintomalaria
AT talladama anexhaustivenoneuclideannonparametricdataminingtoolforunravelingthecomplexityofbiologicalsystemsnovelinsightsintomalaria
AT sokhnacheikh anexhaustivenoneuclideannonparametricdataminingtoolforunravelingthecomplexityofbiologicalsystemsnovelinsightsintomalaria
AT trapejeanfrancois anexhaustivenoneuclideannonparametricdataminingtoolforunravelingthecomplexityofbiologicalsystemsnovelinsightsintomalaria
AT lyaliounebadara anexhaustivenoneuclideannonparametricdataminingtoolforunravelingthecomplexityofbiologicalsystemsnovelinsightsintomalaria
AT fayejoseph anexhaustivenoneuclideannonparametricdataminingtoolforunravelingthecomplexityofbiologicalsystemsnovelinsightsintomalaria
AT badianeabdoulaye anexhaustivenoneuclideannonparametricdataminingtoolforunravelingthecomplexityofbiologicalsystemsnovelinsightsintomalaria
AT diakhabygaoussou anexhaustivenoneuclideannonparametricdataminingtoolforunravelingthecomplexityofbiologicalsystemsnovelinsightsintomalaria
AT sarrfatoumatadiene anexhaustivenoneuclideannonparametricdataminingtoolforunravelingthecomplexityofbiologicalsystemsnovelinsightsintomalaria
AT diopaliou anexhaustivenoneuclideannonparametricdataminingtoolforunravelingthecomplexityofbiologicalsystemsnovelinsightsintomalaria
AT sakuntabhaianavaj anexhaustivenoneuclideannonparametricdataminingtoolforunravelingthecomplexityofbiologicalsystemsnovelinsightsintomalaria
AT bureaujeanfrancois anexhaustivenoneuclideannonparametricdataminingtoolforunravelingthecomplexityofbiologicalsystemsnovelinsightsintomalaria
AT loucoubarcheikh exhaustivenoneuclideannonparametricdataminingtoolforunravelingthecomplexityofbiologicalsystemsnovelinsightsintomalaria
AT paulrichard exhaustivenoneuclideannonparametricdataminingtoolforunravelingthecomplexityofbiologicalsystemsnovelinsightsintomalaria
AT barhenavner exhaustivenoneuclideannonparametricdataminingtoolforunravelingthecomplexityofbiologicalsystemsnovelinsightsintomalaria
AT huretaugustin exhaustivenoneuclideannonparametricdataminingtoolforunravelingthecomplexityofbiologicalsystemsnovelinsightsintomalaria
AT talladama exhaustivenoneuclideannonparametricdataminingtoolforunravelingthecomplexityofbiologicalsystemsnovelinsightsintomalaria
AT sokhnacheikh exhaustivenoneuclideannonparametricdataminingtoolforunravelingthecomplexityofbiologicalsystemsnovelinsightsintomalaria
AT trapejeanfrancois exhaustivenoneuclideannonparametricdataminingtoolforunravelingthecomplexityofbiologicalsystemsnovelinsightsintomalaria
AT lyaliounebadara exhaustivenoneuclideannonparametricdataminingtoolforunravelingthecomplexityofbiologicalsystemsnovelinsightsintomalaria
AT fayejoseph exhaustivenoneuclideannonparametricdataminingtoolforunravelingthecomplexityofbiologicalsystemsnovelinsightsintomalaria
AT badianeabdoulaye exhaustivenoneuclideannonparametricdataminingtoolforunravelingthecomplexityofbiologicalsystemsnovelinsightsintomalaria
AT diakhabygaoussou exhaustivenoneuclideannonparametricdataminingtoolforunravelingthecomplexityofbiologicalsystemsnovelinsightsintomalaria
AT sarrfatoumatadiene exhaustivenoneuclideannonparametricdataminingtoolforunravelingthecomplexityofbiologicalsystemsnovelinsightsintomalaria
AT diopaliou exhaustivenoneuclideannonparametricdataminingtoolforunravelingthecomplexityofbiologicalsystemsnovelinsightsintomalaria
AT sakuntabhaianavaj exhaustivenoneuclideannonparametricdataminingtoolforunravelingthecomplexityofbiologicalsystemsnovelinsightsintomalaria
AT bureaujeanfrancois exhaustivenoneuclideannonparametricdataminingtoolforunravelingthecomplexityofbiologicalsystemsnovelinsightsintomalaria