Cargando…

Interactive knowledge discovery and data mining on genomic expression data with numeric formal concept analysis

BACKGROUND: Gene Expression Data (GED) analysis poses a great challenge to the scientific community that can be framed into the Knowledge Discovery in Databases (KDD) and Data Mining (DM) paradigm. Biclustering has emerged as the machine learning method of choice to solve this task, but its unsuperv...

Descripción completa

Detalles Bibliográficos
Autores principales: González-Calabozo, Jose M, Valverde-Albacete, Francisco J, Peláez-Moreno, Carmen
Formato: Online Artículo Texto
Lenguaje:English
Publicado: BioMed Central 2016
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5024470/
https://www.ncbi.nlm.nih.gov/pubmed/27628041
http://dx.doi.org/10.1186/s12859-016-1234-z
_version_ 1782453806700691456
author González-Calabozo, Jose M
Valverde-Albacete, Francisco J
Peláez-Moreno, Carmen
author_facet González-Calabozo, Jose M
Valverde-Albacete, Francisco J
Peláez-Moreno, Carmen
author_sort González-Calabozo, Jose M
collection PubMed
description BACKGROUND: Gene Expression Data (GED) analysis poses a great challenge to the scientific community that can be framed into the Knowledge Discovery in Databases (KDD) and Data Mining (DM) paradigm. Biclustering has emerged as the machine learning method of choice to solve this task, but its unsupervised nature makes result assessment problematic. This is often addressed by means of Gene Set Enrichment Analysis (GSEA). RESULTS: We put forward a framework in which GED analysis is understood as an Exploratory Data Analysis (EDA) process where we provide support for continuous human interaction with data aiming at improving the step of hypothesis abduction and assessment. We focus on the adaptation to human cognition of data interpretation and visualization of the output of EDA. First, we give a proper theoretical background to bi-clustering using Lattice Theory and provide a set of analysis tools revolving around [Formula: see text] -Formal Concept Analysis ([Formula: see text] -FCA), a lattice-theoretic unsupervised learning technique for real-valued matrices. By using different kinds of cost structures to quantify expression we obtain different sequences of hierarchical bi-clusterings for gene under- and over-expression using thresholds. Consequently, we provide a method with interleaved analysis steps and visualization devices so that the sequences of lattices for a particular experiment summarize the researcher’s vision of the data. This also allows us to define measures of persistence and robustness of biclusters to assess them. Second, the resulting biclusters are used to index external omics databases—for instance, Gene Ontology (GO)—thus offering a new way of accessing publicly available resources. This provides different flavors of gene set enrichment against which to assess the biclusters, by obtaining their p-values according to the terminology of those resources. We illustrate the exploration procedure on a real data example confirming results previously published. CONCLUSIONS: The GED analysis problem gets transformed into the exploration of a sequence of lattices enabling the visualization of the hierarchical structure of the biclusters with a certain degree of granularity. The ability of FCA-based bi-clustering methods to index external databases such as GO allows us to obtain a quality measure of the biclusters, to observe the evolution of a gene throughout the different biclusters it appears in, to look for relevant biclusters—by observing their genes and what their persistence is—to infer, for instance, hypotheses on their function. ELECTRONIC SUPPLEMENTARY MATERIAL: The online version of this article (doi:10.1186/s12859-016-1234-z) contains supplementary material, which is available to authorized users.
format Online
Article
Text
id pubmed-5024470
institution National Center for Biotechnology Information
language English
publishDate 2016
publisher BioMed Central
record_format MEDLINE/PubMed
spelling pubmed-50244702016-09-20 Interactive knowledge discovery and data mining on genomic expression data with numeric formal concept analysis González-Calabozo, Jose M Valverde-Albacete, Francisco J Peláez-Moreno, Carmen BMC Bioinformatics Methodology Article BACKGROUND: Gene Expression Data (GED) analysis poses a great challenge to the scientific community that can be framed into the Knowledge Discovery in Databases (KDD) and Data Mining (DM) paradigm. Biclustering has emerged as the machine learning method of choice to solve this task, but its unsupervised nature makes result assessment problematic. This is often addressed by means of Gene Set Enrichment Analysis (GSEA). RESULTS: We put forward a framework in which GED analysis is understood as an Exploratory Data Analysis (EDA) process where we provide support for continuous human interaction with data aiming at improving the step of hypothesis abduction and assessment. We focus on the adaptation to human cognition of data interpretation and visualization of the output of EDA. First, we give a proper theoretical background to bi-clustering using Lattice Theory and provide a set of analysis tools revolving around [Formula: see text] -Formal Concept Analysis ([Formula: see text] -FCA), a lattice-theoretic unsupervised learning technique for real-valued matrices. By using different kinds of cost structures to quantify expression we obtain different sequences of hierarchical bi-clusterings for gene under- and over-expression using thresholds. Consequently, we provide a method with interleaved analysis steps and visualization devices so that the sequences of lattices for a particular experiment summarize the researcher’s vision of the data. This also allows us to define measures of persistence and robustness of biclusters to assess them. Second, the resulting biclusters are used to index external omics databases—for instance, Gene Ontology (GO)—thus offering a new way of accessing publicly available resources. This provides different flavors of gene set enrichment against which to assess the biclusters, by obtaining their p-values according to the terminology of those resources. We illustrate the exploration procedure on a real data example confirming results previously published. CONCLUSIONS: The GED analysis problem gets transformed into the exploration of a sequence of lattices enabling the visualization of the hierarchical structure of the biclusters with a certain degree of granularity. The ability of FCA-based bi-clustering methods to index external databases such as GO allows us to obtain a quality measure of the biclusters, to observe the evolution of a gene throughout the different biclusters it appears in, to look for relevant biclusters—by observing their genes and what their persistence is—to infer, for instance, hypotheses on their function. ELECTRONIC SUPPLEMENTARY MATERIAL: The online version of this article (doi:10.1186/s12859-016-1234-z) contains supplementary material, which is available to authorized users. BioMed Central 2016-09-15 /pmc/articles/PMC5024470/ /pubmed/27628041 http://dx.doi.org/10.1186/s12859-016-1234-z Text en © The Author(s) 2016 Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.
spellingShingle Methodology Article
González-Calabozo, Jose M
Valverde-Albacete, Francisco J
Peláez-Moreno, Carmen
Interactive knowledge discovery and data mining on genomic expression data with numeric formal concept analysis
title Interactive knowledge discovery and data mining on genomic expression data with numeric formal concept analysis
title_full Interactive knowledge discovery and data mining on genomic expression data with numeric formal concept analysis
title_fullStr Interactive knowledge discovery and data mining on genomic expression data with numeric formal concept analysis
title_full_unstemmed Interactive knowledge discovery and data mining on genomic expression data with numeric formal concept analysis
title_short Interactive knowledge discovery and data mining on genomic expression data with numeric formal concept analysis
title_sort interactive knowledge discovery and data mining on genomic expression data with numeric formal concept analysis
topic Methodology Article
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5024470/
https://www.ncbi.nlm.nih.gov/pubmed/27628041
http://dx.doi.org/10.1186/s12859-016-1234-z
work_keys_str_mv AT gonzalezcalabozojosem interactiveknowledgediscoveryanddataminingongenomicexpressiondatawithnumericformalconceptanalysis
AT valverdealbacetefranciscoj interactiveknowledgediscoveryanddataminingongenomicexpressiondatawithnumericformalconceptanalysis
AT pelaezmorenocarmen interactiveknowledgediscoveryanddataminingongenomicexpressiondatawithnumericformalconceptanalysis