Cargando…

Towards enhanced and interpretable clustering/classification in integrative genomics

High-throughput technologies have led to large collections of different types of biological data that provide unprecedented opportunities to unravel molecular heterogeneity of biological processes. Nevertheless, how to jointly explore data from multiple sources into a holistic, biologically meaningf...

Descripción completa

Detalles Bibliográficos
Autores principales: Lu, Yang Young, Lv, Jinchi, Fuhrman, Jed A., Sun, Fengzhu
Formato: Online Artículo Texto
Lenguaje:English
Publicado: Oxford University Press 2017
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5714251/
https://www.ncbi.nlm.nih.gov/pubmed/28977511
http://dx.doi.org/10.1093/nar/gkx767
_version_ 1783283554321432576
author Lu, Yang Young
Lv, Jinchi
Fuhrman, Jed A.
Sun, Fengzhu
author_facet Lu, Yang Young
Lv, Jinchi
Fuhrman, Jed A.
Sun, Fengzhu
author_sort Lu, Yang Young
collection PubMed
description High-throughput technologies have led to large collections of different types of biological data that provide unprecedented opportunities to unravel molecular heterogeneity of biological processes. Nevertheless, how to jointly explore data from multiple sources into a holistic, biologically meaningful interpretation remains challenging. In this work, we propose a scalable and tuning-free preprocessing framework, Heterogeneity Rescaling Pursuit (Hetero-RP), which weighs important features more highly than less important ones in accord with implicitly existing auxiliary knowledge. Finally, we demonstrate effectiveness of Hetero-RP in diverse clustering and classification applications. More importantly, Hetero-RP offers an interpretation of feature importance, shedding light on the driving forces of the underlying biology. In metagenomic contig binning, Hetero-RP automatically weighs abundance and composition profiles according to the varying number of samples, resulting in markedly improved performance of contig binning. In RNA-binding protein (RBP) binding site prediction, Hetero-RP not only improves the prediction performance measured by the area under the receiver operating characteristic curves (AUC), but also uncovers the evidence supported by independent studies, including the distribution of the binding sites of IGF2BP and PUM2, the binding competition between hnRNPC and U2AF2, and the intron–exon boundary of U2AF2 [availability: https://github.com/younglululu/Hetero-RP].
format Online
Article
Text
id pubmed-5714251
institution National Center for Biotechnology Information
language English
publishDate 2017
publisher Oxford University Press
record_format MEDLINE/PubMed
spelling pubmed-57142512017-12-08 Towards enhanced and interpretable clustering/classification in integrative genomics Lu, Yang Young Lv, Jinchi Fuhrman, Jed A. Sun, Fengzhu Nucleic Acids Res Methods Online High-throughput technologies have led to large collections of different types of biological data that provide unprecedented opportunities to unravel molecular heterogeneity of biological processes. Nevertheless, how to jointly explore data from multiple sources into a holistic, biologically meaningful interpretation remains challenging. In this work, we propose a scalable and tuning-free preprocessing framework, Heterogeneity Rescaling Pursuit (Hetero-RP), which weighs important features more highly than less important ones in accord with implicitly existing auxiliary knowledge. Finally, we demonstrate effectiveness of Hetero-RP in diverse clustering and classification applications. More importantly, Hetero-RP offers an interpretation of feature importance, shedding light on the driving forces of the underlying biology. In metagenomic contig binning, Hetero-RP automatically weighs abundance and composition profiles according to the varying number of samples, resulting in markedly improved performance of contig binning. In RNA-binding protein (RBP) binding site prediction, Hetero-RP not only improves the prediction performance measured by the area under the receiver operating characteristic curves (AUC), but also uncovers the evidence supported by independent studies, including the distribution of the binding sites of IGF2BP and PUM2, the binding competition between hnRNPC and U2AF2, and the intron–exon boundary of U2AF2 [availability: https://github.com/younglululu/Hetero-RP]. Oxford University Press 2017-11-16 2017-08-30 /pmc/articles/PMC5714251/ /pubmed/28977511 http://dx.doi.org/10.1093/nar/gkx767 Text en © The Author(s) 2017. Published by Oxford University Press on behalf of Nucleic Acids Research. http://creativecommons.org/licenses/by-nc/4.0/ This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by-nc/4.0/), which permits non-commercial re-use, distribution, and reproduction in any medium, provided the original work is properly cited. For commercial re-use, please contact journals.permissions@oup.com
spellingShingle Methods Online
Lu, Yang Young
Lv, Jinchi
Fuhrman, Jed A.
Sun, Fengzhu
Towards enhanced and interpretable clustering/classification in integrative genomics
title Towards enhanced and interpretable clustering/classification in integrative genomics
title_full Towards enhanced and interpretable clustering/classification in integrative genomics
title_fullStr Towards enhanced and interpretable clustering/classification in integrative genomics
title_full_unstemmed Towards enhanced and interpretable clustering/classification in integrative genomics
title_short Towards enhanced and interpretable clustering/classification in integrative genomics
title_sort towards enhanced and interpretable clustering/classification in integrative genomics
topic Methods Online
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5714251/
https://www.ncbi.nlm.nih.gov/pubmed/28977511
http://dx.doi.org/10.1093/nar/gkx767
work_keys_str_mv AT luyangyoung towardsenhancedandinterpretableclusteringclassificationinintegrativegenomics
AT lvjinchi towardsenhancedandinterpretableclusteringclassificationinintegrativegenomics
AT fuhrmanjeda towardsenhancedandinterpretableclusteringclassificationinintegrativegenomics
AT sunfengzhu towardsenhancedandinterpretableclusteringclassificationinintegrativegenomics