Cargando…

Efficient Test and Visualization of Multi-Set Intersections

Identification of sets of objects with shared features is a common operation in all disciplines. Analysis of intersections among multiple sets is fundamental for in-depth understanding of their complex relationships. However, so far no method has been developed to assess statistical significance of...

Descripción completa

Detalles Bibliográficos
Autores principales: Wang, Minghui, Zhao, Yongzhong, Zhang, Bin
Formato: Online Artículo Texto
Lenguaje:English
Publicado: Nature Publishing Group 2015
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4658477/
https://www.ncbi.nlm.nih.gov/pubmed/26603754
http://dx.doi.org/10.1038/srep16923
Descripción
Sumario:Identification of sets of objects with shared features is a common operation in all disciplines. Analysis of intersections among multiple sets is fundamental for in-depth understanding of their complex relationships. However, so far no method has been developed to assess statistical significance of intersections among three or more sets. Moreover, the state-of-the-art approaches for visualization of multi-set intersections are not scalable. Here, we first developed a theoretical framework for computing the statistical distributions of multi-set intersections based upon combinatorial theory, and then accordingly designed a procedure to efficiently calculate the exact probabilities of multi-set intersections. We further developed multiple efficient and scalable techniques to visualize multi-set intersections and the corresponding intersection statistics. We implemented both the theoretical framework and the visualization techniques in a unified R software package, SuperExactTest. We demonstrated the utility of SuperExactTest through an intensive simulation study and a comprehensive analysis of seven independently curated cancer gene sets as well as six disease or trait associated gene sets identified by genome-wide association studies. We expect SuperExactTest developed by this study will have a broad range of applications in scientific data analysis in many disciplines.