Cargando…

Semblance: An empirical similarity kernel on probability spaces

In data science, determining proximity between observations is critical to many downstream analyses such as clustering, classification, and prediction. However, when the data’s underlying probability distribution is unclear, the function used to compute similarity between data points is often arbitr...

Descripción completa

Detalles Bibliográficos
Autores principales: Agarwal, Divyansh, Zhang, Nancy R.
Formato: Online Artículo Texto
Lenguaje:English
Publicado: American Association for the Advancement of Science 2019
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6892634/
https://www.ncbi.nlm.nih.gov/pubmed/31840051
http://dx.doi.org/10.1126/sciadv.aau9630
Descripción
Sumario:In data science, determining proximity between observations is critical to many downstream analyses such as clustering, classification, and prediction. However, when the data’s underlying probability distribution is unclear, the function used to compute similarity between data points is often arbitrarily chosen. Here, we present a novel definition of proximity, Semblance, that uses the empirical distribution of a feature to inform the pair-wise similarity between observations. The advantage of Semblance lies in its distribution-free formulation and its ability to place greater emphasis on proximity between observation pairs that fall at the outskirts of the data distribution, as opposed to those toward the center. Semblance is a valid Mercer kernel, allowing its principled use in kernel-based learning algorithms, and for any data modality. We demonstrate its consistently improved performance against conventional methods through simulations and real case studies from diverse applications in single-cell transcriptomics, image reconstruction, and financial forecasting.