Cargando…

Finding an appropriate equation to measure similarity between binary vectors: case studies on Indonesian and Japanese herbal medicines

BACKGROUND: The binary similarity and dissimilarity measures have critical roles in the processing of data consisting of binary vectors in various fields including bioinformatics and chemometrics. These metrics express the similarity and dissimilarity values between two binary vectors in terms of th...

Descripción completa

Detalles Bibliográficos
Autores principales: Wijaya, Sony Hartono, Afendi, Farit Mochamad, Batubara, Irmanida, Darusman, Latifah K., Altaf-Ul-Amin, Md, Kanaya, Shigehiko
Formato: Online Artículo Texto
Lenguaje:English
Publicado: BioMed Central 2016
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5142342/
https://www.ncbi.nlm.nih.gov/pubmed/27927171
http://dx.doi.org/10.1186/s12859-016-1392-z
Descripción
Sumario:BACKGROUND: The binary similarity and dissimilarity measures have critical roles in the processing of data consisting of binary vectors in various fields including bioinformatics and chemometrics. These metrics express the similarity and dissimilarity values between two binary vectors in terms of the positive matches, absence mismatches or negative matches. To our knowledge, there is no published work presenting a systematic way of finding an appropriate equation to measure binary similarity that performs well for certain data type or application. A proper method to select a suitable binary similarity or dissimilarity measure is needed to obtain better classification results. RESULTS: In this study, we proposed a novel approach to select binary similarity and dissimilarity measures. We collected 79 binary similarity and dissimilarity equations by extensive literature search and implemented those equations as an R package called bmeasures. We applied these metrics to quantify the similarity and dissimilarity between herbal medicine formulas belonging to the Indonesian Jamu and Japanese Kampo separately. We assessed the capability of binary equations to classify herbal medicine pairs into match and mismatch efficacies based on their similarity or dissimilarity coefficients using the Receiver Operating Characteristic (ROC) curve analysis. According to the area under the ROC curve results, we found Indonesian Jamu and Japanese Kampo datasets obtained different ranking of binary similarity and dissimilarity measures. Out of all the equations, the Forbes-2 similarity and the Variant of Correlation similarity measures are recommended for studying the relationship between Jamu formulas and Kampo formulas, respectively. CONCLUSIONS: The selection of binary similarity and dissimilarity measures for multivariate analysis is data dependent. The proposed method can be used to find the most suitable binary similarity and dissimilarity equation wisely for a particular data. Our finding suggests that all four types of matching quantities in the Operational Taxonomic Unit (OTU) table are important to calculate the similarity and dissimilarity coefficients between herbal medicine formulas. Also, the binary similarity and dissimilarity measures that include the negative match quantity d achieve better capability to separate herbal medicine pairs compared to equations that exclude d. ELECTRONIC SUPPLEMENTARY MATERIAL: The online version of this article (doi:10.1186/s12859-016-1392-z) contains supplementary material, which is available to authorized users.