Cargando…
Jaccard/Tanimoto similarity test and estimation methods for biological presence-absence data
BACKGROUND: A survey of presences and absences of specific species across multiple biogeographic units (or bioregions) are used in a broad area of biological studies from ecology to microbiology. Using binary presence-absence data, we evaluate species co-occurrences that help elucidate relationships...
Autores principales: | , , , |
---|---|
Formato: | Online Artículo Texto |
Lenguaje: | English |
Publicado: |
BioMed Central
2019
|
Materias: | |
Acceso en línea: | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6929325/ https://www.ncbi.nlm.nih.gov/pubmed/31874610 http://dx.doi.org/10.1186/s12859-019-3118-5 |
_version_ | 1783482677917122560 |
---|---|
author | Chung, Neo Christopher Miasojedow, BłaŻej Startek, Michał Gambin, Anna |
author_facet | Chung, Neo Christopher Miasojedow, BłaŻej Startek, Michał Gambin, Anna |
author_sort | Chung, Neo Christopher |
collection | PubMed |
description | BACKGROUND: A survey of presences and absences of specific species across multiple biogeographic units (or bioregions) are used in a broad area of biological studies from ecology to microbiology. Using binary presence-absence data, we evaluate species co-occurrences that help elucidate relationships among organisms and environments. To summarize similarity between occurrences of species, we routinely use the Jaccard/Tanimoto coefficient, which is the ratio of their intersection to their union. It is natural, then, to identify statistically significant Jaccard/Tanimoto coefficients, which suggest non-random co-occurrences of species. However, statistical hypothesis testing using this similarity coefficient has been seldom used or studied. RESULTS: We introduce a hypothesis test for similarity for biological presence-absence data, using the Jaccard/Tanimoto coefficient. Several key improvements are presented including unbiased estimation of expectation and centered Jaccard/Tanimoto coefficients, that account for occurrence probabilities. The exact and asymptotic solutions are derived. To overcome a computational burden due to high-dimensionality, we propose the bootstrap and measurement concentration algorithms to efficiently estimate statistical significance of binary similarity. Comprehensive simulation studies demonstrate that our proposed methods produce accurate p-values and false discovery rates. The proposed estimation methods are orders of magnitude faster than the exact solution, particularly with an increasing dimensionality. We showcase their applications in evaluating co-occurrences of bird species in 28 islands of Vanuatu and fish species in 3347 freshwater habitats in France. The proposed methods are implemented in an open source R package called jaccard (https://cran.r-project.org/package=jaccard). CONCLUSION: We introduce a suite of statistical methods for the Jaccard/Tanimoto similarity coefficient for binary data, that enable straightforward incorporation of probabilistic measures in analysis for species co-occurrences. Due to their generality, the proposed methods and implementations are applicable to a wide range of binary data arising from genomics, biochemistry, and other areas of science. |
format | Online Article Text |
id | pubmed-6929325 |
institution | National Center for Biotechnology Information |
language | English |
publishDate | 2019 |
publisher | BioMed Central |
record_format | MEDLINE/PubMed |
spelling | pubmed-69293252019-12-30 Jaccard/Tanimoto similarity test and estimation methods for biological presence-absence data Chung, Neo Christopher Miasojedow, BłaŻej Startek, Michał Gambin, Anna BMC Bioinformatics Research BACKGROUND: A survey of presences and absences of specific species across multiple biogeographic units (or bioregions) are used in a broad area of biological studies from ecology to microbiology. Using binary presence-absence data, we evaluate species co-occurrences that help elucidate relationships among organisms and environments. To summarize similarity between occurrences of species, we routinely use the Jaccard/Tanimoto coefficient, which is the ratio of their intersection to their union. It is natural, then, to identify statistically significant Jaccard/Tanimoto coefficients, which suggest non-random co-occurrences of species. However, statistical hypothesis testing using this similarity coefficient has been seldom used or studied. RESULTS: We introduce a hypothesis test for similarity for biological presence-absence data, using the Jaccard/Tanimoto coefficient. Several key improvements are presented including unbiased estimation of expectation and centered Jaccard/Tanimoto coefficients, that account for occurrence probabilities. The exact and asymptotic solutions are derived. To overcome a computational burden due to high-dimensionality, we propose the bootstrap and measurement concentration algorithms to efficiently estimate statistical significance of binary similarity. Comprehensive simulation studies demonstrate that our proposed methods produce accurate p-values and false discovery rates. The proposed estimation methods are orders of magnitude faster than the exact solution, particularly with an increasing dimensionality. We showcase their applications in evaluating co-occurrences of bird species in 28 islands of Vanuatu and fish species in 3347 freshwater habitats in France. The proposed methods are implemented in an open source R package called jaccard (https://cran.r-project.org/package=jaccard). CONCLUSION: We introduce a suite of statistical methods for the Jaccard/Tanimoto similarity coefficient for binary data, that enable straightforward incorporation of probabilistic measures in analysis for species co-occurrences. Due to their generality, the proposed methods and implementations are applicable to a wide range of binary data arising from genomics, biochemistry, and other areas of science. BioMed Central 2019-12-24 /pmc/articles/PMC6929325/ /pubmed/31874610 http://dx.doi.org/10.1186/s12859-019-3118-5 Text en © The Author(s) 2019 Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated. |
spellingShingle | Research Chung, Neo Christopher Miasojedow, BłaŻej Startek, Michał Gambin, Anna Jaccard/Tanimoto similarity test and estimation methods for biological presence-absence data |
title | Jaccard/Tanimoto similarity test and estimation methods for biological presence-absence data |
title_full | Jaccard/Tanimoto similarity test and estimation methods for biological presence-absence data |
title_fullStr | Jaccard/Tanimoto similarity test and estimation methods for biological presence-absence data |
title_full_unstemmed | Jaccard/Tanimoto similarity test and estimation methods for biological presence-absence data |
title_short | Jaccard/Tanimoto similarity test and estimation methods for biological presence-absence data |
title_sort | jaccard/tanimoto similarity test and estimation methods for biological presence-absence data |
topic | Research |
url | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6929325/ https://www.ncbi.nlm.nih.gov/pubmed/31874610 http://dx.doi.org/10.1186/s12859-019-3118-5 |
work_keys_str_mv | AT chungneochristopher jaccardtanimotosimilaritytestandestimationmethodsforbiologicalpresenceabsencedata AT miasojedowbłazej jaccardtanimotosimilaritytestandestimationmethodsforbiologicalpresenceabsencedata AT startekmichał jaccardtanimotosimilaritytestandestimationmethodsforbiologicalpresenceabsencedata AT gambinanna jaccardtanimotosimilaritytestandestimationmethodsforbiologicalpresenceabsencedata |