Cargando…

A comparison of clustering methods for biogeography with fossil datasets

Cluster analysis is one of the most commonly used methods in palaeoecological studies, particularly in studies investigating biogeographic patterns. Although a number of different clustering methods are widely used, the approach and underlying assumptions of many of these methods are quite different...

Descripción completa

Detalles Bibliográficos
Autor principal: Vavrek, Matthew J.
Formato: Online Artículo Texto
Lenguaje:English
Publicado: PeerJ Inc. 2016
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4782735/
https://www.ncbi.nlm.nih.gov/pubmed/26966658
http://dx.doi.org/10.7717/peerj.1720
_version_ 1782420012459360256
author Vavrek, Matthew J.
author_facet Vavrek, Matthew J.
author_sort Vavrek, Matthew J.
collection PubMed
description Cluster analysis is one of the most commonly used methods in palaeoecological studies, particularly in studies investigating biogeographic patterns. Although a number of different clustering methods are widely used, the approach and underlying assumptions of many of these methods are quite different. For example, methods may be hierarchical or non-hierarchical in their approaches, and may use Euclidean distance or non-Euclidean indices to cluster the data. In order to assess the effectiveness of the different clustering methods as compared to one another, a simulation was designed that could assess each method over a range of both cluster distinctiveness and sampling intensity. Additionally, a non-hierarchical, non-Euclidean, iterative clustering method implemented in the R Statistical Language is described. This method, Non-Euclidean Relational Clustering (NERC), creates distinct clusters by dividing the data set in order to maximize the average similarity within each cluster, identifying clusters in which each data point is on average more similar to those within its own group than to those in any other group. While all the methods performed well with clearly differentiated and well-sampled datasets, when data are less than ideal the linkage methods perform poorly compared to non-Euclidean based k-means and the NERC method. Based on this analysis, Unweighted Pair Group Method with Arithmetic Mean and neighbor joining methods are less reliable with incomplete datasets like those found in palaeobiological analyses, and the k-means and NERC methods should be used in their place.
format Online
Article
Text
id pubmed-4782735
institution National Center for Biotechnology Information
language English
publishDate 2016
publisher PeerJ Inc.
record_format MEDLINE/PubMed
spelling pubmed-47827352016-03-10 A comparison of clustering methods for biogeography with fossil datasets Vavrek, Matthew J. PeerJ Biogeography Cluster analysis is one of the most commonly used methods in palaeoecological studies, particularly in studies investigating biogeographic patterns. Although a number of different clustering methods are widely used, the approach and underlying assumptions of many of these methods are quite different. For example, methods may be hierarchical or non-hierarchical in their approaches, and may use Euclidean distance or non-Euclidean indices to cluster the data. In order to assess the effectiveness of the different clustering methods as compared to one another, a simulation was designed that could assess each method over a range of both cluster distinctiveness and sampling intensity. Additionally, a non-hierarchical, non-Euclidean, iterative clustering method implemented in the R Statistical Language is described. This method, Non-Euclidean Relational Clustering (NERC), creates distinct clusters by dividing the data set in order to maximize the average similarity within each cluster, identifying clusters in which each data point is on average more similar to those within its own group than to those in any other group. While all the methods performed well with clearly differentiated and well-sampled datasets, when data are less than ideal the linkage methods perform poorly compared to non-Euclidean based k-means and the NERC method. Based on this analysis, Unweighted Pair Group Method with Arithmetic Mean and neighbor joining methods are less reliable with incomplete datasets like those found in palaeobiological analyses, and the k-means and NERC methods should be used in their place. PeerJ Inc. 2016-02-25 /pmc/articles/PMC4782735/ /pubmed/26966658 http://dx.doi.org/10.7717/peerj.1720 Text en © 2016 Vavrek http://creativecommons.org/licenses/by/4.0/ This is an open access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/4.0/) , which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.
spellingShingle Biogeography
Vavrek, Matthew J.
A comparison of clustering methods for biogeography with fossil datasets
title A comparison of clustering methods for biogeography with fossil datasets
title_full A comparison of clustering methods for biogeography with fossil datasets
title_fullStr A comparison of clustering methods for biogeography with fossil datasets
title_full_unstemmed A comparison of clustering methods for biogeography with fossil datasets
title_short A comparison of clustering methods for biogeography with fossil datasets
title_sort comparison of clustering methods for biogeography with fossil datasets
topic Biogeography
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4782735/
https://www.ncbi.nlm.nih.gov/pubmed/26966658
http://dx.doi.org/10.7717/peerj.1720
work_keys_str_mv AT vavrekmatthewj acomparisonofclusteringmethodsforbiogeographywithfossildatasets
AT vavrekmatthewj comparisonofclusteringmethodsforbiogeographywithfossildatasets