Cargando…

Influence of different data cleaning solutions of point‐occurrence records on downstream macroecological diversity models

Digital point‐occurrence records from the Global Biodiversity Information Facility (GBIF) and other data providers enable a wide range of research in macroecology and biogeography. However, data errors may hamper immediate use. Manual data cleaning is time‐consuming and often unfeasible, given that...

Descripción completa

Detalles Bibliográficos
Autores principales:	Führding‐Potschkat, Petra, Kreft, Holger, Ickert‐Bond, Stefanie M.
Formato:	Online Artículo Texto
Lenguaje:	English
Publicado:	John Wiley and Sons Inc. 2022
Materias:	Research Articles
Acceso en línea:	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9351331/ https://www.ncbi.nlm.nih.gov/pubmed/35949539 http://dx.doi.org/10.1002/ece3.9168

_version_	1784762422541680640
author	Führding‐Potschkat, Petra Kreft, Holger Ickert‐Bond, Stefanie M.
author_facet	Führding‐Potschkat, Petra Kreft, Holger Ickert‐Bond, Stefanie M.
author_sort	Führding‐Potschkat, Petra
collection	PubMed
description	Digital point‐occurrence records from the Global Biodiversity Information Facility (GBIF) and other data providers enable a wide range of research in macroecology and biogeography. However, data errors may hamper immediate use. Manual data cleaning is time‐consuming and often unfeasible, given that the databases may contain thousands or millions of records. Automated data cleaning pipelines are therefore of high importance. Taking North American Ephedra as a model, we examined how different data cleaning pipelines (using, e.g., the GBIF web application, and four different R packages) affect downstream species distribution models (SDMs). We also assessed how data differed from expert data. From 13,889 North American Ephedra observations in GBIF, the pipelines removed 31.7% to 62.7% false positives, invalid coordinates, and duplicates, leading to datasets between 9484 (GBIF application) and 5196 records (manual‐guided filtering). The expert data consisted of 704 records, comparable to data from field studies. Although differences in the absolute numbers of records were relatively large, species richness models based on stacked SDMs (S‐SDM) from pipeline and expert data were strongly correlated (mean Pearson's r across the pipelines: .9986, vs. the expert data: .9173). Our results suggest that all R package‐based pipelines reliably identified invalid coordinates. In contrast, the GBIF‐filtered data still contained both spatial and taxonomic errors. Major drawbacks emerge from the fact that no pipeline fully discovered misidentified specimens without the assistance of taxonomic expert knowledge. We conclude that application‐filtered GBIF data will still need additional review to achieve higher spatial data quality. Achieving high‐quality taxonomic data will require extra effort, probably by thoroughly analyzing the data for misidentified taxa, supported by experts.
format	Online Article Text
id	pubmed-9351331
institution	National Center for Biotechnology Information
language	English
publishDate	2022
publisher	John Wiley and Sons Inc.
record_format	MEDLINE/PubMed
spelling	pubmed-93513312022-08-09 Influence of different data cleaning solutions of point‐occurrence records on downstream macroecological diversity models Führding‐Potschkat, Petra Kreft, Holger Ickert‐Bond, Stefanie M. Ecol Evol Research Articles Digital point‐occurrence records from the Global Biodiversity Information Facility (GBIF) and other data providers enable a wide range of research in macroecology and biogeography. However, data errors may hamper immediate use. Manual data cleaning is time‐consuming and often unfeasible, given that the databases may contain thousands or millions of records. Automated data cleaning pipelines are therefore of high importance. Taking North American Ephedra as a model, we examined how different data cleaning pipelines (using, e.g., the GBIF web application, and four different R packages) affect downstream species distribution models (SDMs). We also assessed how data differed from expert data. From 13,889 North American Ephedra observations in GBIF, the pipelines removed 31.7% to 62.7% false positives, invalid coordinates, and duplicates, leading to datasets between 9484 (GBIF application) and 5196 records (manual‐guided filtering). The expert data consisted of 704 records, comparable to data from field studies. Although differences in the absolute numbers of records were relatively large, species richness models based on stacked SDMs (S‐SDM) from pipeline and expert data were strongly correlated (mean Pearson's r across the pipelines: .9986, vs. the expert data: .9173). Our results suggest that all R package‐based pipelines reliably identified invalid coordinates. In contrast, the GBIF‐filtered data still contained both spatial and taxonomic errors. Major drawbacks emerge from the fact that no pipeline fully discovered misidentified specimens without the assistance of taxonomic expert knowledge. We conclude that application‐filtered GBIF data will still need additional review to achieve higher spatial data quality. Achieving high‐quality taxonomic data will require extra effort, probably by thoroughly analyzing the data for misidentified taxa, supported by experts. John Wiley and Sons Inc. 2022-08-04 /pmc/articles/PMC9351331/ /pubmed/35949539 http://dx.doi.org/10.1002/ece3.9168 Text en © 2022 The Authors. Ecology and Evolution published by John Wiley & Sons Ltd. https://creativecommons.org/licenses/by/4.0/This is an open access article under the terms of the http://creativecommons.org/licenses/by/4.0/ (https://creativecommons.org/licenses/by/4.0/) License, which permits use, distribution and reproduction in any medium, provided the original work is properly cited.
spellingShingle	Research Articles Führding‐Potschkat, Petra Kreft, Holger Ickert‐Bond, Stefanie M. Influence of different data cleaning solutions of point‐occurrence records on downstream macroecological diversity models
title	Influence of different data cleaning solutions of point‐occurrence records on downstream macroecological diversity models
title_full	Influence of different data cleaning solutions of point‐occurrence records on downstream macroecological diversity models
title_fullStr	Influence of different data cleaning solutions of point‐occurrence records on downstream macroecological diversity models
title_full_unstemmed	Influence of different data cleaning solutions of point‐occurrence records on downstream macroecological diversity models
title_short	Influence of different data cleaning solutions of point‐occurrence records on downstream macroecological diversity models
title_sort	influence of different data cleaning solutions of point‐occurrence records on downstream macroecological diversity models
topic	Research Articles
url	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9351331/ https://www.ncbi.nlm.nih.gov/pubmed/35949539 http://dx.doi.org/10.1002/ece3.9168
work_keys_str_mv	AT fuhrdingpotschkatpetra influenceofdifferentdatacleaningsolutionsofpointoccurrencerecordsondownstreammacroecologicaldiversitymodels AT kreftholger influenceofdifferentdatacleaningsolutionsofpointoccurrencerecordsondownstreammacroecologicaldiversitymodels AT ickertbondstefaniem influenceofdifferentdatacleaningsolutionsofpointoccurrencerecordsondownstreammacroecologicaldiversitymodels

Influence of different data cleaning solutions of point‐occurrence records on downstream macroecological diversity models

Ejemplares similares