Cargando…

Text-mined fossil biodiversity dynamics using machine learning

Documented occurrences of fossil taxa are the empirical foundation for understanding large-scale biodiversity changes and evolutionary dynamics in deep time. The fossil record contains vast amounts of understudied taxa. Yet the compilation of huge volumes of data remains a labour-intensive impedimen...

Descripción completa

Detalles Bibliográficos
Autores principales: Kopperud, Bjørn Tore, Lidgard, Scott, Liow, Lee Hsiang
Formato: Online Artículo Texto
Lenguaje:English
Publicado: The Royal Society 2019
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6501925/
https://www.ncbi.nlm.nih.gov/pubmed/31014224
http://dx.doi.org/10.1098/rspb.2019.0022
_version_ 1783416168592179200
author Kopperud, Bjørn Tore
Lidgard, Scott
Liow, Lee Hsiang
author_facet Kopperud, Bjørn Tore
Lidgard, Scott
Liow, Lee Hsiang
author_sort Kopperud, Bjørn Tore
collection PubMed
description Documented occurrences of fossil taxa are the empirical foundation for understanding large-scale biodiversity changes and evolutionary dynamics in deep time. The fossil record contains vast amounts of understudied taxa. Yet the compilation of huge volumes of data remains a labour-intensive impediment to a more complete understanding of Earth's biodiversity history. Even so, many occurrence records of species and genera in these taxa can be uncovered in the palaeontological literature. Here, we extract observations of fossils and their inferred ages from unstructured text in books and scientific articles using machine-learning approaches. We use Bryozoa, a group of marine invertebrates with a rich fossil record, as a case study. Building on recent advances in computational linguistics, we develop a pipeline to recognize taxonomic names and geologic time intervals in published literature and use supervised learning to machine-read whether the species in question occurred in a given age interval. Intermediate machine error rates appear comparable to human error rates in a simple trial, and resulting genus richness curves capture the main features of published fossil diversity studies of bryozoans. We believe our automated pipeline, that greatly reduced the time required to compile our dataset, can help others compile similar data for other taxa.
format Online
Article
Text
id pubmed-6501925
institution National Center for Biotechnology Information
language English
publishDate 2019
publisher The Royal Society
record_format MEDLINE/PubMed
spelling pubmed-65019252019-05-15 Text-mined fossil biodiversity dynamics using machine learning Kopperud, Bjørn Tore Lidgard, Scott Liow, Lee Hsiang Proc Biol Sci Palaeobiology Documented occurrences of fossil taxa are the empirical foundation for understanding large-scale biodiversity changes and evolutionary dynamics in deep time. The fossil record contains vast amounts of understudied taxa. Yet the compilation of huge volumes of data remains a labour-intensive impediment to a more complete understanding of Earth's biodiversity history. Even so, many occurrence records of species and genera in these taxa can be uncovered in the palaeontological literature. Here, we extract observations of fossils and their inferred ages from unstructured text in books and scientific articles using machine-learning approaches. We use Bryozoa, a group of marine invertebrates with a rich fossil record, as a case study. Building on recent advances in computational linguistics, we develop a pipeline to recognize taxonomic names and geologic time intervals in published literature and use supervised learning to machine-read whether the species in question occurred in a given age interval. Intermediate machine error rates appear comparable to human error rates in a simple trial, and resulting genus richness curves capture the main features of published fossil diversity studies of bryozoans. We believe our automated pipeline, that greatly reduced the time required to compile our dataset, can help others compile similar data for other taxa. The Royal Society 2019-04-24 2019-04-24 /pmc/articles/PMC6501925/ /pubmed/31014224 http://dx.doi.org/10.1098/rspb.2019.0022 Text en © 2019 The Authors. http://creativecommons.org/licenses/by/4.0/ Published by the Royal Society under the terms of the Creative Commons Attribution License http://creativecommons.org/licenses/by/4.0/, which permits unrestricted use, provided the original author and source are credited.
spellingShingle Palaeobiology
Kopperud, Bjørn Tore
Lidgard, Scott
Liow, Lee Hsiang
Text-mined fossil biodiversity dynamics using machine learning
title Text-mined fossil biodiversity dynamics using machine learning
title_full Text-mined fossil biodiversity dynamics using machine learning
title_fullStr Text-mined fossil biodiversity dynamics using machine learning
title_full_unstemmed Text-mined fossil biodiversity dynamics using machine learning
title_short Text-mined fossil biodiversity dynamics using machine learning
title_sort text-mined fossil biodiversity dynamics using machine learning
topic Palaeobiology
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6501925/
https://www.ncbi.nlm.nih.gov/pubmed/31014224
http://dx.doi.org/10.1098/rspb.2019.0022
work_keys_str_mv AT kopperudbjørntore textminedfossilbiodiversitydynamicsusingmachinelearning
AT lidgardscott textminedfossilbiodiversitydynamicsusingmachinelearning
AT liowleehsiang textminedfossilbiodiversitydynamicsusingmachinelearning