Cargando…

From documents to datasets: A MediaWiki-based method of annotating and extracting species observations in century-old field notebooks

Abstract. Part diary, part scientific record, biological field notebooks often contain details necessary to understanding the location and environmental conditions existent during collecting events. Despite their clear value for (and recent use in) global change studies, the text-mining outputs from...

Descripción completa

Detalles Bibliográficos
Autores principales:	Thomer, Andrea, Vaidya, Gaurav, Guralnick, Robert, Bloom, David, Russell, Laura
Formato:	Online Artículo Texto
Lenguaje:	English
Publicado:	Pensoft Publishers 2012
Materias:	Article
Acceso en línea:	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3406479/ https://www.ncbi.nlm.nih.gov/pubmed/22859891 http://dx.doi.org/10.3897/zookeys.209.3247

_version_	1782239231346737152
author	Thomer, Andrea Vaidya, Gaurav Guralnick, Robert Bloom, David Russell, Laura
author_facet	Thomer, Andrea Vaidya, Gaurav Guralnick, Robert Bloom, David Russell, Laura
author_sort	Thomer, Andrea
collection	PubMed
description	Abstract. Part diary, part scientific record, biological field notebooks often contain details necessary to understanding the location and environmental conditions existent during collecting events. Despite their clear value for (and recent use in) global change studies, the text-mining outputs from field notebooks have been idiosyncratic to specific research projects, and impossible to discover or re-use. Best practices and workflows for digitization, transcription, extraction, and integration with other sources are nascent or non-existent. In this paper, we demonstrate a workflow to generate structured outputs while also maintaining links to the original texts. The first step in this workflow was to place already digitized and transcribed field notebooks from the University of Colorado Museum of Natural History founder, Junius Henderson, on Wikisource, an open text transcription platform. Next, we created Wikisource templates to document places, dates, and taxa to facilitate annotation and wiki-linking. We then requested help from the public, through social media tools, to take advantage of volunteer efforts and energy. After three notebooks were fully annotated, content was converted into XML and annotations were extracted and cross-walked into Darwin Core compliant record sets. Finally, these recordsets were vetted, to provide valid taxon names, via a process we call “taxonomic referencing.” The result is identification and mobilization of 1,068 observations from three of Henderson’s thirteen notebooks and a publishable Darwin Core record set for use in other analyses. Although challenges remain, this work demonstrates a feasible approach to unlock observations from field notebooks that enhances their discovery and interoperability without losing the narrative context from which those observations are drawn. “Compose your notes as if you were writing a letter to someone a century in the future.” Perrine and Patton (2011)
format	Online Article Text
id	pubmed-3406479
institution	National Center for Biotechnology Information
language	English
publishDate	2012
publisher	Pensoft Publishers
record_format	MEDLINE/PubMed
spelling	pubmed-34064792012-08-02 From documents to datasets: A MediaWiki-based method of annotating and extracting species observations in century-old field notebooks Thomer, Andrea Vaidya, Gaurav Guralnick, Robert Bloom, David Russell, Laura Zookeys Article Abstract. Part diary, part scientific record, biological field notebooks often contain details necessary to understanding the location and environmental conditions existent during collecting events. Despite their clear value for (and recent use in) global change studies, the text-mining outputs from field notebooks have been idiosyncratic to specific research projects, and impossible to discover or re-use. Best practices and workflows for digitization, transcription, extraction, and integration with other sources are nascent or non-existent. In this paper, we demonstrate a workflow to generate structured outputs while also maintaining links to the original texts. The first step in this workflow was to place already digitized and transcribed field notebooks from the University of Colorado Museum of Natural History founder, Junius Henderson, on Wikisource, an open text transcription platform. Next, we created Wikisource templates to document places, dates, and taxa to facilitate annotation and wiki-linking. We then requested help from the public, through social media tools, to take advantage of volunteer efforts and energy. After three notebooks were fully annotated, content was converted into XML and annotations were extracted and cross-walked into Darwin Core compliant record sets. Finally, these recordsets were vetted, to provide valid taxon names, via a process we call “taxonomic referencing.” The result is identification and mobilization of 1,068 observations from three of Henderson’s thirteen notebooks and a publishable Darwin Core record set for use in other analyses. Although challenges remain, this work demonstrates a feasible approach to unlock observations from field notebooks that enhances their discovery and interoperability without losing the narrative context from which those observations are drawn. “Compose your notes as if you were writing a letter to someone a century in the future.” Perrine and Patton (2011) Pensoft Publishers 2012-07-20 /pmc/articles/PMC3406479/ /pubmed/22859891 http://dx.doi.org/10.3897/zookeys.209.3247 Text en Andrea Thomer, Gaurav Vaidya, Robert Guralnick, David Bloom, Laura Russell http://creativecommons.org/licenses/by/3.0 This is an open access article distributed under the terms of the Creative Commons Attribution License 3.0 (CC-BY), which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.
spellingShingle	Article Thomer, Andrea Vaidya, Gaurav Guralnick, Robert Bloom, David Russell, Laura From documents to datasets: A MediaWiki-based method of annotating and extracting species observations in century-old field notebooks
title	From documents to datasets: A MediaWiki-based method of annotating and extracting species observations in century-old field notebooks
title_full	From documents to datasets: A MediaWiki-based method of annotating and extracting species observations in century-old field notebooks
title_fullStr	From documents to datasets: A MediaWiki-based method of annotating and extracting species observations in century-old field notebooks
title_full_unstemmed	From documents to datasets: A MediaWiki-based method of annotating and extracting species observations in century-old field notebooks
title_short	From documents to datasets: A MediaWiki-based method of annotating and extracting species observations in century-old field notebooks
title_sort	from documents to datasets: a mediawiki-based method of annotating and extracting species observations in century-old field notebooks
topic	Article
url	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3406479/ https://www.ncbi.nlm.nih.gov/pubmed/22859891 http://dx.doi.org/10.3897/zookeys.209.3247
work_keys_str_mv	AT thomerandrea fromdocumentstodatasetsamediawikibasedmethodofannotatingandextractingspeciesobservationsincenturyoldfieldnotebooks AT vaidyagaurav fromdocumentstodatasetsamediawikibasedmethodofannotatingandextractingspeciesobservationsincenturyoldfieldnotebooks AT guralnickrobert fromdocumentstodatasetsamediawikibasedmethodofannotatingandextractingspeciesobservationsincenturyoldfieldnotebooks AT bloomdavid fromdocumentstodatasetsamediawikibasedmethodofannotatingandextractingspeciesobservationsincenturyoldfieldnotebooks AT russelllaura fromdocumentstodatasetsamediawikibasedmethodofannotatingandextractingspeciesobservationsincenturyoldfieldnotebooks

From documents to datasets: A MediaWiki-based method of annotating and extracting species observations in century-old field notebooks

Ejemplares similares