Cargando…

COPIOUS: A gold standard corpus of named entities towards extracting species occurrence from biodiversity literature

Abstract. Background Species occurrence records are very important in the biodiversity domain. While several available corpora contain only annotations of species names or habitats and geographical locations, there is no consolidated corpus that covers all types of entities necessary for extracting...

Descripción completa

Detalles Bibliográficos
Autores principales:	Nguyen, Nhung T.H., Gabud, Roselyn S., Ananiadou, Sophia
Formato:	Online Artículo Texto
Lenguaje:	English
Publicado:	Pensoft Publishers 2019
Materias:	Research Article
Acceso en línea:	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6351503/ https://www.ncbi.nlm.nih.gov/pubmed/30700967 http://dx.doi.org/10.3897/BDJ.7.e29626

_version_	1783390584488067072
author	Nguyen, Nhung T.H. Gabud, Roselyn S. Ananiadou, Sophia
author_facet	Nguyen, Nhung T.H. Gabud, Roselyn S. Ananiadou, Sophia
author_sort	Nguyen, Nhung T.H.
collection	PubMed
description	Abstract. Background Species occurrence records are very important in the biodiversity domain. While several available corpora contain only annotations of species names or habitats and geographical locations, there is no consolidated corpus that covers all types of entities necessary for extracting species occurrence from biodiversity literature. In order to alleviate this issue, we have constructed the COPIOUS corpus—a gold standard corpus that covers a wide range of biodiversity entities. Results Two annotators manually annotated the corpus with five categories of entities, i.e. taxon names, geographical locations, habitats, temporal expressions and person names. The overall inter-annotator agreement on 200 doubly-annotated documents is approximately 81.86% F-score. Amongst the five categories, the agreement on habitat entities was the lowest, indicating that this type of entity is complex. The COPIOUS corpus consists of 668 documents downloaded from the Biodiversity Heritage Library with over 26K sentences and more than 28K entities. Named entity recognisers trained on the corpus could achieve an F-score of 74.58%. Moreover, in recognising taxon names, our model performed better than two available tools in the biodiversity domain, namely the SPECIES tagger and the Global Name Recognition and Discovery. More than 1,600 binary relations of Taxon-Habitat, Taxon-Person, Taxon-Geographical locations and Taxon-Temporal expressions were identified by applying a pattern-based relation extraction system to the gold standard. Based on the extracted relations, we can produce a knowledge repository of species occurrences. Conclusion The paper describes in detail the construction of a gold standard named entity corpus for the biodiversity domain. An investigation of the performance of named entity recognition (NER) tools trained on the gold standard revealed that the corpus is sufficiently reliable and sizeable for both training and evaluation purposes. The corpus can be further used for relation extraction to locate species occurrences in literature—a useful task for monitoring species distribution and preserving the biodiversity.
format	Online Article Text
id	pubmed-6351503
institution	National Center for Biotechnology Information
language	English
publishDate	2019
publisher	Pensoft Publishers
record_format	MEDLINE/PubMed
spelling	pubmed-63515032019-01-30 COPIOUS: A gold standard corpus of named entities towards extracting species occurrence from biodiversity literature Nguyen, Nhung T.H. Gabud, Roselyn S. Ananiadou, Sophia Biodivers Data J Research Article Abstract. Background Species occurrence records are very important in the biodiversity domain. While several available corpora contain only annotations of species names or habitats and geographical locations, there is no consolidated corpus that covers all types of entities necessary for extracting species occurrence from biodiversity literature. In order to alleviate this issue, we have constructed the COPIOUS corpus—a gold standard corpus that covers a wide range of biodiversity entities. Results Two annotators manually annotated the corpus with five categories of entities, i.e. taxon names, geographical locations, habitats, temporal expressions and person names. The overall inter-annotator agreement on 200 doubly-annotated documents is approximately 81.86% F-score. Amongst the five categories, the agreement on habitat entities was the lowest, indicating that this type of entity is complex. The COPIOUS corpus consists of 668 documents downloaded from the Biodiversity Heritage Library with over 26K sentences and more than 28K entities. Named entity recognisers trained on the corpus could achieve an F-score of 74.58%. Moreover, in recognising taxon names, our model performed better than two available tools in the biodiversity domain, namely the SPECIES tagger and the Global Name Recognition and Discovery. More than 1,600 binary relations of Taxon-Habitat, Taxon-Person, Taxon-Geographical locations and Taxon-Temporal expressions were identified by applying a pattern-based relation extraction system to the gold standard. Based on the extracted relations, we can produce a knowledge repository of species occurrences. Conclusion The paper describes in detail the construction of a gold standard named entity corpus for the biodiversity domain. An investigation of the performance of named entity recognition (NER) tools trained on the gold standard revealed that the corpus is sufficiently reliable and sizeable for both training and evaluation purposes. The corpus can be further used for relation extraction to locate species occurrences in literature—a useful task for monitoring species distribution and preserving the biodiversity. Pensoft Publishers 2019-01-22 /pmc/articles/PMC6351503/ /pubmed/30700967 http://dx.doi.org/10.3897/BDJ.7.e29626 Text en Nhung Nguyen, Roselyn Gabud, Sophia Ananiadou http://creativecommons.org/licenses/by/4.0/ This is an open access article distributed under the terms of the Creative Commons Attribution License (CC BY 4.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.
spellingShingle	Research Article Nguyen, Nhung T.H. Gabud, Roselyn S. Ananiadou, Sophia COPIOUS: A gold standard corpus of named entities towards extracting species occurrence from biodiversity literature
title	COPIOUS: A gold standard corpus of named entities towards extracting species occurrence from biodiversity literature
title_full	COPIOUS: A gold standard corpus of named entities towards extracting species occurrence from biodiversity literature
title_fullStr	COPIOUS: A gold standard corpus of named entities towards extracting species occurrence from biodiversity literature
title_full_unstemmed	COPIOUS: A gold standard corpus of named entities towards extracting species occurrence from biodiversity literature
title_short	COPIOUS: A gold standard corpus of named entities towards extracting species occurrence from biodiversity literature
title_sort	copious: a gold standard corpus of named entities towards extracting species occurrence from biodiversity literature
topic	Research Article
url	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6351503/ https://www.ncbi.nlm.nih.gov/pubmed/30700967 http://dx.doi.org/10.3897/BDJ.7.e29626
work_keys_str_mv	AT nguyennhungth copiousagoldstandardcorpusofnamedentitiestowardsextractingspeciesoccurrencefrombiodiversityliterature AT gabudroselyns copiousagoldstandardcorpusofnamedentitiestowardsextractingspeciesoccurrencefrombiodiversityliterature AT ananiadousophia copiousagoldstandardcorpusofnamedentitiestowardsextractingspeciesoccurrencefrombiodiversityliterature

COPIOUS: A gold standard corpus of named entities towards extracting species occurrence from biodiversity literature

Ejemplares similares