Cargando…

Automating the extraction of information from a historical text and building a linked data model for the domain of ecology and conservation science

Data heterogeneity is a pressing issue and is further compounded if we have to deal with data from textual documents. The unstructured nature of such documents implies that collating, comparing and analysing the information contained therein can be a challenging task. Automating these processes can...

Descripción completa

Detalles Bibliográficos
Autores principales: Nundloll, Vatsala, Smail, Robert, Stevens, Carly, Blair, Gordon
Formato: Online Artículo Texto
Lenguaje:English
Publicado: Elsevier 2022
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9573881/
https://www.ncbi.nlm.nih.gov/pubmed/36262290
http://dx.doi.org/10.1016/j.heliyon.2022.e10710
_version_ 1784810975531106304
author Nundloll, Vatsala
Smail, Robert
Stevens, Carly
Blair, Gordon
author_facet Nundloll, Vatsala
Smail, Robert
Stevens, Carly
Blair, Gordon
author_sort Nundloll, Vatsala
collection PubMed
description Data heterogeneity is a pressing issue and is further compounded if we have to deal with data from textual documents. The unstructured nature of such documents implies that collating, comparing and analysing the information contained therein can be a challenging task. Automating these processes can help to unleash insightful knowledge that otherwise remains buried in them. Moreover, integrating the extracted information from the documents with other related information can help to make more information-rich queries. In this context, the paper presents a comprehensive review of text extraction and data integration techniques to enable this automation process in an ecological context. The paper investigates into extracting valuable floristic information from a historical Botany journal. The purpose behind this extraction is to bring to light relevant pieces of information contained within the document. In addition, the paper also explores the need to integrate the extracted information together with other related information from disparate sources. All the information is then rendered into a query-able form in order to make unified queries. Hence, the paper makes use of a combination of Machine Learning, Natural Language Processing and Semantic Web techniques to achieve this. The proposed approach is demonstrated through the information extracted from the journal and the information-rich queries made through the integration process. The paper shows that the approach has a merit in extracting relevant information from the journal, discusses how the machine learning models have been designed to classify complex information and also gives a measure of their performance. The paper also shows that the approach has a merit in query time in regard to querying floristic information from a multi-source linked data model.
format Online
Article
Text
id pubmed-9573881
institution National Center for Biotechnology Information
language English
publishDate 2022
publisher Elsevier
record_format MEDLINE/PubMed
spelling pubmed-95738812022-10-18 Automating the extraction of information from a historical text and building a linked data model for the domain of ecology and conservation science Nundloll, Vatsala Smail, Robert Stevens, Carly Blair, Gordon Heliyon Research Article Data heterogeneity is a pressing issue and is further compounded if we have to deal with data from textual documents. The unstructured nature of such documents implies that collating, comparing and analysing the information contained therein can be a challenging task. Automating these processes can help to unleash insightful knowledge that otherwise remains buried in them. Moreover, integrating the extracted information from the documents with other related information can help to make more information-rich queries. In this context, the paper presents a comprehensive review of text extraction and data integration techniques to enable this automation process in an ecological context. The paper investigates into extracting valuable floristic information from a historical Botany journal. The purpose behind this extraction is to bring to light relevant pieces of information contained within the document. In addition, the paper also explores the need to integrate the extracted information together with other related information from disparate sources. All the information is then rendered into a query-able form in order to make unified queries. Hence, the paper makes use of a combination of Machine Learning, Natural Language Processing and Semantic Web techniques to achieve this. The proposed approach is demonstrated through the information extracted from the journal and the information-rich queries made through the integration process. The paper shows that the approach has a merit in extracting relevant information from the journal, discusses how the machine learning models have been designed to classify complex information and also gives a measure of their performance. The paper also shows that the approach has a merit in query time in regard to querying floristic information from a multi-source linked data model. Elsevier 2022-10-04 /pmc/articles/PMC9573881/ /pubmed/36262290 http://dx.doi.org/10.1016/j.heliyon.2022.e10710 Text en © 2022 The Author(s) https://creativecommons.org/licenses/by/4.0/This is an open access article under the CC BY license (http://creativecommons.org/licenses/by/4.0/).
spellingShingle Research Article
Nundloll, Vatsala
Smail, Robert
Stevens, Carly
Blair, Gordon
Automating the extraction of information from a historical text and building a linked data model for the domain of ecology and conservation science
title Automating the extraction of information from a historical text and building a linked data model for the domain of ecology and conservation science
title_full Automating the extraction of information from a historical text and building a linked data model for the domain of ecology and conservation science
title_fullStr Automating the extraction of information from a historical text and building a linked data model for the domain of ecology and conservation science
title_full_unstemmed Automating the extraction of information from a historical text and building a linked data model for the domain of ecology and conservation science
title_short Automating the extraction of information from a historical text and building a linked data model for the domain of ecology and conservation science
title_sort automating the extraction of information from a historical text and building a linked data model for the domain of ecology and conservation science
topic Research Article
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9573881/
https://www.ncbi.nlm.nih.gov/pubmed/36262290
http://dx.doi.org/10.1016/j.heliyon.2022.e10710
work_keys_str_mv AT nundlollvatsala automatingtheextractionofinformationfromahistoricaltextandbuildingalinkeddatamodelforthedomainofecologyandconservationscience
AT smailrobert automatingtheextractionofinformationfromahistoricaltextandbuildingalinkeddatamodelforthedomainofecologyandconservationscience
AT stevenscarly automatingtheextractionofinformationfromahistoricaltextandbuildingalinkeddatamodelforthedomainofecologyandconservationscience
AT blairgordon automatingtheextractionofinformationfromahistoricaltextandbuildingalinkeddatamodelforthedomainofecologyandconservationscience