Cargando…

Terminology spectrum analysis of natural-language chemical documents: term-like phrases retrieval routine

BACKGROUND: This study seeks to develop, test and assess a methodology for automatic extraction of a complete set of ‘term-like phrases’ and to create a terminology spectrum from a collection of natural language PDF documents in the field of chemistry. The definition of ‘term-like phrases’ is one or...

Descripción completa

Detalles Bibliográficos
Autores principales: Alperin, Boris L., Kuzmin, Andrey O., Ilina, Ludmila Yu., Gusev, Vladimir D., Salomatina, Natalia V., Parmon, Valentin N.
Formato: Online Artículo Texto
Lenguaje:English
Publicado: Springer International Publishing 2016
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4850643/
https://www.ncbi.nlm.nih.gov/pubmed/27134681
http://dx.doi.org/10.1186/s13321-016-0136-4
_version_ 1782429687000072192
author Alperin, Boris L.
Kuzmin, Andrey O.
Ilina, Ludmila Yu.
Gusev, Vladimir D.
Salomatina, Natalia V.
Parmon, Valentin N.
author_facet Alperin, Boris L.
Kuzmin, Andrey O.
Ilina, Ludmila Yu.
Gusev, Vladimir D.
Salomatina, Natalia V.
Parmon, Valentin N.
author_sort Alperin, Boris L.
collection PubMed
description BACKGROUND: This study seeks to develop, test and assess a methodology for automatic extraction of a complete set of ‘term-like phrases’ and to create a terminology spectrum from a collection of natural language PDF documents in the field of chemistry. The definition of ‘term-like phrases’ is one or more consecutive words and/or alphanumeric string combinations with unchanged spelling which convey specific scientific meanings. A terminology spectrum for a natural language document is an indexed list of tagged entities including: recognized general scientific concepts, terms linked to existing thesauri, names of chemical substances/reactions and term-like phrases. The retrieval routine is based on n-gram textual analysis with a sequential execution of various ‘accept and reject’ rules with taking into account the morphological and structural information. RESULTS: The assessment of the retrieval process, expressed quantitatively with a precision (P), recall (R) and F(1)-measure, which are calculated manually from a limited set of documents (the full set of text abstracts belonging to 5 EuropaCat events were processed) by professional chemical scientists, has proved the effectiveness of the developed approach. The term-like phrase parsing efficiency is quantified with precision (P = 0.53), recall (R = 0.71) and F(1)-measure (F(1) = 0.61) values. CONCLUSION: The paper suggests using such terminology spectra to perform various types of textual analysis across document collections. This sort of the terminology spectrum may be successfully employed for text information retrieval, for reference database development, to analyze research trends in subject fields of research and to look for the similarity between documents. [Figure: see text] ELECTRONIC SUPPLEMENTARY MATERIAL: The online version of this article (doi:10.1186/s13321-016-0136-4) contains supplementary material, which is available to authorized users.
format Online
Article
Text
id pubmed-4850643
institution National Center for Biotechnology Information
language English
publishDate 2016
publisher Springer International Publishing
record_format MEDLINE/PubMed
spelling pubmed-48506432016-04-30 Terminology spectrum analysis of natural-language chemical documents: term-like phrases retrieval routine Alperin, Boris L. Kuzmin, Andrey O. Ilina, Ludmila Yu. Gusev, Vladimir D. Salomatina, Natalia V. Parmon, Valentin N. J Cheminform Methodology BACKGROUND: This study seeks to develop, test and assess a methodology for automatic extraction of a complete set of ‘term-like phrases’ and to create a terminology spectrum from a collection of natural language PDF documents in the field of chemistry. The definition of ‘term-like phrases’ is one or more consecutive words and/or alphanumeric string combinations with unchanged spelling which convey specific scientific meanings. A terminology spectrum for a natural language document is an indexed list of tagged entities including: recognized general scientific concepts, terms linked to existing thesauri, names of chemical substances/reactions and term-like phrases. The retrieval routine is based on n-gram textual analysis with a sequential execution of various ‘accept and reject’ rules with taking into account the morphological and structural information. RESULTS: The assessment of the retrieval process, expressed quantitatively with a precision (P), recall (R) and F(1)-measure, which are calculated manually from a limited set of documents (the full set of text abstracts belonging to 5 EuropaCat events were processed) by professional chemical scientists, has proved the effectiveness of the developed approach. The term-like phrase parsing efficiency is quantified with precision (P = 0.53), recall (R = 0.71) and F(1)-measure (F(1) = 0.61) values. CONCLUSION: The paper suggests using such terminology spectra to perform various types of textual analysis across document collections. This sort of the terminology spectrum may be successfully employed for text information retrieval, for reference database development, to analyze research trends in subject fields of research and to look for the similarity between documents. [Figure: see text] ELECTRONIC SUPPLEMENTARY MATERIAL: The online version of this article (doi:10.1186/s13321-016-0136-4) contains supplementary material, which is available to authorized users. Springer International Publishing 2016-04-29 /pmc/articles/PMC4850643/ /pubmed/27134681 http://dx.doi.org/10.1186/s13321-016-0136-4 Text en © Alperin et al. 2016 Open AccessThis article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.
spellingShingle Methodology
Alperin, Boris L.
Kuzmin, Andrey O.
Ilina, Ludmila Yu.
Gusev, Vladimir D.
Salomatina, Natalia V.
Parmon, Valentin N.
Terminology spectrum analysis of natural-language chemical documents: term-like phrases retrieval routine
title Terminology spectrum analysis of natural-language chemical documents: term-like phrases retrieval routine
title_full Terminology spectrum analysis of natural-language chemical documents: term-like phrases retrieval routine
title_fullStr Terminology spectrum analysis of natural-language chemical documents: term-like phrases retrieval routine
title_full_unstemmed Terminology spectrum analysis of natural-language chemical documents: term-like phrases retrieval routine
title_short Terminology spectrum analysis of natural-language chemical documents: term-like phrases retrieval routine
title_sort terminology spectrum analysis of natural-language chemical documents: term-like phrases retrieval routine
topic Methodology
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4850643/
https://www.ncbi.nlm.nih.gov/pubmed/27134681
http://dx.doi.org/10.1186/s13321-016-0136-4
work_keys_str_mv AT alperinborisl terminologyspectrumanalysisofnaturallanguagechemicaldocumentstermlikephrasesretrievalroutine
AT kuzminandreyo terminologyspectrumanalysisofnaturallanguagechemicaldocumentstermlikephrasesretrievalroutine
AT ilinaludmilayu terminologyspectrumanalysisofnaturallanguagechemicaldocumentstermlikephrasesretrievalroutine
AT gusevvladimird terminologyspectrumanalysisofnaturallanguagechemicaldocumentstermlikephrasesretrievalroutine
AT salomatinanataliav terminologyspectrumanalysisofnaturallanguagechemicaldocumentstermlikephrasesretrievalroutine
AT parmonvalentinn terminologyspectrumanalysisofnaturallanguagechemicaldocumentstermlikephrasesretrievalroutine