Cargando…

Representing Multiword Chemical Terms through Phrase-Level Preprocessing and Word Embedding

[Image: see text] In recent years, data-driven methods and artificial intelligence have been widely used in chemoinformatic and material informatics domains, for which the success is critically determined by the availability of training data with good quality and large quantity. A potential approach...

Descripción completa

Detalles Bibliográficos
Autores principales: Huang, Liyuan, Ling, Chen
Formato: Online Artículo Texto
Lenguaje:English
Publicado: American Chemical Society 2019
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6854573/
https://www.ncbi.nlm.nih.gov/pubmed/31737809
http://dx.doi.org/10.1021/acsomega.9b02060
_version_ 1783470234755137536
author Huang, Liyuan
Ling, Chen
author_facet Huang, Liyuan
Ling, Chen
author_sort Huang, Liyuan
collection PubMed
description [Image: see text] In recent years, data-driven methods and artificial intelligence have been widely used in chemoinformatic and material informatics domains, for which the success is critically determined by the availability of training data with good quality and large quantity. A potential approach to break this bottleneck is by leveraging the chemical literature such as papers and patents as alternative data resources to high throughput experiments and simulation. Compared to other domains where natural language processing techniques have established successes, the chemical literature contains a large portion of phrases of multiple words that create additional challenges for accurate identification and representation. Here, we introduce a chemistry domain suitable approach to identify multiword chemical terms and train word representations at the phrase level. Through a series of special-designed experiments, we demonstrate that our multiword identifying and representing method effectively and accurately identifies multiword chemical terms from 119, 166 chemical patents and is more robust and precise to preserve the semantic meaning of chemical phrases compared to the conventional approach, which represents constituent single words first and combine them afterward. Because the accurate representation of chemical terms is the first and essential step to provide learning features for downstream natural language processing tasks, our results pave the road to utilize the large volume of chemical literature in future data-driven studies.
format Online
Article
Text
id pubmed-6854573
institution National Center for Biotechnology Information
language English
publishDate 2019
publisher American Chemical Society
record_format MEDLINE/PubMed
spelling pubmed-68545732019-11-15 Representing Multiword Chemical Terms through Phrase-Level Preprocessing and Word Embedding Huang, Liyuan Ling, Chen ACS Omega [Image: see text] In recent years, data-driven methods and artificial intelligence have been widely used in chemoinformatic and material informatics domains, for which the success is critically determined by the availability of training data with good quality and large quantity. A potential approach to break this bottleneck is by leveraging the chemical literature such as papers and patents as alternative data resources to high throughput experiments and simulation. Compared to other domains where natural language processing techniques have established successes, the chemical literature contains a large portion of phrases of multiple words that create additional challenges for accurate identification and representation. Here, we introduce a chemistry domain suitable approach to identify multiword chemical terms and train word representations at the phrase level. Through a series of special-designed experiments, we demonstrate that our multiword identifying and representing method effectively and accurately identifies multiword chemical terms from 119, 166 chemical patents and is more robust and precise to preserve the semantic meaning of chemical phrases compared to the conventional approach, which represents constituent single words first and combine them afterward. Because the accurate representation of chemical terms is the first and essential step to provide learning features for downstream natural language processing tasks, our results pave the road to utilize the large volume of chemical literature in future data-driven studies. American Chemical Society 2019-10-31 /pmc/articles/PMC6854573/ /pubmed/31737809 http://dx.doi.org/10.1021/acsomega.9b02060 Text en Copyright © 2019 American Chemical Society This is an open access article published under an ACS AuthorChoice License (http://pubs.acs.org/page/policy/authorchoice_termsofuse.html) , which permits copying and redistribution of the article or any adaptations for non-commercial purposes.
spellingShingle Huang, Liyuan
Ling, Chen
Representing Multiword Chemical Terms through Phrase-Level Preprocessing and Word Embedding
title Representing Multiword Chemical Terms through Phrase-Level Preprocessing and Word Embedding
title_full Representing Multiword Chemical Terms through Phrase-Level Preprocessing and Word Embedding
title_fullStr Representing Multiword Chemical Terms through Phrase-Level Preprocessing and Word Embedding
title_full_unstemmed Representing Multiword Chemical Terms through Phrase-Level Preprocessing and Word Embedding
title_short Representing Multiword Chemical Terms through Phrase-Level Preprocessing and Word Embedding
title_sort representing multiword chemical terms through phrase-level preprocessing and word embedding
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6854573/
https://www.ncbi.nlm.nih.gov/pubmed/31737809
http://dx.doi.org/10.1021/acsomega.9b02060
work_keys_str_mv AT huangliyuan representingmultiwordchemicaltermsthroughphraselevelpreprocessingandwordembedding
AT lingchen representingmultiwordchemicaltermsthroughphraselevelpreprocessingandwordembedding