Cargando…

Mining chemical patents with an ensemble of open systems

The significant amount of medicinal chemistry information contained in patents makes them an attractive target for text mining. In this manuscript, we describe systems for named entity recognition (NER) of chemicals and genes/proteins in patents, using the CEMP (for chemicals) and GPRO (for genes/pr...

Descripción completa

Detalles Bibliográficos
Autores principales: Leaman, Robert, Wei, Chih-Hsuan, Zou, Cherry, Lu, Zhiyong
Formato: Online Artículo Texto
Lenguaje:English
Publicado: Oxford University Press 2016
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4865327/
https://www.ncbi.nlm.nih.gov/pubmed/27173521
http://dx.doi.org/10.1093/database/baw065
_version_ 1782431774441209856
author Leaman, Robert
Wei, Chih-Hsuan
Zou, Cherry
Lu, Zhiyong
author_facet Leaman, Robert
Wei, Chih-Hsuan
Zou, Cherry
Lu, Zhiyong
author_sort Leaman, Robert
collection PubMed
description The significant amount of medicinal chemistry information contained in patents makes them an attractive target for text mining. In this manuscript, we describe systems for named entity recognition (NER) of chemicals and genes/proteins in patents, using the CEMP (for chemicals) and GPRO (for genes/proteins) corpora provided by the CHEMDNER task at BioCreative V. Our chemical NER system is an ensemble of five open systems, including both versions of tmChem, our previous work on chemical NER. Their output is combined using a machine learning classification approach. Our chemical NER system obtained 0.8752 precision and 0.9129 recall, for 0.8937 f-score on the CEMP task. Our gene/protein NER system is an extension of our previous work for gene and protein NER, GNormPlus. This system obtained a performance of 0.8143 precision and 0.8141 recall, for 0.8137 f-score on the GPRO task. Both systems achieved the highest performance in their respective tasks at BioCreative V. We conclude that an ensemble of independently-created open systems is sufficiently diverse to significantly improve performance over any individual system, even when they use a similar approach. Database URL: http://www.ncbi.nlm.nih.gov/CBBresearch/Lu/Demo/tmTools/.
format Online
Article
Text
id pubmed-4865327
institution National Center for Biotechnology Information
language English
publishDate 2016
publisher Oxford University Press
record_format MEDLINE/PubMed
spelling pubmed-48653272016-05-13 Mining chemical patents with an ensemble of open systems Leaman, Robert Wei, Chih-Hsuan Zou, Cherry Lu, Zhiyong Database (Oxford) Original Article The significant amount of medicinal chemistry information contained in patents makes them an attractive target for text mining. In this manuscript, we describe systems for named entity recognition (NER) of chemicals and genes/proteins in patents, using the CEMP (for chemicals) and GPRO (for genes/proteins) corpora provided by the CHEMDNER task at BioCreative V. Our chemical NER system is an ensemble of five open systems, including both versions of tmChem, our previous work on chemical NER. Their output is combined using a machine learning classification approach. Our chemical NER system obtained 0.8752 precision and 0.9129 recall, for 0.8937 f-score on the CEMP task. Our gene/protein NER system is an extension of our previous work for gene and protein NER, GNormPlus. This system obtained a performance of 0.8143 precision and 0.8141 recall, for 0.8137 f-score on the GPRO task. Both systems achieved the highest performance in their respective tasks at BioCreative V. We conclude that an ensemble of independently-created open systems is sufficiently diverse to significantly improve performance over any individual system, even when they use a similar approach. Database URL: http://www.ncbi.nlm.nih.gov/CBBresearch/Lu/Demo/tmTools/. Oxford University Press 2016-05-12 /pmc/articles/PMC4865327/ /pubmed/27173521 http://dx.doi.org/10.1093/database/baw065 Text en Published by Oxford University Press 2016. This work is written by US Government employees and is in the public domain in the US.
spellingShingle Original Article
Leaman, Robert
Wei, Chih-Hsuan
Zou, Cherry
Lu, Zhiyong
Mining chemical patents with an ensemble of open systems
title Mining chemical patents with an ensemble of open systems
title_full Mining chemical patents with an ensemble of open systems
title_fullStr Mining chemical patents with an ensemble of open systems
title_full_unstemmed Mining chemical patents with an ensemble of open systems
title_short Mining chemical patents with an ensemble of open systems
title_sort mining chemical patents with an ensemble of open systems
topic Original Article
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4865327/
https://www.ncbi.nlm.nih.gov/pubmed/27173521
http://dx.doi.org/10.1093/database/baw065
work_keys_str_mv AT leamanrobert miningchemicalpatentswithanensembleofopensystems
AT weichihhsuan miningchemicalpatentswithanensembleofopensystems
AT zoucherry miningchemicalpatentswithanensembleofopensystems
AT luzhiyong miningchemicalpatentswithanensembleofopensystems