Cargando…
Mining chemical patents with an ensemble of open systems
The significant amount of medicinal chemistry information contained in patents makes them an attractive target for text mining. In this manuscript, we describe systems for named entity recognition (NER) of chemicals and genes/proteins in patents, using the CEMP (for chemicals) and GPRO (for genes/pr...
Autores principales: | , , , |
---|---|
Formato: | Online Artículo Texto |
Lenguaje: | English |
Publicado: |
Oxford University Press
2016
|
Materias: | |
Acceso en línea: | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4865327/ https://www.ncbi.nlm.nih.gov/pubmed/27173521 http://dx.doi.org/10.1093/database/baw065 |
_version_ | 1782431774441209856 |
---|---|
author | Leaman, Robert Wei, Chih-Hsuan Zou, Cherry Lu, Zhiyong |
author_facet | Leaman, Robert Wei, Chih-Hsuan Zou, Cherry Lu, Zhiyong |
author_sort | Leaman, Robert |
collection | PubMed |
description | The significant amount of medicinal chemistry information contained in patents makes them an attractive target for text mining. In this manuscript, we describe systems for named entity recognition (NER) of chemicals and genes/proteins in patents, using the CEMP (for chemicals) and GPRO (for genes/proteins) corpora provided by the CHEMDNER task at BioCreative V. Our chemical NER system is an ensemble of five open systems, including both versions of tmChem, our previous work on chemical NER. Their output is combined using a machine learning classification approach. Our chemical NER system obtained 0.8752 precision and 0.9129 recall, for 0.8937 f-score on the CEMP task. Our gene/protein NER system is an extension of our previous work for gene and protein NER, GNormPlus. This system obtained a performance of 0.8143 precision and 0.8141 recall, for 0.8137 f-score on the GPRO task. Both systems achieved the highest performance in their respective tasks at BioCreative V. We conclude that an ensemble of independently-created open systems is sufficiently diverse to significantly improve performance over any individual system, even when they use a similar approach. Database URL: http://www.ncbi.nlm.nih.gov/CBBresearch/Lu/Demo/tmTools/. |
format | Online Article Text |
id | pubmed-4865327 |
institution | National Center for Biotechnology Information |
language | English |
publishDate | 2016 |
publisher | Oxford University Press |
record_format | MEDLINE/PubMed |
spelling | pubmed-48653272016-05-13 Mining chemical patents with an ensemble of open systems Leaman, Robert Wei, Chih-Hsuan Zou, Cherry Lu, Zhiyong Database (Oxford) Original Article The significant amount of medicinal chemistry information contained in patents makes them an attractive target for text mining. In this manuscript, we describe systems for named entity recognition (NER) of chemicals and genes/proteins in patents, using the CEMP (for chemicals) and GPRO (for genes/proteins) corpora provided by the CHEMDNER task at BioCreative V. Our chemical NER system is an ensemble of five open systems, including both versions of tmChem, our previous work on chemical NER. Their output is combined using a machine learning classification approach. Our chemical NER system obtained 0.8752 precision and 0.9129 recall, for 0.8937 f-score on the CEMP task. Our gene/protein NER system is an extension of our previous work for gene and protein NER, GNormPlus. This system obtained a performance of 0.8143 precision and 0.8141 recall, for 0.8137 f-score on the GPRO task. Both systems achieved the highest performance in their respective tasks at BioCreative V. We conclude that an ensemble of independently-created open systems is sufficiently diverse to significantly improve performance over any individual system, even when they use a similar approach. Database URL: http://www.ncbi.nlm.nih.gov/CBBresearch/Lu/Demo/tmTools/. Oxford University Press 2016-05-12 /pmc/articles/PMC4865327/ /pubmed/27173521 http://dx.doi.org/10.1093/database/baw065 Text en Published by Oxford University Press 2016. This work is written by US Government employees and is in the public domain in the US. |
spellingShingle | Original Article Leaman, Robert Wei, Chih-Hsuan Zou, Cherry Lu, Zhiyong Mining chemical patents with an ensemble of open systems |
title | Mining chemical patents with an ensemble of open systems |
title_full | Mining chemical patents with an ensemble of open systems |
title_fullStr | Mining chemical patents with an ensemble of open systems |
title_full_unstemmed | Mining chemical patents with an ensemble of open systems |
title_short | Mining chemical patents with an ensemble of open systems |
title_sort | mining chemical patents with an ensemble of open systems |
topic | Original Article |
url | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4865327/ https://www.ncbi.nlm.nih.gov/pubmed/27173521 http://dx.doi.org/10.1093/database/baw065 |
work_keys_str_mv | AT leamanrobert miningchemicalpatentswithanensembleofopensystems AT weichihhsuan miningchemicalpatentswithanensembleofopensystems AT zoucherry miningchemicalpatentswithanensembleofopensystems AT luzhiyong miningchemicalpatentswithanensembleofopensystems |