Cargando…

Automatic vs. manual curation of a multi-source chemical dictionary: the impact on text mining

BACKGROUND: Previously, we developed a combined dictionary dubbed Chemlist for the identification of small molecules and drugs in text based on a number of publicly available databases and tested it on an annotated corpus. To achieve an acceptable recall and precision we used a number of automatic a...

Descripción completa

Detalles Bibliográficos
Autores principales: Hettne, Kristina M, Williams, Antony J, van Mulligen, Erik M, Kleinjans, Jos, Tkachenko, Valery, Kors, Jan A
Formato: Texto
Lenguaje:English
Publicado: BioMed Central 2010
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2848622/
https://www.ncbi.nlm.nih.gov/pubmed/20331846
http://dx.doi.org/10.1186/1758-2946-2-3
_version_ 1782179696421634048
author Hettne, Kristina M
Williams, Antony J
van Mulligen, Erik M
Kleinjans, Jos
Tkachenko, Valery
Kors, Jan A
author_facet Hettne, Kristina M
Williams, Antony J
van Mulligen, Erik M
Kleinjans, Jos
Tkachenko, Valery
Kors, Jan A
author_sort Hettne, Kristina M
collection PubMed
description BACKGROUND: Previously, we developed a combined dictionary dubbed Chemlist for the identification of small molecules and drugs in text based on a number of publicly available databases and tested it on an annotated corpus. To achieve an acceptable recall and precision we used a number of automatic and semi-automatic processing steps together with disambiguation rules. However, it remained to be investigated which impact an extensive manual curation of a multi-source chemical dictionary would have on chemical term identification in text. ChemSpider is a chemical database that has undergone extensive manual curation aimed at establishing valid chemical name-to-structure relationships. RESULTS: We acquired the component of ChemSpider containing only manually curated names and synonyms. Rule-based term filtering, semi-automatic manual curation, and disambiguation rules were applied. We tested the dictionary from ChemSpider on an annotated corpus and compared the results with those for the Chemlist dictionary. The ChemSpider dictionary of ca. 80 k names was only a 1/3 to a 1/4 the size of Chemlist at around 300 k. The ChemSpider dictionary had a precision of 0.43 and a recall of 0.19 before the application of filtering and disambiguation and a precision of 0.87 and a recall of 0.19 after filtering and disambiguation. The Chemlist dictionary had a precision of 0.20 and a recall of 0.47 before the application of filtering and disambiguation and a precision of 0.67 and a recall of 0.40 after filtering and disambiguation. CONCLUSIONS: We conclude the following: (1) The ChemSpider dictionary achieved the best precision but the Chemlist dictionary had a higher recall and the best F-score; (2) Rule-based filtering and disambiguation is necessary to achieve a high precision for both the automatically generated and the manually curated dictionary. ChemSpider is available as a web service at http://www.chemspider.com/ and the Chemlist dictionary is freely available as an XML file in Simple Knowledge Organization System format on the web at http://www.biosemantics.org/chemlist.
format Text
id pubmed-2848622
institution National Center for Biotechnology Information
language English
publishDate 2010
publisher BioMed Central
record_format MEDLINE/PubMed
spelling pubmed-28486222010-04-02 Automatic vs. manual curation of a multi-source chemical dictionary: the impact on text mining Hettne, Kristina M Williams, Antony J van Mulligen, Erik M Kleinjans, Jos Tkachenko, Valery Kors, Jan A J Cheminform Research article BACKGROUND: Previously, we developed a combined dictionary dubbed Chemlist for the identification of small molecules and drugs in text based on a number of publicly available databases and tested it on an annotated corpus. To achieve an acceptable recall and precision we used a number of automatic and semi-automatic processing steps together with disambiguation rules. However, it remained to be investigated which impact an extensive manual curation of a multi-source chemical dictionary would have on chemical term identification in text. ChemSpider is a chemical database that has undergone extensive manual curation aimed at establishing valid chemical name-to-structure relationships. RESULTS: We acquired the component of ChemSpider containing only manually curated names and synonyms. Rule-based term filtering, semi-automatic manual curation, and disambiguation rules were applied. We tested the dictionary from ChemSpider on an annotated corpus and compared the results with those for the Chemlist dictionary. The ChemSpider dictionary of ca. 80 k names was only a 1/3 to a 1/4 the size of Chemlist at around 300 k. The ChemSpider dictionary had a precision of 0.43 and a recall of 0.19 before the application of filtering and disambiguation and a precision of 0.87 and a recall of 0.19 after filtering and disambiguation. The Chemlist dictionary had a precision of 0.20 and a recall of 0.47 before the application of filtering and disambiguation and a precision of 0.67 and a recall of 0.40 after filtering and disambiguation. CONCLUSIONS: We conclude the following: (1) The ChemSpider dictionary achieved the best precision but the Chemlist dictionary had a higher recall and the best F-score; (2) Rule-based filtering and disambiguation is necessary to achieve a high precision for both the automatically generated and the manually curated dictionary. ChemSpider is available as a web service at http://www.chemspider.com/ and the Chemlist dictionary is freely available as an XML file in Simple Knowledge Organization System format on the web at http://www.biosemantics.org/chemlist. BioMed Central 2010-03-23 /pmc/articles/PMC2848622/ /pubmed/20331846 http://dx.doi.org/10.1186/1758-2946-2-3 Text en Copyright ©2010 Hettne et al; licensee BioMed Central Ltd. http://creativecommons.org/licenses/by/2.0 This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.
spellingShingle Research article
Hettne, Kristina M
Williams, Antony J
van Mulligen, Erik M
Kleinjans, Jos
Tkachenko, Valery
Kors, Jan A
Automatic vs. manual curation of a multi-source chemical dictionary: the impact on text mining
title Automatic vs. manual curation of a multi-source chemical dictionary: the impact on text mining
title_full Automatic vs. manual curation of a multi-source chemical dictionary: the impact on text mining
title_fullStr Automatic vs. manual curation of a multi-source chemical dictionary: the impact on text mining
title_full_unstemmed Automatic vs. manual curation of a multi-source chemical dictionary: the impact on text mining
title_short Automatic vs. manual curation of a multi-source chemical dictionary: the impact on text mining
title_sort automatic vs. manual curation of a multi-source chemical dictionary: the impact on text mining
topic Research article
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2848622/
https://www.ncbi.nlm.nih.gov/pubmed/20331846
http://dx.doi.org/10.1186/1758-2946-2-3
work_keys_str_mv AT hettnekristinam automaticvsmanualcurationofamultisourcechemicaldictionarytheimpactontextmining
AT williamsantonyj automaticvsmanualcurationofamultisourcechemicaldictionarytheimpactontextmining
AT vanmulligenerikm automaticvsmanualcurationofamultisourcechemicaldictionarytheimpactontextmining
AT kleinjansjos automaticvsmanualcurationofamultisourcechemicaldictionarytheimpactontextmining
AT tkachenkovalery automaticvsmanualcurationofamultisourcechemicaldictionarytheimpactontextmining
AT korsjana automaticvsmanualcurationofamultisourcechemicaldictionarytheimpactontextmining