Cargando…

Training text chunkers on a silver standard corpus: can silver replace gold?

BACKGROUND: To train chunkers in recognizing noun phrases and verb phrases in biomedical text, an annotated corpus is required. The creation of gold standard corpora (GSCs), however, is expensive and time-consuming. GSCs therefore tend to be small and to focus on specific subdomains, which limits th...

Descripción completa

Detalles Bibliográficos
Autores principales:	Kang, Ning, van Mulligen, Erik M, Kors, Jan A
Formato:	Online Artículo Texto
Lenguaje:	English
Publicado:	BioMed Central 2012
Materias:	Research Article
Acceso en línea:	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3280170/ https://www.ncbi.nlm.nih.gov/pubmed/22289351 http://dx.doi.org/10.1186/1471-2105-13-17

_version_	1782223782713229312
author	Kang, Ning van Mulligen, Erik M Kors, Jan A
author_facet	Kang, Ning van Mulligen, Erik M Kors, Jan A
author_sort	Kang, Ning
collection	PubMed
description	BACKGROUND: To train chunkers in recognizing noun phrases and verb phrases in biomedical text, an annotated corpus is required. The creation of gold standard corpora (GSCs), however, is expensive and time-consuming. GSCs therefore tend to be small and to focus on specific subdomains, which limits their usefulness. We investigated the use of a silver standard corpus (SSC) that is automatically generated by combining the outputs of multiple chunking systems. We explored two use scenarios: one in which chunkers are trained on an SSC in a new domain for which a GSC is not available, and one in which chunkers are trained on an available, although small GSC but supplemented with an SSC. RESULTS: We have tested the two scenarios using three chunkers, Lingpipe, OpenNLP, and Yamcha, and two different corpora, GENIA and PennBioIE. For the first scenario, we showed that the systems trained for noun-phrase recognition on the SSC in one domain performed 2.7-3.1 percentage points better in terms of F-score than the systems trained on the GSC in another domain, and only 0.2-0.8 percentage points less than when they were trained on a GSC in the same domain as the SSC. When the outputs of the chunkers were combined, the combined system showed little improvement when using the SSC. For the second scenario, the systems trained on a GSC supplemented with an SSC performed considerably better than systems that were trained on the GSC alone, especially when the GSC was small. For example, training the chunkers on a GSC consisting of only 10 abstracts but supplemented with an SSC yielded similar performance as training them on a GSC of 100-250 abstracts. The combined system even performed better than any of the individual chunkers trained on a GSC of 500 abstracts. CONCLUSIONS: We conclude that an SSC can be a viable alternative for or a supplement to a GSC when training chunkers in a biomedical domain. A combined system only shows improvement if the SSC is used to supplement a GSC. Whether the approach is applicable to other systems in a natural-language processing pipeline has to be further investigated.
format	Online Article Text
id	pubmed-3280170
institution	National Center for Biotechnology Information
language	English
publishDate	2012
publisher	BioMed Central
record_format	MEDLINE/PubMed
spelling	pubmed-32801702012-02-16 Training text chunkers on a silver standard corpus: can silver replace gold? Kang, Ning van Mulligen, Erik M Kors, Jan A BMC Bioinformatics Research Article BACKGROUND: To train chunkers in recognizing noun phrases and verb phrases in biomedical text, an annotated corpus is required. The creation of gold standard corpora (GSCs), however, is expensive and time-consuming. GSCs therefore tend to be small and to focus on specific subdomains, which limits their usefulness. We investigated the use of a silver standard corpus (SSC) that is automatically generated by combining the outputs of multiple chunking systems. We explored two use scenarios: one in which chunkers are trained on an SSC in a new domain for which a GSC is not available, and one in which chunkers are trained on an available, although small GSC but supplemented with an SSC. RESULTS: We have tested the two scenarios using three chunkers, Lingpipe, OpenNLP, and Yamcha, and two different corpora, GENIA and PennBioIE. For the first scenario, we showed that the systems trained for noun-phrase recognition on the SSC in one domain performed 2.7-3.1 percentage points better in terms of F-score than the systems trained on the GSC in another domain, and only 0.2-0.8 percentage points less than when they were trained on a GSC in the same domain as the SSC. When the outputs of the chunkers were combined, the combined system showed little improvement when using the SSC. For the second scenario, the systems trained on a GSC supplemented with an SSC performed considerably better than systems that were trained on the GSC alone, especially when the GSC was small. For example, training the chunkers on a GSC consisting of only 10 abstracts but supplemented with an SSC yielded similar performance as training them on a GSC of 100-250 abstracts. The combined system even performed better than any of the individual chunkers trained on a GSC of 500 abstracts. CONCLUSIONS: We conclude that an SSC can be a viable alternative for or a supplement to a GSC when training chunkers in a biomedical domain. A combined system only shows improvement if the SSC is used to supplement a GSC. Whether the approach is applicable to other systems in a natural-language processing pipeline has to be further investigated. BioMed Central 2012-01-30 /pmc/articles/PMC3280170/ /pubmed/22289351 http://dx.doi.org/10.1186/1471-2105-13-17 Text en Copyright ©2012 Kang et al; licensee BioMed Central Ltd. http://creativecommons.org/licenses/by/2.0 This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.
spellingShingle	Research Article Kang, Ning van Mulligen, Erik M Kors, Jan A Training text chunkers on a silver standard corpus: can silver replace gold?
title	Training text chunkers on a silver standard corpus: can silver replace gold?
title_full	Training text chunkers on a silver standard corpus: can silver replace gold?
title_fullStr	Training text chunkers on a silver standard corpus: can silver replace gold?
title_full_unstemmed	Training text chunkers on a silver standard corpus: can silver replace gold?
title_short	Training text chunkers on a silver standard corpus: can silver replace gold?
title_sort	training text chunkers on a silver standard corpus: can silver replace gold?
topic	Research Article
url	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3280170/ https://www.ncbi.nlm.nih.gov/pubmed/22289351 http://dx.doi.org/10.1186/1471-2105-13-17
work_keys_str_mv	AT kangning trainingtextchunkersonasilverstandardcorpuscansilverreplacegold AT vanmulligenerikm trainingtextchunkersonasilverstandardcorpuscansilverreplacegold AT korsjana trainingtextchunkersonasilverstandardcorpuscansilverreplacegold

Training text chunkers on a silver standard corpus: can silver replace gold?

Ejemplares similares