Cargando…

Semi-supervised Learning for the BioNLP Gene Regulation Network

BACKGROUND: The BioNLP Gene Regulation Task has attracted a diverse collection of submissions showcasing state-of-the-art systems. However, a principal challenge remains in obtaining a significant amount of recall. We argue that this is an important quality for Information Extraction tasks in this f...

Descripción completa

Detalles Bibliográficos
Autores principales:	Provoost, Thomas, Moens, Marie-Francine
Formato:	Online Artículo Texto
Lenguaje:	English
Publicado:	BioMed Central 2015
Materias:	Research
Acceso en línea:	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4511406/ https://www.ncbi.nlm.nih.gov/pubmed/26202824 http://dx.doi.org/10.1186/1471-2105-16-S10-S4

_version_	1782382328696274944
author	Provoost, Thomas Moens, Marie-Francine
author_facet	Provoost, Thomas Moens, Marie-Francine
author_sort	Provoost, Thomas
collection	PubMed
description	BACKGROUND: The BioNLP Gene Regulation Task has attracted a diverse collection of submissions showcasing state-of-the-art systems. However, a principal challenge remains in obtaining a significant amount of recall. We argue that this is an important quality for Information Extraction tasks in this field. We propose a semi-supervised framework, leveraging a large corpus of unannotated data available to us. In this framework, the annotated data is used to find plausible candidates for positive data points, which are included in the machine learning process. As this is a method principally designed for gaining recall, we further explore additional methods to improve precision on top of this. These are: weighted regularisation in the SVM framework, and filtering out unlabelled examples based on a probabilistic rule-finding method. The latter method also allows us to add candidates for negatives from unlabelled data, a method not viable in the unfiltered approach. RESULTS: We replicate one of the original participant systems, and modify it to incorporate our methods. This allows us to test the extent of our proposed methods by applying them to the GRN task data. We find a considerable improvement in recall compared to the baseline system. We also investigate the evaluation metrics and find several mechanisms explaining a bias towards precision. Furthermore, these findings uncover an intricate precision-recall interaction, depriving recall of its habitual immediacy seen in traditional machine learning set-ups. CONCLUSION: Our contributions are twofold: 1. An exploration of a novel semi-supervised pipeline. We have succeeded in employing additional knowledge through adding unannotated data points, while responding to the inherent noise of this method by imposing an automated, rule-based pre-selection step. 2. A thorough analysis of the evaluation procedure in the Gene Regulation Shared Task. We have performed an in depth inquiry of the Slot Error Rate, responding to arguments that lead to some design choices of this task. We have furthermore uncovered complexities in the interplay of precision and recall that negate the customary behaviour commonplace to the machine learning engineer.
format	Online Article Text
id	pubmed-4511406
institution	National Center for Biotechnology Information
language	English
publishDate	2015
publisher	BioMed Central
record_format	MEDLINE/PubMed
spelling	pubmed-45114062015-07-28 Semi-supervised Learning for the BioNLP Gene Regulation Network Provoost, Thomas Moens, Marie-Francine BMC Bioinformatics Research BACKGROUND: The BioNLP Gene Regulation Task has attracted a diverse collection of submissions showcasing state-of-the-art systems. However, a principal challenge remains in obtaining a significant amount of recall. We argue that this is an important quality for Information Extraction tasks in this field. We propose a semi-supervised framework, leveraging a large corpus of unannotated data available to us. In this framework, the annotated data is used to find plausible candidates for positive data points, which are included in the machine learning process. As this is a method principally designed for gaining recall, we further explore additional methods to improve precision on top of this. These are: weighted regularisation in the SVM framework, and filtering out unlabelled examples based on a probabilistic rule-finding method. The latter method also allows us to add candidates for negatives from unlabelled data, a method not viable in the unfiltered approach. RESULTS: We replicate one of the original participant systems, and modify it to incorporate our methods. This allows us to test the extent of our proposed methods by applying them to the GRN task data. We find a considerable improvement in recall compared to the baseline system. We also investigate the evaluation metrics and find several mechanisms explaining a bias towards precision. Furthermore, these findings uncover an intricate precision-recall interaction, depriving recall of its habitual immediacy seen in traditional machine learning set-ups. CONCLUSION: Our contributions are twofold: 1. An exploration of a novel semi-supervised pipeline. We have succeeded in employing additional knowledge through adding unannotated data points, while responding to the inherent noise of this method by imposing an automated, rule-based pre-selection step. 2. A thorough analysis of the evaluation procedure in the Gene Regulation Shared Task. We have performed an in depth inquiry of the Slot Error Rate, responding to arguments that lead to some design choices of this task. We have furthermore uncovered complexities in the interplay of precision and recall that negate the customary behaviour commonplace to the machine learning engineer. BioMed Central 2015-07-13 /pmc/articles/PMC4511406/ /pubmed/26202824 http://dx.doi.org/10.1186/1471-2105-16-S10-S4 Text en Copyright © 2015 Provoost and Moens; licensee BioMed Central Ltd. http://creativecommons.org/licenses/by/4.0 This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/4.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.
spellingShingle	Research Provoost, Thomas Moens, Marie-Francine Semi-supervised Learning for the BioNLP Gene Regulation Network
title	Semi-supervised Learning for the BioNLP Gene Regulation Network
title_full	Semi-supervised Learning for the BioNLP Gene Regulation Network
title_fullStr	Semi-supervised Learning for the BioNLP Gene Regulation Network
title_full_unstemmed	Semi-supervised Learning for the BioNLP Gene Regulation Network
title_short	Semi-supervised Learning for the BioNLP Gene Regulation Network
title_sort	semi-supervised learning for the bionlp gene regulation network
topic	Research
url	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4511406/ https://www.ncbi.nlm.nih.gov/pubmed/26202824 http://dx.doi.org/10.1186/1471-2105-16-S10-S4
work_keys_str_mv	AT provoostthomas semisupervisedlearningforthebionlpgeneregulationnetwork AT moensmariefrancine semisupervisedlearningforthebionlpgeneregulationnetwork

Semi-supervised Learning for the BioNLP Gene Regulation Network

Ejemplares similares