Cargando…

Semi-supervised Learning for the BioNLP Gene Regulation Network

BACKGROUND: The BioNLP Gene Regulation Task has attracted a diverse collection of submissions showcasing state-of-the-art systems. However, a principal challenge remains in obtaining a significant amount of recall. We argue that this is an important quality for Information Extraction tasks in this f...

Descripción completa

Detalles Bibliográficos
Autores principales: Provoost, Thomas, Moens, Marie-Francine
Formato: Online Artículo Texto
Lenguaje:English
Publicado: BioMed Central 2015
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4511406/
https://www.ncbi.nlm.nih.gov/pubmed/26202824
http://dx.doi.org/10.1186/1471-2105-16-S10-S4
_version_ 1782382328696274944
author Provoost, Thomas
Moens, Marie-Francine
author_facet Provoost, Thomas
Moens, Marie-Francine
author_sort Provoost, Thomas
collection PubMed
description BACKGROUND: The BioNLP Gene Regulation Task has attracted a diverse collection of submissions showcasing state-of-the-art systems. However, a principal challenge remains in obtaining a significant amount of recall. We argue that this is an important quality for Information Extraction tasks in this field. We propose a semi-supervised framework, leveraging a large corpus of unannotated data available to us. In this framework, the annotated data is used to find plausible candidates for positive data points, which are included in the machine learning process. As this is a method principally designed for gaining recall, we further explore additional methods to improve precision on top of this. These are: weighted regularisation in the SVM framework, and filtering out unlabelled examples based on a probabilistic rule-finding method. The latter method also allows us to add candidates for negatives from unlabelled data, a method not viable in the unfiltered approach. RESULTS: We replicate one of the original participant systems, and modify it to incorporate our methods. This allows us to test the extent of our proposed methods by applying them to the GRN task data. We find a considerable improvement in recall compared to the baseline system. We also investigate the evaluation metrics and find several mechanisms explaining a bias towards precision. Furthermore, these findings uncover an intricate precision-recall interaction, depriving recall of its habitual immediacy seen in traditional machine learning set-ups. CONCLUSION: Our contributions are twofold: 1. An exploration of a novel semi-supervised pipeline. We have succeeded in employing additional knowledge through adding unannotated data points, while responding to the inherent noise of this method by imposing an automated, rule-based pre-selection step. 2. A thorough analysis of the evaluation procedure in the Gene Regulation Shared Task. We have performed an in depth inquiry of the Slot Error Rate, responding to arguments that lead to some design choices of this task. We have furthermore uncovered complexities in the interplay of precision and recall that negate the customary behaviour commonplace to the machine learning engineer.
format Online
Article
Text
id pubmed-4511406
institution National Center for Biotechnology Information
language English
publishDate 2015
publisher BioMed Central
record_format MEDLINE/PubMed
spelling pubmed-45114062015-07-28 Semi-supervised Learning for the BioNLP Gene Regulation Network Provoost, Thomas Moens, Marie-Francine BMC Bioinformatics Research BACKGROUND: The BioNLP Gene Regulation Task has attracted a diverse collection of submissions showcasing state-of-the-art systems. However, a principal challenge remains in obtaining a significant amount of recall. We argue that this is an important quality for Information Extraction tasks in this field. We propose a semi-supervised framework, leveraging a large corpus of unannotated data available to us. In this framework, the annotated data is used to find plausible candidates for positive data points, which are included in the machine learning process. As this is a method principally designed for gaining recall, we further explore additional methods to improve precision on top of this. These are: weighted regularisation in the SVM framework, and filtering out unlabelled examples based on a probabilistic rule-finding method. The latter method also allows us to add candidates for negatives from unlabelled data, a method not viable in the unfiltered approach. RESULTS: We replicate one of the original participant systems, and modify it to incorporate our methods. This allows us to test the extent of our proposed methods by applying them to the GRN task data. We find a considerable improvement in recall compared to the baseline system. We also investigate the evaluation metrics and find several mechanisms explaining a bias towards precision. Furthermore, these findings uncover an intricate precision-recall interaction, depriving recall of its habitual immediacy seen in traditional machine learning set-ups. CONCLUSION: Our contributions are twofold: 1. An exploration of a novel semi-supervised pipeline. We have succeeded in employing additional knowledge through adding unannotated data points, while responding to the inherent noise of this method by imposing an automated, rule-based pre-selection step. 2. A thorough analysis of the evaluation procedure in the Gene Regulation Shared Task. We have performed an in depth inquiry of the Slot Error Rate, responding to arguments that lead to some design choices of this task. We have furthermore uncovered complexities in the interplay of precision and recall that negate the customary behaviour commonplace to the machine learning engineer. BioMed Central 2015-07-13 /pmc/articles/PMC4511406/ /pubmed/26202824 http://dx.doi.org/10.1186/1471-2105-16-S10-S4 Text en Copyright © 2015 Provoost and Moens; licensee BioMed Central Ltd. http://creativecommons.org/licenses/by/4.0 This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/4.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.
spellingShingle Research
Provoost, Thomas
Moens, Marie-Francine
Semi-supervised Learning for the BioNLP Gene Regulation Network
title Semi-supervised Learning for the BioNLP Gene Regulation Network
title_full Semi-supervised Learning for the BioNLP Gene Regulation Network
title_fullStr Semi-supervised Learning for the BioNLP Gene Regulation Network
title_full_unstemmed Semi-supervised Learning for the BioNLP Gene Regulation Network
title_short Semi-supervised Learning for the BioNLP Gene Regulation Network
title_sort semi-supervised learning for the bionlp gene regulation network
topic Research
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4511406/
https://www.ncbi.nlm.nih.gov/pubmed/26202824
http://dx.doi.org/10.1186/1471-2105-16-S10-S4
work_keys_str_mv AT provoostthomas semisupervisedlearningforthebionlpgeneregulationnetwork
AT moensmariefrancine semisupervisedlearningforthebionlpgeneregulationnetwork