Cargando…
Semi-supervised Learning for the BioNLP Gene Regulation Network
BACKGROUND: The BioNLP Gene Regulation Task has attracted a diverse collection of submissions showcasing state-of-the-art systems. However, a principal challenge remains in obtaining a significant amount of recall. We argue that this is an important quality for Information Extraction tasks in this f...
Autores principales: | , |
---|---|
Formato: | Online Artículo Texto |
Lenguaje: | English |
Publicado: |
BioMed Central
2015
|
Materias: | |
Acceso en línea: | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4511406/ https://www.ncbi.nlm.nih.gov/pubmed/26202824 http://dx.doi.org/10.1186/1471-2105-16-S10-S4 |
_version_ | 1782382328696274944 |
---|---|
author | Provoost, Thomas Moens, Marie-Francine |
author_facet | Provoost, Thomas Moens, Marie-Francine |
author_sort | Provoost, Thomas |
collection | PubMed |
description | BACKGROUND: The BioNLP Gene Regulation Task has attracted a diverse collection of submissions showcasing state-of-the-art systems. However, a principal challenge remains in obtaining a significant amount of recall. We argue that this is an important quality for Information Extraction tasks in this field. We propose a semi-supervised framework, leveraging a large corpus of unannotated data available to us. In this framework, the annotated data is used to find plausible candidates for positive data points, which are included in the machine learning process. As this is a method principally designed for gaining recall, we further explore additional methods to improve precision on top of this. These are: weighted regularisation in the SVM framework, and filtering out unlabelled examples based on a probabilistic rule-finding method. The latter method also allows us to add candidates for negatives from unlabelled data, a method not viable in the unfiltered approach. RESULTS: We replicate one of the original participant systems, and modify it to incorporate our methods. This allows us to test the extent of our proposed methods by applying them to the GRN task data. We find a considerable improvement in recall compared to the baseline system. We also investigate the evaluation metrics and find several mechanisms explaining a bias towards precision. Furthermore, these findings uncover an intricate precision-recall interaction, depriving recall of its habitual immediacy seen in traditional machine learning set-ups. CONCLUSION: Our contributions are twofold: 1. An exploration of a novel semi-supervised pipeline. We have succeeded in employing additional knowledge through adding unannotated data points, while responding to the inherent noise of this method by imposing an automated, rule-based pre-selection step. 2. A thorough analysis of the evaluation procedure in the Gene Regulation Shared Task. We have performed an in depth inquiry of the Slot Error Rate, responding to arguments that lead to some design choices of this task. We have furthermore uncovered complexities in the interplay of precision and recall that negate the customary behaviour commonplace to the machine learning engineer. |
format | Online Article Text |
id | pubmed-4511406 |
institution | National Center for Biotechnology Information |
language | English |
publishDate | 2015 |
publisher | BioMed Central |
record_format | MEDLINE/PubMed |
spelling | pubmed-45114062015-07-28 Semi-supervised Learning for the BioNLP Gene Regulation Network Provoost, Thomas Moens, Marie-Francine BMC Bioinformatics Research BACKGROUND: The BioNLP Gene Regulation Task has attracted a diverse collection of submissions showcasing state-of-the-art systems. However, a principal challenge remains in obtaining a significant amount of recall. We argue that this is an important quality for Information Extraction tasks in this field. We propose a semi-supervised framework, leveraging a large corpus of unannotated data available to us. In this framework, the annotated data is used to find plausible candidates for positive data points, which are included in the machine learning process. As this is a method principally designed for gaining recall, we further explore additional methods to improve precision on top of this. These are: weighted regularisation in the SVM framework, and filtering out unlabelled examples based on a probabilistic rule-finding method. The latter method also allows us to add candidates for negatives from unlabelled data, a method not viable in the unfiltered approach. RESULTS: We replicate one of the original participant systems, and modify it to incorporate our methods. This allows us to test the extent of our proposed methods by applying them to the GRN task data. We find a considerable improvement in recall compared to the baseline system. We also investigate the evaluation metrics and find several mechanisms explaining a bias towards precision. Furthermore, these findings uncover an intricate precision-recall interaction, depriving recall of its habitual immediacy seen in traditional machine learning set-ups. CONCLUSION: Our contributions are twofold: 1. An exploration of a novel semi-supervised pipeline. We have succeeded in employing additional knowledge through adding unannotated data points, while responding to the inherent noise of this method by imposing an automated, rule-based pre-selection step. 2. A thorough analysis of the evaluation procedure in the Gene Regulation Shared Task. We have performed an in depth inquiry of the Slot Error Rate, responding to arguments that lead to some design choices of this task. We have furthermore uncovered complexities in the interplay of precision and recall that negate the customary behaviour commonplace to the machine learning engineer. BioMed Central 2015-07-13 /pmc/articles/PMC4511406/ /pubmed/26202824 http://dx.doi.org/10.1186/1471-2105-16-S10-S4 Text en Copyright © 2015 Provoost and Moens; licensee BioMed Central Ltd. http://creativecommons.org/licenses/by/4.0 This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/4.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated. |
spellingShingle | Research Provoost, Thomas Moens, Marie-Francine Semi-supervised Learning for the BioNLP Gene Regulation Network |
title | Semi-supervised Learning for the BioNLP Gene Regulation Network |
title_full | Semi-supervised Learning for the BioNLP Gene Regulation Network |
title_fullStr | Semi-supervised Learning for the BioNLP Gene Regulation Network |
title_full_unstemmed | Semi-supervised Learning for the BioNLP Gene Regulation Network |
title_short | Semi-supervised Learning for the BioNLP Gene Regulation Network |
title_sort | semi-supervised learning for the bionlp gene regulation network |
topic | Research |
url | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4511406/ https://www.ncbi.nlm.nih.gov/pubmed/26202824 http://dx.doi.org/10.1186/1471-2105-16-S10-S4 |
work_keys_str_mv | AT provoostthomas semisupervisedlearningforthebionlpgeneregulationnetwork AT moensmariefrancine semisupervisedlearningforthebionlpgeneregulationnetwork |