Cargando…

A hybrid approach toward biomedical relation extraction training corpora: combining distant supervision with crowdsourcing

Biomedical relation extraction (RE) datasets are vital in the construction of knowledge bases and to potentiate the discovery of new interactions. There are several ways to create biomedical RE datasets, some more reliable than others, such as resorting to domain expert annotations. However, the eme...

Descripción completa

Detalles Bibliográficos
Autores principales: Sousa, Diana, Lamurias, Andre, Couto, Francisco M
Formato: Online Artículo Texto
Lenguaje:English
Publicado: Oxford University Press 2020
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7706181/
https://www.ncbi.nlm.nih.gov/pubmed/33258966
http://dx.doi.org/10.1093/database/baaa104
_version_ 1783617100864028672
author Sousa, Diana
Lamurias, Andre
Couto, Francisco M
author_facet Sousa, Diana
Lamurias, Andre
Couto, Francisco M
author_sort Sousa, Diana
collection PubMed
description Biomedical relation extraction (RE) datasets are vital in the construction of knowledge bases and to potentiate the discovery of new interactions. There are several ways to create biomedical RE datasets, some more reliable than others, such as resorting to domain expert annotations. However, the emerging use of crowdsourcing platforms, such as Amazon Mechanical Turk (MTurk), can potentially reduce the cost of RE dataset construction, even if the same level of quality cannot be guaranteed. There is a lack of power of the researcher to control who, how and in what context workers engage in crowdsourcing platforms. Hence, allying distant supervision with crowdsourcing can be a more reliable alternative. The crowdsourcing workers would be asked only to rectify or discard already existing annotations, which would make the process less dependent on their ability to interpret complex biomedical sentences. In this work, we use a previously created distantly supervised human phenotype–gene relations (PGR) dataset to perform crowdsourcing validation. We divided the original dataset into two annotation tasks: Task 1, 70% of the dataset annotated by one worker, and Task 2, 30% of the dataset annotated by seven workers. Also, for Task 2, we added an extra rater on-site and a domain expert to further assess the crowdsourcing validation quality. Here, we describe a detailed pipeline for RE crowdsourcing validation, creating a new release of the PGR dataset with partial domain expert revision, and assess the quality of the MTurk platform. We applied the new dataset to two state-of-the-art deep learning systems (BiOnt and BioBERT) and compared its performance with the original PGR dataset, as well as combinations between the two, achieving a 0.3494 increase in average F-measure. The code supporting our work and the new release of the PGR dataset is available at https://github.com/lasigeBioTM/PGR-crowd.
format Online
Article
Text
id pubmed-7706181
institution National Center for Biotechnology Information
language English
publishDate 2020
publisher Oxford University Press
record_format MEDLINE/PubMed
spelling pubmed-77061812020-12-07 A hybrid approach toward biomedical relation extraction training corpora: combining distant supervision with crowdsourcing Sousa, Diana Lamurias, Andre Couto, Francisco M Database (Oxford) Original Article Biomedical relation extraction (RE) datasets are vital in the construction of knowledge bases and to potentiate the discovery of new interactions. There are several ways to create biomedical RE datasets, some more reliable than others, such as resorting to domain expert annotations. However, the emerging use of crowdsourcing platforms, such as Amazon Mechanical Turk (MTurk), can potentially reduce the cost of RE dataset construction, even if the same level of quality cannot be guaranteed. There is a lack of power of the researcher to control who, how and in what context workers engage in crowdsourcing platforms. Hence, allying distant supervision with crowdsourcing can be a more reliable alternative. The crowdsourcing workers would be asked only to rectify or discard already existing annotations, which would make the process less dependent on their ability to interpret complex biomedical sentences. In this work, we use a previously created distantly supervised human phenotype–gene relations (PGR) dataset to perform crowdsourcing validation. We divided the original dataset into two annotation tasks: Task 1, 70% of the dataset annotated by one worker, and Task 2, 30% of the dataset annotated by seven workers. Also, for Task 2, we added an extra rater on-site and a domain expert to further assess the crowdsourcing validation quality. Here, we describe a detailed pipeline for RE crowdsourcing validation, creating a new release of the PGR dataset with partial domain expert revision, and assess the quality of the MTurk platform. We applied the new dataset to two state-of-the-art deep learning systems (BiOnt and BioBERT) and compared its performance with the original PGR dataset, as well as combinations between the two, achieving a 0.3494 increase in average F-measure. The code supporting our work and the new release of the PGR dataset is available at https://github.com/lasigeBioTM/PGR-crowd. Oxford University Press 2020-12-01 /pmc/articles/PMC7706181/ /pubmed/33258966 http://dx.doi.org/10.1093/database/baaa104 Text en © The Author(s) 2020. Published by Oxford University Press. http://creativecommons.org/licenses/by/4.0/ This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted reuse, distribution, and reproduction in any medium, provided the original work is properly cited.
spellingShingle Original Article
Sousa, Diana
Lamurias, Andre
Couto, Francisco M
A hybrid approach toward biomedical relation extraction training corpora: combining distant supervision with crowdsourcing
title A hybrid approach toward biomedical relation extraction training corpora: combining distant supervision with crowdsourcing
title_full A hybrid approach toward biomedical relation extraction training corpora: combining distant supervision with crowdsourcing
title_fullStr A hybrid approach toward biomedical relation extraction training corpora: combining distant supervision with crowdsourcing
title_full_unstemmed A hybrid approach toward biomedical relation extraction training corpora: combining distant supervision with crowdsourcing
title_short A hybrid approach toward biomedical relation extraction training corpora: combining distant supervision with crowdsourcing
title_sort hybrid approach toward biomedical relation extraction training corpora: combining distant supervision with crowdsourcing
topic Original Article
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7706181/
https://www.ncbi.nlm.nih.gov/pubmed/33258966
http://dx.doi.org/10.1093/database/baaa104
work_keys_str_mv AT sousadiana ahybridapproachtowardbiomedicalrelationextractiontrainingcorporacombiningdistantsupervisionwithcrowdsourcing
AT lamuriasandre ahybridapproachtowardbiomedicalrelationextractiontrainingcorporacombiningdistantsupervisionwithcrowdsourcing
AT coutofranciscom ahybridapproachtowardbiomedicalrelationextractiontrainingcorporacombiningdistantsupervisionwithcrowdsourcing
AT sousadiana hybridapproachtowardbiomedicalrelationextractiontrainingcorporacombiningdistantsupervisionwithcrowdsourcing
AT lamuriasandre hybridapproachtowardbiomedicalrelationextractiontrainingcorporacombiningdistantsupervisionwithcrowdsourcing
AT coutofranciscom hybridapproachtowardbiomedicalrelationextractiontrainingcorporacombiningdistantsupervisionwithcrowdsourcing