Cargando…

Automated assessment of biological database assertions using the scientific literature

BACKGROUND: The large biological databases such as GenBank contain vast numbers of records, the content of which is substantively based on external resources, including published literature. Manual curation is used to establish whether the literature and the records are indeed consistent. We explore...

Descripción completa

Detalles Bibliográficos
Autores principales:	Bouadjenek, Mohamed Reda, Zobel, Justin, Verspoor, Karin
Formato:	Online Artículo Texto
Lenguaje:	English
Publicado:	BioMed Central 2019
Materias:	Research Article
Acceso en línea:	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6489365/ https://www.ncbi.nlm.nih.gov/pubmed/31035936 http://dx.doi.org/10.1186/s12859-019-2801-x

_version_	1783414811903655936
author	Bouadjenek, Mohamed Reda Zobel, Justin Verspoor, Karin
author_facet	Bouadjenek, Mohamed Reda Zobel, Justin Verspoor, Karin
author_sort	Bouadjenek, Mohamed Reda
collection	PubMed
description	BACKGROUND: The large biological databases such as GenBank contain vast numbers of records, the content of which is substantively based on external resources, including published literature. Manual curation is used to establish whether the literature and the records are indeed consistent. We explore in this paper an automated method for assessing the consistency of biological assertions, to assist biocurators, which we call BARC, Biocuration tool for Assessment of Relation Consistency. In this method a biological assertion is represented as a relation between two objects (for example, a gene and a disease); we then use our novel set-based relevance algorithm SaBRA to retrieve pertinent literature, and apply a classifier to estimate the likelihood that this relation (assertion) is correct. RESULTS: Our experiments on assessing gene–disease relations and protein–protein interactions using the PubMed Central collection show that BARC can be effective at assisting curators to perform data cleansing. Specifically, the results obtained showed that BARC substantially outperforms the best baselines, with an improvement of F-measure of 3.5% and 13%, respectively, on gene-disease relations and protein-protein interactions. We have additionally carried out a feature analysis that showed that all feature types are informative, as are all fields of the documents. CONCLUSIONS: BARC provides a clear benefit for the biocuration community, as there are no prior automated tools for identifying inconsistent assertions in large-scale biological databases.
format	Online Article Text
id	pubmed-6489365
institution	National Center for Biotechnology Information
language	English
publishDate	2019
publisher	BioMed Central
record_format	MEDLINE/PubMed
spelling	pubmed-64893652019-06-04 Automated assessment of biological database assertions using the scientific literature Bouadjenek, Mohamed Reda Zobel, Justin Verspoor, Karin BMC Bioinformatics Research Article BACKGROUND: The large biological databases such as GenBank contain vast numbers of records, the content of which is substantively based on external resources, including published literature. Manual curation is used to establish whether the literature and the records are indeed consistent. We explore in this paper an automated method for assessing the consistency of biological assertions, to assist biocurators, which we call BARC, Biocuration tool for Assessment of Relation Consistency. In this method a biological assertion is represented as a relation between two objects (for example, a gene and a disease); we then use our novel set-based relevance algorithm SaBRA to retrieve pertinent literature, and apply a classifier to estimate the likelihood that this relation (assertion) is correct. RESULTS: Our experiments on assessing gene–disease relations and protein–protein interactions using the PubMed Central collection show that BARC can be effective at assisting curators to perform data cleansing. Specifically, the results obtained showed that BARC substantially outperforms the best baselines, with an improvement of F-measure of 3.5% and 13%, respectively, on gene-disease relations and protein-protein interactions. We have additionally carried out a feature analysis that showed that all feature types are informative, as are all fields of the documents. CONCLUSIONS: BARC provides a clear benefit for the biocuration community, as there are no prior automated tools for identifying inconsistent assertions in large-scale biological databases. BioMed Central 2019-04-29 /pmc/articles/PMC6489365/ /pubmed/31035936 http://dx.doi.org/10.1186/s12859-019-2801-x Text en © The Author(s) 2019 Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver(http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.
spellingShingle	Research Article Bouadjenek, Mohamed Reda Zobel, Justin Verspoor, Karin Automated assessment of biological database assertions using the scientific literature
title	Automated assessment of biological database assertions using the scientific literature
title_full	Automated assessment of biological database assertions using the scientific literature
title_fullStr	Automated assessment of biological database assertions using the scientific literature
title_full_unstemmed	Automated assessment of biological database assertions using the scientific literature
title_short	Automated assessment of biological database assertions using the scientific literature
title_sort	automated assessment of biological database assertions using the scientific literature
topic	Research Article
url	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6489365/ https://www.ncbi.nlm.nih.gov/pubmed/31035936 http://dx.doi.org/10.1186/s12859-019-2801-x
work_keys_str_mv	AT bouadjenekmohamedreda automatedassessmentofbiologicaldatabaseassertionsusingthescientificliterature AT zobeljustin automatedassessmentofbiologicaldatabaseassertionsusingthescientificliterature AT verspoorkarin automatedassessmentofbiologicaldatabaseassertionsusingthescientificliterature

Automated assessment of biological database assertions using the scientific literature

Ejemplares similares