Cargando…

Literature consistency of bioinformatics sequence databases is effective for assessing record quality

Bioinformatics sequence databases such as Genbank or UniProt contain hundreds of millions of records of genomic data. These records are derived from direct submissions from individual laboratories, as well as from bulk submissions from large-scale sequencing centres; their diversity and scale means...

Descripción completa

Detalles Bibliográficos
Autores principales:	Bouadjenek, Mohamed Reda, Verspoor, Karin, Zobel, Justin
Formato:	Online Artículo Texto
Lenguaje:	English
Publicado:	Oxford University Press 2017
Materias:	Original Article
Acceso en línea:	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5467556/ https://www.ncbi.nlm.nih.gov/pubmed/28365737 http://dx.doi.org/10.1093/database/bax021

_version_	1783243287972282368
author	Bouadjenek, Mohamed Reda Verspoor, Karin Zobel, Justin
author_facet	Bouadjenek, Mohamed Reda Verspoor, Karin Zobel, Justin
author_sort	Bouadjenek, Mohamed Reda
collection	PubMed
description	Bioinformatics sequence databases such as Genbank or UniProt contain hundreds of millions of records of genomic data. These records are derived from direct submissions from individual laboratories, as well as from bulk submissions from large-scale sequencing centres; their diversity and scale means that they suffer from a range of data quality issues including errors, discrepancies, redundancies, ambiguities, incompleteness and inconsistencies with the published literature. In this work, we seek to investigate and analyze the data quality of sequence databases from the perspective of a curator, who must detect anomalous and suspicious records. Specifically, we emphasize the detection of inconsistent records with respect to the literature. Focusing on GenBank, we propose a set of 24 quality indicators, which are based on treating a record as a query into the published literature, and then use query quality predictors. We then carry out an analysis that shows that the proposed quality indicators and the quality of the records have a mutual relationship, in which one depends on the other. We propose to represent record-literature consistency as a vector of these quality indicators. By reducing the dimensionality of this representation for visualization purposes using principal component analysis, we show that records which have been reported as inconsistent with the literature fall roughly in the same area, and therefore share similar characteristics. By manually analyzing records not previously known to be erroneous that fall in the same area than records know to be inconsistent, we show that one record out of four is inconsistent with respect to the literature. This high density of inconsistent record opens the way towards the development of automatic methods for the detection of faulty records. We conclude that literature inconsistency is a meaningful strategy for identifying suspicious records. Database URL: https://github.com/rbouadjenek/DQBioinformatics
format	Online Article Text
id	pubmed-5467556
institution	National Center for Biotechnology Information
language	English
publishDate	2017
publisher	Oxford University Press
record_format	MEDLINE/PubMed
spelling	pubmed-54675562017-06-19 Literature consistency of bioinformatics sequence databases is effective for assessing record quality Bouadjenek, Mohamed Reda Verspoor, Karin Zobel, Justin Database (Oxford) Original Article Bioinformatics sequence databases such as Genbank or UniProt contain hundreds of millions of records of genomic data. These records are derived from direct submissions from individual laboratories, as well as from bulk submissions from large-scale sequencing centres; their diversity and scale means that they suffer from a range of data quality issues including errors, discrepancies, redundancies, ambiguities, incompleteness and inconsistencies with the published literature. In this work, we seek to investigate and analyze the data quality of sequence databases from the perspective of a curator, who must detect anomalous and suspicious records. Specifically, we emphasize the detection of inconsistent records with respect to the literature. Focusing on GenBank, we propose a set of 24 quality indicators, which are based on treating a record as a query into the published literature, and then use query quality predictors. We then carry out an analysis that shows that the proposed quality indicators and the quality of the records have a mutual relationship, in which one depends on the other. We propose to represent record-literature consistency as a vector of these quality indicators. By reducing the dimensionality of this representation for visualization purposes using principal component analysis, we show that records which have been reported as inconsistent with the literature fall roughly in the same area, and therefore share similar characteristics. By manually analyzing records not previously known to be erroneous that fall in the same area than records know to be inconsistent, we show that one record out of four is inconsistent with respect to the literature. This high density of inconsistent record opens the way towards the development of automatic methods for the detection of faulty records. We conclude that literature inconsistency is a meaningful strategy for identifying suspicious records. Database URL: https://github.com/rbouadjenek/DQBioinformatics Oxford University Press 2017-03-18 /pmc/articles/PMC5467556/ /pubmed/28365737 http://dx.doi.org/10.1093/database/bax021 Text en © The Author(s) 2017. Published by Oxford University Press. http://creativecommons.org/licenses/by/4.0/ This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted reuse, distribution, and reproduction in any medium, provided the original work is properly cited.
spellingShingle	Original Article Bouadjenek, Mohamed Reda Verspoor, Karin Zobel, Justin Literature consistency of bioinformatics sequence databases is effective for assessing record quality
title	Literature consistency of bioinformatics sequence databases is effective for assessing record quality
title_full	Literature consistency of bioinformatics sequence databases is effective for assessing record quality
title_fullStr	Literature consistency of bioinformatics sequence databases is effective for assessing record quality
title_full_unstemmed	Literature consistency of bioinformatics sequence databases is effective for assessing record quality
title_short	Literature consistency of bioinformatics sequence databases is effective for assessing record quality
title_sort	literature consistency of bioinformatics sequence databases is effective for assessing record quality
topic	Original Article
url	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5467556/ https://www.ncbi.nlm.nih.gov/pubmed/28365737 http://dx.doi.org/10.1093/database/bax021
work_keys_str_mv	AT bouadjenekmohamedreda literatureconsistencyofbioinformaticssequencedatabasesiseffectiveforassessingrecordquality AT verspoorkarin literatureconsistencyofbioinformaticssequencedatabasesiseffectiveforassessingrecordquality AT zobeljustin literatureconsistencyofbioinformaticssequencedatabasesiseffectiveforassessingrecordquality

Literature consistency of bioinformatics sequence databases is effective for assessing record quality

Ejemplares similares