Cargando…

Duplicates, redundancies and inconsistencies in the primary nucleotide databases: a descriptive study

GenBank, the EMBL European Nucleotide Archive and the DNA DataBank of Japan, known collectively as the International Nucleotide Sequence Database Collaboration or INSDC, are the three most significant nucleotide sequence databases. Their records are derived from laboratory work undertaken by differe...

Descripción completa

Detalles Bibliográficos
Autores principales: Chen, Qingyu, Zobel, Justin, Verspoor, Karin
Formato: Online Artículo Texto
Lenguaje:English
Publicado: Oxford University Press 2017
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5225397/
https://www.ncbi.nlm.nih.gov/pubmed/28077566
http://dx.doi.org/10.1093/database/baw163
_version_ 1782493497144639488
author Chen, Qingyu
Zobel, Justin
Verspoor, Karin
author_facet Chen, Qingyu
Zobel, Justin
Verspoor, Karin
author_sort Chen, Qingyu
collection PubMed
description GenBank, the EMBL European Nucleotide Archive and the DNA DataBank of Japan, known collectively as the International Nucleotide Sequence Database Collaboration or INSDC, are the three most significant nucleotide sequence databases. Their records are derived from laboratory work undertaken by different individuals, by different teams, with a range of technologies and assumptions and over a period of decades. As a consequence, they contain a great many duplicates, redundancies and inconsistencies, but neither the prevalence nor the characteristics of various types of duplicates have been rigorously assessed. Existing duplicate detection methods in bioinformatics only address specific duplicate types, with inconsistent assumptions; and the impact of duplicates in bioinformatics databases has not been carefully assessed, making it difficult to judge the value of such methods. Our goal is to assess the scale, kinds and impact of duplicates in bioinformatics databases, through a retrospective analysis of merged groups in INSDC databases. Our outcomes are threefold: (1) We analyse a benchmark dataset consisting of duplicates manually identified in INSDC—a dataset of 67 888 merged groups with 111 823 duplicate pairs across 21 organisms from INSDC databases – in terms of the prevalence, types and impacts of duplicates. (2) We categorize duplicates at both sequence and annotation level, with supporting quantitative statistics, showing that different organisms have different prevalence of distinct kinds of duplicate. (3) We show that the presence of duplicates has practical impact via a simple case study on duplicates, in terms of GC content and melting temperature. We demonstrate that duplicates not only introduce redundancy, but can lead to inconsistent results for certain tasks. Our findings lead to a better understanding of the problem of duplication in biological databases. Database URL: the merged records are available at https://cloudstor.aarnet.edu.au/plus/index.php/s/Xef2fvsebBEAv9w
format Online
Article
Text
id pubmed-5225397
institution National Center for Biotechnology Information
language English
publishDate 2017
publisher Oxford University Press
record_format MEDLINE/PubMed
spelling pubmed-52253972017-01-18 Duplicates, redundancies and inconsistencies in the primary nucleotide databases: a descriptive study Chen, Qingyu Zobel, Justin Verspoor, Karin Database (Oxford) Original Article GenBank, the EMBL European Nucleotide Archive and the DNA DataBank of Japan, known collectively as the International Nucleotide Sequence Database Collaboration or INSDC, are the three most significant nucleotide sequence databases. Their records are derived from laboratory work undertaken by different individuals, by different teams, with a range of technologies and assumptions and over a period of decades. As a consequence, they contain a great many duplicates, redundancies and inconsistencies, but neither the prevalence nor the characteristics of various types of duplicates have been rigorously assessed. Existing duplicate detection methods in bioinformatics only address specific duplicate types, with inconsistent assumptions; and the impact of duplicates in bioinformatics databases has not been carefully assessed, making it difficult to judge the value of such methods. Our goal is to assess the scale, kinds and impact of duplicates in bioinformatics databases, through a retrospective analysis of merged groups in INSDC databases. Our outcomes are threefold: (1) We analyse a benchmark dataset consisting of duplicates manually identified in INSDC—a dataset of 67 888 merged groups with 111 823 duplicate pairs across 21 organisms from INSDC databases – in terms of the prevalence, types and impacts of duplicates. (2) We categorize duplicates at both sequence and annotation level, with supporting quantitative statistics, showing that different organisms have different prevalence of distinct kinds of duplicate. (3) We show that the presence of duplicates has practical impact via a simple case study on duplicates, in terms of GC content and melting temperature. We demonstrate that duplicates not only introduce redundancy, but can lead to inconsistent results for certain tasks. Our findings lead to a better understanding of the problem of duplication in biological databases. Database URL: the merged records are available at https://cloudstor.aarnet.edu.au/plus/index.php/s/Xef2fvsebBEAv9w Oxford University Press 2017-01-10 /pmc/articles/PMC5225397/ /pubmed/28077566 http://dx.doi.org/10.1093/database/baw163 Text en © The Author(s) 2017. Published by Oxford University Press. http://creativecommons.org/licenses/by/4.0/ This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted reuse, distribution, and reproduction in any medium, provided the original work is properly cited.
spellingShingle Original Article
Chen, Qingyu
Zobel, Justin
Verspoor, Karin
Duplicates, redundancies and inconsistencies in the primary nucleotide databases: a descriptive study
title Duplicates, redundancies and inconsistencies in the primary nucleotide databases: a descriptive study
title_full Duplicates, redundancies and inconsistencies in the primary nucleotide databases: a descriptive study
title_fullStr Duplicates, redundancies and inconsistencies in the primary nucleotide databases: a descriptive study
title_full_unstemmed Duplicates, redundancies and inconsistencies in the primary nucleotide databases: a descriptive study
title_short Duplicates, redundancies and inconsistencies in the primary nucleotide databases: a descriptive study
title_sort duplicates, redundancies and inconsistencies in the primary nucleotide databases: a descriptive study
topic Original Article
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5225397/
https://www.ncbi.nlm.nih.gov/pubmed/28077566
http://dx.doi.org/10.1093/database/baw163
work_keys_str_mv AT chenqingyu duplicatesredundanciesandinconsistenciesintheprimarynucleotidedatabasesadescriptivestudy
AT zobeljustin duplicatesredundanciesandinconsistenciesintheprimarynucleotidedatabasesadescriptivestudy
AT verspoorkarin duplicatesredundanciesandinconsistenciesintheprimarynucleotidedatabasesadescriptivestudy