Cargando…

20 GB in 10 minutes: a case for linking major biodiversity databases using an open socio-technical infrastructure and a pragmatic, cross-institutional collaboration

Biodiversity information is made available through numerous databases that each have their own data models, web services, and data types. Combining data across databases leads to new insights, but is not easy because each database uses its own system of identifiers. In the absence of stable and inte...

Descripción completa

Detalles Bibliográficos
Autores principales:	Thessen, Anne E., Poelen, Jorrit H., Collins, Matthew, Hammock, Jen
Formato:	Online Artículo Texto
Lenguaje:	English
Publicado:	PeerJ Inc. 2018
Materias:	Bioinformatics
Acceso en línea:	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7924439/ https://www.ncbi.nlm.nih.gov/pubmed/33816817 http://dx.doi.org/10.7717/peerj-cs.164

_version_	1783659089875697664
author	Thessen, Anne E. Poelen, Jorrit H. Collins, Matthew Hammock, Jen
author_facet	Thessen, Anne E. Poelen, Jorrit H. Collins, Matthew Hammock, Jen
author_sort	Thessen, Anne E.
collection	PubMed
description	Biodiversity information is made available through numerous databases that each have their own data models, web services, and data types. Combining data across databases leads to new insights, but is not easy because each database uses its own system of identifiers. In the absence of stable and interoperable identifiers, databases are often linked using taxonomic names. This labor intensive, error prone, and lengthy process relies on accessible versions of nomenclatural authorities and fuzzy-matching algorithms. To approach the challenge of linking diverse data, more than technology is needed. New social collaborations like the Global Unified Open Data Architecture (GUODA) that combines skills from diverse groups of computer engineers from iDigBio, server resources from the Advanced Computing and Information Systems (ACIS) Lab, global-scale data presentation from EOL, and independent developers and researchers are what is needed to make concrete progress on finding relationships between biodiversity datasets. This paper will discuss a technical solution developed by the GUODA collaboration for faster linking across databases with a use case linking Wikidata and the Global Biotic Interactions database (GloBI). The GUODA infrastructure is a 12-node, high performance computing cluster made up of about 192 threads with 12 TB of storage and 288 GB memory. Using GUODA, 20 GB of compressed JSON from Wikidata was processed and linked to GloBI in about 10–11 min. Instead of comparing name strings or relying on a single identifier, Wikidata and GloBI were linked by comparing graphs of biodiversity identifiers external to each system. This method resulted in adding 119,957 Wikidata links in GloBI, an increase of 13.7% of all outgoing name links in GloBI. Wikidata and GloBI were compared to Open Tree of Life Reference Taxonomy to examine consistency and coverage. The process of parsing Wikidata, Open Tree of Life Reference Taxonomy and GloBI archives and calculating consistency metrics was done in minutes on the GUODA platform. As a model collaboration, GUODA has the potential to revolutionize biodiversity science by bringing diverse technically minded people together with high performance computing resources that are accessible from a laptop or desktop. However, participating in such a collaboration still requires basic programming skills.
format	Online Article Text
id	pubmed-7924439
institution	National Center for Biotechnology Information
language	English
publishDate	2018
publisher	PeerJ Inc.
record_format	MEDLINE/PubMed
spelling	pubmed-79244392021-04-02 20 GB in 10 minutes: a case for linking major biodiversity databases using an open socio-technical infrastructure and a pragmatic, cross-institutional collaboration Thessen, Anne E. Poelen, Jorrit H. Collins, Matthew Hammock, Jen PeerJ Comput Sci Bioinformatics Biodiversity information is made available through numerous databases that each have their own data models, web services, and data types. Combining data across databases leads to new insights, but is not easy because each database uses its own system of identifiers. In the absence of stable and interoperable identifiers, databases are often linked using taxonomic names. This labor intensive, error prone, and lengthy process relies on accessible versions of nomenclatural authorities and fuzzy-matching algorithms. To approach the challenge of linking diverse data, more than technology is needed. New social collaborations like the Global Unified Open Data Architecture (GUODA) that combines skills from diverse groups of computer engineers from iDigBio, server resources from the Advanced Computing and Information Systems (ACIS) Lab, global-scale data presentation from EOL, and independent developers and researchers are what is needed to make concrete progress on finding relationships between biodiversity datasets. This paper will discuss a technical solution developed by the GUODA collaboration for faster linking across databases with a use case linking Wikidata and the Global Biotic Interactions database (GloBI). The GUODA infrastructure is a 12-node, high performance computing cluster made up of about 192 threads with 12 TB of storage and 288 GB memory. Using GUODA, 20 GB of compressed JSON from Wikidata was processed and linked to GloBI in about 10–11 min. Instead of comparing name strings or relying on a single identifier, Wikidata and GloBI were linked by comparing graphs of biodiversity identifiers external to each system. This method resulted in adding 119,957 Wikidata links in GloBI, an increase of 13.7% of all outgoing name links in GloBI. Wikidata and GloBI were compared to Open Tree of Life Reference Taxonomy to examine consistency and coverage. The process of parsing Wikidata, Open Tree of Life Reference Taxonomy and GloBI archives and calculating consistency metrics was done in minutes on the GUODA platform. As a model collaboration, GUODA has the potential to revolutionize biodiversity science by bringing diverse technically minded people together with high performance computing resources that are accessible from a laptop or desktop. However, participating in such a collaboration still requires basic programming skills. PeerJ Inc. 2018-09-17 /pmc/articles/PMC7924439/ /pubmed/33816817 http://dx.doi.org/10.7717/peerj-cs.164 Text en ©2018 Thessen et al. http://creativecommons.org/licenses/by/4.0/ This is an open access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/4.0/) , which permits unrestricted use, distribution, reproduction and adaptation in any medium and for any purpose provided that it is properly attributed. For attribution, the original author(s), title, publication source (PeerJ Computer Science) and either DOI or URL of the article must be cited.
spellingShingle	Bioinformatics Thessen, Anne E. Poelen, Jorrit H. Collins, Matthew Hammock, Jen 20 GB in 10 minutes: a case for linking major biodiversity databases using an open socio-technical infrastructure and a pragmatic, cross-institutional collaboration
title	20 GB in 10 minutes: a case for linking major biodiversity databases using an open socio-technical infrastructure and a pragmatic, cross-institutional collaboration
title_full	20 GB in 10 minutes: a case for linking major biodiversity databases using an open socio-technical infrastructure and a pragmatic, cross-institutional collaboration
title_fullStr	20 GB in 10 minutes: a case for linking major biodiversity databases using an open socio-technical infrastructure and a pragmatic, cross-institutional collaboration
title_full_unstemmed	20 GB in 10 minutes: a case for linking major biodiversity databases using an open socio-technical infrastructure and a pragmatic, cross-institutional collaboration
title_short	20 GB in 10 minutes: a case for linking major biodiversity databases using an open socio-technical infrastructure and a pragmatic, cross-institutional collaboration
title_sort	20 gb in 10 minutes: a case for linking major biodiversity databases using an open socio-technical infrastructure and a pragmatic, cross-institutional collaboration
topic	Bioinformatics
url	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7924439/ https://www.ncbi.nlm.nih.gov/pubmed/33816817 http://dx.doi.org/10.7717/peerj-cs.164
work_keys_str_mv	AT thessenannee 20gbin10minutesacaseforlinkingmajorbiodiversitydatabasesusinganopensociotechnicalinfrastructureandapragmaticcrossinstitutionalcollaboration AT poelenjorrith 20gbin10minutesacaseforlinkingmajorbiodiversitydatabasesusinganopensociotechnicalinfrastructureandapragmaticcrossinstitutionalcollaboration AT collinsmatthew 20gbin10minutesacaseforlinkingmajorbiodiversitydatabasesusinganopensociotechnicalinfrastructureandapragmaticcrossinstitutionalcollaboration AT hammockjen 20gbin10minutesacaseforlinkingmajorbiodiversitydatabasesusinganopensociotechnicalinfrastructureandapragmaticcrossinstitutionalcollaboration

20 GB in 10 minutes: a case for linking major biodiversity databases using an open socio-technical infrastructure and a pragmatic, cross-institutional collaboration

Ejemplares similares