Cargando…

NCBI’s Virus Discovery Hackathon: Engaging Research Communities to Identify Cloud Infrastructure Requirements

A wealth of viral data sits untapped in publicly available metagenomic data sets when it might be extracted to create a usable index for the virological research community. We hypothesized that work of this complexity and scale could be done in a hackathon setting. Ten teams comprised of over 40 par...

Descripción completa

Detalles Bibliográficos
Autores principales: Connor, Ryan, Brister, Rodney, Buchmann, Jan P., Deboutte, Ward, Edwards, Rob, Martí-Carreras, Joan, Tisza, Mike, Zalunin, Vadim, Andrade-Martínez, Juan, Cantu, Adrian, D’Amour, Michael, Efremov, Alexandre, Fleischmann, Lydia, Forero-Junco, Laura, Garmaeva, Sanzhima, Giluso, Melissa, Glickman, Cody, Henderson, Margaret, Kellman, Benjamin, Kristensen, David, Leubsdorf, Carl, Levi, Kyle, Levi, Shane, Pakala, Suman, Peddu, Vikas, Ponsero, Alise, Ribeiro, Eldred, Roy, Farrah, Rutter, Lindsay, Saha, Surya, Shakya, Migun, Shean, Ryan, Miller, Matthew, Tully, Benjamin, Turkington, Christopher, Youens-Clark, Ken, Vanmechelen, Bert, Busby, Ben
Formato: Online Artículo Texto
Lenguaje:English
Publicado: MDPI 2019
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6771016/
https://www.ncbi.nlm.nih.gov/pubmed/31527408
http://dx.doi.org/10.3390/genes10090714
_version_ 1783455617134886912
author Connor, Ryan
Brister, Rodney
Buchmann, Jan P.
Deboutte, Ward
Edwards, Rob
Martí-Carreras, Joan
Tisza, Mike
Zalunin, Vadim
Andrade-Martínez, Juan
Cantu, Adrian
D’Amour, Michael
Efremov, Alexandre
Fleischmann, Lydia
Forero-Junco, Laura
Garmaeva, Sanzhima
Giluso, Melissa
Glickman, Cody
Henderson, Margaret
Kellman, Benjamin
Kristensen, David
Leubsdorf, Carl
Levi, Kyle
Levi, Shane
Pakala, Suman
Peddu, Vikas
Ponsero, Alise
Ribeiro, Eldred
Roy, Farrah
Rutter, Lindsay
Saha, Surya
Shakya, Migun
Shean, Ryan
Miller, Matthew
Tully, Benjamin
Turkington, Christopher
Youens-Clark, Ken
Vanmechelen, Bert
Busby, Ben
author_facet Connor, Ryan
Brister, Rodney
Buchmann, Jan P.
Deboutte, Ward
Edwards, Rob
Martí-Carreras, Joan
Tisza, Mike
Zalunin, Vadim
Andrade-Martínez, Juan
Cantu, Adrian
D’Amour, Michael
Efremov, Alexandre
Fleischmann, Lydia
Forero-Junco, Laura
Garmaeva, Sanzhima
Giluso, Melissa
Glickman, Cody
Henderson, Margaret
Kellman, Benjamin
Kristensen, David
Leubsdorf, Carl
Levi, Kyle
Levi, Shane
Pakala, Suman
Peddu, Vikas
Ponsero, Alise
Ribeiro, Eldred
Roy, Farrah
Rutter, Lindsay
Saha, Surya
Shakya, Migun
Shean, Ryan
Miller, Matthew
Tully, Benjamin
Turkington, Christopher
Youens-Clark, Ken
Vanmechelen, Bert
Busby, Ben
author_sort Connor, Ryan
collection PubMed
description A wealth of viral data sits untapped in publicly available metagenomic data sets when it might be extracted to create a usable index for the virological research community. We hypothesized that work of this complexity and scale could be done in a hackathon setting. Ten teams comprised of over 40 participants from six countries, assembled to create a crowd-sourced set of analysis and processing pipelines for a complex biological data set in a three-day event on the San Diego State University campus starting 9 January 2019. Prior to the hackathon, 141,676 metagenomic data sets from the National Center for Biotechnology Information (NCBI) Sequence Read Archive (SRA) were pre-assembled into contiguous assemblies (contigs) by NCBI staff. During the hackathon, a subset consisting of 2953 SRA data sets (approximately 55 million contigs) was selected, which were further filtered for a minimal length of 1 kb. This resulted in 4.2 million (Mio) contigs, which were aligned using BLAST against all known virus genomes, phylogenetically clustered and assigned metadata. Out of the 4.2 Mio contigs, 360,000 contigs were labeled with domains and an additional subset containing 4400 contigs was screened for virus or virus-like genes. The work yielded valuable insights into both SRA data and the cloud infrastructure required to support such efforts, revealing analysis bottlenecks and possible workarounds thereof. Mainly: (i) Conservative assemblies of SRA data improves initial analysis steps; (ii) existing bioinformatic software with weak multithreading/multicore support can be elevated by wrapper scripts to use all cores within a computing node; (iii) redesigning existing bioinformatic algorithms for a cloud infrastructure to facilitate its use for a wider audience; and (iv) a cloud infrastructure allows a diverse group of researchers to collaborate effectively. The scientific findings will be extended during a follow-up event. Here, we present the applied workflows, initial results, and lessons learned from the hackathon.
format Online
Article
Text
id pubmed-6771016
institution National Center for Biotechnology Information
language English
publishDate 2019
publisher MDPI
record_format MEDLINE/PubMed
spelling pubmed-67710162019-10-30 NCBI’s Virus Discovery Hackathon: Engaging Research Communities to Identify Cloud Infrastructure Requirements Connor, Ryan Brister, Rodney Buchmann, Jan P. Deboutte, Ward Edwards, Rob Martí-Carreras, Joan Tisza, Mike Zalunin, Vadim Andrade-Martínez, Juan Cantu, Adrian D’Amour, Michael Efremov, Alexandre Fleischmann, Lydia Forero-Junco, Laura Garmaeva, Sanzhima Giluso, Melissa Glickman, Cody Henderson, Margaret Kellman, Benjamin Kristensen, David Leubsdorf, Carl Levi, Kyle Levi, Shane Pakala, Suman Peddu, Vikas Ponsero, Alise Ribeiro, Eldred Roy, Farrah Rutter, Lindsay Saha, Surya Shakya, Migun Shean, Ryan Miller, Matthew Tully, Benjamin Turkington, Christopher Youens-Clark, Ken Vanmechelen, Bert Busby, Ben Genes (Basel) Article A wealth of viral data sits untapped in publicly available metagenomic data sets when it might be extracted to create a usable index for the virological research community. We hypothesized that work of this complexity and scale could be done in a hackathon setting. Ten teams comprised of over 40 participants from six countries, assembled to create a crowd-sourced set of analysis and processing pipelines for a complex biological data set in a three-day event on the San Diego State University campus starting 9 January 2019. Prior to the hackathon, 141,676 metagenomic data sets from the National Center for Biotechnology Information (NCBI) Sequence Read Archive (SRA) were pre-assembled into contiguous assemblies (contigs) by NCBI staff. During the hackathon, a subset consisting of 2953 SRA data sets (approximately 55 million contigs) was selected, which were further filtered for a minimal length of 1 kb. This resulted in 4.2 million (Mio) contigs, which were aligned using BLAST against all known virus genomes, phylogenetically clustered and assigned metadata. Out of the 4.2 Mio contigs, 360,000 contigs were labeled with domains and an additional subset containing 4400 contigs was screened for virus or virus-like genes. The work yielded valuable insights into both SRA data and the cloud infrastructure required to support such efforts, revealing analysis bottlenecks and possible workarounds thereof. Mainly: (i) Conservative assemblies of SRA data improves initial analysis steps; (ii) existing bioinformatic software with weak multithreading/multicore support can be elevated by wrapper scripts to use all cores within a computing node; (iii) redesigning existing bioinformatic algorithms for a cloud infrastructure to facilitate its use for a wider audience; and (iv) a cloud infrastructure allows a diverse group of researchers to collaborate effectively. The scientific findings will be extended during a follow-up event. Here, we present the applied workflows, initial results, and lessons learned from the hackathon. MDPI 2019-09-16 /pmc/articles/PMC6771016/ /pubmed/31527408 http://dx.doi.org/10.3390/genes10090714 Text en © 2019 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).
spellingShingle Article
Connor, Ryan
Brister, Rodney
Buchmann, Jan P.
Deboutte, Ward
Edwards, Rob
Martí-Carreras, Joan
Tisza, Mike
Zalunin, Vadim
Andrade-Martínez, Juan
Cantu, Adrian
D’Amour, Michael
Efremov, Alexandre
Fleischmann, Lydia
Forero-Junco, Laura
Garmaeva, Sanzhima
Giluso, Melissa
Glickman, Cody
Henderson, Margaret
Kellman, Benjamin
Kristensen, David
Leubsdorf, Carl
Levi, Kyle
Levi, Shane
Pakala, Suman
Peddu, Vikas
Ponsero, Alise
Ribeiro, Eldred
Roy, Farrah
Rutter, Lindsay
Saha, Surya
Shakya, Migun
Shean, Ryan
Miller, Matthew
Tully, Benjamin
Turkington, Christopher
Youens-Clark, Ken
Vanmechelen, Bert
Busby, Ben
NCBI’s Virus Discovery Hackathon: Engaging Research Communities to Identify Cloud Infrastructure Requirements
title NCBI’s Virus Discovery Hackathon: Engaging Research Communities to Identify Cloud Infrastructure Requirements
title_full NCBI’s Virus Discovery Hackathon: Engaging Research Communities to Identify Cloud Infrastructure Requirements
title_fullStr NCBI’s Virus Discovery Hackathon: Engaging Research Communities to Identify Cloud Infrastructure Requirements
title_full_unstemmed NCBI’s Virus Discovery Hackathon: Engaging Research Communities to Identify Cloud Infrastructure Requirements
title_short NCBI’s Virus Discovery Hackathon: Engaging Research Communities to Identify Cloud Infrastructure Requirements
title_sort ncbi’s virus discovery hackathon: engaging research communities to identify cloud infrastructure requirements
topic Article
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6771016/
https://www.ncbi.nlm.nih.gov/pubmed/31527408
http://dx.doi.org/10.3390/genes10090714
work_keys_str_mv AT connorryan ncbisvirusdiscoveryhackathonengagingresearchcommunitiestoidentifycloudinfrastructurerequirements
AT bristerrodney ncbisvirusdiscoveryhackathonengagingresearchcommunitiestoidentifycloudinfrastructurerequirements
AT buchmannjanp ncbisvirusdiscoveryhackathonengagingresearchcommunitiestoidentifycloudinfrastructurerequirements
AT deboutteward ncbisvirusdiscoveryhackathonengagingresearchcommunitiestoidentifycloudinfrastructurerequirements
AT edwardsrob ncbisvirusdiscoveryhackathonengagingresearchcommunitiestoidentifycloudinfrastructurerequirements
AT marticarrerasjoan ncbisvirusdiscoveryhackathonengagingresearchcommunitiestoidentifycloudinfrastructurerequirements
AT tiszamike ncbisvirusdiscoveryhackathonengagingresearchcommunitiestoidentifycloudinfrastructurerequirements
AT zaluninvadim ncbisvirusdiscoveryhackathonengagingresearchcommunitiestoidentifycloudinfrastructurerequirements
AT andrademartinezjuan ncbisvirusdiscoveryhackathonengagingresearchcommunitiestoidentifycloudinfrastructurerequirements
AT cantuadrian ncbisvirusdiscoveryhackathonengagingresearchcommunitiestoidentifycloudinfrastructurerequirements
AT damourmichael ncbisvirusdiscoveryhackathonengagingresearchcommunitiestoidentifycloudinfrastructurerequirements
AT efremovalexandre ncbisvirusdiscoveryhackathonengagingresearchcommunitiestoidentifycloudinfrastructurerequirements
AT fleischmannlydia ncbisvirusdiscoveryhackathonengagingresearchcommunitiestoidentifycloudinfrastructurerequirements
AT forerojuncolaura ncbisvirusdiscoveryhackathonengagingresearchcommunitiestoidentifycloudinfrastructurerequirements
AT garmaevasanzhima ncbisvirusdiscoveryhackathonengagingresearchcommunitiestoidentifycloudinfrastructurerequirements
AT gilusomelissa ncbisvirusdiscoveryhackathonengagingresearchcommunitiestoidentifycloudinfrastructurerequirements
AT glickmancody ncbisvirusdiscoveryhackathonengagingresearchcommunitiestoidentifycloudinfrastructurerequirements
AT hendersonmargaret ncbisvirusdiscoveryhackathonengagingresearchcommunitiestoidentifycloudinfrastructurerequirements
AT kellmanbenjamin ncbisvirusdiscoveryhackathonengagingresearchcommunitiestoidentifycloudinfrastructurerequirements
AT kristensendavid ncbisvirusdiscoveryhackathonengagingresearchcommunitiestoidentifycloudinfrastructurerequirements
AT leubsdorfcarl ncbisvirusdiscoveryhackathonengagingresearchcommunitiestoidentifycloudinfrastructurerequirements
AT levikyle ncbisvirusdiscoveryhackathonengagingresearchcommunitiestoidentifycloudinfrastructurerequirements
AT levishane ncbisvirusdiscoveryhackathonengagingresearchcommunitiestoidentifycloudinfrastructurerequirements
AT pakalasuman ncbisvirusdiscoveryhackathonengagingresearchcommunitiestoidentifycloudinfrastructurerequirements
AT pedduvikas ncbisvirusdiscoveryhackathonengagingresearchcommunitiestoidentifycloudinfrastructurerequirements
AT ponseroalise ncbisvirusdiscoveryhackathonengagingresearchcommunitiestoidentifycloudinfrastructurerequirements
AT ribeiroeldred ncbisvirusdiscoveryhackathonengagingresearchcommunitiestoidentifycloudinfrastructurerequirements
AT royfarrah ncbisvirusdiscoveryhackathonengagingresearchcommunitiestoidentifycloudinfrastructurerequirements
AT rutterlindsay ncbisvirusdiscoveryhackathonengagingresearchcommunitiestoidentifycloudinfrastructurerequirements
AT sahasurya ncbisvirusdiscoveryhackathonengagingresearchcommunitiestoidentifycloudinfrastructurerequirements
AT shakyamigun ncbisvirusdiscoveryhackathonengagingresearchcommunitiestoidentifycloudinfrastructurerequirements
AT sheanryan ncbisvirusdiscoveryhackathonengagingresearchcommunitiestoidentifycloudinfrastructurerequirements
AT millermatthew ncbisvirusdiscoveryhackathonengagingresearchcommunitiestoidentifycloudinfrastructurerequirements
AT tullybenjamin ncbisvirusdiscoveryhackathonengagingresearchcommunitiestoidentifycloudinfrastructurerequirements
AT turkingtonchristopher ncbisvirusdiscoveryhackathonengagingresearchcommunitiestoidentifycloudinfrastructurerequirements
AT youensclarkken ncbisvirusdiscoveryhackathonengagingresearchcommunitiestoidentifycloudinfrastructurerequirements
AT vanmechelenbert ncbisvirusdiscoveryhackathonengagingresearchcommunitiestoidentifycloudinfrastructurerequirements
AT busbyben ncbisvirusdiscoveryhackathonengagingresearchcommunitiestoidentifycloudinfrastructurerequirements