Cargando…

PGen: large-scale genomic variations analysis workflow and browser in SoyKB

BACKGROUND: With the advances in next-generation sequencing (NGS) technology and significant reductions in sequencing costs, it is now possible to sequence large collections of germplasm in crops for detecting genome-scale genetic variations and to apply the knowledge towards improvements in traits....

Descripción completa

Detalles Bibliográficos
Autores principales: Liu, Yang, Khan, Saad M., Wang, Juexin, Rynge, Mats, Zhang, Yuanxun, Zeng, Shuai, Chen, Shiyuan, Maldonado dos Santos, Joao V., Valliyodan, Babu, Calyam, Prasad P., Merchant, Nirav, Nguyen, Henry T., Xu, Dong, Joshi, Trupti
Formato: Online Artículo Texto
Lenguaje:English
Publicado: BioMed Central 2016
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5074001/
https://www.ncbi.nlm.nih.gov/pubmed/27766951
http://dx.doi.org/10.1186/s12859-016-1227-y
_version_ 1782461675954241536
author Liu, Yang
Khan, Saad M.
Wang, Juexin
Rynge, Mats
Zhang, Yuanxun
Zeng, Shuai
Chen, Shiyuan
Maldonado dos Santos, Joao V.
Valliyodan, Babu
Calyam, Prasad P.
Merchant, Nirav
Nguyen, Henry T.
Xu, Dong
Joshi, Trupti
author_facet Liu, Yang
Khan, Saad M.
Wang, Juexin
Rynge, Mats
Zhang, Yuanxun
Zeng, Shuai
Chen, Shiyuan
Maldonado dos Santos, Joao V.
Valliyodan, Babu
Calyam, Prasad P.
Merchant, Nirav
Nguyen, Henry T.
Xu, Dong
Joshi, Trupti
author_sort Liu, Yang
collection PubMed
description BACKGROUND: With the advances in next-generation sequencing (NGS) technology and significant reductions in sequencing costs, it is now possible to sequence large collections of germplasm in crops for detecting genome-scale genetic variations and to apply the knowledge towards improvements in traits. To efficiently facilitate large-scale NGS resequencing data analysis of genomic variations, we have developed “PGen”, an integrated and optimized workflow using the Extreme Science and Engineering Discovery Environment (XSEDE) high-performance computing (HPC) virtual system, iPlant cloud data storage resources and Pegasus workflow management system (Pegasus-WMS). The workflow allows users to identify single nucleotide polymorphisms (SNPs) and insertion-deletions (indels), perform SNP annotations and conduct copy number variation analyses on multiple resequencing datasets in a user-friendly and seamless way. RESULTS: We have developed both a Linux version in GitHub (https://github.com/pegasus-isi/PGen-GenomicVariations-Workflow) and a web-based implementation of the PGen workflow integrated within the Soybean Knowledge Base (SoyKB), (http://soykb.org/Pegasus/index.php). Using PGen, we identified 10,218,140 single-nucleotide polymorphisms (SNPs) and 1,398,982 indels from analysis of 106 soybean lines sequenced at 15X coverage. 297,245 non-synonymous SNPs and 3330 copy number variation (CNV) regions were identified from this analysis. SNPs identified using PGen from additional soybean resequencing projects adding to 500+ soybean germplasm lines in total have been integrated. These SNPs are being utilized for trait improvement using genotype to phenotype prediction approaches developed in-house. In order to browse and access NGS data easily, we have also developed an NGS resequencing data browser (http://soykb.org/NGS_Resequence/NGS_index.php) within SoyKB to provide easy access to SNP and downstream analysis results for soybean researchers. CONCLUSION: PGen workflow has been optimized for the most efficient analysis of soybean data using thorough testing and validation. This research serves as an example of best practices for development of genomics data analysis workflows by integrating remote HPC resources and efficient data management with ease of use for biological users. PGen workflow can also be easily customized for analysis of data in other species. ELECTRONIC SUPPLEMENTARY MATERIAL: The online version of this article (doi:10.1186/s12859-016-1227-y) contains supplementary material, which is available to authorized users.
format Online
Article
Text
id pubmed-5074001
institution National Center for Biotechnology Information
language English
publishDate 2016
publisher BioMed Central
record_format MEDLINE/PubMed
spelling pubmed-50740012016-10-27 PGen: large-scale genomic variations analysis workflow and browser in SoyKB Liu, Yang Khan, Saad M. Wang, Juexin Rynge, Mats Zhang, Yuanxun Zeng, Shuai Chen, Shiyuan Maldonado dos Santos, Joao V. Valliyodan, Babu Calyam, Prasad P. Merchant, Nirav Nguyen, Henry T. Xu, Dong Joshi, Trupti BMC Bioinformatics Proceedings BACKGROUND: With the advances in next-generation sequencing (NGS) technology and significant reductions in sequencing costs, it is now possible to sequence large collections of germplasm in crops for detecting genome-scale genetic variations and to apply the knowledge towards improvements in traits. To efficiently facilitate large-scale NGS resequencing data analysis of genomic variations, we have developed “PGen”, an integrated and optimized workflow using the Extreme Science and Engineering Discovery Environment (XSEDE) high-performance computing (HPC) virtual system, iPlant cloud data storage resources and Pegasus workflow management system (Pegasus-WMS). The workflow allows users to identify single nucleotide polymorphisms (SNPs) and insertion-deletions (indels), perform SNP annotations and conduct copy number variation analyses on multiple resequencing datasets in a user-friendly and seamless way. RESULTS: We have developed both a Linux version in GitHub (https://github.com/pegasus-isi/PGen-GenomicVariations-Workflow) and a web-based implementation of the PGen workflow integrated within the Soybean Knowledge Base (SoyKB), (http://soykb.org/Pegasus/index.php). Using PGen, we identified 10,218,140 single-nucleotide polymorphisms (SNPs) and 1,398,982 indels from analysis of 106 soybean lines sequenced at 15X coverage. 297,245 non-synonymous SNPs and 3330 copy number variation (CNV) regions were identified from this analysis. SNPs identified using PGen from additional soybean resequencing projects adding to 500+ soybean germplasm lines in total have been integrated. These SNPs are being utilized for trait improvement using genotype to phenotype prediction approaches developed in-house. In order to browse and access NGS data easily, we have also developed an NGS resequencing data browser (http://soykb.org/NGS_Resequence/NGS_index.php) within SoyKB to provide easy access to SNP and downstream analysis results for soybean researchers. CONCLUSION: PGen workflow has been optimized for the most efficient analysis of soybean data using thorough testing and validation. This research serves as an example of best practices for development of genomics data analysis workflows by integrating remote HPC resources and efficient data management with ease of use for biological users. PGen workflow can also be easily customized for analysis of data in other species. ELECTRONIC SUPPLEMENTARY MATERIAL: The online version of this article (doi:10.1186/s12859-016-1227-y) contains supplementary material, which is available to authorized users. BioMed Central 2016-10-06 /pmc/articles/PMC5074001/ /pubmed/27766951 http://dx.doi.org/10.1186/s12859-016-1227-y Text en © The Author(s). 2016 Open AccessThis article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.
spellingShingle Proceedings
Liu, Yang
Khan, Saad M.
Wang, Juexin
Rynge, Mats
Zhang, Yuanxun
Zeng, Shuai
Chen, Shiyuan
Maldonado dos Santos, Joao V.
Valliyodan, Babu
Calyam, Prasad P.
Merchant, Nirav
Nguyen, Henry T.
Xu, Dong
Joshi, Trupti
PGen: large-scale genomic variations analysis workflow and browser in SoyKB
title PGen: large-scale genomic variations analysis workflow and browser in SoyKB
title_full PGen: large-scale genomic variations analysis workflow and browser in SoyKB
title_fullStr PGen: large-scale genomic variations analysis workflow and browser in SoyKB
title_full_unstemmed PGen: large-scale genomic variations analysis workflow and browser in SoyKB
title_short PGen: large-scale genomic variations analysis workflow and browser in SoyKB
title_sort pgen: large-scale genomic variations analysis workflow and browser in soykb
topic Proceedings
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5074001/
https://www.ncbi.nlm.nih.gov/pubmed/27766951
http://dx.doi.org/10.1186/s12859-016-1227-y
work_keys_str_mv AT liuyang pgenlargescalegenomicvariationsanalysisworkflowandbrowserinsoykb
AT khansaadm pgenlargescalegenomicvariationsanalysisworkflowandbrowserinsoykb
AT wangjuexin pgenlargescalegenomicvariationsanalysisworkflowandbrowserinsoykb
AT ryngemats pgenlargescalegenomicvariationsanalysisworkflowandbrowserinsoykb
AT zhangyuanxun pgenlargescalegenomicvariationsanalysisworkflowandbrowserinsoykb
AT zengshuai pgenlargescalegenomicvariationsanalysisworkflowandbrowserinsoykb
AT chenshiyuan pgenlargescalegenomicvariationsanalysisworkflowandbrowserinsoykb
AT maldonadodossantosjoaov pgenlargescalegenomicvariationsanalysisworkflowandbrowserinsoykb
AT valliyodanbabu pgenlargescalegenomicvariationsanalysisworkflowandbrowserinsoykb
AT calyamprasadp pgenlargescalegenomicvariationsanalysisworkflowandbrowserinsoykb
AT merchantnirav pgenlargescalegenomicvariationsanalysisworkflowandbrowserinsoykb
AT nguyenhenryt pgenlargescalegenomicvariationsanalysisworkflowandbrowserinsoykb
AT xudong pgenlargescalegenomicvariationsanalysisworkflowandbrowserinsoykb
AT joshitrupti pgenlargescalegenomicvariationsanalysisworkflowandbrowserinsoykb