Cargando…

SeqHBase: a big data toolset for family based sequencing data analysis

BACKGROUND: Whole-genome sequencing (WGS) and whole-exome sequencing (WES) technologies are increasingly used to identify disease-contributing mutations in human genomic studies. It can be a significant challenge to process such data, especially when a large family or cohort is sequenced. Our object...

Descripción completa

Detalles Bibliográficos
Autores principales: He, Min, Person, Thomas N, Hebbring, Scott J, Heinzen, Ethan, Ye, Zhan, Schrodi, Steven J, McPherson, Elizabeth W, Lin, Simon M, Peissig, Peggy L, Brilliant, Murray H, O'Rawe, Jason, Robison, Reid J, Lyon, Gholson J, Wang, Kai
Formato: Online Artículo Texto
Lenguaje:English
Publicado: BMJ Publishing Group 2015
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4382803/
https://www.ncbi.nlm.nih.gov/pubmed/25587064
http://dx.doi.org/10.1136/jmedgenet-2014-102907
_version_ 1782364636024143872
author He, Min
Person, Thomas N
Hebbring, Scott J
Heinzen, Ethan
Ye, Zhan
Schrodi, Steven J
McPherson, Elizabeth W
Lin, Simon M
Peissig, Peggy L
Brilliant, Murray H
O'Rawe, Jason
Robison, Reid J
Lyon, Gholson J
Wang, Kai
author_facet He, Min
Person, Thomas N
Hebbring, Scott J
Heinzen, Ethan
Ye, Zhan
Schrodi, Steven J
McPherson, Elizabeth W
Lin, Simon M
Peissig, Peggy L
Brilliant, Murray H
O'Rawe, Jason
Robison, Reid J
Lyon, Gholson J
Wang, Kai
author_sort He, Min
collection PubMed
description BACKGROUND: Whole-genome sequencing (WGS) and whole-exome sequencing (WES) technologies are increasingly used to identify disease-contributing mutations in human genomic studies. It can be a significant challenge to process such data, especially when a large family or cohort is sequenced. Our objective was to develop a big data toolset to efficiently manipulate genome-wide variants, functional annotations and coverage, together with conducting family based sequencing data analysis. METHODS: Hadoop is a framework for reliable, scalable, distributed processing of large data sets using MapReduce programming models. Based on Hadoop and HBase, we developed SeqHBase, a big data-based toolset for analysing family based sequencing data to detect de novo, inherited homozygous, or compound heterozygous mutations that may contribute to disease manifestations. SeqHBase takes as input BAM files (for coverage at every site), variant call format (VCF) files (for variant calls) and functional annotations (for variant prioritisation). RESULTS: We applied SeqHBase to a 5-member nuclear family and a 10-member 3-generation family with WGS data, as well as a 4-member nuclear family with WES data. Analysis times were almost linearly scalable with number of data nodes. With 20 data nodes, SeqHBase took about 5 secs to analyse WES familial data and approximately 1 min to analyse WGS familial data. CONCLUSIONS: These results demonstrate SeqHBase's high efficiency and scalability, which is necessary as WGS and WES are rapidly becoming standard methods to study the genetics of familial disorders.
format Online
Article
Text
id pubmed-4382803
institution National Center for Biotechnology Information
language English
publishDate 2015
publisher BMJ Publishing Group
record_format MEDLINE/PubMed
spelling pubmed-43828032015-04-02 SeqHBase: a big data toolset for family based sequencing data analysis He, Min Person, Thomas N Hebbring, Scott J Heinzen, Ethan Ye, Zhan Schrodi, Steven J McPherson, Elizabeth W Lin, Simon M Peissig, Peggy L Brilliant, Murray H O'Rawe, Jason Robison, Reid J Lyon, Gholson J Wang, Kai J Med Genet Methods BACKGROUND: Whole-genome sequencing (WGS) and whole-exome sequencing (WES) technologies are increasingly used to identify disease-contributing mutations in human genomic studies. It can be a significant challenge to process such data, especially when a large family or cohort is sequenced. Our objective was to develop a big data toolset to efficiently manipulate genome-wide variants, functional annotations and coverage, together with conducting family based sequencing data analysis. METHODS: Hadoop is a framework for reliable, scalable, distributed processing of large data sets using MapReduce programming models. Based on Hadoop and HBase, we developed SeqHBase, a big data-based toolset for analysing family based sequencing data to detect de novo, inherited homozygous, or compound heterozygous mutations that may contribute to disease manifestations. SeqHBase takes as input BAM files (for coverage at every site), variant call format (VCF) files (for variant calls) and functional annotations (for variant prioritisation). RESULTS: We applied SeqHBase to a 5-member nuclear family and a 10-member 3-generation family with WGS data, as well as a 4-member nuclear family with WES data. Analysis times were almost linearly scalable with number of data nodes. With 20 data nodes, SeqHBase took about 5 secs to analyse WES familial data and approximately 1 min to analyse WGS familial data. CONCLUSIONS: These results demonstrate SeqHBase's high efficiency and scalability, which is necessary as WGS and WES are rapidly becoming standard methods to study the genetics of familial disorders. BMJ Publishing Group 2015-04 2015-01-13 /pmc/articles/PMC4382803/ /pubmed/25587064 http://dx.doi.org/10.1136/jmedgenet-2014-102907 Text en Published by the BMJ Publishing Group Limited. For permission to use (where not already granted under a licence) please go to http://group.bmj.com/group/rights-licensing/permissions This is an Open Access article distributed in accordance with the Creative Commons Attribution Non Commercial (CC BY-NC 4.0) license, which permits others to distribute, remix, adapt, build upon this work non-commercially, and license their derivative works on different terms, provided the original work is properly cited and the use is non-commercial. See: http://creativecommons.org/licenses/by-nc/4.0/
spellingShingle Methods
He, Min
Person, Thomas N
Hebbring, Scott J
Heinzen, Ethan
Ye, Zhan
Schrodi, Steven J
McPherson, Elizabeth W
Lin, Simon M
Peissig, Peggy L
Brilliant, Murray H
O'Rawe, Jason
Robison, Reid J
Lyon, Gholson J
Wang, Kai
SeqHBase: a big data toolset for family based sequencing data analysis
title SeqHBase: a big data toolset for family based sequencing data analysis
title_full SeqHBase: a big data toolset for family based sequencing data analysis
title_fullStr SeqHBase: a big data toolset for family based sequencing data analysis
title_full_unstemmed SeqHBase: a big data toolset for family based sequencing data analysis
title_short SeqHBase: a big data toolset for family based sequencing data analysis
title_sort seqhbase: a big data toolset for family based sequencing data analysis
topic Methods
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4382803/
https://www.ncbi.nlm.nih.gov/pubmed/25587064
http://dx.doi.org/10.1136/jmedgenet-2014-102907
work_keys_str_mv AT hemin seqhbaseabigdatatoolsetforfamilybasedsequencingdataanalysis
AT personthomasn seqhbaseabigdatatoolsetforfamilybasedsequencingdataanalysis
AT hebbringscottj seqhbaseabigdatatoolsetforfamilybasedsequencingdataanalysis
AT heinzenethan seqhbaseabigdatatoolsetforfamilybasedsequencingdataanalysis
AT yezhan seqhbaseabigdatatoolsetforfamilybasedsequencingdataanalysis
AT schrodistevenj seqhbaseabigdatatoolsetforfamilybasedsequencingdataanalysis
AT mcphersonelizabethw seqhbaseabigdatatoolsetforfamilybasedsequencingdataanalysis
AT linsimonm seqhbaseabigdatatoolsetforfamilybasedsequencingdataanalysis
AT peissigpeggyl seqhbaseabigdatatoolsetforfamilybasedsequencingdataanalysis
AT brilliantmurrayh seqhbaseabigdatatoolsetforfamilybasedsequencingdataanalysis
AT orawejason seqhbaseabigdatatoolsetforfamilybasedsequencingdataanalysis
AT robisonreidj seqhbaseabigdatatoolsetforfamilybasedsequencingdataanalysis
AT lyongholsonj seqhbaseabigdatatoolsetforfamilybasedsequencingdataanalysis
AT wangkai seqhbaseabigdatatoolsetforfamilybasedsequencingdataanalysis