Cargando…

Robust and rapid algorithms facilitate large-scale whole genome sequencing downstream analysis in an integrative framework

Whole genome sequencing (WGS) is a promising strategy to unravel variants or genes responsible for human diseases and traits. However, there is a lack of robust platforms for a comprehensive downstream analysis. In the present study, we first proposed three novel algorithms, sequence gap-filled gene...

Descripción completa

Detalles Bibliográficos
Autores principales: Li, Miaoxin, Li, Jiang, Li, Mulin Jun, Pan, Zhicheng, Hsu, Jacob Shujui, Liu, Dajiang J., Zhan, Xiaowei, Wang, Junwen, Song, Youqiang, Sham, Pak Chung
Formato: Online Artículo Texto
Lenguaje:English
Publicado: Oxford University Press 2017
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5435951/
https://www.ncbi.nlm.nih.gov/pubmed/28115622
http://dx.doi.org/10.1093/nar/gkx019
_version_ 1783237313639219200
author Li, Miaoxin
Li, Jiang
Li, Mulin Jun
Pan, Zhicheng
Hsu, Jacob Shujui
Liu, Dajiang J.
Zhan, Xiaowei
Wang, Junwen
Song, Youqiang
Sham, Pak Chung
author_facet Li, Miaoxin
Li, Jiang
Li, Mulin Jun
Pan, Zhicheng
Hsu, Jacob Shujui
Liu, Dajiang J.
Zhan, Xiaowei
Wang, Junwen
Song, Youqiang
Sham, Pak Chung
author_sort Li, Miaoxin
collection PubMed
description Whole genome sequencing (WGS) is a promising strategy to unravel variants or genes responsible for human diseases and traits. However, there is a lack of robust platforms for a comprehensive downstream analysis. In the present study, we first proposed three novel algorithms, sequence gap-filled gene feature annotation, bit-block encoded genotypes and sectional fast access to text lines to address three fundamental problems. The three algorithms then formed the infrastructure of a robust parallel computing framework, KGGSeq, for integrating downstream analysis functions for whole genome sequencing data. KGGSeq has been equipped with a comprehensive set of analysis functions for quality control, filtration, annotation, pathogenic prediction and statistical tests. In the tests with whole genome sequencing data from 1000 Genomes Project, KGGSeq annotated several thousand more reliable non-synonymous variants than other widely used tools (e.g. ANNOVAR and SNPEff). It took only around half an hour on a small server with 10 CPUs to access genotypes of ∼60 million variants of 2504 subjects, while a popular alternative tool required around one day. KGGSeq's bit-block genotype format used 1.5% or less space to flexibly represent phased or unphased genotypes with multiple alleles and achieved a speed of over 1000 times faster to calculate genotypic correlation.
format Online
Article
Text
id pubmed-5435951
institution National Center for Biotechnology Information
language English
publishDate 2017
publisher Oxford University Press
record_format MEDLINE/PubMed
spelling pubmed-54359512017-05-22 Robust and rapid algorithms facilitate large-scale whole genome sequencing downstream analysis in an integrative framework Li, Miaoxin Li, Jiang Li, Mulin Jun Pan, Zhicheng Hsu, Jacob Shujui Liu, Dajiang J. Zhan, Xiaowei Wang, Junwen Song, Youqiang Sham, Pak Chung Nucleic Acids Res Methods Online Whole genome sequencing (WGS) is a promising strategy to unravel variants or genes responsible for human diseases and traits. However, there is a lack of robust platforms for a comprehensive downstream analysis. In the present study, we first proposed three novel algorithms, sequence gap-filled gene feature annotation, bit-block encoded genotypes and sectional fast access to text lines to address three fundamental problems. The three algorithms then formed the infrastructure of a robust parallel computing framework, KGGSeq, for integrating downstream analysis functions for whole genome sequencing data. KGGSeq has been equipped with a comprehensive set of analysis functions for quality control, filtration, annotation, pathogenic prediction and statistical tests. In the tests with whole genome sequencing data from 1000 Genomes Project, KGGSeq annotated several thousand more reliable non-synonymous variants than other widely used tools (e.g. ANNOVAR and SNPEff). It took only around half an hour on a small server with 10 CPUs to access genotypes of ∼60 million variants of 2504 subjects, while a popular alternative tool required around one day. KGGSeq's bit-block genotype format used 1.5% or less space to flexibly represent phased or unphased genotypes with multiple alleles and achieved a speed of over 1000 times faster to calculate genotypic correlation. Oxford University Press 2017-05-19 2017-01-23 /pmc/articles/PMC5435951/ /pubmed/28115622 http://dx.doi.org/10.1093/nar/gkx019 Text en © The Author(s) 2017. Published by Oxford University Press on behalf of Nucleic Acids Research. http://creativecommons.org/licenses/by-nc/4.0/ This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by-nc/4.0/), which permits non-commercial re-use, distribution, and reproduction in any medium, provided the original work is properly cited. For commercial re-use, please contact journals.permissions@oup.com
spellingShingle Methods Online
Li, Miaoxin
Li, Jiang
Li, Mulin Jun
Pan, Zhicheng
Hsu, Jacob Shujui
Liu, Dajiang J.
Zhan, Xiaowei
Wang, Junwen
Song, Youqiang
Sham, Pak Chung
Robust and rapid algorithms facilitate large-scale whole genome sequencing downstream analysis in an integrative framework
title Robust and rapid algorithms facilitate large-scale whole genome sequencing downstream analysis in an integrative framework
title_full Robust and rapid algorithms facilitate large-scale whole genome sequencing downstream analysis in an integrative framework
title_fullStr Robust and rapid algorithms facilitate large-scale whole genome sequencing downstream analysis in an integrative framework
title_full_unstemmed Robust and rapid algorithms facilitate large-scale whole genome sequencing downstream analysis in an integrative framework
title_short Robust and rapid algorithms facilitate large-scale whole genome sequencing downstream analysis in an integrative framework
title_sort robust and rapid algorithms facilitate large-scale whole genome sequencing downstream analysis in an integrative framework
topic Methods Online
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5435951/
https://www.ncbi.nlm.nih.gov/pubmed/28115622
http://dx.doi.org/10.1093/nar/gkx019
work_keys_str_mv AT limiaoxin robustandrapidalgorithmsfacilitatelargescalewholegenomesequencingdownstreamanalysisinanintegrativeframework
AT lijiang robustandrapidalgorithmsfacilitatelargescalewholegenomesequencingdownstreamanalysisinanintegrativeframework
AT limulinjun robustandrapidalgorithmsfacilitatelargescalewholegenomesequencingdownstreamanalysisinanintegrativeframework
AT panzhicheng robustandrapidalgorithmsfacilitatelargescalewholegenomesequencingdownstreamanalysisinanintegrativeframework
AT hsujacobshujui robustandrapidalgorithmsfacilitatelargescalewholegenomesequencingdownstreamanalysisinanintegrativeframework
AT liudajiangj robustandrapidalgorithmsfacilitatelargescalewholegenomesequencingdownstreamanalysisinanintegrativeframework
AT zhanxiaowei robustandrapidalgorithmsfacilitatelargescalewholegenomesequencingdownstreamanalysisinanintegrativeframework
AT wangjunwen robustandrapidalgorithmsfacilitatelargescalewholegenomesequencingdownstreamanalysisinanintegrativeframework
AT songyouqiang robustandrapidalgorithmsfacilitatelargescalewholegenomesequencingdownstreamanalysisinanintegrativeframework
AT shampakchung robustandrapidalgorithmsfacilitatelargescalewholegenomesequencingdownstreamanalysisinanintegrativeframework