Cargando…
Robust and rapid algorithms facilitate large-scale whole genome sequencing downstream analysis in an integrative framework
Whole genome sequencing (WGS) is a promising strategy to unravel variants or genes responsible for human diseases and traits. However, there is a lack of robust platforms for a comprehensive downstream analysis. In the present study, we first proposed three novel algorithms, sequence gap-filled gene...
Autores principales: | , , , , , , , , , |
---|---|
Formato: | Online Artículo Texto |
Lenguaje: | English |
Publicado: |
Oxford University Press
2017
|
Materias: | |
Acceso en línea: | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5435951/ https://www.ncbi.nlm.nih.gov/pubmed/28115622 http://dx.doi.org/10.1093/nar/gkx019 |
_version_ | 1783237313639219200 |
---|---|
author | Li, Miaoxin Li, Jiang Li, Mulin Jun Pan, Zhicheng Hsu, Jacob Shujui Liu, Dajiang J. Zhan, Xiaowei Wang, Junwen Song, Youqiang Sham, Pak Chung |
author_facet | Li, Miaoxin Li, Jiang Li, Mulin Jun Pan, Zhicheng Hsu, Jacob Shujui Liu, Dajiang J. Zhan, Xiaowei Wang, Junwen Song, Youqiang Sham, Pak Chung |
author_sort | Li, Miaoxin |
collection | PubMed |
description | Whole genome sequencing (WGS) is a promising strategy to unravel variants or genes responsible for human diseases and traits. However, there is a lack of robust platforms for a comprehensive downstream analysis. In the present study, we first proposed three novel algorithms, sequence gap-filled gene feature annotation, bit-block encoded genotypes and sectional fast access to text lines to address three fundamental problems. The three algorithms then formed the infrastructure of a robust parallel computing framework, KGGSeq, for integrating downstream analysis functions for whole genome sequencing data. KGGSeq has been equipped with a comprehensive set of analysis functions for quality control, filtration, annotation, pathogenic prediction and statistical tests. In the tests with whole genome sequencing data from 1000 Genomes Project, KGGSeq annotated several thousand more reliable non-synonymous variants than other widely used tools (e.g. ANNOVAR and SNPEff). It took only around half an hour on a small server with 10 CPUs to access genotypes of ∼60 million variants of 2504 subjects, while a popular alternative tool required around one day. KGGSeq's bit-block genotype format used 1.5% or less space to flexibly represent phased or unphased genotypes with multiple alleles and achieved a speed of over 1000 times faster to calculate genotypic correlation. |
format | Online Article Text |
id | pubmed-5435951 |
institution | National Center for Biotechnology Information |
language | English |
publishDate | 2017 |
publisher | Oxford University Press |
record_format | MEDLINE/PubMed |
spelling | pubmed-54359512017-05-22 Robust and rapid algorithms facilitate large-scale whole genome sequencing downstream analysis in an integrative framework Li, Miaoxin Li, Jiang Li, Mulin Jun Pan, Zhicheng Hsu, Jacob Shujui Liu, Dajiang J. Zhan, Xiaowei Wang, Junwen Song, Youqiang Sham, Pak Chung Nucleic Acids Res Methods Online Whole genome sequencing (WGS) is a promising strategy to unravel variants or genes responsible for human diseases and traits. However, there is a lack of robust platforms for a comprehensive downstream analysis. In the present study, we first proposed three novel algorithms, sequence gap-filled gene feature annotation, bit-block encoded genotypes and sectional fast access to text lines to address three fundamental problems. The three algorithms then formed the infrastructure of a robust parallel computing framework, KGGSeq, for integrating downstream analysis functions for whole genome sequencing data. KGGSeq has been equipped with a comprehensive set of analysis functions for quality control, filtration, annotation, pathogenic prediction and statistical tests. In the tests with whole genome sequencing data from 1000 Genomes Project, KGGSeq annotated several thousand more reliable non-synonymous variants than other widely used tools (e.g. ANNOVAR and SNPEff). It took only around half an hour on a small server with 10 CPUs to access genotypes of ∼60 million variants of 2504 subjects, while a popular alternative tool required around one day. KGGSeq's bit-block genotype format used 1.5% or less space to flexibly represent phased or unphased genotypes with multiple alleles and achieved a speed of over 1000 times faster to calculate genotypic correlation. Oxford University Press 2017-05-19 2017-01-23 /pmc/articles/PMC5435951/ /pubmed/28115622 http://dx.doi.org/10.1093/nar/gkx019 Text en © The Author(s) 2017. Published by Oxford University Press on behalf of Nucleic Acids Research. http://creativecommons.org/licenses/by-nc/4.0/ This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by-nc/4.0/), which permits non-commercial re-use, distribution, and reproduction in any medium, provided the original work is properly cited. For commercial re-use, please contact journals.permissions@oup.com |
spellingShingle | Methods Online Li, Miaoxin Li, Jiang Li, Mulin Jun Pan, Zhicheng Hsu, Jacob Shujui Liu, Dajiang J. Zhan, Xiaowei Wang, Junwen Song, Youqiang Sham, Pak Chung Robust and rapid algorithms facilitate large-scale whole genome sequencing downstream analysis in an integrative framework |
title | Robust and rapid algorithms facilitate large-scale whole genome sequencing downstream analysis in an integrative framework |
title_full | Robust and rapid algorithms facilitate large-scale whole genome sequencing downstream analysis in an integrative framework |
title_fullStr | Robust and rapid algorithms facilitate large-scale whole genome sequencing downstream analysis in an integrative framework |
title_full_unstemmed | Robust and rapid algorithms facilitate large-scale whole genome sequencing downstream analysis in an integrative framework |
title_short | Robust and rapid algorithms facilitate large-scale whole genome sequencing downstream analysis in an integrative framework |
title_sort | robust and rapid algorithms facilitate large-scale whole genome sequencing downstream analysis in an integrative framework |
topic | Methods Online |
url | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5435951/ https://www.ncbi.nlm.nih.gov/pubmed/28115622 http://dx.doi.org/10.1093/nar/gkx019 |
work_keys_str_mv | AT limiaoxin robustandrapidalgorithmsfacilitatelargescalewholegenomesequencingdownstreamanalysisinanintegrativeframework AT lijiang robustandrapidalgorithmsfacilitatelargescalewholegenomesequencingdownstreamanalysisinanintegrativeframework AT limulinjun robustandrapidalgorithmsfacilitatelargescalewholegenomesequencingdownstreamanalysisinanintegrativeframework AT panzhicheng robustandrapidalgorithmsfacilitatelargescalewholegenomesequencingdownstreamanalysisinanintegrativeframework AT hsujacobshujui robustandrapidalgorithmsfacilitatelargescalewholegenomesequencingdownstreamanalysisinanintegrativeframework AT liudajiangj robustandrapidalgorithmsfacilitatelargescalewholegenomesequencingdownstreamanalysisinanintegrativeframework AT zhanxiaowei robustandrapidalgorithmsfacilitatelargescalewholegenomesequencingdownstreamanalysisinanintegrativeframework AT wangjunwen robustandrapidalgorithmsfacilitatelargescalewholegenomesequencingdownstreamanalysisinanintegrativeframework AT songyouqiang robustandrapidalgorithmsfacilitatelargescalewholegenomesequencingdownstreamanalysisinanintegrativeframework AT shampakchung robustandrapidalgorithmsfacilitatelargescalewholegenomesequencingdownstreamanalysisinanintegrativeframework |