Cargando…

De novo assembly of highly polymorphic metagenomic data using in situ generated reference sequences and a novel BLAST-based assembly pipeline

BACKGROUND: The accuracy of metagenomic assembly is usually compromised by high levels of polymorphism due to divergent reads from the same genomic region recognized as different loci when sequenced and assembled together. A viral quasispecies is a group of abundant and diversified genetically relat...

Descripción completa

Detalles Bibliográficos
Autores principales: Lin, You-Yu, Hsieh, Chia-Hung, Chen, Jiun-Hong, Lu, Xuemei, Kao, Jia-Horng, Chen, Pei-Jer, Chen, Ding-Shinn, Wang, Hurng-Yi
Formato: Online Artículo Texto
Lenguaje:English
Publicado: BioMed Central 2017
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5406902/
https://www.ncbi.nlm.nih.gov/pubmed/28446139
http://dx.doi.org/10.1186/s12859-017-1630-z
_version_ 1783232059673673728
author Lin, You-Yu
Hsieh, Chia-Hung
Chen, Jiun-Hong
Lu, Xuemei
Kao, Jia-Horng
Chen, Pei-Jer
Chen, Ding-Shinn
Wang, Hurng-Yi
author_facet Lin, You-Yu
Hsieh, Chia-Hung
Chen, Jiun-Hong
Lu, Xuemei
Kao, Jia-Horng
Chen, Pei-Jer
Chen, Ding-Shinn
Wang, Hurng-Yi
author_sort Lin, You-Yu
collection PubMed
description BACKGROUND: The accuracy of metagenomic assembly is usually compromised by high levels of polymorphism due to divergent reads from the same genomic region recognized as different loci when sequenced and assembled together. A viral quasispecies is a group of abundant and diversified genetically related viruses found in a single carrier. Current mainstream assembly methods, such as Velvet and SOAPdenovo, were not originally intended for the assembly of such metagenomics data, and therefore demands for new methods to provide accurate and informative assembly results for metagenomic data. RESULTS: In this study, we present a hybrid method for assembling highly polymorphic data combining the partial de novo-reference assembly (PDR) strategy and the BLAST-based assembly pipeline (BBAP). The PDR strategy generates in situ reference sequences through de novo assembly of a randomly extracted partial data set which is subsequently used for the reference assembly for the full data set. BBAP employs a greedy algorithm to assemble polymorphic reads. We used 12 hepatitis B virus quasispecies NGS data sets from a previous study to assess and compare the performance of both PDR and BBAP. Analyses suggest the high polymorphism of a full metagenomic data set leads to fragmentized de novo assembly results, whereas the biased or limited representation of external reference sequences included fewer reads into the assembly with lower assembly accuracy and variation sensitivity. In comparison, the PDR generated in situ reference sequence incorporated more reads into the final PDR assembly of the full metagenomics data set along with greater accuracy and higher variation sensitivity. BBAP assembly results also suggest higher assembly efficiency and accuracy compared to other assembly methods. Additionally, BBAP assembly recovered HBV structural variants that were not observed amongst assembly results of other methods. Together, PDR/BBAP assembly results were significantly better than other compared methods. CONCLUSIONS: Both PDR and BBAP independently increased the assembly efficiency and accuracy of highly polymorphic data, and assembly performances were further improved when used together. BBAP also provides nucleotide frequency information. Together, PDR and BBAP provide powerful tools for metagenomic data studies. ELECTRONIC SUPPLEMENTARY MATERIAL: The online version of this article (doi:10.1186/s12859-017-1630-z) contains supplementary material, which is available to authorized users.
format Online
Article
Text
id pubmed-5406902
institution National Center for Biotechnology Information
language English
publishDate 2017
publisher BioMed Central
record_format MEDLINE/PubMed
spelling pubmed-54069022017-04-27 De novo assembly of highly polymorphic metagenomic data using in situ generated reference sequences and a novel BLAST-based assembly pipeline Lin, You-Yu Hsieh, Chia-Hung Chen, Jiun-Hong Lu, Xuemei Kao, Jia-Horng Chen, Pei-Jer Chen, Ding-Shinn Wang, Hurng-Yi BMC Bioinformatics Methodology Article BACKGROUND: The accuracy of metagenomic assembly is usually compromised by high levels of polymorphism due to divergent reads from the same genomic region recognized as different loci when sequenced and assembled together. A viral quasispecies is a group of abundant and diversified genetically related viruses found in a single carrier. Current mainstream assembly methods, such as Velvet and SOAPdenovo, were not originally intended for the assembly of such metagenomics data, and therefore demands for new methods to provide accurate and informative assembly results for metagenomic data. RESULTS: In this study, we present a hybrid method for assembling highly polymorphic data combining the partial de novo-reference assembly (PDR) strategy and the BLAST-based assembly pipeline (BBAP). The PDR strategy generates in situ reference sequences through de novo assembly of a randomly extracted partial data set which is subsequently used for the reference assembly for the full data set. BBAP employs a greedy algorithm to assemble polymorphic reads. We used 12 hepatitis B virus quasispecies NGS data sets from a previous study to assess and compare the performance of both PDR and BBAP. Analyses suggest the high polymorphism of a full metagenomic data set leads to fragmentized de novo assembly results, whereas the biased or limited representation of external reference sequences included fewer reads into the assembly with lower assembly accuracy and variation sensitivity. In comparison, the PDR generated in situ reference sequence incorporated more reads into the final PDR assembly of the full metagenomics data set along with greater accuracy and higher variation sensitivity. BBAP assembly results also suggest higher assembly efficiency and accuracy compared to other assembly methods. Additionally, BBAP assembly recovered HBV structural variants that were not observed amongst assembly results of other methods. Together, PDR/BBAP assembly results were significantly better than other compared methods. CONCLUSIONS: Both PDR and BBAP independently increased the assembly efficiency and accuracy of highly polymorphic data, and assembly performances were further improved when used together. BBAP also provides nucleotide frequency information. Together, PDR and BBAP provide powerful tools for metagenomic data studies. ELECTRONIC SUPPLEMENTARY MATERIAL: The online version of this article (doi:10.1186/s12859-017-1630-z) contains supplementary material, which is available to authorized users. BioMed Central 2017-04-26 /pmc/articles/PMC5406902/ /pubmed/28446139 http://dx.doi.org/10.1186/s12859-017-1630-z Text en © The Author(s). 2017 Open AccessThis article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.
spellingShingle Methodology Article
Lin, You-Yu
Hsieh, Chia-Hung
Chen, Jiun-Hong
Lu, Xuemei
Kao, Jia-Horng
Chen, Pei-Jer
Chen, Ding-Shinn
Wang, Hurng-Yi
De novo assembly of highly polymorphic metagenomic data using in situ generated reference sequences and a novel BLAST-based assembly pipeline
title De novo assembly of highly polymorphic metagenomic data using in situ generated reference sequences and a novel BLAST-based assembly pipeline
title_full De novo assembly of highly polymorphic metagenomic data using in situ generated reference sequences and a novel BLAST-based assembly pipeline
title_fullStr De novo assembly of highly polymorphic metagenomic data using in situ generated reference sequences and a novel BLAST-based assembly pipeline
title_full_unstemmed De novo assembly of highly polymorphic metagenomic data using in situ generated reference sequences and a novel BLAST-based assembly pipeline
title_short De novo assembly of highly polymorphic metagenomic data using in situ generated reference sequences and a novel BLAST-based assembly pipeline
title_sort de novo assembly of highly polymorphic metagenomic data using in situ generated reference sequences and a novel blast-based assembly pipeline
topic Methodology Article
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5406902/
https://www.ncbi.nlm.nih.gov/pubmed/28446139
http://dx.doi.org/10.1186/s12859-017-1630-z
work_keys_str_mv AT linyouyu denovoassemblyofhighlypolymorphicmetagenomicdatausinginsitugeneratedreferencesequencesandanovelblastbasedassemblypipeline
AT hsiehchiahung denovoassemblyofhighlypolymorphicmetagenomicdatausinginsitugeneratedreferencesequencesandanovelblastbasedassemblypipeline
AT chenjiunhong denovoassemblyofhighlypolymorphicmetagenomicdatausinginsitugeneratedreferencesequencesandanovelblastbasedassemblypipeline
AT luxuemei denovoassemblyofhighlypolymorphicmetagenomicdatausinginsitugeneratedreferencesequencesandanovelblastbasedassemblypipeline
AT kaojiahorng denovoassemblyofhighlypolymorphicmetagenomicdatausinginsitugeneratedreferencesequencesandanovelblastbasedassemblypipeline
AT chenpeijer denovoassemblyofhighlypolymorphicmetagenomicdatausinginsitugeneratedreferencesequencesandanovelblastbasedassemblypipeline
AT chendingshinn denovoassemblyofhighlypolymorphicmetagenomicdatausinginsitugeneratedreferencesequencesandanovelblastbasedassemblypipeline
AT wanghurngyi denovoassemblyofhighlypolymorphicmetagenomicdatausinginsitugeneratedreferencesequencesandanovelblastbasedassemblypipeline