Cargando…

misFinder: identify mis-assemblies in an unbiased manner using reference and paired-end reads

BACKGROUND: Because of the short read length of high throughput sequencing data, assembly errors are introduced in genome assembly, which may have adverse impact to the downstream data analysis. Several tools have been developed to eliminate these errors by either 1) comparing the assembled sequence...

Descripción completa

Detalles Bibliográficos
Autores principales: Zhu, Xiao, Leung, Henry C. M., Wang, Rongjie, Chin, Francis Y. L., Yiu, Siu Ming, Quan, Guangri, Li, Yajie, Zhang, Rui, Jiang, Qinghua, Liu, Bo, Dong, Yucui, Zhou, Guohui, Wang, Yadong
Formato: Online Artículo Texto
Lenguaje:English
Publicado: BioMed Central 2015
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4647709/
https://www.ncbi.nlm.nih.gov/pubmed/26573684
http://dx.doi.org/10.1186/s12859-015-0818-3
Descripción
Sumario:BACKGROUND: Because of the short read length of high throughput sequencing data, assembly errors are introduced in genome assembly, which may have adverse impact to the downstream data analysis. Several tools have been developed to eliminate these errors by either 1) comparing the assembled sequences with some similar reference genome, or 2) analyzing paired-end reads aligned to the assembled sequences and determining inconsistent features alone mis-assembled sequences. However, the former approach cannot distinguish real structural variations between the target genome and the reference genome while the latter approach could have many false positive detections (correctly assembled sequence being considered as mis-assembled sequence). RESULTS: We present misFinder, a tool that aims to identify the assembly errors with high accuracy in an unbiased way and correct these errors at their mis-assembled positions to improve the assembly accuracy for downstream analysis. It combines the information of reference (or close related reference) genome and aligned paired-end reads to the assembled sequence. Assembly errors and correct assemblies corresponding to structural variations can be detected by comparing the genome reference and assembled sequence. Different types of assembly errors can then be distinguished from the mis-assembled sequence by analyzing the aligned paired-end reads using multiple features derived from coverage and consistence of insert distance to obtain high confident error calls. CONCLUSIONS: We tested the performance of misFinder on both simulated and real paired-end reads data, and misFinder gave accurate error calls with only very few miscalls. And, we further compared misFinder with QUAST and REAPR. misFinder outperformed QUAST and REAPR by 1) identified more true positive mis-assemblies with very few false positives and false negatives, and 2) distinguished the correct assemblies corresponding to structural variations from mis-assembled sequence. misFinder can be freely downloaded from https://github.com/hitbio/misFinder. ELECTRONIC SUPPLEMENTARY MATERIAL: The online version of this article (doi:10.1186/s12859-015-0818-3) contains supplementary material, which is available to authorized users.