Cargando…

Improving de novo sequence assembly using machine learning and comparative genomics for overlap correction

BACKGROUND: With the rapid expansion of DNA sequencing databases, it is now feasible to identify relevant information from prior sequencing projects and completed genomes and apply it to de novo sequencing of new organisms. As an example, this paper demonstrates how such extra information can be use...

Descripción completa

Detalles Bibliográficos
Autores principales: Palmer, Lance E, Dejori, Mathaeus, Bolanos, Randall, Fasulo, Daniel
Formato: Texto
Lenguaje:English
Publicado: BioMed Central 2010
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2824677/
https://www.ncbi.nlm.nih.gov/pubmed/20078885
http://dx.doi.org/10.1186/1471-2105-11-33
_version_ 1782177715824099328
author Palmer, Lance E
Dejori, Mathaeus
Bolanos, Randall
Fasulo, Daniel
author_facet Palmer, Lance E
Dejori, Mathaeus
Bolanos, Randall
Fasulo, Daniel
author_sort Palmer, Lance E
collection PubMed
description BACKGROUND: With the rapid expansion of DNA sequencing databases, it is now feasible to identify relevant information from prior sequencing projects and completed genomes and apply it to de novo sequencing of new organisms. As an example, this paper demonstrates how such extra information can be used to improve de novo assemblies by augmenting the overlapping step. Finding all pairs of overlapping reads is a key task in many genome assemblers, and to this end, highly efficient algorithms have been developed to find alignments in large collections of sequences. It is well known that due to repeated sequences, many aligned pairs of reads nevertheless do not overlap. But no overlapping algorithm to date takes a rigorous approach to separating aligned but non-overlapping read pairs from true overlaps. RESULTS: We present an approach that extends the Minimus assembler by a data driven step to classify overlaps as true or false prior to contig construction. We trained several different classification models within the Weka framework using various statistics derived from overlaps of reads available from prior sequencing projects. These statistics included percent mismatch and k-mer frequencies within the overlaps as well as a comparative genomics score derived from mapping reads to multiple reference genomes. We show that in real whole-genome sequencing data from the E. coli and S. aureus genomes, by providing a curated set of overlaps to the contigging phase of the assembler, we nearly doubled the median contig length (N50) without sacrificing coverage of the genome or increasing the number of mis-assemblies. CONCLUSIONS: Machine learning methods that use comparative and non-comparative features to classify overlaps as true or false can be used to improve the quality of a sequence assembly.
format Text
id pubmed-2824677
institution National Center for Biotechnology Information
language English
publishDate 2010
publisher BioMed Central
record_format MEDLINE/PubMed
spelling pubmed-28246772010-02-19 Improving de novo sequence assembly using machine learning and comparative genomics for overlap correction Palmer, Lance E Dejori, Mathaeus Bolanos, Randall Fasulo, Daniel BMC Bioinformatics Research article BACKGROUND: With the rapid expansion of DNA sequencing databases, it is now feasible to identify relevant information from prior sequencing projects and completed genomes and apply it to de novo sequencing of new organisms. As an example, this paper demonstrates how such extra information can be used to improve de novo assemblies by augmenting the overlapping step. Finding all pairs of overlapping reads is a key task in many genome assemblers, and to this end, highly efficient algorithms have been developed to find alignments in large collections of sequences. It is well known that due to repeated sequences, many aligned pairs of reads nevertheless do not overlap. But no overlapping algorithm to date takes a rigorous approach to separating aligned but non-overlapping read pairs from true overlaps. RESULTS: We present an approach that extends the Minimus assembler by a data driven step to classify overlaps as true or false prior to contig construction. We trained several different classification models within the Weka framework using various statistics derived from overlaps of reads available from prior sequencing projects. These statistics included percent mismatch and k-mer frequencies within the overlaps as well as a comparative genomics score derived from mapping reads to multiple reference genomes. We show that in real whole-genome sequencing data from the E. coli and S. aureus genomes, by providing a curated set of overlaps to the contigging phase of the assembler, we nearly doubled the median contig length (N50) without sacrificing coverage of the genome or increasing the number of mis-assemblies. CONCLUSIONS: Machine learning methods that use comparative and non-comparative features to classify overlaps as true or false can be used to improve the quality of a sequence assembly. BioMed Central 2010-01-15 /pmc/articles/PMC2824677/ /pubmed/20078885 http://dx.doi.org/10.1186/1471-2105-11-33 Text en Copyright ©2010 Palmer et al; licensee BioMed Central Ltd. http://creativecommons.org/licenses/by/2.0 This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.
spellingShingle Research article
Palmer, Lance E
Dejori, Mathaeus
Bolanos, Randall
Fasulo, Daniel
Improving de novo sequence assembly using machine learning and comparative genomics for overlap correction
title Improving de novo sequence assembly using machine learning and comparative genomics for overlap correction
title_full Improving de novo sequence assembly using machine learning and comparative genomics for overlap correction
title_fullStr Improving de novo sequence assembly using machine learning and comparative genomics for overlap correction
title_full_unstemmed Improving de novo sequence assembly using machine learning and comparative genomics for overlap correction
title_short Improving de novo sequence assembly using machine learning and comparative genomics for overlap correction
title_sort improving de novo sequence assembly using machine learning and comparative genomics for overlap correction
topic Research article
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2824677/
https://www.ncbi.nlm.nih.gov/pubmed/20078885
http://dx.doi.org/10.1186/1471-2105-11-33
work_keys_str_mv AT palmerlancee improvingdenovosequenceassemblyusingmachinelearningandcomparativegenomicsforoverlapcorrection
AT dejorimathaeus improvingdenovosequenceassemblyusingmachinelearningandcomparativegenomicsforoverlapcorrection
AT bolanosrandall improvingdenovosequenceassemblyusingmachinelearningandcomparativegenomicsforoverlapcorrection
AT fasulodaniel improvingdenovosequenceassemblyusingmachinelearningandcomparativegenomicsforoverlapcorrection