Cargando…

Scaffolding low quality genomes using orthologous protein sequences

Motivation: The ready availability of next-generation sequencing has led to a situation where it is easy to produce very fragmentary genome assemblies. We present a pipeline, SWiPS (Scaffolding With Protein Sequences), that uses orthologous proteins to improve low quality genome assemblies. The prot...

Descripción completa

Detalles Bibliográficos
Autores principales: Li, Yang I., Copley, Richard R.
Formato: Online Artículo Texto
Lenguaje:English
Publicado: Oxford University Press 2013
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3546802/
https://www.ncbi.nlm.nih.gov/pubmed/23162087
http://dx.doi.org/10.1093/bioinformatics/bts661
_version_ 1782256115042484224
author Li, Yang I.
Copley, Richard R.
author_facet Li, Yang I.
Copley, Richard R.
author_sort Li, Yang I.
collection PubMed
description Motivation: The ready availability of next-generation sequencing has led to a situation where it is easy to produce very fragmentary genome assemblies. We present a pipeline, SWiPS (Scaffolding With Protein Sequences), that uses orthologous proteins to improve low quality genome assemblies. The protein sequences are used as guides to scaffold existing contigs, while simultaneously allowing the gene structure to be predicted by homology. Results: To perform, SWiPS does not depend on a high N50 or whole proteins being encoded on a single contig. We tested our algorithm on simulated next-generation data from Ciona intestinalis, real next-generation data from Drosophila melanogaster, a complex genome assembly of Homo sapiens and the low coverage Sanger sequence assembly of Callorhinchus milii. The improvements in N50 are of the order of ∼20% for the C.intestinalis and H.sapiens assemblies, which is significant, considering the large size of intergenic regions in these eukaryotes. Using the CEGMA pipeline to assess the gene space represented in the genome assemblies, the number of genes retrieved increased by >110% for C.milii and from 20 to 40% for C.intestinalis. The scaffold error rates are low: 85–90% of scaffolds are fully correct, and >95% of local contig joins are correct. Availability: SWiPS is available freely for download at http://www.well.ox.ac.uk/∼yli142/swips.html. Contact: yang.li@well.ox.ac.uk or copley@well.ox.ac.uk
format Online
Article
Text
id pubmed-3546802
institution National Center for Biotechnology Information
language English
publishDate 2013
publisher Oxford University Press
record_format MEDLINE/PubMed
spelling pubmed-35468022013-01-16 Scaffolding low quality genomes using orthologous protein sequences Li, Yang I. Copley, Richard R. Bioinformatics Original Papers Motivation: The ready availability of next-generation sequencing has led to a situation where it is easy to produce very fragmentary genome assemblies. We present a pipeline, SWiPS (Scaffolding With Protein Sequences), that uses orthologous proteins to improve low quality genome assemblies. The protein sequences are used as guides to scaffold existing contigs, while simultaneously allowing the gene structure to be predicted by homology. Results: To perform, SWiPS does not depend on a high N50 or whole proteins being encoded on a single contig. We tested our algorithm on simulated next-generation data from Ciona intestinalis, real next-generation data from Drosophila melanogaster, a complex genome assembly of Homo sapiens and the low coverage Sanger sequence assembly of Callorhinchus milii. The improvements in N50 are of the order of ∼20% for the C.intestinalis and H.sapiens assemblies, which is significant, considering the large size of intergenic regions in these eukaryotes. Using the CEGMA pipeline to assess the gene space represented in the genome assemblies, the number of genes retrieved increased by >110% for C.milii and from 20 to 40% for C.intestinalis. The scaffold error rates are low: 85–90% of scaffolds are fully correct, and >95% of local contig joins are correct. Availability: SWiPS is available freely for download at http://www.well.ox.ac.uk/∼yli142/swips.html. Contact: yang.li@well.ox.ac.uk or copley@well.ox.ac.uk Oxford University Press 2013-01-15 2012-11-18 /pmc/articles/PMC3546802/ /pubmed/23162087 http://dx.doi.org/10.1093/bioinformatics/bts661 Text en © The Author 2012. Published by Oxford University Press. All rights reserved. For Permissions, please e-mail: journals.permissions@oup.com http://creativecommons.org/licenses/by/3.0 This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/3.0/), which permits unrestricted reuse, distribution, and reproduction in any medium, provided the original work is properly cited.
spellingShingle Original Papers
Li, Yang I.
Copley, Richard R.
Scaffolding low quality genomes using orthologous protein sequences
title Scaffolding low quality genomes using orthologous protein sequences
title_full Scaffolding low quality genomes using orthologous protein sequences
title_fullStr Scaffolding low quality genomes using orthologous protein sequences
title_full_unstemmed Scaffolding low quality genomes using orthologous protein sequences
title_short Scaffolding low quality genomes using orthologous protein sequences
title_sort scaffolding low quality genomes using orthologous protein sequences
topic Original Papers
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3546802/
https://www.ncbi.nlm.nih.gov/pubmed/23162087
http://dx.doi.org/10.1093/bioinformatics/bts661
work_keys_str_mv AT liyangi scaffoldinglowqualitygenomesusingorthologousproteinsequences
AT copleyrichardr scaffoldinglowqualitygenomesusingorthologousproteinsequences