Cargando…
Scaffolding low quality genomes using orthologous protein sequences
Motivation: The ready availability of next-generation sequencing has led to a situation where it is easy to produce very fragmentary genome assemblies. We present a pipeline, SWiPS (Scaffolding With Protein Sequences), that uses orthologous proteins to improve low quality genome assemblies. The prot...
Autores principales: | , |
---|---|
Formato: | Online Artículo Texto |
Lenguaje: | English |
Publicado: |
Oxford University Press
2013
|
Materias: | |
Acceso en línea: | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3546802/ https://www.ncbi.nlm.nih.gov/pubmed/23162087 http://dx.doi.org/10.1093/bioinformatics/bts661 |
_version_ | 1782256115042484224 |
---|---|
author | Li, Yang I. Copley, Richard R. |
author_facet | Li, Yang I. Copley, Richard R. |
author_sort | Li, Yang I. |
collection | PubMed |
description | Motivation: The ready availability of next-generation sequencing has led to a situation where it is easy to produce very fragmentary genome assemblies. We present a pipeline, SWiPS (Scaffolding With Protein Sequences), that uses orthologous proteins to improve low quality genome assemblies. The protein sequences are used as guides to scaffold existing contigs, while simultaneously allowing the gene structure to be predicted by homology. Results: To perform, SWiPS does not depend on a high N50 or whole proteins being encoded on a single contig. We tested our algorithm on simulated next-generation data from Ciona intestinalis, real next-generation data from Drosophila melanogaster, a complex genome assembly of Homo sapiens and the low coverage Sanger sequence assembly of Callorhinchus milii. The improvements in N50 are of the order of ∼20% for the C.intestinalis and H.sapiens assemblies, which is significant, considering the large size of intergenic regions in these eukaryotes. Using the CEGMA pipeline to assess the gene space represented in the genome assemblies, the number of genes retrieved increased by >110% for C.milii and from 20 to 40% for C.intestinalis. The scaffold error rates are low: 85–90% of scaffolds are fully correct, and >95% of local contig joins are correct. Availability: SWiPS is available freely for download at http://www.well.ox.ac.uk/∼yli142/swips.html. Contact: yang.li@well.ox.ac.uk or copley@well.ox.ac.uk |
format | Online Article Text |
id | pubmed-3546802 |
institution | National Center for Biotechnology Information |
language | English |
publishDate | 2013 |
publisher | Oxford University Press |
record_format | MEDLINE/PubMed |
spelling | pubmed-35468022013-01-16 Scaffolding low quality genomes using orthologous protein sequences Li, Yang I. Copley, Richard R. Bioinformatics Original Papers Motivation: The ready availability of next-generation sequencing has led to a situation where it is easy to produce very fragmentary genome assemblies. We present a pipeline, SWiPS (Scaffolding With Protein Sequences), that uses orthologous proteins to improve low quality genome assemblies. The protein sequences are used as guides to scaffold existing contigs, while simultaneously allowing the gene structure to be predicted by homology. Results: To perform, SWiPS does not depend on a high N50 or whole proteins being encoded on a single contig. We tested our algorithm on simulated next-generation data from Ciona intestinalis, real next-generation data from Drosophila melanogaster, a complex genome assembly of Homo sapiens and the low coverage Sanger sequence assembly of Callorhinchus milii. The improvements in N50 are of the order of ∼20% for the C.intestinalis and H.sapiens assemblies, which is significant, considering the large size of intergenic regions in these eukaryotes. Using the CEGMA pipeline to assess the gene space represented in the genome assemblies, the number of genes retrieved increased by >110% for C.milii and from 20 to 40% for C.intestinalis. The scaffold error rates are low: 85–90% of scaffolds are fully correct, and >95% of local contig joins are correct. Availability: SWiPS is available freely for download at http://www.well.ox.ac.uk/∼yli142/swips.html. Contact: yang.li@well.ox.ac.uk or copley@well.ox.ac.uk Oxford University Press 2013-01-15 2012-11-18 /pmc/articles/PMC3546802/ /pubmed/23162087 http://dx.doi.org/10.1093/bioinformatics/bts661 Text en © The Author 2012. Published by Oxford University Press. All rights reserved. For Permissions, please e-mail: journals.permissions@oup.com http://creativecommons.org/licenses/by/3.0 This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/3.0/), which permits unrestricted reuse, distribution, and reproduction in any medium, provided the original work is properly cited. |
spellingShingle | Original Papers Li, Yang I. Copley, Richard R. Scaffolding low quality genomes using orthologous protein sequences |
title | Scaffolding low quality genomes using orthologous protein
sequences |
title_full | Scaffolding low quality genomes using orthologous protein
sequences |
title_fullStr | Scaffolding low quality genomes using orthologous protein
sequences |
title_full_unstemmed | Scaffolding low quality genomes using orthologous protein
sequences |
title_short | Scaffolding low quality genomes using orthologous protein
sequences |
title_sort | scaffolding low quality genomes using orthologous protein
sequences |
topic | Original Papers |
url | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3546802/ https://www.ncbi.nlm.nih.gov/pubmed/23162087 http://dx.doi.org/10.1093/bioinformatics/bts661 |
work_keys_str_mv | AT liyangi scaffoldinglowqualitygenomesusingorthologousproteinsequences AT copleyrichardr scaffoldinglowqualitygenomesusingorthologousproteinsequences |