Cargando…

Error and Error Mitigation in Low-Coverage Genome Assemblies

The recent release of twenty-two new genome sequences has dramatically increased the data available for mammalian comparative genomics, but twenty of these new sequences are currently limited to ∼2× coverage. Here we examine the extent of sequencing error in these 2× assemblies, and its potential im...

Descripción completa

Detalles Bibliográficos
Autores principales: Hubisz, Melissa J., Lin, Michael F., Kellis, Manolis, Siepel, Adam
Formato: Texto
Lenguaje:English
Publicado: Public Library of Science 2011
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3038916/
https://www.ncbi.nlm.nih.gov/pubmed/21340033
http://dx.doi.org/10.1371/journal.pone.0017034
_version_ 1782198143550488576
author Hubisz, Melissa J.
Lin, Michael F.
Kellis, Manolis
Siepel, Adam
author_facet Hubisz, Melissa J.
Lin, Michael F.
Kellis, Manolis
Siepel, Adam
author_sort Hubisz, Melissa J.
collection PubMed
description The recent release of twenty-two new genome sequences has dramatically increased the data available for mammalian comparative genomics, but twenty of these new sequences are currently limited to ∼2× coverage. Here we examine the extent of sequencing error in these 2× assemblies, and its potential impact in downstream analyses. By comparing 2× assemblies with high-quality sequences from the ENCODE regions, we estimate the rate of sequencing error to be 1–4 errors per kilobase. While this error rate is fairly modest, sequencing error can still have surprising effects. For example, an apparent lineage-specific insertion in a coding region is more likely to reflect sequencing error than a true biological event, and the length distribution of coding indels is strongly distorted by error. We find that most errors are contributed by a small fraction of bases with low quality scores, in particular, by the ends of reads in regions of single-read coverage in the assembly. We explore several approaches for automatic sequencing error mitigation (SEM), making use of the localized nature of sequencing error, the fact that it is well predicted by quality scores, and information about errors that comes from comparisons across species. Our automatic methods for error mitigation cannot replace the need for additional sequencing, but they do allow substantial fractions of errors to be masked or eliminated at the cost of modest amounts of over-correction, and they can reduce the impact of error in downstream phylogenomic analyses. Our error-mitigated alignments are available for download.
format Text
id pubmed-3038916
institution National Center for Biotechnology Information
language English
publishDate 2011
publisher Public Library of Science
record_format MEDLINE/PubMed
spelling pubmed-30389162011-02-18 Error and Error Mitigation in Low-Coverage Genome Assemblies Hubisz, Melissa J. Lin, Michael F. Kellis, Manolis Siepel, Adam PLoS One Research Article The recent release of twenty-two new genome sequences has dramatically increased the data available for mammalian comparative genomics, but twenty of these new sequences are currently limited to ∼2× coverage. Here we examine the extent of sequencing error in these 2× assemblies, and its potential impact in downstream analyses. By comparing 2× assemblies with high-quality sequences from the ENCODE regions, we estimate the rate of sequencing error to be 1–4 errors per kilobase. While this error rate is fairly modest, sequencing error can still have surprising effects. For example, an apparent lineage-specific insertion in a coding region is more likely to reflect sequencing error than a true biological event, and the length distribution of coding indels is strongly distorted by error. We find that most errors are contributed by a small fraction of bases with low quality scores, in particular, by the ends of reads in regions of single-read coverage in the assembly. We explore several approaches for automatic sequencing error mitigation (SEM), making use of the localized nature of sequencing error, the fact that it is well predicted by quality scores, and information about errors that comes from comparisons across species. Our automatic methods for error mitigation cannot replace the need for additional sequencing, but they do allow substantial fractions of errors to be masked or eliminated at the cost of modest amounts of over-correction, and they can reduce the impact of error in downstream phylogenomic analyses. Our error-mitigated alignments are available for download. Public Library of Science 2011-02-14 /pmc/articles/PMC3038916/ /pubmed/21340033 http://dx.doi.org/10.1371/journal.pone.0017034 Text en Hubisz et al. http://creativecommons.org/licenses/by/4.0/ This is an open-access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are properly credited.
spellingShingle Research Article
Hubisz, Melissa J.
Lin, Michael F.
Kellis, Manolis
Siepel, Adam
Error and Error Mitigation in Low-Coverage Genome Assemblies
title Error and Error Mitigation in Low-Coverage Genome Assemblies
title_full Error and Error Mitigation in Low-Coverage Genome Assemblies
title_fullStr Error and Error Mitigation in Low-Coverage Genome Assemblies
title_full_unstemmed Error and Error Mitigation in Low-Coverage Genome Assemblies
title_short Error and Error Mitigation in Low-Coverage Genome Assemblies
title_sort error and error mitigation in low-coverage genome assemblies
topic Research Article
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3038916/
https://www.ncbi.nlm.nih.gov/pubmed/21340033
http://dx.doi.org/10.1371/journal.pone.0017034
work_keys_str_mv AT hubiszmelissaj erroranderrormitigationinlowcoveragegenomeassemblies
AT linmichaelf erroranderrormitigationinlowcoveragegenomeassemblies
AT kellismanolis erroranderrormitigationinlowcoveragegenomeassemblies
AT siepeladam erroranderrormitigationinlowcoveragegenomeassemblies