Cargando…

Estimating repeat spectra and genome length from low-coverage genome skims with RESPECT

The cost of sequencing the genome is dropping at a much faster rate compared to assembling and finishing the genome. The use of lightly sampled genomes (genome-skims) could be transformative for genomic ecology, and results using k-mers have shown the advantage of this approach in identification and...

Descripción completa

Detalles Bibliográficos
Autores principales: Sarmashghi, Shahab, Balaban, Metin, Rachtman, Eleonora, Touri, Behrouz, Mirarab, Siavash, Bafna, Vineet
Formato: Online Artículo Texto
Lenguaje:English
Publicado: Public Library of Science 2021
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8629397/
https://www.ncbi.nlm.nih.gov/pubmed/34780468
http://dx.doi.org/10.1371/journal.pcbi.1009449
_version_ 1784607197865443328
author Sarmashghi, Shahab
Balaban, Metin
Rachtman, Eleonora
Touri, Behrouz
Mirarab, Siavash
Bafna, Vineet
author_facet Sarmashghi, Shahab
Balaban, Metin
Rachtman, Eleonora
Touri, Behrouz
Mirarab, Siavash
Bafna, Vineet
author_sort Sarmashghi, Shahab
collection PubMed
description The cost of sequencing the genome is dropping at a much faster rate compared to assembling and finishing the genome. The use of lightly sampled genomes (genome-skims) could be transformative for genomic ecology, and results using k-mers have shown the advantage of this approach in identification and phylogenetic placement of eukaryotic species. Here, we revisit the basic question of estimating genomic parameters such as genome length, coverage, and repeat structure, focusing specifically on estimating the k-mer repeat spectrum. We show using a mix of theoretical and empirical analysis that there are fundamental limitations to estimating the k-mer spectra due to ill-conditioned systems, and that has implications for other genomic parameters. We get around this problem using a novel constrained optimization approach (Spline Linear Programming), where the constraints are learned empirically. On reads simulated at 1X coverage from 66 genomes, our method, REPeat SPECTra Estimation (RESPECT), had 2.2% error in length estimation compared to 27% error previously achieved. In shotgun sequenced read samples with contaminants, RESPECT length estimates had median error 4%, in contrast to other methods that had median error 80%. Together, the results suggest that low-pass genomic sequencing can yield reliable estimates of the length and repeat content of the genome. The RESPECT software will be publicly available at https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_shahab-2Dsarmashghi_RESPECT.git&d=DwIGAw&c=-35OiAkTchMrZOngvJPOeA&r=ZozViWvD1E8PorCkfwYKYQMVKFoEcqLFm4Tg49XnPcA&m=f-xS8GMHKckknkc7Xpp8FJYw_ltUwz5frOw1a5pJ81EpdTOK8xhbYmrN4ZxniM96&s=717o8hLR1JmHFpRPSWG6xdUQTikyUjicjkipjFsKG4w&e=.
format Online
Article
Text
id pubmed-8629397
institution National Center for Biotechnology Information
language English
publishDate 2021
publisher Public Library of Science
record_format MEDLINE/PubMed
spelling pubmed-86293972021-11-30 Estimating repeat spectra and genome length from low-coverage genome skims with RESPECT Sarmashghi, Shahab Balaban, Metin Rachtman, Eleonora Touri, Behrouz Mirarab, Siavash Bafna, Vineet PLoS Comput Biol Research Article The cost of sequencing the genome is dropping at a much faster rate compared to assembling and finishing the genome. The use of lightly sampled genomes (genome-skims) could be transformative for genomic ecology, and results using k-mers have shown the advantage of this approach in identification and phylogenetic placement of eukaryotic species. Here, we revisit the basic question of estimating genomic parameters such as genome length, coverage, and repeat structure, focusing specifically on estimating the k-mer repeat spectrum. We show using a mix of theoretical and empirical analysis that there are fundamental limitations to estimating the k-mer spectra due to ill-conditioned systems, and that has implications for other genomic parameters. We get around this problem using a novel constrained optimization approach (Spline Linear Programming), where the constraints are learned empirically. On reads simulated at 1X coverage from 66 genomes, our method, REPeat SPECTra Estimation (RESPECT), had 2.2% error in length estimation compared to 27% error previously achieved. In shotgun sequenced read samples with contaminants, RESPECT length estimates had median error 4%, in contrast to other methods that had median error 80%. Together, the results suggest that low-pass genomic sequencing can yield reliable estimates of the length and repeat content of the genome. The RESPECT software will be publicly available at https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_shahab-2Dsarmashghi_RESPECT.git&d=DwIGAw&c=-35OiAkTchMrZOngvJPOeA&r=ZozViWvD1E8PorCkfwYKYQMVKFoEcqLFm4Tg49XnPcA&m=f-xS8GMHKckknkc7Xpp8FJYw_ltUwz5frOw1a5pJ81EpdTOK8xhbYmrN4ZxniM96&s=717o8hLR1JmHFpRPSWG6xdUQTikyUjicjkipjFsKG4w&e=. Public Library of Science 2021-11-15 /pmc/articles/PMC8629397/ /pubmed/34780468 http://dx.doi.org/10.1371/journal.pcbi.1009449 Text en © 2021 Sarmashghi et al https://creativecommons.org/licenses/by/4.0/This is an open access article distributed under the terms of the Creative Commons Attribution License (https://creativecommons.org/licenses/by/4.0/) , which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.
spellingShingle Research Article
Sarmashghi, Shahab
Balaban, Metin
Rachtman, Eleonora
Touri, Behrouz
Mirarab, Siavash
Bafna, Vineet
Estimating repeat spectra and genome length from low-coverage genome skims with RESPECT
title Estimating repeat spectra and genome length from low-coverage genome skims with RESPECT
title_full Estimating repeat spectra and genome length from low-coverage genome skims with RESPECT
title_fullStr Estimating repeat spectra and genome length from low-coverage genome skims with RESPECT
title_full_unstemmed Estimating repeat spectra and genome length from low-coverage genome skims with RESPECT
title_short Estimating repeat spectra and genome length from low-coverage genome skims with RESPECT
title_sort estimating repeat spectra and genome length from low-coverage genome skims with respect
topic Research Article
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8629397/
https://www.ncbi.nlm.nih.gov/pubmed/34780468
http://dx.doi.org/10.1371/journal.pcbi.1009449
work_keys_str_mv AT sarmashghishahab estimatingrepeatspectraandgenomelengthfromlowcoveragegenomeskimswithrespect
AT balabanmetin estimatingrepeatspectraandgenomelengthfromlowcoveragegenomeskimswithrespect
AT rachtmaneleonora estimatingrepeatspectraandgenomelengthfromlowcoveragegenomeskimswithrespect
AT touribehrouz estimatingrepeatspectraandgenomelengthfromlowcoveragegenomeskimswithrespect
AT mirarabsiavash estimatingrepeatspectraandgenomelengthfromlowcoveragegenomeskimswithrespect
AT bafnavineet estimatingrepeatspectraandgenomelengthfromlowcoveragegenomeskimswithrespect