Cargando…

False gene and chromosome losses in genome assemblies caused by GC content variation and repeats

BACKGROUND: Many short-read genome assemblies have been found to be incomplete and contain mis-assemblies. The Vertebrate Genomes Project has been producing new reference genome assemblies with an emphasis on being as complete and error-free as possible, which requires utilizing long reads, long-ran...

Descripción completa

Detalles Bibliográficos
Autores principales: Kim, Juwan, Lee, Chul, Ko, Byung June, Yoo, Dong Ahn, Won, Sohyoung, Phillippy, Adam M., Fedrigo, Olivier, Zhang, Guojie, Howe, Kerstin, Wood, Jonathan, Durbin, Richard, Formenti, Giulio, Brown, Samara, Cantin, Lindsey, Mello, Claudio V., Cho, Seoae, Rhie, Arang, Kim, Heebal, Jarvis, Erich D.
Formato: Online Artículo Texto
Lenguaje:English
Publicado: BioMed Central 2022
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9516821/
https://www.ncbi.nlm.nih.gov/pubmed/36167554
http://dx.doi.org/10.1186/s13059-022-02765-0
_version_ 1784798787341910016
author Kim, Juwan
Lee, Chul
Ko, Byung June
Yoo, Dong Ahn
Won, Sohyoung
Phillippy, Adam M.
Fedrigo, Olivier
Zhang, Guojie
Howe, Kerstin
Wood, Jonathan
Durbin, Richard
Formenti, Giulio
Brown, Samara
Cantin, Lindsey
Mello, Claudio V.
Cho, Seoae
Rhie, Arang
Kim, Heebal
Jarvis, Erich D.
author_facet Kim, Juwan
Lee, Chul
Ko, Byung June
Yoo, Dong Ahn
Won, Sohyoung
Phillippy, Adam M.
Fedrigo, Olivier
Zhang, Guojie
Howe, Kerstin
Wood, Jonathan
Durbin, Richard
Formenti, Giulio
Brown, Samara
Cantin, Lindsey
Mello, Claudio V.
Cho, Seoae
Rhie, Arang
Kim, Heebal
Jarvis, Erich D.
author_sort Kim, Juwan
collection PubMed
description BACKGROUND: Many short-read genome assemblies have been found to be incomplete and contain mis-assemblies. The Vertebrate Genomes Project has been producing new reference genome assemblies with an emphasis on being as complete and error-free as possible, which requires utilizing long reads, long-range scaffolding data, new assembly algorithms, and manual curation. A more thorough evaluation of the recent references relative to prior assemblies can provide a detailed overview of the types and magnitude of improvements. RESULTS: Here we evaluate new vertebrate genome references relative to the previous assemblies for the same species and, in two cases, the same individuals, including a mammal (platypus), two birds (zebra finch, Anna’s hummingbird), and a fish (climbing perch). We find that up to 11% of genomic sequence is entirely missing in the previous assemblies. In the Vertebrate Genomes Project zebra finch assembly, we identify eight new GC- and repeat-rich micro-chromosomes with high gene density. The impact of missing sequences is biased towards GC-rich 5′-proximal promoters and 5′ exon regions of protein-coding genes and long non-coding RNAs. Between 26 and 60% of genes include structural or sequence errors that could lead to misunderstanding of their function when using the previous genome assemblies. CONCLUSIONS: Our findings reveal novel regulatory landscapes and protein coding sequences that have been greatly underestimated in previous assemblies and are now present in the Vertebrate Genomes Project reference genomes. SUPPLEMENTARY INFORMATION: The online version contains supplementary material available at 10.1186/s13059-022-02765-0.
format Online
Article
Text
id pubmed-9516821
institution National Center for Biotechnology Information
language English
publishDate 2022
publisher BioMed Central
record_format MEDLINE/PubMed
spelling pubmed-95168212022-09-29 False gene and chromosome losses in genome assemblies caused by GC content variation and repeats Kim, Juwan Lee, Chul Ko, Byung June Yoo, Dong Ahn Won, Sohyoung Phillippy, Adam M. Fedrigo, Olivier Zhang, Guojie Howe, Kerstin Wood, Jonathan Durbin, Richard Formenti, Giulio Brown, Samara Cantin, Lindsey Mello, Claudio V. Cho, Seoae Rhie, Arang Kim, Heebal Jarvis, Erich D. Genome Biol Research BACKGROUND: Many short-read genome assemblies have been found to be incomplete and contain mis-assemblies. The Vertebrate Genomes Project has been producing new reference genome assemblies with an emphasis on being as complete and error-free as possible, which requires utilizing long reads, long-range scaffolding data, new assembly algorithms, and manual curation. A more thorough evaluation of the recent references relative to prior assemblies can provide a detailed overview of the types and magnitude of improvements. RESULTS: Here we evaluate new vertebrate genome references relative to the previous assemblies for the same species and, in two cases, the same individuals, including a mammal (platypus), two birds (zebra finch, Anna’s hummingbird), and a fish (climbing perch). We find that up to 11% of genomic sequence is entirely missing in the previous assemblies. In the Vertebrate Genomes Project zebra finch assembly, we identify eight new GC- and repeat-rich micro-chromosomes with high gene density. The impact of missing sequences is biased towards GC-rich 5′-proximal promoters and 5′ exon regions of protein-coding genes and long non-coding RNAs. Between 26 and 60% of genes include structural or sequence errors that could lead to misunderstanding of their function when using the previous genome assemblies. CONCLUSIONS: Our findings reveal novel regulatory landscapes and protein coding sequences that have been greatly underestimated in previous assemblies and are now present in the Vertebrate Genomes Project reference genomes. SUPPLEMENTARY INFORMATION: The online version contains supplementary material available at 10.1186/s13059-022-02765-0. BioMed Central 2022-09-27 /pmc/articles/PMC9516821/ /pubmed/36167554 http://dx.doi.org/10.1186/s13059-022-02765-0 Text en © The Author(s) 2022 https://creativecommons.org/licenses/by/4.0/Open AccessThis article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/ (https://creativecommons.org/licenses/by/4.0/) . The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/ (https://creativecommons.org/publicdomain/zero/1.0/) ) applies to the data made available in this article, unless otherwise stated in a credit line to the data.
spellingShingle Research
Kim, Juwan
Lee, Chul
Ko, Byung June
Yoo, Dong Ahn
Won, Sohyoung
Phillippy, Adam M.
Fedrigo, Olivier
Zhang, Guojie
Howe, Kerstin
Wood, Jonathan
Durbin, Richard
Formenti, Giulio
Brown, Samara
Cantin, Lindsey
Mello, Claudio V.
Cho, Seoae
Rhie, Arang
Kim, Heebal
Jarvis, Erich D.
False gene and chromosome losses in genome assemblies caused by GC content variation and repeats
title False gene and chromosome losses in genome assemblies caused by GC content variation and repeats
title_full False gene and chromosome losses in genome assemblies caused by GC content variation and repeats
title_fullStr False gene and chromosome losses in genome assemblies caused by GC content variation and repeats
title_full_unstemmed False gene and chromosome losses in genome assemblies caused by GC content variation and repeats
title_short False gene and chromosome losses in genome assemblies caused by GC content variation and repeats
title_sort false gene and chromosome losses in genome assemblies caused by gc content variation and repeats
topic Research
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9516821/
https://www.ncbi.nlm.nih.gov/pubmed/36167554
http://dx.doi.org/10.1186/s13059-022-02765-0
work_keys_str_mv AT kimjuwan falsegeneandchromosomelossesingenomeassembliescausedbygccontentvariationandrepeats
AT leechul falsegeneandchromosomelossesingenomeassembliescausedbygccontentvariationandrepeats
AT kobyungjune falsegeneandchromosomelossesingenomeassembliescausedbygccontentvariationandrepeats
AT yoodongahn falsegeneandchromosomelossesingenomeassembliescausedbygccontentvariationandrepeats
AT wonsohyoung falsegeneandchromosomelossesingenomeassembliescausedbygccontentvariationandrepeats
AT phillippyadamm falsegeneandchromosomelossesingenomeassembliescausedbygccontentvariationandrepeats
AT fedrigoolivier falsegeneandchromosomelossesingenomeassembliescausedbygccontentvariationandrepeats
AT zhangguojie falsegeneandchromosomelossesingenomeassembliescausedbygccontentvariationandrepeats
AT howekerstin falsegeneandchromosomelossesingenomeassembliescausedbygccontentvariationandrepeats
AT woodjonathan falsegeneandchromosomelossesingenomeassembliescausedbygccontentvariationandrepeats
AT durbinrichard falsegeneandchromosomelossesingenomeassembliescausedbygccontentvariationandrepeats
AT formentigiulio falsegeneandchromosomelossesingenomeassembliescausedbygccontentvariationandrepeats
AT brownsamara falsegeneandchromosomelossesingenomeassembliescausedbygccontentvariationandrepeats
AT cantinlindsey falsegeneandchromosomelossesingenomeassembliescausedbygccontentvariationandrepeats
AT melloclaudiov falsegeneandchromosomelossesingenomeassembliescausedbygccontentvariationandrepeats
AT choseoae falsegeneandchromosomelossesingenomeassembliescausedbygccontentvariationandrepeats
AT rhiearang falsegeneandchromosomelossesingenomeassembliescausedbygccontentvariationandrepeats
AT kimheebal falsegeneandchromosomelossesingenomeassembliescausedbygccontentvariationandrepeats
AT jarviserichd falsegeneandchromosomelossesingenomeassembliescausedbygccontentvariationandrepeats