Cargando…

Tandem repeats lead to sequence assembly errors and impose multi-level challenges for genome and protein databases

The widespread occurrence of repetitive stretches of DNA in genomes of organisms across the tree of life imposes fundamental challenges for sequencing, genome assembly, and automated annotation of genes and proteins. This multi-level problem can lead to errors in genome and protein databases that ar...

Descripción completa

Detalles Bibliográficos
Autores principales: Tørresen, Ole K, Star, Bastiaan, Mier, Pablo, Andrade-Navarro, Miguel A, Bateman, Alex, Jarnot, Patryk, Gruca, Aleksandra, Grynberg, Marcin, Kajava, Andrey V, Promponas, Vasilis J, Anisimova, Maria, Jakobsen, Kjetill S, Linke, Dirk
Formato: Online Artículo Texto
Lenguaje:English
Publicado: Oxford University Press 2019
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6868369/
https://www.ncbi.nlm.nih.gov/pubmed/31584084
http://dx.doi.org/10.1093/nar/gkz841
_version_ 1783472243591872512
author Tørresen, Ole K
Star, Bastiaan
Mier, Pablo
Andrade-Navarro, Miguel A
Bateman, Alex
Jarnot, Patryk
Gruca, Aleksandra
Grynberg, Marcin
Kajava, Andrey V
Promponas, Vasilis J
Anisimova, Maria
Jakobsen, Kjetill S
Linke, Dirk
author_facet Tørresen, Ole K
Star, Bastiaan
Mier, Pablo
Andrade-Navarro, Miguel A
Bateman, Alex
Jarnot, Patryk
Gruca, Aleksandra
Grynberg, Marcin
Kajava, Andrey V
Promponas, Vasilis J
Anisimova, Maria
Jakobsen, Kjetill S
Linke, Dirk
author_sort Tørresen, Ole K
collection PubMed
description The widespread occurrence of repetitive stretches of DNA in genomes of organisms across the tree of life imposes fundamental challenges for sequencing, genome assembly, and automated annotation of genes and proteins. This multi-level problem can lead to errors in genome and protein databases that are often not recognized or acknowledged. As a consequence, end users working with sequences with repetitive regions are faced with ‘ready-to-use’ deposited data whose trustworthiness is difficult to determine, let alone to quantify. Here, we provide a review of the problems associated with tandem repeat sequences that originate from different stages during the sequencing-assembly-annotation-deposition workflow, and that may proliferate in public database repositories affecting all downstream analyses. As a case study, we provide examples of the Atlantic cod genome, whose sequencing and assembly were hindered by a particularly high prevalence of tandem repeats. We complement this case study with examples from other species, where mis-annotations and sequencing errors have propagated into protein databases. With this review, we aim to raise the awareness level within the community of database users, and alert scientists working in the underlying workflow of database creation that the data they omit or improperly assemble may well contain important biological information valuable to others.
format Online
Article
Text
id pubmed-6868369
institution National Center for Biotechnology Information
language English
publishDate 2019
publisher Oxford University Press
record_format MEDLINE/PubMed
spelling pubmed-68683692019-11-27 Tandem repeats lead to sequence assembly errors and impose multi-level challenges for genome and protein databases Tørresen, Ole K Star, Bastiaan Mier, Pablo Andrade-Navarro, Miguel A Bateman, Alex Jarnot, Patryk Gruca, Aleksandra Grynberg, Marcin Kajava, Andrey V Promponas, Vasilis J Anisimova, Maria Jakobsen, Kjetill S Linke, Dirk Nucleic Acids Res Survey and Summary The widespread occurrence of repetitive stretches of DNA in genomes of organisms across the tree of life imposes fundamental challenges for sequencing, genome assembly, and automated annotation of genes and proteins. This multi-level problem can lead to errors in genome and protein databases that are often not recognized or acknowledged. As a consequence, end users working with sequences with repetitive regions are faced with ‘ready-to-use’ deposited data whose trustworthiness is difficult to determine, let alone to quantify. Here, we provide a review of the problems associated with tandem repeat sequences that originate from different stages during the sequencing-assembly-annotation-deposition workflow, and that may proliferate in public database repositories affecting all downstream analyses. As a case study, we provide examples of the Atlantic cod genome, whose sequencing and assembly were hindered by a particularly high prevalence of tandem repeats. We complement this case study with examples from other species, where mis-annotations and sequencing errors have propagated into protein databases. With this review, we aim to raise the awareness level within the community of database users, and alert scientists working in the underlying workflow of database creation that the data they omit or improperly assemble may well contain important biological information valuable to others. Oxford University Press 2019-12-02 2019-10-04 /pmc/articles/PMC6868369/ /pubmed/31584084 http://dx.doi.org/10.1093/nar/gkz841 Text en © The Author(s) 2019. Published by Oxford University Press on behalf of Nucleic Acids Research. http://creativecommons.org/licenses/by/4.0/ This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted reuse, distribution, and reproduction in any medium, provided the original work is properly cited.
spellingShingle Survey and Summary
Tørresen, Ole K
Star, Bastiaan
Mier, Pablo
Andrade-Navarro, Miguel A
Bateman, Alex
Jarnot, Patryk
Gruca, Aleksandra
Grynberg, Marcin
Kajava, Andrey V
Promponas, Vasilis J
Anisimova, Maria
Jakobsen, Kjetill S
Linke, Dirk
Tandem repeats lead to sequence assembly errors and impose multi-level challenges for genome and protein databases
title Tandem repeats lead to sequence assembly errors and impose multi-level challenges for genome and protein databases
title_full Tandem repeats lead to sequence assembly errors and impose multi-level challenges for genome and protein databases
title_fullStr Tandem repeats lead to sequence assembly errors and impose multi-level challenges for genome and protein databases
title_full_unstemmed Tandem repeats lead to sequence assembly errors and impose multi-level challenges for genome and protein databases
title_short Tandem repeats lead to sequence assembly errors and impose multi-level challenges for genome and protein databases
title_sort tandem repeats lead to sequence assembly errors and impose multi-level challenges for genome and protein databases
topic Survey and Summary
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6868369/
https://www.ncbi.nlm.nih.gov/pubmed/31584084
http://dx.doi.org/10.1093/nar/gkz841
work_keys_str_mv AT tørresenolek tandemrepeatsleadtosequenceassemblyerrorsandimposemultilevelchallengesforgenomeandproteindatabases
AT starbastiaan tandemrepeatsleadtosequenceassemblyerrorsandimposemultilevelchallengesforgenomeandproteindatabases
AT mierpablo tandemrepeatsleadtosequenceassemblyerrorsandimposemultilevelchallengesforgenomeandproteindatabases
AT andradenavarromiguela tandemrepeatsleadtosequenceassemblyerrorsandimposemultilevelchallengesforgenomeandproteindatabases
AT batemanalex tandemrepeatsleadtosequenceassemblyerrorsandimposemultilevelchallengesforgenomeandproteindatabases
AT jarnotpatryk tandemrepeatsleadtosequenceassemblyerrorsandimposemultilevelchallengesforgenomeandproteindatabases
AT grucaaleksandra tandemrepeatsleadtosequenceassemblyerrorsandimposemultilevelchallengesforgenomeandproteindatabases
AT grynbergmarcin tandemrepeatsleadtosequenceassemblyerrorsandimposemultilevelchallengesforgenomeandproteindatabases
AT kajavaandreyv tandemrepeatsleadtosequenceassemblyerrorsandimposemultilevelchallengesforgenomeandproteindatabases
AT promponasvasilisj tandemrepeatsleadtosequenceassemblyerrorsandimposemultilevelchallengesforgenomeandproteindatabases
AT anisimovamaria tandemrepeatsleadtosequenceassemblyerrorsandimposemultilevelchallengesforgenomeandproteindatabases
AT jakobsenkjetills tandemrepeatsleadtosequenceassemblyerrorsandimposemultilevelchallengesforgenomeandproteindatabases
AT linkedirk tandemrepeatsleadtosequenceassemblyerrorsandimposemultilevelchallengesforgenomeandproteindatabases