Cargando…
Tandem repeats lead to sequence assembly errors and impose multi-level challenges for genome and protein databases
The widespread occurrence of repetitive stretches of DNA in genomes of organisms across the tree of life imposes fundamental challenges for sequencing, genome assembly, and automated annotation of genes and proteins. This multi-level problem can lead to errors in genome and protein databases that ar...
Autores principales: | , , , , , , , , , , , , |
---|---|
Formato: | Online Artículo Texto |
Lenguaje: | English |
Publicado: |
Oxford University Press
2019
|
Materias: | |
Acceso en línea: | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6868369/ https://www.ncbi.nlm.nih.gov/pubmed/31584084 http://dx.doi.org/10.1093/nar/gkz841 |
_version_ | 1783472243591872512 |
---|---|
author | Tørresen, Ole K Star, Bastiaan Mier, Pablo Andrade-Navarro, Miguel A Bateman, Alex Jarnot, Patryk Gruca, Aleksandra Grynberg, Marcin Kajava, Andrey V Promponas, Vasilis J Anisimova, Maria Jakobsen, Kjetill S Linke, Dirk |
author_facet | Tørresen, Ole K Star, Bastiaan Mier, Pablo Andrade-Navarro, Miguel A Bateman, Alex Jarnot, Patryk Gruca, Aleksandra Grynberg, Marcin Kajava, Andrey V Promponas, Vasilis J Anisimova, Maria Jakobsen, Kjetill S Linke, Dirk |
author_sort | Tørresen, Ole K |
collection | PubMed |
description | The widespread occurrence of repetitive stretches of DNA in genomes of organisms across the tree of life imposes fundamental challenges for sequencing, genome assembly, and automated annotation of genes and proteins. This multi-level problem can lead to errors in genome and protein databases that are often not recognized or acknowledged. As a consequence, end users working with sequences with repetitive regions are faced with ‘ready-to-use’ deposited data whose trustworthiness is difficult to determine, let alone to quantify. Here, we provide a review of the problems associated with tandem repeat sequences that originate from different stages during the sequencing-assembly-annotation-deposition workflow, and that may proliferate in public database repositories affecting all downstream analyses. As a case study, we provide examples of the Atlantic cod genome, whose sequencing and assembly were hindered by a particularly high prevalence of tandem repeats. We complement this case study with examples from other species, where mis-annotations and sequencing errors have propagated into protein databases. With this review, we aim to raise the awareness level within the community of database users, and alert scientists working in the underlying workflow of database creation that the data they omit or improperly assemble may well contain important biological information valuable to others. |
format | Online Article Text |
id | pubmed-6868369 |
institution | National Center for Biotechnology Information |
language | English |
publishDate | 2019 |
publisher | Oxford University Press |
record_format | MEDLINE/PubMed |
spelling | pubmed-68683692019-11-27 Tandem repeats lead to sequence assembly errors and impose multi-level challenges for genome and protein databases Tørresen, Ole K Star, Bastiaan Mier, Pablo Andrade-Navarro, Miguel A Bateman, Alex Jarnot, Patryk Gruca, Aleksandra Grynberg, Marcin Kajava, Andrey V Promponas, Vasilis J Anisimova, Maria Jakobsen, Kjetill S Linke, Dirk Nucleic Acids Res Survey and Summary The widespread occurrence of repetitive stretches of DNA in genomes of organisms across the tree of life imposes fundamental challenges for sequencing, genome assembly, and automated annotation of genes and proteins. This multi-level problem can lead to errors in genome and protein databases that are often not recognized or acknowledged. As a consequence, end users working with sequences with repetitive regions are faced with ‘ready-to-use’ deposited data whose trustworthiness is difficult to determine, let alone to quantify. Here, we provide a review of the problems associated with tandem repeat sequences that originate from different stages during the sequencing-assembly-annotation-deposition workflow, and that may proliferate in public database repositories affecting all downstream analyses. As a case study, we provide examples of the Atlantic cod genome, whose sequencing and assembly were hindered by a particularly high prevalence of tandem repeats. We complement this case study with examples from other species, where mis-annotations and sequencing errors have propagated into protein databases. With this review, we aim to raise the awareness level within the community of database users, and alert scientists working in the underlying workflow of database creation that the data they omit or improperly assemble may well contain important biological information valuable to others. Oxford University Press 2019-12-02 2019-10-04 /pmc/articles/PMC6868369/ /pubmed/31584084 http://dx.doi.org/10.1093/nar/gkz841 Text en © The Author(s) 2019. Published by Oxford University Press on behalf of Nucleic Acids Research. http://creativecommons.org/licenses/by/4.0/ This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted reuse, distribution, and reproduction in any medium, provided the original work is properly cited. |
spellingShingle | Survey and Summary Tørresen, Ole K Star, Bastiaan Mier, Pablo Andrade-Navarro, Miguel A Bateman, Alex Jarnot, Patryk Gruca, Aleksandra Grynberg, Marcin Kajava, Andrey V Promponas, Vasilis J Anisimova, Maria Jakobsen, Kjetill S Linke, Dirk Tandem repeats lead to sequence assembly errors and impose multi-level challenges for genome and protein databases |
title | Tandem repeats lead to sequence assembly errors and impose multi-level challenges for genome and protein databases |
title_full | Tandem repeats lead to sequence assembly errors and impose multi-level challenges for genome and protein databases |
title_fullStr | Tandem repeats lead to sequence assembly errors and impose multi-level challenges for genome and protein databases |
title_full_unstemmed | Tandem repeats lead to sequence assembly errors and impose multi-level challenges for genome and protein databases |
title_short | Tandem repeats lead to sequence assembly errors and impose multi-level challenges for genome and protein databases |
title_sort | tandem repeats lead to sequence assembly errors and impose multi-level challenges for genome and protein databases |
topic | Survey and Summary |
url | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6868369/ https://www.ncbi.nlm.nih.gov/pubmed/31584084 http://dx.doi.org/10.1093/nar/gkz841 |
work_keys_str_mv | AT tørresenolek tandemrepeatsleadtosequenceassemblyerrorsandimposemultilevelchallengesforgenomeandproteindatabases AT starbastiaan tandemrepeatsleadtosequenceassemblyerrorsandimposemultilevelchallengesforgenomeandproteindatabases AT mierpablo tandemrepeatsleadtosequenceassemblyerrorsandimposemultilevelchallengesforgenomeandproteindatabases AT andradenavarromiguela tandemrepeatsleadtosequenceassemblyerrorsandimposemultilevelchallengesforgenomeandproteindatabases AT batemanalex tandemrepeatsleadtosequenceassemblyerrorsandimposemultilevelchallengesforgenomeandproteindatabases AT jarnotpatryk tandemrepeatsleadtosequenceassemblyerrorsandimposemultilevelchallengesforgenomeandproteindatabases AT grucaaleksandra tandemrepeatsleadtosequenceassemblyerrorsandimposemultilevelchallengesforgenomeandproteindatabases AT grynbergmarcin tandemrepeatsleadtosequenceassemblyerrorsandimposemultilevelchallengesforgenomeandproteindatabases AT kajavaandreyv tandemrepeatsleadtosequenceassemblyerrorsandimposemultilevelchallengesforgenomeandproteindatabases AT promponasvasilisj tandemrepeatsleadtosequenceassemblyerrorsandimposemultilevelchallengesforgenomeandproteindatabases AT anisimovamaria tandemrepeatsleadtosequenceassemblyerrorsandimposemultilevelchallengesforgenomeandproteindatabases AT jakobsenkjetills tandemrepeatsleadtosequenceassemblyerrorsandimposemultilevelchallengesforgenomeandproteindatabases AT linkedirk tandemrepeatsleadtosequenceassemblyerrorsandimposemultilevelchallengesforgenomeandproteindatabases |