Cargando…

Database of Trypanosoma cruzi repeated genes: 20 000 additional gene variants

BACKGROUND: Repeats are present in all genomes, and often have important functions. However, in large genome sequencing projects, many repetitive regions remain uncharacterized. The genome of the protozoan parasite Trypanosoma cruzi consists of more than 50% repeats. These repeats include surface mo...

Descripción completa

Detalles Bibliográficos
Autores principales:	Arner, Erik, Kindlund, Ellen, Nilsson, Daniel, Farzana, Fatima, Ferella, Marcela, Tammi, Martti T, Andersson, Björn
Formato:	Texto
Lenguaje:	English
Publicado:	BioMed Central 2007
Materias:	Research Article
Acceso en línea:	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2204015/ https://www.ncbi.nlm.nih.gov/pubmed/17963481 http://dx.doi.org/10.1186/1471-2164-8-391

_version_	1782148414620827648
author	Arner, Erik Kindlund, Ellen Nilsson, Daniel Farzana, Fatima Ferella, Marcela Tammi, Martti T Andersson, Björn
author_facet	Arner, Erik Kindlund, Ellen Nilsson, Daniel Farzana, Fatima Ferella, Marcela Tammi, Martti T Andersson, Björn
author_sort	Arner, Erik
collection	PubMed
description	BACKGROUND: Repeats are present in all genomes, and often have important functions. However, in large genome sequencing projects, many repetitive regions remain uncharacterized. The genome of the protozoan parasite Trypanosoma cruzi consists of more than 50% repeats. These repeats include surface molecule genes, and several other gene families. In the T. cruzi genome sequencing project, it was clear that not all copies of repetitive genes were present in the assembly, due to collapse of nearly identical repeats. However, at the time of publication of the T. cruzi genome, it was not clear to what extent this had occurred. RESULTS: We have developed a pipeline to estimate the genomic repeat content, where shotgun reads are aligned to the genomic sequence and the gene copy number is estimated using the average shotgun coverage. This method was applied to the genome of T. cruzi and copy numbers of all protein coding sequences and pseudogenes were estimated. The 22 640 results were stored in a database available online. 18% of all protein coding sequences and pseudogenes were estimated to exist in 14 or more copies in the T. cruzi CL Brener genome. The average coverage of the annotated protein coding sequences and pseudogenes indicate a total gene copy number, including allelic gene variants, of over 40 000. CONCLUSION: Our results indicate that the number of protein coding sequences and pseudogenes in the T. cruzi genome may be twice the previous estimate. We have constructed a database of the T. cruzi gene repeat data that is available as a resource to the community. The main purpose of the database is to enable biologists interested in repeated, unfinished regions to closely examine and resolve these regions themselves using all available shotgun data, instead of having to rely on annotated consensus sequences that often are erroneous and possibly misleading. Five repetitive genes were studied in more detail, in order to illustrate how the database can be used to analyze and extract information about gene repeats with different characteristics in Trypanosoma cruzi.
format	Text
id	pubmed-2204015
institution	National Center for Biotechnology Information
language	English
publishDate	2007
publisher	BioMed Central
record_format	MEDLINE/PubMed
spelling	pubmed-22040152008-01-17 Database of Trypanosoma cruzi repeated genes: 20 000 additional gene variants Arner, Erik Kindlund, Ellen Nilsson, Daniel Farzana, Fatima Ferella, Marcela Tammi, Martti T Andersson, Björn BMC Genomics Research Article BACKGROUND: Repeats are present in all genomes, and often have important functions. However, in large genome sequencing projects, many repetitive regions remain uncharacterized. The genome of the protozoan parasite Trypanosoma cruzi consists of more than 50% repeats. These repeats include surface molecule genes, and several other gene families. In the T. cruzi genome sequencing project, it was clear that not all copies of repetitive genes were present in the assembly, due to collapse of nearly identical repeats. However, at the time of publication of the T. cruzi genome, it was not clear to what extent this had occurred. RESULTS: We have developed a pipeline to estimate the genomic repeat content, where shotgun reads are aligned to the genomic sequence and the gene copy number is estimated using the average shotgun coverage. This method was applied to the genome of T. cruzi and copy numbers of all protein coding sequences and pseudogenes were estimated. The 22 640 results were stored in a database available online. 18% of all protein coding sequences and pseudogenes were estimated to exist in 14 or more copies in the T. cruzi CL Brener genome. The average coverage of the annotated protein coding sequences and pseudogenes indicate a total gene copy number, including allelic gene variants, of over 40 000. CONCLUSION: Our results indicate that the number of protein coding sequences and pseudogenes in the T. cruzi genome may be twice the previous estimate. We have constructed a database of the T. cruzi gene repeat data that is available as a resource to the community. The main purpose of the database is to enable biologists interested in repeated, unfinished regions to closely examine and resolve these regions themselves using all available shotgun data, instead of having to rely on annotated consensus sequences that often are erroneous and possibly misleading. Five repetitive genes were studied in more detail, in order to illustrate how the database can be used to analyze and extract information about gene repeats with different characteristics in Trypanosoma cruzi. BioMed Central 2007-10-26 /pmc/articles/PMC2204015/ /pubmed/17963481 http://dx.doi.org/10.1186/1471-2164-8-391 Text en Copyright © 2007 Arner et al; licensee BioMed Central Ltd. http://creativecommons.org/licenses/by/2.0 This is an Open Access article distributed under the terms of the Creative Commons Attribution License ( (http://creativecommons.org/licenses/by/2.0) ), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.
spellingShingle	Research Article Arner, Erik Kindlund, Ellen Nilsson, Daniel Farzana, Fatima Ferella, Marcela Tammi, Martti T Andersson, Björn Database of Trypanosoma cruzi repeated genes: 20 000 additional gene variants
title	Database of Trypanosoma cruzi repeated genes: 20 000 additional gene variants
title_full	Database of Trypanosoma cruzi repeated genes: 20 000 additional gene variants
title_fullStr	Database of Trypanosoma cruzi repeated genes: 20 000 additional gene variants
title_full_unstemmed	Database of Trypanosoma cruzi repeated genes: 20 000 additional gene variants
title_short	Database of Trypanosoma cruzi repeated genes: 20 000 additional gene variants
title_sort	database of trypanosoma cruzi repeated genes: 20 000 additional gene variants
topic	Research Article
url	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2204015/ https://www.ncbi.nlm.nih.gov/pubmed/17963481 http://dx.doi.org/10.1186/1471-2164-8-391
work_keys_str_mv	AT arnererik databaseoftrypanosomacruzirepeatedgenes20000additionalgenevariants AT kindlundellen databaseoftrypanosomacruzirepeatedgenes20000additionalgenevariants AT nilssondaniel databaseoftrypanosomacruzirepeatedgenes20000additionalgenevariants AT farzanafatima databaseoftrypanosomacruzirepeatedgenes20000additionalgenevariants AT ferellamarcela databaseoftrypanosomacruzirepeatedgenes20000additionalgenevariants AT tammimarttit databaseoftrypanosomacruzirepeatedgenes20000additionalgenevariants AT anderssonbjorn databaseoftrypanosomacruzirepeatedgenes20000additionalgenevariants

Database of Trypanosoma cruzi repeated genes: 20 000 additional gene variants

Ejemplares similares