Cargando…

Unexpected observations after mapping LongSAGE tags to the human genome

BACKGROUND: SAGE has been used widely to study the expression of known transcripts, but much less to annotate new transcribed regions. LongSAGE produces tags that are sufficiently long to be reliably mapped to a whole-genome sequence. Here we used this property to study the position of human LongSAG...

Descripción completa

Detalles Bibliográficos
Autores principales: Keime, Céline, Sémon, Marie, Mouchiroud, Dominique, Duret, Laurent, Gandrillon, Olivier
Formato: Texto
Lenguaje:English
Publicado: BioMed Central 2007
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC1884178/
https://www.ncbi.nlm.nih.gov/pubmed/17504516
http://dx.doi.org/10.1186/1471-2105-8-154
_version_ 1782133606365265920
author Keime, Céline
Sémon, Marie
Mouchiroud, Dominique
Duret, Laurent
Gandrillon, Olivier
author_facet Keime, Céline
Sémon, Marie
Mouchiroud, Dominique
Duret, Laurent
Gandrillon, Olivier
author_sort Keime, Céline
collection PubMed
description BACKGROUND: SAGE has been used widely to study the expression of known transcripts, but much less to annotate new transcribed regions. LongSAGE produces tags that are sufficiently long to be reliably mapped to a whole-genome sequence. Here we used this property to study the position of human LongSAGE tags obtained from all public libraries. We focused mainly on tags that do not map to known transcripts. RESULTS: Using a published error rate in SAGE libraries, we first removed the tags likely to result from sequencing errors. We then observed that an unexpectedly large number of the remaining tags still did not match the genome sequence. Some of these correspond to parts of human mRNAs, such as polyA tails, junctions between two exons and polymorphic regions of transcripts. Another non-negligible proportion can be attributed to contamination by murine transcripts and to residual sequencing errors. After filtering out our data with these screens to ensure that our dataset is highly reliable, we studied the tags that map once to the genome. 31% of these tags correspond to unannotated transcripts. The others map to known transcribed regions, but many of them (nearly half) are located either in antisense or in new variants of these known transcripts. CONCLUSION: We performed a comprehensive study of all publicly available human LongSAGE tags, and carefully verified the reliability of these data. We found the potential origin of many tags that did not match the human genome sequence. The properties of the remaining tags imply that the level of sequencing error may have been under-estimated. The frequency of tags matching once the genome sequence but not in an annotated exon suggests that the human transcriptome is much more complex than shown by the current human genome annotations, with many new splicing variants and antisense transcripts. SAGE data is appropriate to map new transcripts to the genome, as demonstrated by the high rate of cross-validation of the corresponding tags using other methods.
format Text
id pubmed-1884178
institution National Center for Biotechnology Information
language English
publishDate 2007
publisher BioMed Central
record_format MEDLINE/PubMed
spelling pubmed-18841782007-05-30 Unexpected observations after mapping LongSAGE tags to the human genome Keime, Céline Sémon, Marie Mouchiroud, Dominique Duret, Laurent Gandrillon, Olivier BMC Bioinformatics Research Article BACKGROUND: SAGE has been used widely to study the expression of known transcripts, but much less to annotate new transcribed regions. LongSAGE produces tags that are sufficiently long to be reliably mapped to a whole-genome sequence. Here we used this property to study the position of human LongSAGE tags obtained from all public libraries. We focused mainly on tags that do not map to known transcripts. RESULTS: Using a published error rate in SAGE libraries, we first removed the tags likely to result from sequencing errors. We then observed that an unexpectedly large number of the remaining tags still did not match the genome sequence. Some of these correspond to parts of human mRNAs, such as polyA tails, junctions between two exons and polymorphic regions of transcripts. Another non-negligible proportion can be attributed to contamination by murine transcripts and to residual sequencing errors. After filtering out our data with these screens to ensure that our dataset is highly reliable, we studied the tags that map once to the genome. 31% of these tags correspond to unannotated transcripts. The others map to known transcribed regions, but many of them (nearly half) are located either in antisense or in new variants of these known transcripts. CONCLUSION: We performed a comprehensive study of all publicly available human LongSAGE tags, and carefully verified the reliability of these data. We found the potential origin of many tags that did not match the human genome sequence. The properties of the remaining tags imply that the level of sequencing error may have been under-estimated. The frequency of tags matching once the genome sequence but not in an annotated exon suggests that the human transcriptome is much more complex than shown by the current human genome annotations, with many new splicing variants and antisense transcripts. SAGE data is appropriate to map new transcripts to the genome, as demonstrated by the high rate of cross-validation of the corresponding tags using other methods. BioMed Central 2007-05-15 /pmc/articles/PMC1884178/ /pubmed/17504516 http://dx.doi.org/10.1186/1471-2105-8-154 Text en Copyright © 2007 Keime et al; licensee BioMed Central Ltd. http://creativecommons.org/licenses/by/2.0 This is an Open Access article distributed under the terms of the Creative Commons Attribution License ( (http://creativecommons.org/licenses/by/2.0) ), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.
spellingShingle Research Article
Keime, Céline
Sémon, Marie
Mouchiroud, Dominique
Duret, Laurent
Gandrillon, Olivier
Unexpected observations after mapping LongSAGE tags to the human genome
title Unexpected observations after mapping LongSAGE tags to the human genome
title_full Unexpected observations after mapping LongSAGE tags to the human genome
title_fullStr Unexpected observations after mapping LongSAGE tags to the human genome
title_full_unstemmed Unexpected observations after mapping LongSAGE tags to the human genome
title_short Unexpected observations after mapping LongSAGE tags to the human genome
title_sort unexpected observations after mapping longsage tags to the human genome
topic Research Article
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC1884178/
https://www.ncbi.nlm.nih.gov/pubmed/17504516
http://dx.doi.org/10.1186/1471-2105-8-154
work_keys_str_mv AT keimeceline unexpectedobservationsaftermappinglongsagetagstothehumangenome
AT semonmarie unexpectedobservationsaftermappinglongsagetagstothehumangenome
AT mouchirouddominique unexpectedobservationsaftermappinglongsagetagstothehumangenome
AT duretlaurent unexpectedobservationsaftermappinglongsagetagstothehumangenome
AT gandrillonolivier unexpectedobservationsaftermappinglongsagetagstothehumangenome