Cargando…

GENCODE: producing a reference annotation for ENCODE

BACKGROUND: The GENCODE consortium was formed to identify and map all protein-coding genes within the ENCODE regions. This was achieved by a combination of initial manual annotation by the HAVANA team, experimental validation by the GENCODE consortium and a refinement of the annotation based on thes...

Descripción completa

Detalles Bibliográficos
Autores principales: Harrow, Jennifer, Denoeud, France, Frankish, Adam, Reymond, Alexandre, Chen, Chao-Kung, Chrast, Jacqueline, Lagarde, Julien, Gilbert, James GR, Storey, Roy, Swarbreck, David, Rossier, Colette, Ucla, Catherine, Hubbard, Tim, Antonarakis, Stylianos E, Guigo, Roderic
Formato: Texto
Lenguaje:English
Publicado: BioMed Central 2006
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC1810553/
https://www.ncbi.nlm.nih.gov/pubmed/16925838
http://dx.doi.org/10.1186/gb-2006-7-s1-s4
_version_ 1782132600560680960
author Harrow, Jennifer
Denoeud, France
Frankish, Adam
Reymond, Alexandre
Chen, Chao-Kung
Chrast, Jacqueline
Lagarde, Julien
Gilbert, James GR
Storey, Roy
Swarbreck, David
Rossier, Colette
Ucla, Catherine
Hubbard, Tim
Antonarakis, Stylianos E
Guigo, Roderic
author_facet Harrow, Jennifer
Denoeud, France
Frankish, Adam
Reymond, Alexandre
Chen, Chao-Kung
Chrast, Jacqueline
Lagarde, Julien
Gilbert, James GR
Storey, Roy
Swarbreck, David
Rossier, Colette
Ucla, Catherine
Hubbard, Tim
Antonarakis, Stylianos E
Guigo, Roderic
author_sort Harrow, Jennifer
collection PubMed
description BACKGROUND: The GENCODE consortium was formed to identify and map all protein-coding genes within the ENCODE regions. This was achieved by a combination of initial manual annotation by the HAVANA team, experimental validation by the GENCODE consortium and a refinement of the annotation based on these experimental results. RESULTS: The GENCODE gene features are divided into eight different categories of which only the first two (known and novel coding sequence) are confidently predicted to be protein-coding genes. 5' rapid amplification of cDNA ends (RACE) and RT-PCR were used to experimentally verify the initial annotation. Of the 420 coding loci tested, 229 RACE products have been sequenced. They supported 5' extensions of 30 loci and new splice variants in 50 loci. In addition, 46 loci without evidence for a coding sequence were validated, consisting of 31 novel and 15 putative transcripts. We assessed the comprehensiveness of the GENCODE annotation by attempting to validate all the predicted exon boundaries outside the GENCODE annotation. Out of 1,215 tested in a subset of the ENCODE regions, 14 novel exon pairs were validated, only two of them in intergenic regions. CONCLUSION: In total, 487 loci, of which 434 are coding, have been annotated as part of the GENCODE reference set available from the UCSC browser. Comparison of GENCODE annotation with RefSeq and ENSEMBL show only 40% of GENCODE exons are contained within the two sets, which is a reflection of the high number of alternative splice forms with unique exons annotated. Over 50% of coding loci have been experimentally verified by 5' RACE for EGASP and the GENCODE collaboration is continuing to refine its annotation of 1% human genome with the aid of experimental validation.
format Text
id pubmed-1810553
institution National Center for Biotechnology Information
language English
publishDate 2006
publisher BioMed Central
record_format MEDLINE/PubMed
spelling pubmed-18105532007-03-07 GENCODE: producing a reference annotation for ENCODE Harrow, Jennifer Denoeud, France Frankish, Adam Reymond, Alexandre Chen, Chao-Kung Chrast, Jacqueline Lagarde, Julien Gilbert, James GR Storey, Roy Swarbreck, David Rossier, Colette Ucla, Catherine Hubbard, Tim Antonarakis, Stylianos E Guigo, Roderic Genome Biol Research BACKGROUND: The GENCODE consortium was formed to identify and map all protein-coding genes within the ENCODE regions. This was achieved by a combination of initial manual annotation by the HAVANA team, experimental validation by the GENCODE consortium and a refinement of the annotation based on these experimental results. RESULTS: The GENCODE gene features are divided into eight different categories of which only the first two (known and novel coding sequence) are confidently predicted to be protein-coding genes. 5' rapid amplification of cDNA ends (RACE) and RT-PCR were used to experimentally verify the initial annotation. Of the 420 coding loci tested, 229 RACE products have been sequenced. They supported 5' extensions of 30 loci and new splice variants in 50 loci. In addition, 46 loci without evidence for a coding sequence were validated, consisting of 31 novel and 15 putative transcripts. We assessed the comprehensiveness of the GENCODE annotation by attempting to validate all the predicted exon boundaries outside the GENCODE annotation. Out of 1,215 tested in a subset of the ENCODE regions, 14 novel exon pairs were validated, only two of them in intergenic regions. CONCLUSION: In total, 487 loci, of which 434 are coding, have been annotated as part of the GENCODE reference set available from the UCSC browser. Comparison of GENCODE annotation with RefSeq and ENSEMBL show only 40% of GENCODE exons are contained within the two sets, which is a reflection of the high number of alternative splice forms with unique exons annotated. Over 50% of coding loci have been experimentally verified by 5' RACE for EGASP and the GENCODE collaboration is continuing to refine its annotation of 1% human genome with the aid of experimental validation. BioMed Central 2006 2006-08-07 /pmc/articles/PMC1810553/ /pubmed/16925838 http://dx.doi.org/10.1186/gb-2006-7-s1-s4 Text en Copyright © 2006 Harrow et al.; licensee BioMed Central Ltd. http://creativecommons.org/licenses/by/2.0 This is an open access article distributed under the terms of the Creative Commons Attribution License ( (http://creativecommons.org/licenses/by/2.0) ), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.
spellingShingle Research
Harrow, Jennifer
Denoeud, France
Frankish, Adam
Reymond, Alexandre
Chen, Chao-Kung
Chrast, Jacqueline
Lagarde, Julien
Gilbert, James GR
Storey, Roy
Swarbreck, David
Rossier, Colette
Ucla, Catherine
Hubbard, Tim
Antonarakis, Stylianos E
Guigo, Roderic
GENCODE: producing a reference annotation for ENCODE
title GENCODE: producing a reference annotation for ENCODE
title_full GENCODE: producing a reference annotation for ENCODE
title_fullStr GENCODE: producing a reference annotation for ENCODE
title_full_unstemmed GENCODE: producing a reference annotation for ENCODE
title_short GENCODE: producing a reference annotation for ENCODE
title_sort gencode: producing a reference annotation for encode
topic Research
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC1810553/
https://www.ncbi.nlm.nih.gov/pubmed/16925838
http://dx.doi.org/10.1186/gb-2006-7-s1-s4
work_keys_str_mv AT harrowjennifer gencodeproducingareferenceannotationforencode
AT denoeudfrance gencodeproducingareferenceannotationforencode
AT frankishadam gencodeproducingareferenceannotationforencode
AT reymondalexandre gencodeproducingareferenceannotationforencode
AT chenchaokung gencodeproducingareferenceannotationforencode
AT chrastjacqueline gencodeproducingareferenceannotationforencode
AT lagardejulien gencodeproducingareferenceannotationforencode
AT gilbertjamesgr gencodeproducingareferenceannotationforencode
AT storeyroy gencodeproducingareferenceannotationforencode
AT swarbreckdavid gencodeproducingareferenceannotationforencode
AT rossiercolette gencodeproducingareferenceannotationforencode
AT uclacatherine gencodeproducingareferenceannotationforencode
AT hubbardtim gencodeproducingareferenceannotationforencode
AT antonarakisstylianose gencodeproducingareferenceannotationforencode
AT guigoroderic gencodeproducingareferenceannotationforencode