Cargando…

CpGcluster: a distance-based algorithm for CpG-island detection

BACKGROUND: Despite their involvement in the regulation of gene expression and their importance as genomic markers for promoter prediction, no objective standard exists for defining CpG islands (CGIs), since all current approaches rely on a large parameter space formed by the thresholds of length, C...

Descripción completa

Detalles Bibliográficos
Autores principales: Hackenberg, Michael, Previti, Christopher, Luque-Escamilla, Pedro Luis, Carpena, Pedro, Martínez-Aroza, José, Oliver, José L
Formato: Texto
Lenguaje:English
Publicado: BioMed Central 2006
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC1617122/
https://www.ncbi.nlm.nih.gov/pubmed/17038168
http://dx.doi.org/10.1186/1471-2105-7-446
_version_ 1782130509106642944
author Hackenberg, Michael
Previti, Christopher
Luque-Escamilla, Pedro Luis
Carpena, Pedro
Martínez-Aroza, José
Oliver, José L
author_facet Hackenberg, Michael
Previti, Christopher
Luque-Escamilla, Pedro Luis
Carpena, Pedro
Martínez-Aroza, José
Oliver, José L
author_sort Hackenberg, Michael
collection PubMed
description BACKGROUND: Despite their involvement in the regulation of gene expression and their importance as genomic markers for promoter prediction, no objective standard exists for defining CpG islands (CGIs), since all current approaches rely on a large parameter space formed by the thresholds of length, CpG fraction and G+C content. RESULTS: Given the higher frequency of CpG dinucleotides at CGIs, as compared to bulk DNA, the distance distributions between neighboring CpGs should differ for bulk and island CpGs. A new algorithm (CpGcluster) is presented, based on the physical distance between neighboring CpGs on the chromosome and able to predict directly clusters of CpGs, while not depending on the subjective criteria mentioned above. By assigning a p-value to each of these clusters, the most statistically significant ones can be predicted as CGIs. CpGcluster was benchmarked against five other CGI finders by using a test sequence set assembled from an experimental CGI library. CpGcluster reached the highest overall accuracy values, while showing the lowest rate of false-positive predictions. Since a minimum-length threshold is not required, CpGcluster can find short but fully functional CGIs usually missed by other algorithms. The CGIs predicted by CpGcluster present the lowest degree of overlap with Alu retrotransposons and, simultaneously, the highest overlap with vertebrate Phylogenetic Conserved Elements (PhastCons). CpGcluster's CGIs overlapping with the Transcription Start Site (TSS) show the highest statistical significance, as compared to the islands in other genome locations, thus qualifying CpGcluster as a valuable tool in discriminating functional CGIs from the remaining islands in the bulk genome. CONCLUSION: CpGcluster uses only integer arithmetic, thus being a fast and computationally efficient algorithm able to predict statistically significant clusters of CpG dinucleotides. Another outstanding feature is that all predicted CGIs start and end with a CpG dinucleotide, which should be appropriate for a genomic feature whose functionality is based precisely on CpG dinucleotides. The only search parameter in CpGcluster is the distance between two consecutive CpGs, in contrast to previous algorithms. Therefore, none of the main statistical properties of CpG islands (neither G+C content, CpG fraction nor length threshold) are needed as search parameters, which may lead to the high specificity and low overlap with spurious Alu elements observed for CpGcluster predictions.
format Text
id pubmed-1617122
institution National Center for Biotechnology Information
language English
publishDate 2006
publisher BioMed Central
record_format MEDLINE/PubMed
spelling pubmed-16171222006-10-20 CpGcluster: a distance-based algorithm for CpG-island detection Hackenberg, Michael Previti, Christopher Luque-Escamilla, Pedro Luis Carpena, Pedro Martínez-Aroza, José Oliver, José L BMC Bioinformatics Research Article BACKGROUND: Despite their involvement in the regulation of gene expression and their importance as genomic markers for promoter prediction, no objective standard exists for defining CpG islands (CGIs), since all current approaches rely on a large parameter space formed by the thresholds of length, CpG fraction and G+C content. RESULTS: Given the higher frequency of CpG dinucleotides at CGIs, as compared to bulk DNA, the distance distributions between neighboring CpGs should differ for bulk and island CpGs. A new algorithm (CpGcluster) is presented, based on the physical distance between neighboring CpGs on the chromosome and able to predict directly clusters of CpGs, while not depending on the subjective criteria mentioned above. By assigning a p-value to each of these clusters, the most statistically significant ones can be predicted as CGIs. CpGcluster was benchmarked against five other CGI finders by using a test sequence set assembled from an experimental CGI library. CpGcluster reached the highest overall accuracy values, while showing the lowest rate of false-positive predictions. Since a minimum-length threshold is not required, CpGcluster can find short but fully functional CGIs usually missed by other algorithms. The CGIs predicted by CpGcluster present the lowest degree of overlap with Alu retrotransposons and, simultaneously, the highest overlap with vertebrate Phylogenetic Conserved Elements (PhastCons). CpGcluster's CGIs overlapping with the Transcription Start Site (TSS) show the highest statistical significance, as compared to the islands in other genome locations, thus qualifying CpGcluster as a valuable tool in discriminating functional CGIs from the remaining islands in the bulk genome. CONCLUSION: CpGcluster uses only integer arithmetic, thus being a fast and computationally efficient algorithm able to predict statistically significant clusters of CpG dinucleotides. Another outstanding feature is that all predicted CGIs start and end with a CpG dinucleotide, which should be appropriate for a genomic feature whose functionality is based precisely on CpG dinucleotides. The only search parameter in CpGcluster is the distance between two consecutive CpGs, in contrast to previous algorithms. Therefore, none of the main statistical properties of CpG islands (neither G+C content, CpG fraction nor length threshold) are needed as search parameters, which may lead to the high specificity and low overlap with spurious Alu elements observed for CpGcluster predictions. BioMed Central 2006-10-12 /pmc/articles/PMC1617122/ /pubmed/17038168 http://dx.doi.org/10.1186/1471-2105-7-446 Text en Copyright © 2006 Hackenberg et al; licensee BioMed Central Ltd. http://creativecommons.org/licenses/by/2.0 This is an Open Access article distributed under the terms of the Creative Commons Attribution License ( (http://creativecommons.org/licenses/by/2.0) ), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.
spellingShingle Research Article
Hackenberg, Michael
Previti, Christopher
Luque-Escamilla, Pedro Luis
Carpena, Pedro
Martínez-Aroza, José
Oliver, José L
CpGcluster: a distance-based algorithm for CpG-island detection
title CpGcluster: a distance-based algorithm for CpG-island detection
title_full CpGcluster: a distance-based algorithm for CpG-island detection
title_fullStr CpGcluster: a distance-based algorithm for CpG-island detection
title_full_unstemmed CpGcluster: a distance-based algorithm for CpG-island detection
title_short CpGcluster: a distance-based algorithm for CpG-island detection
title_sort cpgcluster: a distance-based algorithm for cpg-island detection
topic Research Article
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC1617122/
https://www.ncbi.nlm.nih.gov/pubmed/17038168
http://dx.doi.org/10.1186/1471-2105-7-446
work_keys_str_mv AT hackenbergmichael cpgclusteradistancebasedalgorithmforcpgislanddetection
AT previtichristopher cpgclusteradistancebasedalgorithmforcpgislanddetection
AT luqueescamillapedroluis cpgclusteradistancebasedalgorithmforcpgislanddetection
AT carpenapedro cpgclusteradistancebasedalgorithmforcpgislanddetection
AT martinezarozajose cpgclusteradistancebasedalgorithmforcpgislanddetection
AT oliverjosel cpgclusteradistancebasedalgorithmforcpgislanddetection