Cargando…

Effective transcription factor binding site prediction using a combination of optimization, a genetic algorithm and discriminant analysis to capture distant interactions

BACKGROUND: Reliable transcription factor binding site (TFBS) prediction methods are essential for computer annotation of large amount of genome sequence data. However, current methods to predict TFBSs are hampered by the high false-positive rates that occur when only sequence conservation at the co...

Descripción completa

Detalles Bibliográficos
Autores principales: Levitsky, Victor G, Ignatieva, Elena V, Ananko, Elena A, Turnaev, Igor I, Merkulova, Tatyana I, Kolchanov, Nikolay A, Hodgman, TC
Formato: Texto
Lenguaje:English
Publicado: BioMed Central 2007
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2265442/
https://www.ncbi.nlm.nih.gov/pubmed/18093302
http://dx.doi.org/10.1186/1471-2105-8-481
_version_ 1782151480032100352
author Levitsky, Victor G
Ignatieva, Elena V
Ananko, Elena A
Turnaev, Igor I
Merkulova, Tatyana I
Kolchanov, Nikolay A
Hodgman, TC
author_facet Levitsky, Victor G
Ignatieva, Elena V
Ananko, Elena A
Turnaev, Igor I
Merkulova, Tatyana I
Kolchanov, Nikolay A
Hodgman, TC
author_sort Levitsky, Victor G
collection PubMed
description BACKGROUND: Reliable transcription factor binding site (TFBS) prediction methods are essential for computer annotation of large amount of genome sequence data. However, current methods to predict TFBSs are hampered by the high false-positive rates that occur when only sequence conservation at the core binding-sites is considered. RESULTS: To improve this situation, we have quantified the performance of several Position Weight Matrix (PWM) algorithms, using exhaustive approaches to find their optimal length and position. We applied these approaches to bio-medically important TFBSs involved in the regulation of cell growth and proliferation as well as in inflammatory, immune, and antiviral responses (NF-κB, ISGF3, IRF1, STAT1), obesity and lipid metabolism (PPAR, SREBP, HNF4), regulation of the steroidogenic (SF-1) and cell cycle (E2F) genes expression. We have also gained extra specificity using a method, entitled SiteGA, which takes into account structural interactions within TFBS core and flanking regions, using a genetic algorithm (GA) with a discriminant function of locally positioned dinucleotide (LPD) frequencies. To ensure a higher confidence in our approach, we applied resampling-jackknife and bootstrap tests for the comparison, it appears that, optimized PWM and SiteGA have shown similar recognition performances. Then we applied SiteGA and optimized PWMs (both separately and together) to sequences in the Eukaryotic Promoter Database (EPD). The resulting SiteGA recognition models can now be used to search sequences for BSs using the web tool, SiteGA. Analysis of dependencies between close and distant LPDs revealed by SiteGA models has shown that the most significant correlations are between close LPDs, and are generally located in the core (footprint) region. A greater number of less significant correlations are mainly between distant LPDs, which spanned both core and flanking regions. When SiteGA and optimized PWM models were applied together, this substantially reduced false positives at least at higher stringencies. CONCLUSION: Based on this analysis, SiteGA adds substantial specificity even to optimized PWMs and may be considered for large-scale genome analysis. It adds to the range of techniques available for TFBS prediction, and EPD analysis has led to a list of genes which appear to be regulated by the above TFs.
format Text
id pubmed-2265442
institution National Center for Biotechnology Information
language English
publishDate 2007
publisher BioMed Central
record_format MEDLINE/PubMed
spelling pubmed-22654422008-05-09 Effective transcription factor binding site prediction using a combination of optimization, a genetic algorithm and discriminant analysis to capture distant interactions Levitsky, Victor G Ignatieva, Elena V Ananko, Elena A Turnaev, Igor I Merkulova, Tatyana I Kolchanov, Nikolay A Hodgman, TC BMC Bioinformatics Research Article BACKGROUND: Reliable transcription factor binding site (TFBS) prediction methods are essential for computer annotation of large amount of genome sequence data. However, current methods to predict TFBSs are hampered by the high false-positive rates that occur when only sequence conservation at the core binding-sites is considered. RESULTS: To improve this situation, we have quantified the performance of several Position Weight Matrix (PWM) algorithms, using exhaustive approaches to find their optimal length and position. We applied these approaches to bio-medically important TFBSs involved in the regulation of cell growth and proliferation as well as in inflammatory, immune, and antiviral responses (NF-κB, ISGF3, IRF1, STAT1), obesity and lipid metabolism (PPAR, SREBP, HNF4), regulation of the steroidogenic (SF-1) and cell cycle (E2F) genes expression. We have also gained extra specificity using a method, entitled SiteGA, which takes into account structural interactions within TFBS core and flanking regions, using a genetic algorithm (GA) with a discriminant function of locally positioned dinucleotide (LPD) frequencies. To ensure a higher confidence in our approach, we applied resampling-jackknife and bootstrap tests for the comparison, it appears that, optimized PWM and SiteGA have shown similar recognition performances. Then we applied SiteGA and optimized PWMs (both separately and together) to sequences in the Eukaryotic Promoter Database (EPD). The resulting SiteGA recognition models can now be used to search sequences for BSs using the web tool, SiteGA. Analysis of dependencies between close and distant LPDs revealed by SiteGA models has shown that the most significant correlations are between close LPDs, and are generally located in the core (footprint) region. A greater number of less significant correlations are mainly between distant LPDs, which spanned both core and flanking regions. When SiteGA and optimized PWM models were applied together, this substantially reduced false positives at least at higher stringencies. CONCLUSION: Based on this analysis, SiteGA adds substantial specificity even to optimized PWMs and may be considered for large-scale genome analysis. It adds to the range of techniques available for TFBS prediction, and EPD analysis has led to a list of genes which appear to be regulated by the above TFs. BioMed Central 2007-12-19 /pmc/articles/PMC2265442/ /pubmed/18093302 http://dx.doi.org/10.1186/1471-2105-8-481 Text en Copyright © 2007 Levitsky et al; licensee BioMed Central Ltd. http://creativecommons.org/licenses/by/2.0 This is an Open Access article distributed under the terms of the Creative Commons Attribution License ( (http://creativecommons.org/licenses/by/2.0) ), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.
spellingShingle Research Article
Levitsky, Victor G
Ignatieva, Elena V
Ananko, Elena A
Turnaev, Igor I
Merkulova, Tatyana I
Kolchanov, Nikolay A
Hodgman, TC
Effective transcription factor binding site prediction using a combination of optimization, a genetic algorithm and discriminant analysis to capture distant interactions
title Effective transcription factor binding site prediction using a combination of optimization, a genetic algorithm and discriminant analysis to capture distant interactions
title_full Effective transcription factor binding site prediction using a combination of optimization, a genetic algorithm and discriminant analysis to capture distant interactions
title_fullStr Effective transcription factor binding site prediction using a combination of optimization, a genetic algorithm and discriminant analysis to capture distant interactions
title_full_unstemmed Effective transcription factor binding site prediction using a combination of optimization, a genetic algorithm and discriminant analysis to capture distant interactions
title_short Effective transcription factor binding site prediction using a combination of optimization, a genetic algorithm and discriminant analysis to capture distant interactions
title_sort effective transcription factor binding site prediction using a combination of optimization, a genetic algorithm and discriminant analysis to capture distant interactions
topic Research Article
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2265442/
https://www.ncbi.nlm.nih.gov/pubmed/18093302
http://dx.doi.org/10.1186/1471-2105-8-481
work_keys_str_mv AT levitskyvictorg effectivetranscriptionfactorbindingsitepredictionusingacombinationofoptimizationageneticalgorithmanddiscriminantanalysistocapturedistantinteractions
AT ignatievaelenav effectivetranscriptionfactorbindingsitepredictionusingacombinationofoptimizationageneticalgorithmanddiscriminantanalysistocapturedistantinteractions
AT anankoelenaa effectivetranscriptionfactorbindingsitepredictionusingacombinationofoptimizationageneticalgorithmanddiscriminantanalysistocapturedistantinteractions
AT turnaevigori effectivetranscriptionfactorbindingsitepredictionusingacombinationofoptimizationageneticalgorithmanddiscriminantanalysistocapturedistantinteractions
AT merkulovatatyanai effectivetranscriptionfactorbindingsitepredictionusingacombinationofoptimizationageneticalgorithmanddiscriminantanalysistocapturedistantinteractions
AT kolchanovnikolaya effectivetranscriptionfactorbindingsitepredictionusingacombinationofoptimizationageneticalgorithmanddiscriminantanalysistocapturedistantinteractions
AT hodgmantc effectivetranscriptionfactorbindingsitepredictionusingacombinationofoptimizationageneticalgorithmanddiscriminantanalysistocapturedistantinteractions