Cargando…

A systematic study of genome context methods: calibration, normalization and combination

BACKGROUND: Genome context methods have been introduced in the last decade as automatic methods to predict functional relatedness between genes in a target genome using the patterns of existence and relative locations of the homologs of those genes in a set of reference genomes. Much work has been d...

Descripción completa

Detalles Bibliográficos
Autores principales: Ferrer, Luciana, Dale, Joseph M, Karp, Peter D
Formato: Online Artículo Texto
Lenguaje:English
Publicado: BioMed Central 2010
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3247869/
https://www.ncbi.nlm.nih.gov/pubmed/20920312
http://dx.doi.org/10.1186/1471-2105-11-493
_version_ 1782220183393271808
author Ferrer, Luciana
Dale, Joseph M
Karp, Peter D
author_facet Ferrer, Luciana
Dale, Joseph M
Karp, Peter D
author_sort Ferrer, Luciana
collection PubMed
description BACKGROUND: Genome context methods have been introduced in the last decade as automatic methods to predict functional relatedness between genes in a target genome using the patterns of existence and relative locations of the homologs of those genes in a set of reference genomes. Much work has been done in the application of these methods to different bioinformatics tasks, but few papers present a systematic study of the methods and their combination necessary for their optimal use. RESULTS: We present a thorough study of the four main families of genome context methods found in the literature: phylogenetic profile, gene fusion, gene cluster, and gene neighbor. We find that for most organisms the gene neighbor method outperforms the phylogenetic profile method by as much as 40% in sensitivity, being competitive with the gene cluster method at low sensitivities. Gene fusion is generally the worst performing of the four methods. A thorough exploration of the parameter space for each method is performed and results across different target organisms are presented. We propose the use of normalization procedures as those used on microarray data for the genome context scores. We show that substantial gains can be achieved from the use of a simple normalization technique. In particular, the sensitivity of the phylogenetic profile method is improved by around 25% after normalization, resulting, to our knowledge, on the best-performing phylogenetic profile system in the literature. Finally, we show results from combining the various genome context methods into a single score. When using a cross-validation procedure to train the combiners, with both original and normalized scores as input, a decision tree combiner results in gains of up to 20% with respect to the gene neighbor method. Overall, this represents a gain of around 15% over what can be considered the state of the art in this area: the four original genome context methods combined using a procedure like that used in the STRING database. Unfortunately, we find that these gains disappear when the combiner is trained only with organisms that are phylogenetically distant from the target organism. CONCLUSIONS: Our experiments indicate that gene neighbor is the best individual genome context method and that gains from the combination of individual methods are very sensitive to the training data used to obtain the combiner's parameters. If adequate training data is not available, using the gene neighbor score by itself instead of a combined score might be the best choice.
format Online
Article
Text
id pubmed-3247869
institution National Center for Biotechnology Information
language English
publishDate 2010
publisher BioMed Central
record_format MEDLINE/PubMed
spelling pubmed-32478692011-12-30 A systematic study of genome context methods: calibration, normalization and combination Ferrer, Luciana Dale, Joseph M Karp, Peter D BMC Bioinformatics Methodology Article BACKGROUND: Genome context methods have been introduced in the last decade as automatic methods to predict functional relatedness between genes in a target genome using the patterns of existence and relative locations of the homologs of those genes in a set of reference genomes. Much work has been done in the application of these methods to different bioinformatics tasks, but few papers present a systematic study of the methods and their combination necessary for their optimal use. RESULTS: We present a thorough study of the four main families of genome context methods found in the literature: phylogenetic profile, gene fusion, gene cluster, and gene neighbor. We find that for most organisms the gene neighbor method outperforms the phylogenetic profile method by as much as 40% in sensitivity, being competitive with the gene cluster method at low sensitivities. Gene fusion is generally the worst performing of the four methods. A thorough exploration of the parameter space for each method is performed and results across different target organisms are presented. We propose the use of normalization procedures as those used on microarray data for the genome context scores. We show that substantial gains can be achieved from the use of a simple normalization technique. In particular, the sensitivity of the phylogenetic profile method is improved by around 25% after normalization, resulting, to our knowledge, on the best-performing phylogenetic profile system in the literature. Finally, we show results from combining the various genome context methods into a single score. When using a cross-validation procedure to train the combiners, with both original and normalized scores as input, a decision tree combiner results in gains of up to 20% with respect to the gene neighbor method. Overall, this represents a gain of around 15% over what can be considered the state of the art in this area: the four original genome context methods combined using a procedure like that used in the STRING database. Unfortunately, we find that these gains disappear when the combiner is trained only with organisms that are phylogenetically distant from the target organism. CONCLUSIONS: Our experiments indicate that gene neighbor is the best individual genome context method and that gains from the combination of individual methods are very sensitive to the training data used to obtain the combiner's parameters. If adequate training data is not available, using the gene neighbor score by itself instead of a combined score might be the best choice. BioMed Central 2010-10-01 /pmc/articles/PMC3247869/ /pubmed/20920312 http://dx.doi.org/10.1186/1471-2105-11-493 Text en Copyright ©2010 Ferrer et al; licensee BioMed Central Ltd. http://creativecommons.org/licenses/by/2.0 This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.
spellingShingle Methodology Article
Ferrer, Luciana
Dale, Joseph M
Karp, Peter D
A systematic study of genome context methods: calibration, normalization and combination
title A systematic study of genome context methods: calibration, normalization and combination
title_full A systematic study of genome context methods: calibration, normalization and combination
title_fullStr A systematic study of genome context methods: calibration, normalization and combination
title_full_unstemmed A systematic study of genome context methods: calibration, normalization and combination
title_short A systematic study of genome context methods: calibration, normalization and combination
title_sort systematic study of genome context methods: calibration, normalization and combination
topic Methodology Article
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3247869/
https://www.ncbi.nlm.nih.gov/pubmed/20920312
http://dx.doi.org/10.1186/1471-2105-11-493
work_keys_str_mv AT ferrerluciana asystematicstudyofgenomecontextmethodscalibrationnormalizationandcombination
AT dalejosephm asystematicstudyofgenomecontextmethodscalibrationnormalizationandcombination
AT karppeterd asystematicstudyofgenomecontextmethodscalibrationnormalizationandcombination
AT ferrerluciana systematicstudyofgenomecontextmethodscalibrationnormalizationandcombination
AT dalejosephm systematicstudyofgenomecontextmethodscalibrationnormalizationandcombination
AT karppeterd systematicstudyofgenomecontextmethodscalibrationnormalizationandcombination