Cargando…

High-precision high-coverage functional inference from integrated data sources

BACKGROUND: Information obtained from diverse data sources can be combined in a principled manner using various machine learning methods to increase the reliability and range of knowledge about protein function. The result is a weighted functional linkage network (FLN) in which linked neighbors shar...

Descripción completa

Detalles Bibliográficos
Autores principales: Linghu, Bolan, Snitkin, Evan S, Holloway, Dustin T, Gustafson, Adam M, Xia, Yu, DeLisi, Charles
Formato: Texto
Lenguaje:English
Publicado: BioMed Central 2008
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2292694/
https://www.ncbi.nlm.nih.gov/pubmed/18298847
http://dx.doi.org/10.1186/1471-2105-9-119
_version_ 1782152511363219456
author Linghu, Bolan
Snitkin, Evan S
Holloway, Dustin T
Gustafson, Adam M
Xia, Yu
DeLisi, Charles
author_facet Linghu, Bolan
Snitkin, Evan S
Holloway, Dustin T
Gustafson, Adam M
Xia, Yu
DeLisi, Charles
author_sort Linghu, Bolan
collection PubMed
description BACKGROUND: Information obtained from diverse data sources can be combined in a principled manner using various machine learning methods to increase the reliability and range of knowledge about protein function. The result is a weighted functional linkage network (FLN) in which linked neighbors share at least one function with high probability. Precision is, however, low. Aiming to provide precise functional annotation for as many proteins as possible, we explore and propose a two-step framework for functional annotation (1) construction of a high-coverage and reliable FLN via machine learning techniques (2) development of a decision rule for the constructed FLN to optimize functional annotation. RESULTS: We first apply this framework to Saccharomyces cerevisiae. In the first step, we demonstrate that four commonly used machine learning methods, Linear SVM, Linear Discriminant Analysis, Naïve Bayes, and Neural Network, all combine heterogeneous data to produce reliable and high-coverage FLNs, in which the linkage weight more accurately estimates functional coupling of linked proteins than use individual data sources alone. In the second step, empirical tuning of an adjustable decision rule on the constructed FLN reveals that basing annotation on maximum edge weight results in the most precise annotation at high coverages. In particular at low coverage all rules evaluated perform comparably. At coverage above approximately 50%, however, they diverge rapidly. At full coverage, the maximum weight decision rule still has a precision of approximately 70%, whereas for other methods, precision ranges from a high of slightly more than 30%, down to 3%. In addition, a scoring scheme to estimate the precisions of individual predictions is also provided. Finally, tests of the robustness of the framework indicate that our framework can be successfully applied to less studied organisms. CONCLUSION: We provide a general two-step function-annotation framework, and show that high coverage, high precision annotations can be achieved by constructing a high-coverage and reliable FLN via data integration followed by applying a maximum weight decision rule.
format Text
id pubmed-2292694
institution National Center for Biotechnology Information
language English
publishDate 2008
publisher BioMed Central
record_format MEDLINE/PubMed
spelling pubmed-22926942008-04-14 High-precision high-coverage functional inference from integrated data sources Linghu, Bolan Snitkin, Evan S Holloway, Dustin T Gustafson, Adam M Xia, Yu DeLisi, Charles BMC Bioinformatics Research Article BACKGROUND: Information obtained from diverse data sources can be combined in a principled manner using various machine learning methods to increase the reliability and range of knowledge about protein function. The result is a weighted functional linkage network (FLN) in which linked neighbors share at least one function with high probability. Precision is, however, low. Aiming to provide precise functional annotation for as many proteins as possible, we explore and propose a two-step framework for functional annotation (1) construction of a high-coverage and reliable FLN via machine learning techniques (2) development of a decision rule for the constructed FLN to optimize functional annotation. RESULTS: We first apply this framework to Saccharomyces cerevisiae. In the first step, we demonstrate that four commonly used machine learning methods, Linear SVM, Linear Discriminant Analysis, Naïve Bayes, and Neural Network, all combine heterogeneous data to produce reliable and high-coverage FLNs, in which the linkage weight more accurately estimates functional coupling of linked proteins than use individual data sources alone. In the second step, empirical tuning of an adjustable decision rule on the constructed FLN reveals that basing annotation on maximum edge weight results in the most precise annotation at high coverages. In particular at low coverage all rules evaluated perform comparably. At coverage above approximately 50%, however, they diverge rapidly. At full coverage, the maximum weight decision rule still has a precision of approximately 70%, whereas for other methods, precision ranges from a high of slightly more than 30%, down to 3%. In addition, a scoring scheme to estimate the precisions of individual predictions is also provided. Finally, tests of the robustness of the framework indicate that our framework can be successfully applied to less studied organisms. CONCLUSION: We provide a general two-step function-annotation framework, and show that high coverage, high precision annotations can be achieved by constructing a high-coverage and reliable FLN via data integration followed by applying a maximum weight decision rule. BioMed Central 2008-02-25 /pmc/articles/PMC2292694/ /pubmed/18298847 http://dx.doi.org/10.1186/1471-2105-9-119 Text en Copyright © 2008 Linghu et al; licensee BioMed Central Ltd. http://creativecommons.org/licenses/by/2.0 This is an Open Access article distributed under the terms of the Creative Commons Attribution License ( (http://creativecommons.org/licenses/by/2.0) ), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.
spellingShingle Research Article
Linghu, Bolan
Snitkin, Evan S
Holloway, Dustin T
Gustafson, Adam M
Xia, Yu
DeLisi, Charles
High-precision high-coverage functional inference from integrated data sources
title High-precision high-coverage functional inference from integrated data sources
title_full High-precision high-coverage functional inference from integrated data sources
title_fullStr High-precision high-coverage functional inference from integrated data sources
title_full_unstemmed High-precision high-coverage functional inference from integrated data sources
title_short High-precision high-coverage functional inference from integrated data sources
title_sort high-precision high-coverage functional inference from integrated data sources
topic Research Article
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2292694/
https://www.ncbi.nlm.nih.gov/pubmed/18298847
http://dx.doi.org/10.1186/1471-2105-9-119
work_keys_str_mv AT linghubolan highprecisionhighcoveragefunctionalinferencefromintegrateddatasources
AT snitkinevans highprecisionhighcoveragefunctionalinferencefromintegrateddatasources
AT hollowaydustint highprecisionhighcoveragefunctionalinferencefromintegrateddatasources
AT gustafsonadamm highprecisionhighcoveragefunctionalinferencefromintegrateddatasources
AT xiayu highprecisionhighcoveragefunctionalinferencefromintegrateddatasources
AT delisicharles highprecisionhighcoveragefunctionalinferencefromintegrateddatasources