Cargando…

Functional Site Discovery From Incomplete Training Data: A Case Study With Nucleic Acid–Binding Proteins

Function annotation efforts provide a foundation to our understanding of cellular processes and the functioning of the living cell. This motivates high-throughput computational methods to characterize new protein members of a particular function. Research work has focused on discriminative machine-l...

Descripción completa

Detalles Bibliográficos
Autores principales: Wang, Wenchuan, Langlois, Robert, Langlois, Marina, Genchev, Georgi Z., Wang, Xiaolei, Lu, Hui
Formato: Online Artículo Texto
Lenguaje:English
Publicado: Frontiers Media S.A. 2019
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6729729/
https://www.ncbi.nlm.nih.gov/pubmed/31543893
http://dx.doi.org/10.3389/fgene.2019.00729
_version_ 1783449556114997248
author Wang, Wenchuan
Langlois, Robert
Langlois, Marina
Genchev, Georgi Z.
Wang, Xiaolei
Lu, Hui
author_facet Wang, Wenchuan
Langlois, Robert
Langlois, Marina
Genchev, Georgi Z.
Wang, Xiaolei
Lu, Hui
author_sort Wang, Wenchuan
collection PubMed
description Function annotation efforts provide a foundation to our understanding of cellular processes and the functioning of the living cell. This motivates high-throughput computational methods to characterize new protein members of a particular function. Research work has focused on discriminative machine-learning methods, which promise to make efficient, de novo predictions of protein function. Furthermore, available function annotation exists predominantly for individual proteins rather than residues of which only a subset is necessary for the conveyance of a particular function. This limits discriminative approaches to predicting functions for which there is sufficient residue-level annotation, e.g., identification of DNA-binding proteins or where an excellent global representation can be divined. Complete understanding of the various functions of proteins requires discovery and functional annotation at the residue level. Herein, we cast this problem into the setting of multiple-instance learning, which only requires knowledge of the protein’s function yet identifies functionally relevant residues and need not rely on homology. We developed a new multiple-instance leaning algorithm derived from AdaBoost and benchmarked this algorithm against two well-studied protein function prediction tasks: annotating proteins that bind DNA and RNA. This algorithm outperforms certain previous approaches in annotating protein function while identifying functionally relevant residues involved in binding both DNA and RNA, and on one protein-DNA benchmark, it achieves near perfect classification.
format Online
Article
Text
id pubmed-6729729
institution National Center for Biotechnology Information
language English
publishDate 2019
publisher Frontiers Media S.A.
record_format MEDLINE/PubMed
spelling pubmed-67297292019-09-20 Functional Site Discovery From Incomplete Training Data: A Case Study With Nucleic Acid–Binding Proteins Wang, Wenchuan Langlois, Robert Langlois, Marina Genchev, Georgi Z. Wang, Xiaolei Lu, Hui Front Genet Genetics Function annotation efforts provide a foundation to our understanding of cellular processes and the functioning of the living cell. This motivates high-throughput computational methods to characterize new protein members of a particular function. Research work has focused on discriminative machine-learning methods, which promise to make efficient, de novo predictions of protein function. Furthermore, available function annotation exists predominantly for individual proteins rather than residues of which only a subset is necessary for the conveyance of a particular function. This limits discriminative approaches to predicting functions for which there is sufficient residue-level annotation, e.g., identification of DNA-binding proteins or where an excellent global representation can be divined. Complete understanding of the various functions of proteins requires discovery and functional annotation at the residue level. Herein, we cast this problem into the setting of multiple-instance learning, which only requires knowledge of the protein’s function yet identifies functionally relevant residues and need not rely on homology. We developed a new multiple-instance leaning algorithm derived from AdaBoost and benchmarked this algorithm against two well-studied protein function prediction tasks: annotating proteins that bind DNA and RNA. This algorithm outperforms certain previous approaches in annotating protein function while identifying functionally relevant residues involved in binding both DNA and RNA, and on one protein-DNA benchmark, it achieves near perfect classification. Frontiers Media S.A. 2019-08-30 /pmc/articles/PMC6729729/ /pubmed/31543893 http://dx.doi.org/10.3389/fgene.2019.00729 Text en Copyright © 2019 Wang, Langlois, Langlois, Genchev, Wang and Lu http://creativecommons.org/licenses/by/4.0/ This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.
spellingShingle Genetics
Wang, Wenchuan
Langlois, Robert
Langlois, Marina
Genchev, Georgi Z.
Wang, Xiaolei
Lu, Hui
Functional Site Discovery From Incomplete Training Data: A Case Study With Nucleic Acid–Binding Proteins
title Functional Site Discovery From Incomplete Training Data: A Case Study With Nucleic Acid–Binding Proteins
title_full Functional Site Discovery From Incomplete Training Data: A Case Study With Nucleic Acid–Binding Proteins
title_fullStr Functional Site Discovery From Incomplete Training Data: A Case Study With Nucleic Acid–Binding Proteins
title_full_unstemmed Functional Site Discovery From Incomplete Training Data: A Case Study With Nucleic Acid–Binding Proteins
title_short Functional Site Discovery From Incomplete Training Data: A Case Study With Nucleic Acid–Binding Proteins
title_sort functional site discovery from incomplete training data: a case study with nucleic acid–binding proteins
topic Genetics
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6729729/
https://www.ncbi.nlm.nih.gov/pubmed/31543893
http://dx.doi.org/10.3389/fgene.2019.00729
work_keys_str_mv AT wangwenchuan functionalsitediscoveryfromincompletetrainingdataacasestudywithnucleicacidbindingproteins
AT langloisrobert functionalsitediscoveryfromincompletetrainingdataacasestudywithnucleicacidbindingproteins
AT langloismarina functionalsitediscoveryfromincompletetrainingdataacasestudywithnucleicacidbindingproteins
AT genchevgeorgiz functionalsitediscoveryfromincompletetrainingdataacasestudywithnucleicacidbindingproteins
AT wangxiaolei functionalsitediscoveryfromincompletetrainingdataacasestudywithnucleicacidbindingproteins
AT luhui functionalsitediscoveryfromincompletetrainingdataacasestudywithnucleicacidbindingproteins