Cargando…

GODoc: high-throughput protein function prediction using novel k-nearest-neighbor and voting algorithms

BACKGROUND: Biological data has grown explosively with the advance of next-generation sequencing. However, annotating protein function with wet lab experiments is time-consuming. Fortunately, computational function prediction can help wet labs formulate biological hypotheses and prioritize experimen...

Descripción completa

Detalles Bibliográficos
Autores principales: Liu, Yi-Wei, Hsu, Tz-Wei, Chang, Che-Yu, Liao, Wen-Hung, Chang, Jia-Ming
Formato: Online Artículo Texto
Lenguaje:English
Publicado: BioMed Central 2020
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7672824/
https://www.ncbi.nlm.nih.gov/pubmed/33203348
http://dx.doi.org/10.1186/s12859-020-03556-9
_version_ 1783611212477497344
author Liu, Yi-Wei
Hsu, Tz-Wei
Chang, Che-Yu
Liao, Wen-Hung
Chang, Jia-Ming
author_facet Liu, Yi-Wei
Hsu, Tz-Wei
Chang, Che-Yu
Liao, Wen-Hung
Chang, Jia-Ming
author_sort Liu, Yi-Wei
collection PubMed
description BACKGROUND: Biological data has grown explosively with the advance of next-generation sequencing. However, annotating protein function with wet lab experiments is time-consuming. Fortunately, computational function prediction can help wet labs formulate biological hypotheses and prioritize experiments. Gene Ontology (GO) is a framework for unifying the representation of protein function in a hierarchical tree composed of GO terms. RESULTS: We propose GODoc, a general protein GO prediction framework based on sequence information which combines feature engineering, feature reduction, and a novel ​k​-nearest-neighbor algorithm to resolve the multiple GO prediction problem. Comprehensive evaluation on CAFA2 shows that GODoc performs better than two baseline models. In the CAFA3 competition (68 teams), GODoc ranks 10th in Cellular Component Ontology. Regarding the species-specific task, the proposed method ranks 10th and 8th in the eukaryotic Cellular Component Ontology and the prokaryotic Molecular Function Ontology, respectively. In the term-centric task, GODoc performs third and is tied for first for the biofilm formation of Pseudomonas aeruginosa and the long-term memory of Drosophila melanogaster, respectively. CONCLUSIONS: We have developed a novel and effective strategy to incorporate a training procedure into the k-nearest neighbor algorithm (instance-based learning) which is capable of solving the Gene Ontology multiple-label prediction problem, which is especially notable given the thousands of Gene Ontology terms.
format Online
Article
Text
id pubmed-7672824
institution National Center for Biotechnology Information
language English
publishDate 2020
publisher BioMed Central
record_format MEDLINE/PubMed
spelling pubmed-76728242020-11-19 GODoc: high-throughput protein function prediction using novel k-nearest-neighbor and voting algorithms Liu, Yi-Wei Hsu, Tz-Wei Chang, Che-Yu Liao, Wen-Hung Chang, Jia-Ming BMC Bioinformatics Research BACKGROUND: Biological data has grown explosively with the advance of next-generation sequencing. However, annotating protein function with wet lab experiments is time-consuming. Fortunately, computational function prediction can help wet labs formulate biological hypotheses and prioritize experiments. Gene Ontology (GO) is a framework for unifying the representation of protein function in a hierarchical tree composed of GO terms. RESULTS: We propose GODoc, a general protein GO prediction framework based on sequence information which combines feature engineering, feature reduction, and a novel ​k​-nearest-neighbor algorithm to resolve the multiple GO prediction problem. Comprehensive evaluation on CAFA2 shows that GODoc performs better than two baseline models. In the CAFA3 competition (68 teams), GODoc ranks 10th in Cellular Component Ontology. Regarding the species-specific task, the proposed method ranks 10th and 8th in the eukaryotic Cellular Component Ontology and the prokaryotic Molecular Function Ontology, respectively. In the term-centric task, GODoc performs third and is tied for first for the biofilm formation of Pseudomonas aeruginosa and the long-term memory of Drosophila melanogaster, respectively. CONCLUSIONS: We have developed a novel and effective strategy to incorporate a training procedure into the k-nearest neighbor algorithm (instance-based learning) which is capable of solving the Gene Ontology multiple-label prediction problem, which is especially notable given the thousands of Gene Ontology terms. BioMed Central 2020-11-18 /pmc/articles/PMC7672824/ /pubmed/33203348 http://dx.doi.org/10.1186/s12859-020-03556-9 Text en © The Author(s) 2020 Open AccessThis article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated in a credit line to the data.
spellingShingle Research
Liu, Yi-Wei
Hsu, Tz-Wei
Chang, Che-Yu
Liao, Wen-Hung
Chang, Jia-Ming
GODoc: high-throughput protein function prediction using novel k-nearest-neighbor and voting algorithms
title GODoc: high-throughput protein function prediction using novel k-nearest-neighbor and voting algorithms
title_full GODoc: high-throughput protein function prediction using novel k-nearest-neighbor and voting algorithms
title_fullStr GODoc: high-throughput protein function prediction using novel k-nearest-neighbor and voting algorithms
title_full_unstemmed GODoc: high-throughput protein function prediction using novel k-nearest-neighbor and voting algorithms
title_short GODoc: high-throughput protein function prediction using novel k-nearest-neighbor and voting algorithms
title_sort godoc: high-throughput protein function prediction using novel k-nearest-neighbor and voting algorithms
topic Research
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7672824/
https://www.ncbi.nlm.nih.gov/pubmed/33203348
http://dx.doi.org/10.1186/s12859-020-03556-9
work_keys_str_mv AT liuyiwei godochighthroughputproteinfunctionpredictionusingnovelknearestneighborandvotingalgorithms
AT hsutzwei godochighthroughputproteinfunctionpredictionusingnovelknearestneighborandvotingalgorithms
AT changcheyu godochighthroughputproteinfunctionpredictionusingnovelknearestneighborandvotingalgorithms
AT liaowenhung godochighthroughputproteinfunctionpredictionusingnovelknearestneighborandvotingalgorithms
AT changjiaming godochighthroughputproteinfunctionpredictionusingnovelknearestneighborandvotingalgorithms