Cargando…

Evaluating a variety of text-mined features for automatic protein function prediction with GOstruct

Most computational methods that predict protein function do not take advantage of the large amount of information contained in the biomedical literature. In this work we evaluate both ontology term co-mention and bag-of-words features mined from the biomedical literature and analyze their impact in...

Descripción completa

Detalles Bibliográficos
Autores principales: Funk, Christopher S, Kahanda, Indika, Ben-Hur, Asa, Verspoor, Karin M
Formato: Online Artículo Texto
Lenguaje:English
Publicado: BioMed Central 2015
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4441003/
https://www.ncbi.nlm.nih.gov/pubmed/26005564
http://dx.doi.org/10.1186/s13326-015-0006-4
_version_ 1782372725921153024
author Funk, Christopher S
Kahanda, Indika
Ben-Hur, Asa
Verspoor, Karin M
author_facet Funk, Christopher S
Kahanda, Indika
Ben-Hur, Asa
Verspoor, Karin M
author_sort Funk, Christopher S
collection PubMed
description Most computational methods that predict protein function do not take advantage of the large amount of information contained in the biomedical literature. In this work we evaluate both ontology term co-mention and bag-of-words features mined from the biomedical literature and analyze their impact in the context of a structured output support vector machine model, GOstruct. We find that even simple literature based features are useful for predicting human protein function (F-max: Molecular Function =0.408, Biological Process =0.461, Cellular Component =0.608). One advantage of using literature features is their ability to offer easy verification of automated predictions. We find through manual inspection of misclassifications that some false positive predictions could be biologically valid predictions based upon support extracted from the literature. Additionally, we present a “medium-throughput” pipeline that was used to annotate a large subset of co-mentions; we suggest that this strategy could help to speed up the rate at which proteins are curated. ELECTRONIC SUPPLEMENTARY MATERIAL: The online version of this article (doi:10.1186/s13326-015-0006-4) contains supplementary material, which is available to authorized users.
format Online
Article
Text
id pubmed-4441003
institution National Center for Biotechnology Information
language English
publishDate 2015
publisher BioMed Central
record_format MEDLINE/PubMed
spelling pubmed-44410032015-05-23 Evaluating a variety of text-mined features for automatic protein function prediction with GOstruct Funk, Christopher S Kahanda, Indika Ben-Hur, Asa Verspoor, Karin M J Biomed Semantics Research Most computational methods that predict protein function do not take advantage of the large amount of information contained in the biomedical literature. In this work we evaluate both ontology term co-mention and bag-of-words features mined from the biomedical literature and analyze their impact in the context of a structured output support vector machine model, GOstruct. We find that even simple literature based features are useful for predicting human protein function (F-max: Molecular Function =0.408, Biological Process =0.461, Cellular Component =0.608). One advantage of using literature features is their ability to offer easy verification of automated predictions. We find through manual inspection of misclassifications that some false positive predictions could be biologically valid predictions based upon support extracted from the literature. Additionally, we present a “medium-throughput” pipeline that was used to annotate a large subset of co-mentions; we suggest that this strategy could help to speed up the rate at which proteins are curated. ELECTRONIC SUPPLEMENTARY MATERIAL: The online version of this article (doi:10.1186/s13326-015-0006-4) contains supplementary material, which is available to authorized users. BioMed Central 2015-03-18 /pmc/articles/PMC4441003/ /pubmed/26005564 http://dx.doi.org/10.1186/s13326-015-0006-4 Text en © Funk et al.; licensee BioMed Central. 2015 This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly credited. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.
spellingShingle Research
Funk, Christopher S
Kahanda, Indika
Ben-Hur, Asa
Verspoor, Karin M
Evaluating a variety of text-mined features for automatic protein function prediction with GOstruct
title Evaluating a variety of text-mined features for automatic protein function prediction with GOstruct
title_full Evaluating a variety of text-mined features for automatic protein function prediction with GOstruct
title_fullStr Evaluating a variety of text-mined features for automatic protein function prediction with GOstruct
title_full_unstemmed Evaluating a variety of text-mined features for automatic protein function prediction with GOstruct
title_short Evaluating a variety of text-mined features for automatic protein function prediction with GOstruct
title_sort evaluating a variety of text-mined features for automatic protein function prediction with gostruct
topic Research
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4441003/
https://www.ncbi.nlm.nih.gov/pubmed/26005564
http://dx.doi.org/10.1186/s13326-015-0006-4
work_keys_str_mv AT funkchristophers evaluatingavarietyoftextminedfeaturesforautomaticproteinfunctionpredictionwithgostruct
AT kahandaindika evaluatingavarietyoftextminedfeaturesforautomaticproteinfunctionpredictionwithgostruct
AT benhurasa evaluatingavarietyoftextminedfeaturesforautomaticproteinfunctionpredictionwithgostruct
AT verspoorkarinm evaluatingavarietyoftextminedfeaturesforautomaticproteinfunctionpredictionwithgostruct